Three interactive visualizations — from a single-parameter loss curve, to a 2-parameter surface, to the difference between convex and non-convex landscapes.
One parameter θ, one loss value L(θ). The gradient is the slope at the current point (green dashed tangent). Each step moves θ in the opposite direction: θ ← θ − η·(dL/dθ).
MSE (mean squared error) is convex — one smooth bowl, always converges. Cross-entropy uses a sigmoid: ŷ = 1/(1+e−θ·x), L = −log(ŷ) for a positive example — also convex but asymmetric. Non-convex has two local minima; where you land depends on the starting point.
Your linear model has two parameters: slope a and intercept b. The MSE loss is a surface — a paraboloid bowl. Data is normalised (zero mean, unit variance) so the loss scale is well-behaved and gradient descent converges reliably across all learning rates shown.
A linear model with MSE produces a perfect paraboloid — one global minimum, guaranteed. Add nonlinear activations and the surface buckles: multiple local minima, saddle points, flat plateaus. Drag either surface to rotate it.