CS-E400218 · Aalto University · Interactive Demo

Gradient Descent Explorer

Three interactive visualizations — from a single-parameter loss curve, to a 2-parameter surface, to the difference between convex and non-convex landscapes.

01

Loss function & gradient descent — one parameter

One parameter θ, one loss value L(θ). The gradient is the slope at the current point (green dashed tangent). Each step moves θ in the opposite direction: θ ← θ − η·(dL/dθ).

MSE (mean squared error) is convex — one smooth bowl, always converges. Cross-entropy uses a sigmoid: ŷ = 1/(1+e−θ·x), L = −log(ŷ) for a positive example — also convex but asymmetric. Non-convex has two local minima; where you land depends on the starting point.

θ
3.000
L(θ)
9.000
dL/dθ
6.000
step
0
η 0.3 Loss
Click Step to move one gradient descent step, or Run to animate.
02

Loss over two parameters — y = ax + b

Your linear model has two parameters: slope a and intercept b. The MSE loss is a surface — a paraboloid bowl. Data is normalised (zero mean, unit variance) so the loss scale is well-behaved and gradient descent converges reliably across all learning rates shown.

slope a
intercept b
MSE loss
step
0
Contour map of L(a, b)
Data fit — current model
Loss over steps
η (learning rate)
0.4
Start position
The contour rings are like elevation lines on a topographic map. The red dot is the current model; the centre is the minimum.
03

Convex vs non-convex loss landscape

A linear model with MSE produces a perfect paraboloid — one global minimum, guaranteed. Add nonlinear activations and the surface buckles: multiple local minima, saddle points, flat plateaus. Drag either surface to rotate it.

Convex · linear model + MSE
One global minimum — always converges
Non-convex · neural network
Multiple local minima, saddle points, plateaus
Drag either surface to rotate. The left has exactly one valley. The right has several traps — gradient descent may settle in any of them depending on starting position and η.