CS-E400218 · Aalto University · Interactive Demo

Gradient Descent Explorer

Three interactive visualizations — from a single-parameter loss curve, to a 2-parameter surface, to the difference between convex and non-convex landscapes.

Loss function & gradient descent — one parameter

One parameter θ, one loss value L(θ). The gradient is the slope at the current point (green dashed tangent). Each step moves θ in the opposite direction: θ ← θ − η·(dL/dθ).

MSE (mean squared error) is convex — one smooth bowl, always converges. Cross-entropy uses a sigmoid: ŷ = 1/(1+e^−θ·x), L = −log(ŷ) for a positive example — also convex but asymmetric. Non-convex has two local minima; where you land depends on the starting point.

3.000

L(θ)

9.000

dL/dθ

6.000

step

η 0.3 Loss

Click Step to move one gradient descent step, or Run to animate.

Loss over two parameters — y = ax + b

Your linear model has two parameters: slope a and intercept b. The MSE loss is a surface — a paraboloid bowl. Data is normalised (zero mean, unit variance) so the loss scale is well-behaved and gradient descent converges reliably across all learning rates shown.

slope a

—

intercept b

—

MSE loss

—

step

Contour map of L(a, b)

Data fit — current model

Loss over steps

η (learning rate)

0.4

Start position

The contour rings are like elevation lines on a topographic map. The red dot is the current model; the centre is the minimum.

Convex vs non-convex loss landscape

A linear model with MSE produces a perfect paraboloid — one global minimum, guaranteed. Add nonlinear activations and the surface buckles: multiple local minima, saddle points, flat plateaus. Drag either surface to rotate it.

Convex · linear model + MSE

One global minimum — always converges

Non-convex · neural network

Multiple local minima, saddle points, plateaus

Drag either surface to rotate. The left has exactly one valley. The right has several traps — gradient descent may settle in any of them depending on starting position and η.