Feel your way down the hill

Every neural network you've ever used — your phone's autocorrect, the spam filter, GPT — is trained by the same idea. It fits on one line:

xt+1=xtηL(xt)x_{t+1} = x_t - \eta\,\nabla L(x_t)

That's it. A model has parameters xx, a loss L(x)L(x) measuring how wrong it is, and a learning rate η\eta. At each step you look at the slope of the loss and walk the other way. Repeat a few million times and you have ChatGPT.

The whole intuition lives in a 1D picture. Click the curve, drag the ball, and step.

One dimension, one update rule

The curve below is a toy loss L(x)=12(x1)2+12L(x) = \tfrac{1}{2}(x-1)^2 + \tfrac{1}{2}. The minimum sits at x=1x = 1. Click the curve, drag the red point, or pick a preset to choose your starting xx, then step or hit Auto-run.

Pick a starting x: click anywhere on the curve, drag the red point, or use a preset below.

Start at:
steps: 0
Learning rate η:
xt+1 = -2.500.40 · -3.50 = -1.10

The red dashed line is the slope L(x)\nabla L(x). Each step subtracts ηL\eta \cdot \nabla L from xx — exactly the equation above, just with the live numbers.

Now play with the presets:

  • Too small — converges, but you'd die of old age.
  • Just right — clean descent in a few steps.
  • Too big — overshoots the minimum, bounces around it.
  • Divergesη\eta above 22 literally throws the ball off the curve. The loss grows every step.

That last one is the bug every ML engineer hits in their first week. The fix is always the same: lower the learning rate.

Two dimensions, same trick

A real model doesn't have one parameter, it has billions. But the picture barely changes: instead of a curve there's a surface, and L\nabla L is a vector pointing uphill. You still subtract ηL\eta \nabla L from your position.

Here L(x,y)=x2+2y2L(x, y) = x^2 + 2y^2 — a bowl that's steeper along yy than xx. The blue arrows show the descent direction everywhere. Drag the red point and notice how the gradient (red) and the resulting step (green) always disagree with the contours in the same way.

This asymmetry — some directions steeper than others — is why modern optimizers like Adam adapt η\eta per parameter. Same idea, more accounting.

The bit worth remembering

Gradient descent is not a technique you learn once and move past. It's the substrate. Every "the model is training" message in your terminal is some variant of:

  1. Compute the loss.
  2. Compute the gradient.
  3. Take a small step against it.

The art is in steps 2 and 3 — getting the gradient cheaply (backprop), and picking η\eta so you don't waste compute crawling or blow up overshooting. But the idea you just dragged around? That's the whole thing.