A blindfolded person on a mountain
Imagine standing on a mountainous terrain, completely blindfolded. Your goal: reach the lowest point in the valley. You can’t see anything, but you can feel the slope of the ground beneath your feet. What do you do? You move in the direction where the ground goes down, one step at a time. If it slopes more steeply to the left, you go left. If it drops more to the right, you go right. With each step, you feel the slope again and redirect yourself.
This strategy, so simple and natural, is exactly what neural networks use to learn. Every time an AI model improves — learning to recognize a face, translate a sentence, or generate text — it does so by descending through a mathematical landscape, one step at a time, following the slope.
It’s called gradient descent, and it’s arguably the most important algorithm in modern machine learning.
What We’ll Cover
The Math, Explained Geometrically
Derivative: the slope beneath your feet
When you have a function of a single variable — think of it as a trail that goes up and down — the derivative at a point tells you how steep the trail is at that point. If the derivative is positive, you’re going uphill. If it’s negative, you’re going downhill. If it’s zero, you’re on flat ground: perhaps a peak, perhaps a dip.
Concrete example: the function f(x) = x² describes a parabola. Its derivative is f'(x) = 2x. If you’re at x = 3, the derivative is 6: you’re climbing steeply. At x = -1, the derivative is -2: you’re going downhill. At x = 0, the derivative is zero: you’re at the lowest point.
The gradient: a multidimensional compass
In practice, the functions we care about don’t depend on just one variable. A machine learning model can have hundreds, thousands, or billions of parameters. The landscape is no longer a trail but a surface in a high-dimensional space — impossible to visualize, but perfectly manageable with mathematics.
The gradient is the vector that collects all partial derivatives of the function with respect to each parameter. If the function depends on two variables (x, y), the gradient is:
∇f = (∂f/∂x, ∂f/∂y)
If it depends on a thousand variables, the gradient is a vector with a thousand components. In every case, the gradient points in the direction of steepest ascent. To find the minimum, just go in the opposite direction: minus the gradient.
Here’s the fundamental formula of gradient descent:
θnew = θold − α · ∇f(θ)
Let’s break it down piece by piece:
- θ represents the model’s parameters — the “knobs” the algorithm adjusts to improve
- ∇f(θ) is the gradient: it indicates the direction of steepest ascent at the current point
- α (alpha) is the learning rate: the size of the step we take at each iteration
- The minus sign makes us go in the opposite direction of the gradient, i.e. downhill
That’s all there is to it. Compute where you’re going uphill, take a step in the opposite direction, repeat.
A numerical example in R
Let’s see gradient descent in action on the function f(x) = x². We know the minimum is at x = 0. Can the algorithm find it starting from a random point?
# Gradient descent on f(x) = x^2
# The derivative is f'(x) = 2x
f <- function(x) x^2 # objective function
grad_f <- function(x) 2 * x # derivative (gradient in 1D)
x <- 10 # starting point
alpha <- 0.1 # learning rate
n_iter <- 50 # number of iterations
path <- numeric(n_iter)
for (i in 1:n_iter) {
path[i] <- x
x <- x - alpha * grad_f(x) # the fundamental rule
}
cat("Starting point: 10\n")
cat("After 50 iterations: x =", round(x, 8), "\n")
cat("Function value:", round(f(x), 10), "\n")
# Visualization
curve(x^2, from = -11, to = 11, lwd = 2, col = "steelblue",
main = "Gradient descent on f(x) = x^2",
xlab = "x", ylab = "f(x)")
points(path, path^2, col = "red", pch = 19, cex = 0.7)
lines(path, path^2, col = "red", lty = 2)
Running this code, you can see the algorithm start from x = 10 and quickly converge toward x = 0. The first steps are large (the slope is steep), then they shrink as we approach the bottom of the parabola. After 50 iterations, x is essentially zero.
The Learning Rate: Big Steps or Small Steps?
Back to our blindfolded person. How big is the step they take at each iteration? This is exactly the question of the learning rate (α), and the answer is less obvious than it seems.
Steps too small (α very low): the person moves with extreme caution, shifting their foot just a few centimeters at a time. They’ll eventually reach the bottom of the valley, but it could take forever. In machine learning, this means extremely long training times and high computational costs.
Steps too large (α too high): the person takes enormous leaps. Instead of gently descending into the valley, they overshoot it, end up on the other side, bounce back, and keep oscillating without ever settling. In extreme cases, the jumps get bigger and bigger and the person ends up higher than where they started. In machine learning, this is called divergence: the model gets worse instead of better.
Steps just right: a good learning rate allows you to descend quickly without oscillating. In practice, finding the right value requires experimentation. It’s one of the most artisanal aspects of machine learning.
Convergence: knowing when to stop
How does the blindfolded person know they’ve arrived? They feel the ground is flat in all directions: the gradient is (nearly) zero. In practice, the algorithm stops when the improvement between iterations becomes negligible, or when it has reached a maximum number of iterations.
The most common stopping criteria are:
- The gradient norm drops below a minimum threshold (the terrain is nearly flat)
- The difference in f(θ) between two consecutive iterations is smaller than a fixed tolerance
- The maximum number of iterations has been reached (computational budget exhausted)
The effect of the learning rate: a visual comparison
This R code shows the effect of three different learning rate values on the same function:
# Comparing three learning rates on f(x) = x^2
gradient_descent <- function(x0, alpha, n_iter = 30) {
x <- x0
path <- numeric(n_iter)
for (i in 1:n_iter) {
path[i] <- x
x <- x - alpha * 2 * x # theta_new = theta_old - alpha * grad
}
return(path)
}
x0 <- 8 # same starting point for all
# Three different learning rates
slow <- gradient_descent(x0, alpha = 0.01) # too small
right <- gradient_descent(x0, alpha = 0.1) # good compromise
fast <- gradient_descent(x0, alpha = 0.9) # nearly unstable
# Visualization
par(mfrow = c(1, 3))
# alpha = 0.01 (too slow)
curve(x^2, from = -10, to = 10, lwd = 2, col = "steelblue",
main = expression(paste(alpha, " = 0.01 (too slow)")))
points(slow, slow^2, col = "red", pch = 19, cex = 0.6)
lines(slow, slow^2, col = "red", lty = 2)
# alpha = 0.1 (just right)
curve(x^2, from = -10, to = 10, lwd = 2, col = "steelblue",
main = expression(paste(alpha, " = 0.1 (good compromise)")))
points(right, right^2, col = "darkgreen", pch = 19, cex = 0.6)
lines(right, right^2, col = "darkgreen", lty = 2)
# alpha = 0.9 (nearly unstable)
curve(x^2, from = -10, to = 10, lwd = 2, col = "steelblue",
main = expression(paste(alpha, " = 0.9 (nearly unstable)")))
points(fast, fast^2, col = "orange", pch = 19, cex = 0.6)
lines(fast, fast^2, col = "orange", lty = 2)
par(mfrow = c(1, 1))
With α = 0.01, the red dots move sluggishly: after 30 iterations we’re still far from the minimum. With α = 0.1, convergence is fast and clean. With α = 0.9, the algorithm oscillates visibly at each step, bouncing from one side of the parabola to the other before settling — a slightly higher learning rate would diverge entirely.
What Can Go Wrong: Geometric Intuition
The mathematical landscape of a real model is not a nice symmetric parabola. It’s wild terrain, with secondary valleys, ridges, plateaus, and shapes that defy imagination. Here are the classic problems, explained through our landscape analogy.
Local minima: secondary valleys
Imagine terrain with multiple dips: one deep valley (the global minimum) and several shallower hollows (local minima). The blindfolded person has no way of knowing whether the valley they’re in is the deepest one. They feel flat ground beneath their feet and stop, convinced they’ve arrived. But they might be in a shallow hollow while the true minimum lies somewhere else entirely.
In practice, this is a less severe problem than once thought. Modern neural networks have so many parameters that local minima tend to have objective function values similar to the global minimum. It’s like terrain with many valleys, but all at roughly the same altitude: ending up in any one of them works fine.
Saddle points: the horse saddle
A more insidious problem are saddle points. Imagine sitting on a horse saddle: if you move forward or backward, you descend; if you move left or right, you ascend. At that point the gradient is zero — the terrain feels flat — but you’re not at a minimum. You’re at a point that is a minimum in some directions and a maximum in others.
In high-dimensional spaces, saddle points are far more common than local minima. Fortunately, modern variants of gradient descent (with some noise or momentum, as we’ll see) generally manage to escape saddle points.
Narrow valleys: the zigzag
Imagine a very narrow, elongated valley, like a canyon. The gradient points almost perpendicularly to the canyon walls, not along the canyon toward the bottom. The blindfolded person ends up bouncing from one wall to the other, making an inefficient zigzag instead of walking straight toward the bottom.
This happens when the problem’s variables have very different scales: some change rapidly, others slowly. It’s a common issue in practice, and one of the main motivations for advanced optimizers like Adam, which we’ll see in the final section.
From a Parabola to ChatGPT
So far we’ve seen gradient descent on a parabola: a problem with a single variable. It’s the simplest possible case. But the beauty of this algorithm is that it works in exactly the same way at any scale.
The scale of parameters
Here’s how the number of parameters grows as models become more complex:
- Simple linear regression: 2 parameters (slope and intercept). The landscape is a 3D surface, easy to visualize.
- Neural network for handwritten digit recognition: ~100,000 parameters. The landscape has 100,000 dimensions.
- ResNet-50 (image classification, 2015): ~25 million parameters.
- GPT-3 (ChatGPT’s predecessor): 175 billion parameters.
- GPT-4 and frontier models (2023-2025): estimated over one trillion parameters.
The principle is identical: compute the gradient, take a step in the opposite direction, repeat. What changes is the scale of the computation. GPT-4’s gradient is a vector with over a trillion components, computed on billions of text fragments, using thousands of processors in parallel. But the formula is the same one we saw on the parabola.
Where you see it in action (without knowing it)
Every time you interact with an AI system, gradient descent has been working behind the scenes:
- Netflix and Spotify recommending what to watch or listen to: recommendation models are trained with gradient descent on billions of user interactions
- Google Translate and automatic translators: neural networks with hundreds of millions of parameters, optimized with gradient descent on massive parallel text corpora
- Voice assistants (Siri, Alexa): speech recognition uses deep neural networks, trained with the same algorithm
- Self-driving cars: the networks that recognize pedestrians, traffic lights, and lane markings are trained with gradient descent variants
- ChatGPT, Claude, Gemini: Large Language Models are the most extreme case — gradient descent applied to billions of parameters on trillions of text tokens
The key message is this: the power of modern AI lies not in the complexity of the optimization algorithm, but in scale. Gradient descent is conceptually simple. What made the AI revolution possible is the ability to apply it to enormous models on enormous amounts of data, thanks to ever more powerful hardware.
The Evolutions: Better Shoes for Our Explorer
Vanilla gradient descent — what we’ve seen so far — works, but it has the limitations we described: it can be slow, it can oscillate, it can get stuck. Over the years, researchers have developed variants that address these problems. Without diving into the formulas, here are the key ideas.
Stochastic Gradient Descent (SGD)
Instead of computing the gradient on the entire dataset at each step (computationally very expensive), SGD computes it on a small random sample (mini-batch). It’s as if the blindfolded person, instead of feeling the entire terrain around them, only sampled a few random points. The slope estimate is noisy but correct on average, and the computation speed is enormously greater. The noise, paradoxically, is also useful: it helps escape local minima and saddle points.
Momentum
Imagine a ball rolling down the hill instead of a person walking. The ball builds up velocity: if the slope continues in the same direction, it accelerates. If the slope changes direction, the ball slows down before reversing. This is momentum: the algorithm “remembers” the direction it was moving and adds the current gradient to it. The result is that it crosses flat zones faster and oscillates less in narrow valleys.
Adam: the Swiss army knife
Adam (Adaptive Moment Estimation) combines the momentum idea with a learning rate that adapts automatically for each parameter. Parameters that change little get larger steps; those that change a lot get smaller steps. It’s as if the blindfolded person had smart shoes that adjust the step length based on the terrain under each foot.
Adam has become the de facto standard for training most modern neural networks. It’s robust, requires little manual tuning, and works well across a wide range of problems. Nearly all the models you use every day — from Spotify to ChatGPT — were trained with Adam or one of its variants.
Back to the Mountain
The blindfolded person we started with now has better shoes. They have a ball that builds velocity instead of legs that take rigid steps. They have soles that automatically adapt to the terrain. And above all, they’re not walking on a mountain with two or three dimensions: they’re walking through a landscape with billions of dimensions.
But the principle is exactly the same. Feel the slope. Take a step where it goes down. Repeat.
Gradient descent is not a spectacular algorithm. It lacks the elegant complexity of a genetic algorithm or the narrative charm of adversarial networks. It’s a mechanical procedure, almost trivial. But it’s the mechanical procedure upon which the entire artificial intelligence revolution rests. From Netflix suggestions to image-generating models, from self-driving cars to simultaneous translation, it all comes down to this: a function to minimize, a gradient to compute, a step to take.
The next time a voice assistant understands your question, or an automatic translator nails a nuance, remember: behind the scenes, a very sophisticated version of our blindfolded person has walked billions of steps through a landscape with billions of dimensions. And found a valley deep enough to be useful.