Imagine standing on a mountainous terrain, completely blindfolded. Your goal: reach the lowest point in the valley. You can’t see anything, but you can feel the slope of the ground beneath your feet. What do you do? You move in the direction where the ground goes down, one step at a time. If it slopes more steeply to the left, you go left. If it drops more to the right, you go right. With each step, you feel the slope again and redirect yourself.
This strategy, so simple and natural, is exactly what neural networks use to learn. Every time an AI model improves — learning to recognize a face, translate a sentence, or generate text — it does so by descending through a mathematical landscape, one step at a time, following the slope.
It’s called gradient descent, and it’s arguably the most important algorithm in modern machine learning.
The idea of following the slope to find a minimum has surprisingly ancient roots. In 1847, the French mathematician Augustin-Louis Cauchy published a method for solving systems of equations that, in essence, is already gradient descent: compute the direction of steepest ascent of a function and move in the opposite direction. Cauchy was not thinking about neural networks — their appearance was more than a century away — but he had formalized the principle that still powers artificial intelligence today.
For over a century the idea remained confined to pure mathematics. The breakthrough came in 1951, when Herbert Robbins and Sutton Monro proposed a stochastic version of the method: instead of computing the gradient on the entire problem, it is estimated on a random sample. This is the seed of what we now call Stochastic Gradient Descent (SGD) — we will discuss it in the final section. A few years later, in 1958, Frank Rosenblatt built the perceptron, the first machine learning model that uses gradient descent to learn from data. The excitement was enormous, but short-lived: in 1969, Minsky and Papert demonstrated the perceptron’s limitations, and interest in neural networks collapsed. This is the so-called AI winter.
The revival came in 1986, when David Rumelhart, Geoffrey Hinton, and Ronald Williams published the backpropagation algorithm: an efficient way to compute the gradient in neural networks with many layers. It was the key that unlocked deep learning. From that moment on, gradient descent was no longer a theoretical exercise but the practical tool used to train neural networks. The final piece of this story was set by Diederik Kingma and Jimmy Ba in 2014, with the Adam optimizer — which we will encounter again in the final section of this article. Nearly two centuries separate Cauchy from Adam, yet the principle has remained the same: feel the slope, take a step in the opposite direction.
Before we descend, we need to know what we are minimizing. The blindfolded person is looking for the lowest point in the valley — but in machine learning, what exactly is that valley?
Let’s take a concrete example. Suppose we want to predict the price of a house knowing only its area. We have five houses for which we know the area and the actual price, and our model is the simplest possible: a line through the origin, price = m × area, where m is the only parameter to adjust.
For each value of m, the model makes a prediction. If m is too low, the predictions underestimate the actual prices; if it is too high, they overestimate them. We need a number that tells us how much the model is getting wrong: this is the cost function (or loss function).
The most widely used cost function is the Mean Squared Error (MSE): for each house, we compute the difference between the predicted price and the actual price, square it, and take the average of all these errors. In formula:
MSE(m) = (1/n) ∑i (pricei − m × areai)²
If we plot MSE(m) as m varies, we get a parabola-shaped curve: the same parabola we will use shortly as an example of gradient descent. This is no coincidence — the parabola is the cost function, and its lowest point is the value of m that makes the model as good as possible.
The examples that follow are available in both R and Python — choose whichever language you’re more comfortable with.
Let’s compute the cost function in R for our house example:
# Cost function: predicting the price of a house
area <- c(50, 70, 90, 120, 150)
price <- c(150, 200, 260, 340, 400) # thousands of euros
# Model: price = m * area
# MSE cost function as m varies
m_values <- seq(1, 4, by = 0.01)
mse <- sapply(m_values, function(m) mean((price - m * area)^2))
plot(m_values, mse, type = "l", lwd = 2, col = "steelblue",
xlab = "m (slope)", ylab = "MSE",
main = "Cost function as m varies")
m_best <- m_values[which.min(mse)]
abline(v = m_best, col = "red", lty = 2)
cat("The value of m that minimizes the error:", round(m_best, 2), "\n") Let’s verify in Python:
import numpy as np
import matplotlib.pyplot as plt
# Cost function: predicting the price of a house
area = np.array([50, 70, 90, 120, 150])
price = np.array([150, 200, 260, 340, 400]) # thousands of euros
# Model: price = m * area
# MSE cost function as m varies
m_values = np.linspace(1, 4, 301) # equivalent to seq(1, 4, by=0.01) in R
mse = np.array([np.mean((price - m * area)**2) for m in m_values])
plt.plot(m_values, mse, lw=2, color="steelblue")
plt.xlabel("m (slope)")
plt.ylabel("MSE")
plt.title("Cost function as m varies")
m_best = m_values[np.argmin(mse)]
plt.axvline(m_best, color="red", linestyle="--")
plt.show()
print(f"The value of m that minimizes the error: {m_best:.2f}") Now we know what to minimize: the cost function. The gradient tells us how.
When you have a function of a single variable — think of it as a trail that goes up and down — the derivative at a point tells you how steep the trail is at that point. If the derivative is positive, you’re going uphill. If it’s negative, you’re going downhill. If it’s zero, you’re on flat ground: perhaps a peak, perhaps a dip.
Concrete example: the function f(x) = x² describes a parabola. Its derivative is f'(x) = 2x. If you’re at x = 3, the derivative is 6: you’re climbing steeply. At x = -1, the derivative is -2: you’re going downhill. At x = 0, the derivative is zero: you’re at the lowest point.
In practice, the functions we care about don’t depend on just one variable. A machine learning model can have hundreds, thousands, or billions of parameters. The landscape is no longer a trail but a surface in a high-dimensional space — impossible to visualize, but perfectly manageable with mathematics.
The gradient is the vector that collects all partial derivatives of the function with respect to each parameter. If the function depends on two variables (x, y), the gradient is:
∇f = (∂f/∂x, ∂f/∂y)
If it depends on a thousand variables, the gradient is a vector with a thousand components. In every case, the gradient points in the direction of steepest ascent. To find the minimum, just go in the opposite direction: minus the gradient.
Here’s the fundamental formula of gradient descent:
θnew = θold − α · ∇f(θ)
Let’s break it down piece by piece:
That’s all there is to it. Compute where you’re going uphill, take a step in the opposite direction, repeat.
Let’s see gradient descent in action on the function f(x) = x². We know the minimum is at x = 0. Can the algorithm find it starting from a random point?
# Gradient descent on f(x) = x^2
# The derivative is f'(x) = 2x
f <- function(x) x^2 # objective function
grad_f <- function(x) 2 * x # derivative (gradient in 1D)
x <- 10 # starting point
alpha <- 0.1 # learning rate
n_iter <- 50 # number of iterations
path <- numeric(n_iter)
for (i in 1:n_iter) {
path[i] <- x
x <- x - alpha * grad_f(x) # the fundamental rule
}
cat("Starting point: 10\n")
cat("After 50 iterations: x =", round(x, 8), "\n")
cat("Function value:", round(f(x), 10), "\n")
# Visualization
curve(x^2, from = -11, to = 11, lwd = 2, col = "steelblue",
main = "Gradient descent on f(x) = x^2",
xlab = "x", ylab = "f(x)")
points(path, path^2, col = "red", pch = 19, cex = 0.7)
lines(path, path^2, col = "red", lty = 2) Let’s verify in Python:
import numpy as np
import matplotlib.pyplot as plt
# Gradient descent on f(x) = x^2
f = lambda x: x**2 # objective function
grad_f = lambda x: 2 * x # derivative (gradient in 1D)
x = 10.0 # starting point
alpha = 0.1 # learning rate
n_iter = 50 # number of iterations
path = np.zeros(n_iter)
for i in range(n_iter):
path[i] = x
x = x - alpha * grad_f(x) # the fundamental rule
print(f"Starting point: 10")
print(f"After 50 iterations: x = {x:.8f}")
print(f"Function value: {f(x):.10f}")
# Visualization
xs = np.linspace(-11, 11, 200)
plt.plot(xs, xs**2, lw=2, color="steelblue")
plt.plot(path, path**2, "ro--", markersize=4)
plt.title("Gradient descent on f(x) = x\u00b2")
plt.xlabel("x")
plt.ylabel("f(x)")
plt.show() Running this code, you can see the algorithm start from x = 10 and quickly converge toward x = 0. The first steps are large (the slope is steep), then they shrink as we approach the bottom of the parabola. After 50 iterations, x is essentially zero.
Back to our blindfolded person. How big is the step they take at each iteration? This is exactly the question of the learning rate (α), and the answer is less obvious than it seems.
Steps too small (α very low): the person moves with extreme caution, shifting their foot just a few centimeters at a time. They’ll eventually reach the bottom of the valley, but it could take forever. In machine learning, this means extremely long training times and high computational costs.
Steps too large (α too high): the person takes enormous leaps. Instead of gently descending into the valley, they overshoot it, end up on the other side, bounce back, and keep oscillating without ever settling. In extreme cases, the jumps get bigger and bigger and the person ends up higher than where they started. In machine learning, this is called divergence: the model gets worse instead of better.
Steps just right: a good learning rate allows you to descend quickly without oscillating. In practice, finding the right value requires experimentation. It’s one of the most artisanal aspects of machine learning.
How does the blindfolded person know they’ve arrived? They feel the ground is flat in all directions: the gradient is (nearly) zero. In practice, the algorithm stops when the improvement between iterations becomes negligible, or when it has reached a maximum number of iterations.
The most common stopping criteria are:
This R code shows the effect of three different learning rate values on the same function:
# Comparing three learning rates on f(x) = x^2
gradient_descent <- function(x0, alpha, n_iter = 30) {
x <- x0
path <- numeric(n_iter)
for (i in 1:n_iter) {
path[i] <- x
x <- x - alpha * 2 * x # theta_new = theta_old - alpha * grad
}
return(path)
}
x0 <- 8 # same starting point for all
# Three different learning rates
slow <- gradient_descent(x0, alpha = 0.01) # too small
right <- gradient_descent(x0, alpha = 0.1) # good compromise
fast <- gradient_descent(x0, alpha = 0.9) # nearly unstable
# Visualization
par(mfrow = c(1, 3))
# alpha = 0.01 (too slow)
curve(x^2, from = -10, to = 10, lwd = 2, col = "steelblue",
main = expression(paste(alpha, " = 0.01 (too slow)")))
points(slow, slow^2, col = "red", pch = 19, cex = 0.6)
lines(slow, slow^2, col = "red", lty = 2)
# alpha = 0.1 (just right)
curve(x^2, from = -10, to = 10, lwd = 2, col = "steelblue",
main = expression(paste(alpha, " = 0.1 (good compromise)")))
points(right, right^2, col = "darkgreen", pch = 19, cex = 0.6)
lines(right, right^2, col = "darkgreen", lty = 2)
# alpha = 0.9 (nearly unstable)
curve(x^2, from = -10, to = 10, lwd = 2, col = "steelblue",
main = expression(paste(alpha, " = 0.9 (nearly unstable)")))
points(fast, fast^2, col = "orange", pch = 19, cex = 0.6)
lines(fast, fast^2, col = "orange", lty = 2)
par(mfrow = c(1, 1)) Let’s compare in Python:
import numpy as np
import matplotlib.pyplot as plt
def gradient_descent(x0, alpha, n_iter=30):
x = x0
path = np.zeros(n_iter)
for i in range(n_iter):
path[i] = x
x = x - alpha * 2 * x # theta_new = theta_old - alpha * grad
return path
x0 = 8.0 # same starting point for all
slow = gradient_descent(x0, alpha=0.01) # too small
right = gradient_descent(x0, alpha=0.1) # good compromise
fast = gradient_descent(x0, alpha=0.9) # nearly unstable
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
xs = np.linspace(-10, 10, 200)
for ax, data, color, title in zip(axes,
[slow, right, fast],
["red", "darkgreen", "orange"],
["\u03b1 = 0.01 (too slow)", "\u03b1 = 0.1 (good compromise)",
"\u03b1 = 0.9 (nearly unstable)"]):
ax.plot(xs, xs**2, lw=2, color="steelblue")
ax.plot(data, data**2, "o--", color=color, markersize=4)
ax.set_title(title)
plt.tight_layout()
plt.show() With α = 0.01, the red dots move sluggishly: after 30 iterations we’re still far from the minimum. With α = 0.1, convergence is fast and clean. With α = 0.9, the algorithm oscillates visibly at each step, bouncing from one side of the parabola to the other before settling — a slightly higher learning rate would diverge entirely.
The mathematical landscape of a real model is not a nice symmetric parabola. It’s wild terrain, with secondary valleys, ridges, plateaus, and shapes that defy imagination. Here are the classic problems, explained through our landscape analogy.
Imagine terrain with multiple dips: one deep valley (the global minimum) and several shallower hollows (local minima). The blindfolded person has no way of knowing whether the valley they’re in is the deepest one. They feel flat ground beneath their feet and stop, convinced they’ve arrived. But they might be in a shallow hollow while the true minimum lies somewhere else entirely.
In practice, this is a less severe problem than once thought. Modern neural networks have so many parameters that local minima tend to have objective function values similar to the global minimum. It’s like terrain with many valleys, but all at roughly the same altitude: ending up in any one of them works fine.
A more insidious problem are saddle points. Imagine sitting on a horse saddle: if you move forward or backward, you descend; if you move left or right, you ascend. At that point the gradient is zero — the terrain feels flat — but you’re not at a minimum. You’re at a point that is a minimum in some directions and a maximum in others.
In high-dimensional spaces, saddle points are far more common than local minima. Fortunately, modern variants of gradient descent (with some noise or momentum, as we’ll see) generally manage to escape saddle points.
Imagine a very narrow, elongated valley, like a canyon. The gradient points almost perpendicularly to the canyon walls, not along the canyon toward the bottom. The blindfolded person ends up bouncing from one wall to the other, making an inefficient zigzag instead of walking straight toward the bottom.
This happens when the problem’s variables have very different scales: some change rapidly, others slowly. It’s a common issue in practice, and one of the main motivations for advanced optimizers like Adam, which we’ll see in the final section.
So far we’ve seen gradient descent on a parabola: a problem with a single variable. It’s the simplest possible case. But the beauty of this algorithm is that it works in exactly the same way at any scale.
Here’s how the number of parameters grows as models become more complex:
The principle is identical: compute the gradient, take a step in the opposite direction, repeat. What changes is the scale of the computation. GPT-4’s gradient is a vector with over a trillion components, computed on billions of text fragments, using thousands of processors in parallel. But the formula is the same one we saw on the parabola.
Every time you interact with an AI system, gradient descent has been working behind the scenes:
The key message is this: the power of modern AI lies not in the complexity of the optimization algorithm, but in scale. Gradient descent is conceptually simple. What made the AI revolution possible is the ability to apply it to enormous models on enormous amounts of data, thanks to ever more powerful hardware.
Vanilla gradient descent — what we’ve seen so far — works, but it has the limitations we described: it can be slow, it can oscillate, it can get stuck. Over the years, researchers have developed variants that address these problems. Without diving into the formulas, here are the key ideas.
Instead of computing the gradient on the entire dataset at each step (computationally very expensive), SGD computes it on a small random sample (mini-batch). It’s as if the blindfolded person, instead of feeling the entire terrain around them, only sampled a few random points. The slope estimate is noisy but correct on average, and the computation speed is enormously greater. The noise, paradoxically, is also useful: it helps escape local minima and saddle points.
Imagine a ball rolling down the hill instead of a person walking. The ball builds up velocity: if the slope continues in the same direction, it accelerates. If the slope changes direction, the ball slows down before reversing. This is momentum: the algorithm “remembers” the direction it was moving and adds the current gradient to it. The result is that it crosses flat zones faster and oscillates less in narrow valleys.
Adam (Adaptive Moment Estimation) combines the momentum idea with a learning rate that adapts automatically for each parameter. Parameters that change little get larger steps; those that change a lot get smaller steps. It’s as if the blindfolded person had smart shoes that adjust the step length based on the terrain under each foot.
Adam has become the de facto standard for training most modern neural networks. It’s robust, requires little manual tuning, and works well across a wide range of problems. Nearly all the models you use every day — from Spotify to ChatGPT — were trained with Adam or one of its variants.
The blindfolded person we started with now has better shoes. They have a ball that builds velocity instead of legs that take rigid steps. They have soles that automatically adapt to the terrain. And above all, they’re not walking on a mountain with two or three dimensions: they’re walking through a landscape with billions of dimensions.
But the principle is exactly the same. Feel the slope. Take a step where it goes down. Repeat.
Gradient descent is not a spectacular algorithm. It lacks the elegant complexity of a genetic algorithm or the narrative charm of adversarial networks. It’s a mechanical procedure, almost trivial. But it’s the mechanical procedure upon which the entire artificial intelligence revolution rests. From Netflix suggestions to image-generating models, from self-driving cars to simultaneous translation, it all comes down to this: a function to minimize, a gradient to compute, a step to take.
The next time a voice assistant understands your question, or an automatic translator nails a nuance, remember: behind the scenes, a very sophisticated version of our blindfolded person has walked billions of steps through a landscape with billions of dimensions. And found a valley deep enough to be useful.
What is the Monte Carlo method The story of the Monte Carlo method begins in…
Date Converter Use the converter to transform any Gregorian date into the corresponding French Revolutionary…
One of the most common questions when planning an A/B test is: how many users…
Introduction Machine Learning is changing the way we see the world around us. From weather…
The Gini coefficient is a measure of the degree of inequality in a distribution, and…
Contingency tables are used to evaluate the interaction between two categorical variables (qualitative). They are…