The Gradient Descent Algorithm Explained Simply

A blindfolded person on a mountain

Imagine standing on a mountainous terrain, completely blindfolded. Your goal: reach the lowest point in the valley. You can’t see anything, but you can feel the slope of the ground beneath your feet. What do you do? You move in the direction where the ground goes down, one step at a time. If it slopes more steeply to the left, you go left. If it drops more to the right, you go right. With each step, you feel the slope again and redirect yourself.

This strategy, so simple and natural, is exactly what neural networks use to learn. Every time an AI model improves — learning to recognize a face, translate a sentence, or generate text — it does so by descending through a mathematical landscape, one step at a time, following the slope.

It’s called gradient descent, and it’s arguably the most important algorithm in modern machine learning.

Infographic: the blindfolded explorer metaphor for gradient descent, with three steps: Sensor, Action, Cycle

From Cauchy to Neural Networks: A Brief History

The idea of following the slope to find a minimum has surprisingly ancient roots. In 1847, the French mathematician Augustin-Louis Cauchy published a method for solving systems of equations that, in essence, is already gradient descent: compute the direction of steepest ascent of a function and move in the opposite direction. Cauchy was not thinking about neural networks — their appearance was more than a century away — but he had formalized the principle that still powers artificial intelligence today.

For over a century the idea remained confined to pure mathematics. The breakthrough came in 1951, when Herbert Robbins and Sutton Monro proposed a stochastic version of the method: instead of computing the gradient on the entire problem, it is estimated on a random sample. This is the seed of what we now call Stochastic Gradient Descent (SGD) — we will discuss it in the final section. A few years later, in 1958, Frank Rosenblatt built the perceptron, the first machine learning model that uses gradient descent to learn from data. The excitement was enormous, but short-lived: in 1969, Minsky and Papert demonstrated the perceptron’s limitations, and interest in neural networks collapsed. This is the so-called AI winter.

The revival came in 1986, when David Rumelhart, Geoffrey Hinton, and Ronald Williams published the backpropagation algorithm: an efficient way to compute the gradient in neural networks with many layers. It was the key that unlocked deep learning. From that moment on, gradient descent was no longer a theoretical exercise but the practical tool used to train neural networks. The final piece of this story was set by Diederik Kingma and Jimmy Ba in 2014, with the Adam optimizer — which we will encounter again in the final section of this article. Nearly two centuries separate Cauchy from Adam, yet the principle has remained the same: feel the slope, take a step in the opposite direction.

What We’ll Cover

From Cauchy to Neural Networks: A Brief History
The Math, Explained Geometrically
The Learning Rate and Convergence
What Can Go Wrong: Geometric Intuition
From a Parabola to ChatGPT
The Evolutions: Better Shoes for Our Explorer

The Math, Explained Geometrically

The Cost Function: Measuring How Wrong the Model Is

Before we descend, we need to know what we are minimizing. The blindfolded person is looking for the lowest point in the valley — but in machine learning, what exactly is that valley?

Let’s take a concrete example. Suppose we want to predict the price of a house knowing only its area. We have five houses for which we know the area and the actual price, and our model is the simplest possible: a line through the origin, price = m × area, where m is the only parameter to adjust.

For each value of m, the model makes a prediction. If m is too low, the predictions underestimate the actual prices; if it is too high, they overestimate them. We need a number that tells us how much the model is getting wrong: this is the cost function (or loss function).

The most widely used cost function is the Mean Squared Error (MSE): for each house, we compute the difference between the predicted price and the actual price, square it, and take the average of all these errors. In formula:

MSE(m) = (1/n) ∑_i (price_i − m × area_i)²

If we plot MSE(m) as m varies, we get a parabola-shaped curve: the same parabola we will use shortly as an example of gradient descent. This is no coincidence — the parabola is the cost function, and its lowest point is the value of m that makes the model as good as possible.

The examples that follow are available in both R and Python — choose whichever language you’re more comfortable with.

Let’s compute the cost function in R for our house example:

# Cost function: predicting the price of a house
area <- c(50, 70, 90, 120, 150)
price <- c(150, 200, 260, 340, 400)  # thousands of euros

# Model: price = m * area
# MSE cost function as m varies
m_values <- seq(1, 4, by = 0.01)
mse <- sapply(m_values, function(m) mean((price - m * area)^2))

plot(m_values, mse, type = "l", lwd = 2, col = "steelblue",
     xlab = "m (slope)", ylab = "MSE",
     main = "Cost function as m varies")
m_best <- m_values[which.min(mse)]
abline(v = m_best, col = "red", lty = 2)
cat("The value of m that minimizes the error:", round(m_best, 2), "\n")

Let’s verify in Python:

import numpy as np
import matplotlib.pyplot as plt

# Cost function: predicting the price of a house
area = np.array([50, 70, 90, 120, 150])
price = np.array([150, 200, 260, 340, 400])  # thousands of euros

# Model: price = m * area
# MSE cost function as m varies
m_values = np.linspace(1, 4, 301)  # equivalent to seq(1, 4, by=0.01) in R
mse = np.array([np.mean((price - m * area)**2) for m in m_values])

plt.plot(m_values, mse, lw=2, color="steelblue")
plt.xlabel("m (slope)")
plt.ylabel("MSE")
plt.title("Cost function as m varies")
m_best = m_values[np.argmin(mse)]
plt.axvline(m_best, color="red", linestyle="--")
plt.show()
print(f"The value of m that minimizes the error: {m_best:.2f}")

Now we know what to minimize: the cost function. The gradient tells us how.

Derivative: the slope beneath your feet

When you have a function of a single variable — think of it as a trail that goes up and down — the derivative at a point tells you how steep the trail is at that point. If the derivative is positive, you’re going uphill. If it’s negative, you’re going downhill. If it’s zero, you’re on flat ground: perhaps a peak, perhaps a dip.

Concrete example: the function f(x) = x² describes a parabola. Its derivative is f'(x) = 2x. If you’re at x = 3, the derivative is 6: you’re climbing steeply. At x = -1, the derivative is -2: you’re going downhill. At x = 0, the derivative is zero: you’re at the lowest point.

Graph of the parabola f(x)=x² with three annotated points: A (x=3, derivative=6, steep slope), B (x=-1, derivative=-2, gentle slope), C (x=0, derivative=0, minimum)

The gradient: a multidimensional compass

In practice, the functions we care about don’t depend on just one variable. A machine learning model can have hundreds, thousands, or billions of parameters. The landscape is no longer a trail but a surface in a high-dimensional space — impossible to visualize, but perfectly manageable with mathematics.

The gradient is the vector that collects all partial derivatives of the function with respect to each parameter. If the function depends on two variables (x, y), the gradient is:

∇f = (∂f/∂x, ∂f/∂y)

If it depends on a thousand variables, the gradient is a vector with a thousand components. In every case, the gradient points in the direction of steepest ascent. To find the minimum, just go in the opposite direction: minus the gradient.

Here’s the fundamental formula of gradient descent:

θ_new = θ_old − α · ∇f(θ)

Let’s break it down piece by piece:

θ represents the model’s parameters — the “knobs” the algorithm adjusts to improve
∇f(θ) is the gradient: it indicates the direction of steepest ascent at the current point
α (alpha) is the learning rate: the size of the step we take at each iteration
The minus sign makes us go in the opposite direction of the gradient, i.e. downhill

That’s all there is to it. Compute where you’re going uphill, take a step in the opposite direction, repeat.

Visual breakdown of the gradient descent formula: theta (model knobs), alpha (learning rate as caliper), nabla f (gradient as compass), minus sign (opposite direction)

A numerical example in R

Let’s see gradient descent in action on the function f(x) = x². We know the minimum is at x = 0. Can the algorithm find it starting from a random point?

# Gradient descent on f(x) = x^2
# The derivative is f'(x) = 2x

f <- function(x) x^2         # objective function
grad_f <- function(x) 2 * x  # derivative (gradient in 1D)

x <- 10              # starting point
alpha <- 0.1          # learning rate
n_iter <- 50          # number of iterations
path <- numeric(n_iter)

for (i in 1:n_iter) {
  path[i] <- x
  x <- x - alpha * grad_f(x)  # the fundamental rule
}

cat("Starting point: 10\n")
cat("After 50 iterations: x =", round(x, 8), "\n")
cat("Function value:", round(f(x), 10), "\n")

# Visualization
curve(x^2, from = -11, to = 11, lwd = 2, col = "steelblue",
      main = "Gradient descent on f(x) = x^2",
      xlab = "x", ylab = "f(x)")
points(path, path^2, col = "red", pch = 19, cex = 0.7)
lines(path, path^2, col = "red", lty = 2)

Let’s verify in Python:

import numpy as np
import matplotlib.pyplot as plt

# Gradient descent on f(x) = x^2
f = lambda x: x**2           # objective function
grad_f = lambda x: 2 * x     # derivative (gradient in 1D)

x = 10.0            # starting point
alpha = 0.1          # learning rate
n_iter = 50          # number of iterations
path = np.zeros(n_iter)

for i in range(n_iter):
    path[i] = x
    x = x - alpha * grad_f(x)  # the fundamental rule

print(f"Starting point: 10")
print(f"After 50 iterations: x = {x:.8f}")
print(f"Function value: {f(x):.10f}")

# Visualization
xs = np.linspace(-11, 11, 200)
plt.plot(xs, xs**2, lw=2, color="steelblue")
plt.plot(path, path**2, "ro--", markersize=4)
plt.title("Gradient descent on f(x) = x\u00b2")
plt.xlabel("x")
plt.ylabel("f(x)")
plt.show()

Running this code, you can see the algorithm start from x = 10 and quickly converge toward x = 0. The first steps are large (the slope is steep), then they shrink as we approach the bottom of the parabola. After 50 iterations, x is essentially zero.

The Learning Rate: Big Steps or Small Steps?

Back to our blindfolded person. How big is the step they take at each iteration? This is exactly the question of the learning rate (α), and the answer is less obvious than it seems.

Steps too small (α very low): the person moves with extreme caution, shifting their foot just a few centimeters at a time. They’ll eventually reach the bottom of the valley, but it could take forever. In machine learning, this means extremely long training times and high computational costs.

Steps too large (α too high): the person takes enormous leaps. Instead of gently descending into the valley, they overshoot it, end up on the other side, bounce back, and keep oscillating without ever settling. In extreme cases, the jumps get bigger and bigger and the person ends up higher than where they started. In machine learning, this is called divergence: the model gets worse instead of better.

Steps just right: a good learning rate allows you to descend quickly without oscillating. In practice, finding the right value requires experimentation. It’s one of the most artisanal aspects of machine learning.

Visual comparison of three learning rates on the parabola: alpha=0.01 too slow, alpha=0.1 efficient convergence, alpha=0.9 chaotic oscillation

Convergence: knowing when to stop

How does the blindfolded person know they’ve arrived? They feel the ground is flat in all directions: the gradient is (nearly) zero. In practice, the algorithm stops when the improvement between iterations becomes negligible, or when it has reached a maximum number of iterations.

The most common stopping criteria are:

The gradient norm drops below a minimum threshold (the terrain is nearly flat)
The difference in f(θ) between two consecutive iterations is smaller than a fixed tolerance
The maximum number of iterations has been reached (computational budget exhausted)

The effect of the learning rate: a visual comparison

This R code shows the effect of three different learning rate values on the same function:

# Comparing three learning rates on f(x) = x^2
gradient_descent <- function(x0, alpha, n_iter = 30) {
  x <- x0
  path <- numeric(n_iter)
  for (i in 1:n_iter) {
    path[i] <- x
    x <- x - alpha * 2 * x  # theta_new = theta_old - alpha * grad
  }
  return(path)
}

x0 <- 8  # same starting point for all

# Three different learning rates
slow   <- gradient_descent(x0, alpha = 0.01)   # too small
right  <- gradient_descent(x0, alpha = 0.1)    # good compromise
fast   <- gradient_descent(x0, alpha = 0.9)    # nearly unstable

# Visualization
par(mfrow = c(1, 3))

# alpha = 0.01 (too slow)
curve(x^2, from = -10, to = 10, lwd = 2, col = "steelblue",
      main = expression(paste(alpha, " = 0.01 (too slow)")))
points(slow, slow^2, col = "red", pch = 19, cex = 0.6)
lines(slow, slow^2, col = "red", lty = 2)

# alpha = 0.1 (just right)
curve(x^2, from = -10, to = 10, lwd = 2, col = "steelblue",
      main = expression(paste(alpha, " = 0.1 (good compromise)")))
points(right, right^2, col = "darkgreen", pch = 19, cex = 0.6)
lines(right, right^2, col = "darkgreen", lty = 2)

# alpha = 0.9 (nearly unstable)
curve(x^2, from = -10, to = 10, lwd = 2, col = "steelblue",
      main = expression(paste(alpha, " = 0.9 (nearly unstable)")))
points(fast, fast^2, col = "orange", pch = 19, cex = 0.6)
lines(fast, fast^2, col = "orange", lty = 2)

par(mfrow = c(1, 1))

Let’s compare in Python:

import numpy as np
import matplotlib.pyplot as plt

def gradient_descent(x0, alpha, n_iter=30):
    x = x0
    path = np.zeros(n_iter)
    for i in range(n_iter):
        path[i] = x
        x = x - alpha * 2 * x  # theta_new = theta_old - alpha * grad
    return path

x0 = 8.0  # same starting point for all
slow  = gradient_descent(x0, alpha=0.01)   # too small
right = gradient_descent(x0, alpha=0.1)    # good compromise
fast  = gradient_descent(x0, alpha=0.9)    # nearly unstable

fig, axes = plt.subplots(1, 3, figsize=(14, 4))
xs = np.linspace(-10, 10, 200)
for ax, data, color, title in zip(axes,
        [slow, right, fast],
        ["red", "darkgreen", "orange"],
        ["\u03b1 = 0.01 (too slow)", "\u03b1 = 0.1 (good compromise)",
         "\u03b1 = 0.9 (nearly unstable)"]):
    ax.plot(xs, xs**2, lw=2, color="steelblue")
    ax.plot(data, data**2, "o--", color=color, markersize=4)
    ax.set_title(title)
plt.tight_layout()
plt.show()

With α = 0.01, the red dots move sluggishly: after 30 iterations we’re still far from the minimum. With α = 0.1, convergence is fast and clean. With α = 0.9, the algorithm oscillates visibly at each step, bouncing from one side of the parabola to the other before settling — a slightly higher learning rate would diverge entirely.

What Can Go Wrong: Geometric Intuition

The mathematical landscape of a real model is not a nice symmetric parabola. It’s wild terrain, with secondary valleys, ridges, plateaus, and shapes that defy imagination. Here are the classic problems, explained through our landscape analogy.

Three pitfalls of gradient descent visualized in 3D: local minimum (secondary valley), saddle point (zero gradient but not a minimum), narrow valley (inefficient zigzag)

Local minima: secondary valleys

Imagine terrain with multiple dips: one deep valley (the global minimum) and several shallower hollows (local minima). The blindfolded person has no way of knowing whether the valley they’re in is the deepest one. They feel flat ground beneath their feet and stop, convinced they’ve arrived. But they might be in a shallow hollow while the true minimum lies somewhere else entirely.

In practice, this is a less severe problem than once thought. Modern neural networks have so many parameters that local minima tend to have objective function values similar to the global minimum. It’s like terrain with many valleys, but all at roughly the same altitude: ending up in any one of them works fine.

Saddle points: the horse saddle

A more insidious problem are saddle points. Imagine sitting on a horse saddle: if you move forward or backward, you descend; if you move left or right, you ascend. At that point the gradient is zero — the terrain feels flat — but you’re not at a minimum. You’re at a point that is a minimum in some directions and a maximum in others.

In high-dimensional spaces, saddle points are far more common than local minima. Fortunately, modern variants of gradient descent (with some noise or momentum, as we’ll see) generally manage to escape saddle points.

Narrow valleys: the zigzag

Imagine a very narrow, elongated valley, like a canyon. The gradient points almost perpendicularly to the canyon walls, not along the canyon toward the bottom. The blindfolded person ends up bouncing from one wall to the other, making an inefficient zigzag instead of walking straight toward the bottom.

This happens when the problem’s variables have very different scales: some change rapidly, others slowly. It’s a common issue in practice, and one of the main motivations for advanced optimizers like Adam, which we’ll see in the final section.

From a Parabola to ChatGPT

So far we’ve seen gradient descent on a parabola: a problem with a single variable. It’s the simplest possible case. But the beauty of this algorithm is that it works in exactly the same way at any scale.

The scale of parameters

Here’s how the number of parameters grows as models become more complex:

Simple linear regression: 2 parameters (slope and intercept). The landscape is a 3D surface, easy to visualize.
Neural network for handwritten digit recognition: ~100,000 parameters. The landscape has 100,000 dimensions.
ResNet-50 (image classification, 2015): ~25 million parameters.
GPT-3 (ChatGPT’s predecessor): 175 billion parameters.
GPT-4 and frontier models (2023-2025): estimated over one trillion parameters.

The principle is identical: compute the gradient, take a step in the opposite direction, repeat. What changes is the scale of the computation. GPT-4’s gradient is a vector with over a trillion components, computed on billions of text fragments, using thousands of processors in parallel. But the formula is the same one we saw on the parabola.

Where you see it in action (without knowing it)

Every time you interact with an AI system, gradient descent has been working behind the scenes:

Netflix and Spotify recommending what to watch or listen to: recommendation models are trained with gradient descent on billions of user interactions
Google Translate and automatic translators: neural networks with hundreds of millions of parameters, optimized with gradient descent on massive parallel text corpora
Voice assistants (Siri, Alexa): speech recognition uses deep neural networks, trained with the same algorithm
Self-driving cars: the networks that recognize pedestrians, traffic lights, and lane markings are trained with gradient descent variants
ChatGPT, Claude, Gemini: Large Language Models are the most extreme case — gradient descent applied to billions of parameters on trillions of text tokens

The key message is this: the power of modern AI lies not in the complexity of the optimization algorithm, but in scale. Gradient descent is conceptually simple. What made the AI revolution possible is the ability to apply it to enormous models on enormous amounts of data, thanks to ever more powerful hardware.

The Evolutions: Better Shoes for Our Explorer

Vanilla gradient descent — what we’ve seen so far — works, but it has the limitations we described: it can be slow, it can oscillate, it can get stuck. Over the years, researchers have developed variants that address these problems. Without diving into the formulas, here are the key ideas.

Stochastic Gradient Descent (SGD)

Instead of computing the gradient on the entire dataset at each step (computationally very expensive), SGD computes it on a small random sample (mini-batch). It’s as if the blindfolded person, instead of feeling the entire terrain around them, only sampled a few random points. The slope estimate is noisy but correct on average, and the computation speed is enormously greater. The noise, paradoxically, is also useful: it helps escape local minima and saddle points.

Momentum

Imagine a ball rolling down the hill instead of a person walking. The ball builds up velocity: if the slope continues in the same direction, it accelerates. If the slope changes direction, the ball slows down before reversing. This is momentum: the algorithm “remembers” the direction it was moving and adds the current gradient to it. The result is that it crosses flat zones faster and oscillates less in narrow valleys.

Adam: the Swiss army knife

Adam (Adaptive Moment Estimation) combines the momentum idea with a learning rate that adapts automatically for each parameter. Parameters that change little get larger steps; those that change a lot get smaller steps. It’s as if the blindfolded person had smart shoes that adjust the step length based on the terrain under each foot.

Adam has become the de facto standard for training most modern neural networks. It’s robust, requires little manual tuning, and works well across a wide range of problems. Nearly all the models you use every day — from Spotify to ChatGPT — were trained with Adam or one of its variants.

Back to the Mountain

The blindfolded person we started with now has better shoes. They have a ball that builds velocity instead of legs that take rigid steps. They have soles that automatically adapt to the terrain. And above all, they’re not walking on a mountain with two or three dimensions: they’re walking through a landscape with billions of dimensions.

But the principle is exactly the same. Feel the slope. Take a step where it goes down. Repeat.

Gradient descent is not a spectacular algorithm. It lacks the elegant complexity of a genetic algorithm or the narrative charm of adversarial networks. It’s a mechanical procedure, almost trivial. But it’s the mechanical procedure upon which the entire artificial intelligence revolution rests. From Netflix suggestions to image-generating models, from self-driving cars to simultaneous translation, it all comes down to this: a function to minimize, a gradient to compute, a step to take.

The next time a voice assistant understands your question, or an automatic translator nails a nuance, remember: behind the scenes, a very sophisticated version of our blindfolded person has walked billions of steps through a landscape with billions of dimensions. And found a valley deep enough to be useful.

The Gradient Descent Algorithm Explained Clearly: From Intuition to Practice