{"id":3266,"date":"2023-08-21T13:25:14","date_gmt":"2023-08-21T12:25:14","guid":{"rendered":"https:\/\/www.gironi.it\/blog\/?p=3266"},"modified":"2026-06-18T14:20:38","modified_gmt":"2026-06-18T13:20:38","slug":"the-gradient-descent-algorithm-explained-simply","status":"publish","type":"post","link":"https:\/\/www.gironi.it\/blog\/en\/the-gradient-descent-algorithm-explained-simply\/","title":{"rendered":"The Gradient Descent Algorithm Explained Clearly: From Intuition to Practice"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">A blindfolded person on a mountain<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Imagine standing on a mountainous terrain, completely blindfolded. Your goal: reach the lowest point in the valley. You can&#8217;t see anything, but you can feel the slope of the ground beneath your feet. What do you do? You move in the direction where the ground goes down, one step at a time. If it slopes more steeply to the left, you go left. If it drops more to the right, you go right. With each step, you feel the slope again and redirect yourself.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This strategy, so simple and natural, is exactly what neural networks use to learn. Every time an AI model improves &mdash; learning to recognize a face, translate a sentence, or generate text &mdash; it does so by descending through a mathematical landscape, one step at a time, following the slope.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It&#8217;s called <strong>gradient descent<\/strong>, and it&#8217;s arguably the most important algorithm in modern machine learning.<\/p>\n\n\n\n<figure style=\"margin: 1.5em 0;text-align: center\"><img decoding=\"async\" src=\"https:\/\/www.gironi.it\/blog\/wp-content\/uploads\/2026\/03\/slide-esploratore-bendato.jpg\" alt=\"Infographic: the blindfolded explorer metaphor for gradient descent, with three steps: Sensor, Action, Cycle\" style=\"max-width: 100%;border: 1px solid #e0e0e0;border-radius: 6px\" \/><\/figure>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\" id=\"storia\">From Cauchy to Neural Networks: A Brief History<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The idea of following the slope to find a minimum has surprisingly ancient roots. In <strong>1847<\/strong>, the French mathematician Augustin-Louis Cauchy published a method for solving systems of equations that, in essence, is already gradient descent: compute the direction of steepest ascent of a function and move in the opposite direction. Cauchy was not thinking about neural networks &mdash; their appearance was more than a century away &mdash; but he had formalized the principle that still powers artificial intelligence today.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For over a century the idea remained confined to pure mathematics. The breakthrough came in <strong>1951<\/strong>, when Herbert Robbins and Sutton Monro proposed a <strong>stochastic<\/strong> version of the method: instead of computing the gradient on the entire problem, it is estimated on a random sample. This is the seed of what we now call <em>Stochastic Gradient Descent<\/em> (SGD) &mdash; we will discuss it in the final section. A few years later, in <strong>1958<\/strong>, Frank Rosenblatt built the <strong>perceptron<\/strong>, the first machine learning model that uses gradient descent to learn from data. The excitement was enormous, but short-lived: in 1969, Minsky and Papert demonstrated the perceptron&#8217;s limitations, and interest in neural networks collapsed. This is the so-called <em>AI winter<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The revival came in <strong>1986<\/strong>, when David Rumelhart, Geoffrey Hinton, and Ronald Williams published the <strong>backpropagation<\/strong> algorithm: an efficient way to compute the gradient in neural networks with many layers. It was the key that unlocked deep learning. From that moment on, gradient descent was no longer a theoretical exercise but the practical tool used to train neural networks. The final piece of this story was set by Diederik Kingma and Jimmy Ba in <strong>2014<\/strong>, with the <strong>Adam<\/strong> optimizer &mdash; which we will encounter again in the final section of this article. Nearly two centuries separate Cauchy from Adam, yet the principle has remained the same: feel the slope, take a step in the opposite direction.<\/p>\n\n\n\n<div style=\"border: 1px solid #ccc;padding: 1.2em 1.5em;margin: 1.5em 0;border-radius: 6px\">\n<h3 style=\"margin-top: 0\">What We&#8217;ll Cover<\/h3>\n<ul>\n<li><a href=\"#storia\">From Cauchy to Neural Networks: A Brief History<\/a><\/li>\n<li><a href=\"#the-math\">The Math, Explained Geometrically<\/a><\/li>\n<li><a href=\"#learning-rate\">The Learning Rate and Convergence<\/a><\/li>\n<li><a href=\"#pitfalls\">What Can Go Wrong: Geometric Intuition<\/a><\/li>\n<li><a href=\"#scale\">From a Parabola to ChatGPT<\/a><\/li>\n<li><a href=\"#evolutions\">The Evolutions: Better Shoes for Our Explorer<\/a><\/li>\n<\/ul>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-math\">The Math, Explained Geometrically<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">The Cost Function: Measuring How Wrong the Model Is<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Before we descend, we need to know <strong>what<\/strong> we are minimizing. The blindfolded person is looking for the lowest point in the valley &mdash; but in machine learning, what exactly is that valley?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s take a concrete example. Suppose we want to predict the price of a house knowing only its area. We have five houses for which we know the area and the actual price, and our model is the simplest possible: a line through the origin, <strong>price = m &times; area<\/strong>, where <em>m<\/em> is the only parameter to adjust.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For each value of <em>m<\/em>, the model makes a prediction. If <em>m<\/em> is too low, the predictions underestimate the actual prices; if it is too high, they overestimate them. We need a number that tells us <em>how much<\/em> the model is getting wrong: this is the <strong>cost function<\/strong> (or <em>loss function<\/em>).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The most widely used cost function is the <strong>Mean Squared Error<\/strong> (MSE): for each house, we compute the difference between the predicted price and the actual price, square it, and take the average of all these errors. In formula:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>MSE(m) = (1\/n) &sum;<sub>i<\/sub> (price<sub>i<\/sub> &minus; m &times; area<sub>i<\/sub>)&sup2;<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If we plot MSE(m) as <em>m<\/em> varies, we get a parabola-shaped curve: the same parabola we will use shortly as an example of gradient descent. This is no coincidence &mdash; <strong>the parabola is the cost function<\/strong>, and its lowest point is the value of <em>m<\/em> that makes the model as good as possible.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The examples that follow are available in both R and Python &mdash; choose whichever language you&#8217;re more comfortable with.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s compute the cost function in R for our house example:<\/p>\n\n\n\n<pre><code class=\"language-r\"># Cost function: predicting the price of a house\narea &lt;- c(50, 70, 90, 120, 150)\nprice &lt;- c(150, 200, 260, 340, 400)  # thousands of euros\n\n# Model: price = m * area\n# MSE cost function as m varies\nm_values &lt;- seq(1, 4, by = 0.01)\nmse &lt;- sapply(m_values, function(m) mean((price - m * area)^2))\n\nplot(m_values, mse, type = \"l\", lwd = 2, col = \"steelblue\",\n     xlab = \"m (slope)\", ylab = \"MSE\",\n     main = \"Cost function as m varies\")\nm_best &lt;- m_values[which.min(mse)]\nabline(v = m_best, col = \"red\", lty = 2)\ncat(\"The value of m that minimizes the error:\", round(m_best, 2), \"\\n\")<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s verify in Python:<\/p>\n\n\n\n<pre><code class=\"language-python\">import numpy as np\nimport matplotlib.pyplot as plt\n\n# Cost function: predicting the price of a house\narea = np.array([50, 70, 90, 120, 150])\nprice = np.array([150, 200, 260, 340, 400])  # thousands of euros\n\n# Model: price = m * area\n# MSE cost function as m varies\nm_values = np.linspace(1, 4, 301)  # equivalent to seq(1, 4, by=0.01) in R\nmse = np.array([np.mean((price - m * area)**2) for m in m_values])\n\nplt.plot(m_values, mse, lw=2, color=\"steelblue\")\nplt.xlabel(\"m (slope)\")\nplt.ylabel(\"MSE\")\nplt.title(\"Cost function as m varies\")\nm_best = m_values[np.argmin(mse)]\nplt.axvline(m_best, color=\"red\", linestyle=\"--\")\nplt.show()\nprint(f\"The value of m that minimizes the error: {m_best:.2f}\")<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Now we know what to minimize: the cost function. The gradient tells us <em>how<\/em>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Derivative: the slope beneath your feet<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">When you have a function of a single variable &mdash; think of it as a trail that goes up and down &mdash; the <strong>derivative<\/strong> at a point tells you how steep the trail is at that point. If the derivative is positive, you&#8217;re going uphill. If it&#8217;s negative, you&#8217;re going downhill. If it&#8217;s zero, you&#8217;re on flat ground: perhaps a peak, perhaps a dip.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Concrete example: the function f(x) = x&sup2; describes a parabola. Its derivative is f'(x) = 2x. If you&#8217;re at x = 3, the derivative is 6: you&#8217;re climbing steeply. At x = -1, the derivative is -2: you&#8217;re going downhill. At x = 0, the derivative is zero: you&#8217;re at the lowest point.<\/p>\n\n\n\n<figure style=\"margin: 1.5em 0;text-align: center\"><img decoding=\"async\" src=\"https:\/\/www.gironi.it\/blog\/wp-content\/uploads\/2026\/03\/slide-derivata-pendenza.jpg\" alt=\"Graph of the parabola f(x)=x\u00b2 with three annotated points: A (x=3, derivative=6, steep slope), B (x=-1, derivative=-2, gentle slope), C (x=0, derivative=0, minimum)\" style=\"max-width: 100%;border: 1px solid #e0e0e0;border-radius: 6px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">The gradient: a multidimensional compass<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In practice, the functions we care about don&#8217;t depend on just one variable. A machine learning model can have hundreds, thousands, or billions of parameters. The landscape is no longer a trail but a surface in a high-dimensional space &mdash; impossible to visualize, but perfectly manageable with mathematics.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The <strong>gradient<\/strong> is the vector that collects all partial derivatives of the function with respect to each parameter. If the function depends on two variables (x, y), the gradient is:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>&nabla;f = (&#8706;f\/&#8706;x, &#8706;f\/&#8706;y)<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If it depends on a thousand variables, the gradient is a vector with a thousand components. In every case, the gradient points in the direction of steepest ascent. To find the minimum, just go in the opposite direction: <strong>minus the gradient<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here&#8217;s the fundamental formula of gradient descent:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>&theta;<sub>new<\/sub> = &theta;<sub>old<\/sub> &minus; &alpha; &middot; &nabla;f(&theta;)<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s break it down piece by piece:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>&theta;<\/strong> represents the model&#8217;s parameters &mdash; the &#8220;knobs&#8221; the algorithm adjusts to improve<\/li>\n<li><strong>&nabla;f(&theta;)<\/strong> is the gradient: it indicates the direction of steepest ascent at the current point<\/li>\n<li><strong>&alpha;<\/strong> (alpha) is the <em>learning rate<\/em>: the size of the step we take at each iteration<\/li>\n<li>The <strong>minus<\/strong> sign makes us go in the opposite direction of the gradient, i.e. downhill<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">That&#8217;s all there is to it. Compute where you&#8217;re going uphill, take a step in the opposite direction, repeat.<\/p>\n\n\n\n<figure style=\"margin: 1.5em 0;text-align: center\"><img decoding=\"async\" src=\"https:\/\/www.gironi.it\/blog\/wp-content\/uploads\/2026\/03\/slide-anatomia-passo.jpg\" alt=\"Visual breakdown of the gradient descent formula: theta (model knobs), alpha (learning rate as caliper), nabla f (gradient as compass), minus sign (opposite direction)\" style=\"max-width: 100%;border: 1px solid #e0e0e0;border-radius: 6px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">A numerical example in R<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s see gradient descent in action on the function f(x) = x&sup2;. We know the minimum is at x = 0. Can the algorithm find it starting from a random point?<\/p>\n\n\n\n<pre><code class=\"language-r\"># Gradient descent on f(x) = x^2\n# The derivative is f'(x) = 2x\n\nf &lt;- function(x) x^2         # objective function\ngrad_f &lt;- function(x) 2 * x  # derivative (gradient in 1D)\n\nx &lt;- 10              # starting point\nalpha &lt;- 0.1          # learning rate\nn_iter &lt;- 50          # number of iterations\npath &lt;- numeric(n_iter)\n\nfor (i in 1:n_iter) {\n  path[i] &lt;- x\n  x &lt;- x - alpha * grad_f(x)  # the fundamental rule\n}\n\ncat(\"Starting point: 10\\n\")\ncat(\"After 50 iterations: x =\", round(x, 8), \"\\n\")\ncat(\"Function value:\", round(f(x), 10), \"\\n\")\n\n# Visualization\ncurve(x^2, from = -11, to = 11, lwd = 2, col = \"steelblue\",\n      main = \"Gradient descent on f(x) = x^2\",\n      xlab = \"x\", ylab = \"f(x)\")\npoints(path, path^2, col = \"red\", pch = 19, cex = 0.7)\nlines(path, path^2, col = \"red\", lty = 2)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s verify in Python:<\/p>\n\n\n\n<pre><code class=\"language-python\">import numpy as np\nimport matplotlib.pyplot as plt\n\n# Gradient descent on f(x) = x^2\nf = lambda x: x**2           # objective function\ngrad_f = lambda x: 2 * x     # derivative (gradient in 1D)\n\nx = 10.0            # starting point\nalpha = 0.1          # learning rate\nn_iter = 50          # number of iterations\npath = np.zeros(n_iter)\n\nfor i in range(n_iter):\n    path[i] = x\n    x = x - alpha * grad_f(x)  # the fundamental rule\n\nprint(f\"Starting point: 10\")\nprint(f\"After 50 iterations: x = {x:.8f}\")\nprint(f\"Function value: {f(x):.10f}\")\n\n# Visualization\nxs = np.linspace(-11, 11, 200)\nplt.plot(xs, xs**2, lw=2, color=\"steelblue\")\nplt.plot(path, path**2, \"ro--\", markersize=4)\nplt.title(\"Gradient descent on f(x) = x\\u00b2\")\nplt.xlabel(\"x\")\nplt.ylabel(\"f(x)\")\nplt.show()<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Running this code, you can see the algorithm start from x = 10 and quickly converge toward x = 0. The first steps are large (the slope is steep), then they shrink as we approach the bottom of the parabola. After 50 iterations, x is essentially zero.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"learning-rate\">The Learning Rate: Big Steps or Small Steps?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Back to our blindfolded person. How big is the step they take at each iteration? This is exactly the question of the <strong>learning rate<\/strong> (&alpha;), and the answer is less obvious than it seems.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Steps too small<\/strong> (&alpha; very low): the person moves with extreme caution, shifting their foot just a few centimeters at a time. They&#8217;ll eventually reach the bottom of the valley, but it could take forever. In machine learning, this means extremely long training times and high computational costs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Steps too large<\/strong> (&alpha; too high): the person takes enormous leaps. Instead of gently descending into the valley, they overshoot it, end up on the other side, bounce back, and keep oscillating without ever settling. In extreme cases, the jumps get bigger and bigger and the person ends up higher than where they started. In machine learning, this is called <em>divergence<\/em>: the model gets worse instead of better.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Steps just right<\/strong>: a good learning rate allows you to descend quickly without oscillating. In practice, finding the right value requires experimentation. It&#8217;s one of the most artisanal aspects of machine learning.<\/p>\n\n\n\n<figure style=\"margin: 1.5em 0;text-align: center\"><img decoding=\"async\" src=\"https:\/\/www.gironi.it\/blog\/wp-content\/uploads\/2026\/03\/slide-learning-rate.jpg\" alt=\"Visual comparison of three learning rates on the parabola: alpha=0.01 too slow, alpha=0.1 efficient convergence, alpha=0.9 chaotic oscillation\" style=\"max-width: 100%;border: 1px solid #e0e0e0;border-radius: 6px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Convergence: knowing when to stop<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">How does the blindfolded person know they&#8217;ve arrived? They feel the ground is flat in all directions: the gradient is (nearly) zero. In practice, the algorithm stops when the improvement between iterations becomes negligible, or when it has reached a maximum number of iterations.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The most common stopping criteria are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The gradient norm drops below a minimum threshold (the terrain is nearly flat)<\/li>\n<li>The difference in f(&theta;) between two consecutive iterations is smaller than a fixed tolerance<\/li>\n<li>The maximum number of iterations has been reached (computational budget exhausted)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">The effect of the learning rate: a visual comparison<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">This R code shows the effect of three different learning rate values on the same function:<\/p>\n\n\n\n<pre><code class=\"language-r\"># Comparing three learning rates on f(x) = x^2\ngradient_descent &lt;- function(x0, alpha, n_iter = 30) {\n  x &lt;- x0\n  path &lt;- numeric(n_iter)\n  for (i in 1:n_iter) {\n    path[i] &lt;- x\n    x &lt;- x - alpha * 2 * x  # theta_new = theta_old - alpha * grad\n  }\n  return(path)\n}\n\nx0 &lt;- 8  # same starting point for all\n\n# Three different learning rates\nslow   &lt;- gradient_descent(x0, alpha = 0.01)   # too small\nright  &lt;- gradient_descent(x0, alpha = 0.1)    # good compromise\nfast   &lt;- gradient_descent(x0, alpha = 0.9)    # nearly unstable\n\n# Visualization\npar(mfrow = c(1, 3))\n\n# alpha = 0.01 (too slow)\ncurve(x^2, from = -10, to = 10, lwd = 2, col = \"steelblue\",\n      main = expression(paste(alpha, \" = 0.01 (too slow)\")))\npoints(slow, slow^2, col = \"red\", pch = 19, cex = 0.6)\nlines(slow, slow^2, col = \"red\", lty = 2)\n\n# alpha = 0.1 (just right)\ncurve(x^2, from = -10, to = 10, lwd = 2, col = \"steelblue\",\n      main = expression(paste(alpha, \" = 0.1 (good compromise)\")))\npoints(right, right^2, col = \"darkgreen\", pch = 19, cex = 0.6)\nlines(right, right^2, col = \"darkgreen\", lty = 2)\n\n# alpha = 0.9 (nearly unstable)\ncurve(x^2, from = -10, to = 10, lwd = 2, col = \"steelblue\",\n      main = expression(paste(alpha, \" = 0.9 (nearly unstable)\")))\npoints(fast, fast^2, col = \"orange\", pch = 19, cex = 0.6)\nlines(fast, fast^2, col = \"orange\", lty = 2)\n\npar(mfrow = c(1, 1))<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s compare in Python:<\/p>\n\n\n\n<pre><code class=\"language-python\">import numpy as np\nimport matplotlib.pyplot as plt\n\ndef gradient_descent(x0, alpha, n_iter=30):\n    x = x0\n    path = np.zeros(n_iter)\n    for i in range(n_iter):\n        path[i] = x\n        x = x - alpha * 2 * x  # theta_new = theta_old - alpha * grad\n    return path\n\nx0 = 8.0  # same starting point for all\nslow  = gradient_descent(x0, alpha=0.01)   # too small\nright = gradient_descent(x0, alpha=0.1)    # good compromise\nfast  = gradient_descent(x0, alpha=0.9)    # nearly unstable\n\nfig, axes = plt.subplots(1, 3, figsize=(14, 4))\nxs = np.linspace(-10, 10, 200)\nfor ax, data, color, title in zip(axes,\n        [slow, right, fast],\n        [\"red\", \"darkgreen\", \"orange\"],\n        [\"\\u03b1 = 0.01 (too slow)\", \"\\u03b1 = 0.1 (good compromise)\",\n         \"\\u03b1 = 0.9 (nearly unstable)\"]):\n    ax.plot(xs, xs**2, lw=2, color=\"steelblue\")\n    ax.plot(data, data**2, \"o--\", color=color, markersize=4)\n    ax.set_title(title)\nplt.tight_layout()\nplt.show()<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">With &alpha; = 0.01, the red dots move sluggishly: after 30 iterations we&#8217;re still far from the minimum. With &alpha; = 0.1, convergence is fast and clean. With &alpha; = 0.9, the algorithm oscillates visibly at each step, bouncing from one side of the parabola to the other before settling &mdash; a slightly higher learning rate would diverge entirely.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"pitfalls\">What Can Go Wrong: Geometric Intuition<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The mathematical landscape of a real model is not a nice symmetric parabola. It&#8217;s wild terrain, with secondary valleys, ridges, plateaus, and shapes that defy imagination. Here are the classic problems, explained through our landscape analogy.<\/p>\n\n\n\n<figure style=\"margin: 1.5em 0;text-align: center\"><img decoding=\"async\" src=\"https:\/\/www.gironi.it\/blog\/wp-content\/uploads\/2026\/03\/slide-insidie-topografiche.jpg\" alt=\"Three pitfalls of gradient descent visualized in 3D: local minimum (secondary valley), saddle point (zero gradient but not a minimum), narrow valley (inefficient zigzag)\" style=\"max-width: 100%;border: 1px solid #e0e0e0;border-radius: 6px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Local minima: secondary valleys<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Imagine terrain with multiple dips: one deep valley (the global minimum) and several shallower hollows (local minima). The blindfolded person has no way of knowing whether the valley they&#8217;re in is the deepest one. They feel flat ground beneath their feet and stop, convinced they&#8217;ve arrived. But they might be in a shallow hollow while the true minimum lies somewhere else entirely.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In practice, this is a less severe problem than once thought. Modern neural networks have so many parameters that local minima tend to have objective function values similar to the global minimum. It&#8217;s like terrain with many valleys, but all at roughly the same altitude: ending up in any one of them works fine.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Saddle points: the horse saddle<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A more insidious problem are <strong>saddle points<\/strong>. Imagine sitting on a horse saddle: if you move forward or backward, you descend; if you move left or right, you ascend. At that point the gradient is zero &mdash; the terrain feels flat &mdash; but you&#8217;re not at a minimum. You&#8217;re at a point that is a minimum in some directions and a maximum in others.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In high-dimensional spaces, saddle points are far more common than local minima. Fortunately, modern variants of gradient descent (with some noise or momentum, as we&#8217;ll see) generally manage to escape saddle points.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Narrow valleys: the zigzag<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Imagine a very narrow, elongated valley, like a canyon. The gradient points almost perpendicularly to the canyon walls, not along the canyon toward the bottom. The blindfolded person ends up bouncing from one wall to the other, making an inefficient zigzag instead of walking straight toward the bottom.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This happens when the problem&#8217;s variables have very different scales: some change rapidly, others slowly. It&#8217;s a common issue in practice, and one of the main motivations for advanced optimizers like Adam, which we&#8217;ll see in the final section.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"scale\">From a Parabola to ChatGPT<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">So far we&#8217;ve seen gradient descent on a parabola: a problem with a single variable. It&#8217;s the simplest possible case. But the beauty of this algorithm is that it works in exactly the same way at any scale.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The scale of parameters<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Here&#8217;s how the number of parameters grows as models become more complex:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Simple linear regression<\/strong>: 2 parameters (slope and intercept). The landscape is a 3D surface, easy to visualize.<\/li>\n<li><strong>Neural network for handwritten digit recognition<\/strong>: ~100,000 parameters. The landscape has 100,000 dimensions.<\/li>\n<li><strong>ResNet-50<\/strong> (image classification, 2015): ~25 million parameters.<\/li>\n<li><strong>GPT-3<\/strong> (ChatGPT&#8217;s predecessor): 175 billion parameters.<\/li>\n<li><strong>GPT-4<\/strong> and frontier models (2023-2025): estimated over one trillion parameters.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The principle is identical: compute the gradient, take a step in the opposite direction, repeat. What changes is the scale of the computation. GPT-4&#8217;s gradient is a vector with over a trillion components, computed on billions of text fragments, using thousands of processors in parallel. But the formula is the same one we saw on the parabola.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Where you see it in action (without knowing it)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Every time you interact with an AI system, gradient descent has been working behind the scenes:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Netflix and Spotify<\/strong> recommending what to watch or listen to: recommendation models are trained with gradient descent on billions of user interactions<\/li>\n<li><strong>Google Translate<\/strong> and automatic translators: neural networks with hundreds of millions of parameters, optimized with gradient descent on massive parallel text corpora<\/li>\n<li><strong>Voice assistants<\/strong> (Siri, Alexa): speech recognition uses deep neural networks, trained with the same algorithm<\/li>\n<li><strong>Self-driving cars<\/strong>: the networks that recognize pedestrians, traffic lights, and lane markings are trained with gradient descent variants<\/li>\n<li><strong>ChatGPT, Claude, Gemini<\/strong>: Large Language Models are the most extreme case &mdash; gradient descent applied to billions of parameters on trillions of text tokens<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The key message is this: <strong>the power of modern AI lies not in the complexity of the optimization algorithm, but in scale<\/strong>. Gradient descent is conceptually simple. What made the AI revolution possible is the ability to apply it to enormous models on enormous amounts of data, thanks to ever more powerful hardware.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"evolutions\">The Evolutions: Better Shoes for Our Explorer<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Vanilla gradient descent &mdash; what we&#8217;ve seen so far &mdash; works, but it has the limitations we described: it can be slow, it can oscillate, it can get stuck. Over the years, researchers have developed variants that address these problems. Without diving into the formulas, here are the key ideas.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Stochastic Gradient Descent (SGD)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Instead of computing the gradient on the entire dataset at each step (computationally very expensive), SGD computes it on a small random sample (<em>mini-batch<\/em>). It&#8217;s as if the blindfolded person, instead of feeling the entire terrain around them, only sampled a few random points. The slope estimate is noisy but correct on average, and the computation speed is enormously greater. The noise, paradoxically, is also useful: it helps escape local minima and saddle points.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Momentum<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Imagine a ball rolling down the hill instead of a person walking. The ball builds up velocity: if the slope continues in the same direction, it accelerates. If the slope changes direction, the ball slows down before reversing. This is momentum: the algorithm &#8220;remembers&#8221; the direction it was moving and adds the current gradient to it. The result is that it crosses flat zones faster and oscillates less in narrow valleys.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Adam: the Swiss army knife<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Adam<\/strong> (<em>Adaptive Moment Estimation<\/em>) combines the momentum idea with a learning rate that adapts automatically for each parameter. Parameters that change little get larger steps; those that change a lot get smaller steps. It&#8217;s as if the blindfolded person had smart shoes that adjust the step length based on the terrain under each foot.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Adam has become the de facto standard for training most modern neural networks. It&#8217;s robust, requires little manual tuning, and works well across a wide range of problems. Nearly all the models you use every day &mdash; from Spotify to ChatGPT &mdash; were trained with Adam or one of its variants.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Back to the Mountain<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The blindfolded person we started with now has better shoes. They have a ball that builds velocity instead of legs that take rigid steps. They have soles that automatically adapt to the terrain. And above all, they&#8217;re not walking on a mountain with two or three dimensions: they&#8217;re walking through a landscape with billions of dimensions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But the principle is exactly the same. Feel the slope. Take a step where it goes down. Repeat.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Gradient descent is not a spectacular algorithm. It lacks the elegant complexity of a genetic algorithm or the narrative charm of adversarial networks. It&#8217;s a mechanical procedure, almost trivial. But it&#8217;s the mechanical procedure upon which the entire artificial intelligence revolution rests. From Netflix suggestions to image-generating models, from self-driving cars to simultaneous translation, it all comes down to this: a function to minimize, a gradient to compute, a step to take.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The next time a voice assistant understands your question, or an automatic translator nails a nuance, remember: behind the scenes, a very sophisticated version of our blindfolded person has walked billions of steps through a landscape with billions of dimensions. And found a valley deep enough to be useful.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Further Reading<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To place gradient descent in the broader context of statistical learning, <a href=\"https:\/\/www.amazon.it\/dp\/1461471370?tag=consulenzeinf-21&#038;ascsubtag=the-gradient-descent-algorithm-explained-simply\" rel=\"nofollow sponsored noopener\" target=\"_blank\"><em>An Introduction to Statistical Learning<\/em><\/a> by James, Witten, Hastie and Tibshirani is the read we recommend: rigorous, yet always practice-oriented.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A blindfolded person on a mountain Imagine standing on a mountainous terrain, completely blindfolded. Your goal: reach the lowest point in the valley. You can&#8217;t see anything, but you can feel the slope of the ground beneath your feet. What do you do? You move in the direction where the ground goes down, one step &hellip; <a href=\"https:\/\/www.gironi.it\/blog\/en\/the-gradient-descent-algorithm-explained-simply\/\" class=\"more-link\">Leggi tutto<span class=\"screen-reader-text\"> &#8220;The Gradient Descent Algorithm Explained Clearly: From Intuition to Practice&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_uag_custom_page_level_css":"","footnotes":"[{\"content\":\"La funzione obiettivo e la funzione di costo sono termini che si usano spesso in modo intercambiabile, ma hanno anche delle sfumature diverse a seconda del contesto.<br>In generale, la funzione obiettivo \u00e8 una funzione che si vuole ottimizzare, cio\u00e8 massimizzare o minimizzare, in base a un certo criterio. La funzione di costo \u00e8 un tipo di funzione obiettivo che misura il \u201ccosto\u201d o la \u201cperdita\u201d associati a una soluzione o a una previsione. In altre parole, la funzione di costo \u00e8 una funzione obiettivo che si vuole minimizzare.<br>Ad esempio, se si vuole trovare la retta che meglio approssima un insieme di punti, si pu\u00f2 usare la regressione lineare. In questo caso, la funzione obiettivo \u00e8 l\u2019errore quadratico medio tra i punti e la retta, cio\u00e8 la somma dei quadrati delle distanze verticali tra i punti e la retta divisa per il numero dei punti. Questa funzione obiettivo \u00e8 anche una funzione di costo, perch\u00e9 rappresenta il costo di avere una retta non perfetta. L\u2019obiettivo \u00e8 trovare i parametri della retta che minimizzano questo costo.<br>Tuttavia, non tutte le funzioni obiettivo sono funzioni di costo. Alcune funzioni obiettivo possono essere funzioni di utilit\u00e0, di guadagno, di verosimiglianza, ecc. Queste sono funzioni che si vogliono massimizzare, perch\u00e9 rappresentano il beneficio o la probabilit\u00e0 associati a una soluzione o a una previsione. In questo caso, non si parla di costo o di perdita, ma di ottimizzazione.<br>Ad esempio, se si vuole trovare il modello probabilistico che meglio descrive un insieme di dati, si pu\u00f2 usare il metodo della massima verosimiglianza. In questo caso, la funzione obiettivo \u00e8 la probabilit\u00e0 di generare l\u2019insieme di dati dato il modello, cio\u00e8 il prodotto delle probabilit\u00e0 di ogni dato dato il modello. Questa funzione obiettivo non \u00e8 una funzione di costo, ma una funzione di verosimiglianza. L\u2019obiettivo \u00e8 trovare i parametri del modello che massimizzano questa verosimiglianza.\",\"id\":\"ec92aa27-198c-4cb8-9137-d3df169896e0\"}]"},"categories":[161,299],"tags":[1192],"class_list":["post-3266","post","type-post","status-publish","format-standard","hentry","category-statistics","category-ai","tag-gradient-descent"],"lang":"en","translations":{"en":3266,"it":3046},"uagb_featured_image_src":{"full":false,"thumbnail":false,"medium":false,"medium_large":false,"large":false,"1536x1536":false,"2048x2048":false,"post-thumbnail":false},"uagb_author_info":{"display_name":"paolo","author_link":"https:\/\/www.gironi.it\/blog\/author\/paolo\/"},"uagb_comment_info":5,"uagb_excerpt":"A blindfolded person on a mountain Imagine standing on a mountainous terrain, completely blindfolded. Your goal: reach the lowest point in the valley. You can&#8217;t see anything, but you can feel the slope of the ground beneath your feet. What do you do? You move in the direction where the ground goes down, one step&hellip;","_links":{"self":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3266","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/comments?post=3266"}],"version-history":[{"count":5,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3266\/revisions"}],"predecessor-version":[{"id":3744,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3266\/revisions\/3744"}],"wp:attachment":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/media?parent=3266"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/categories?post=3266"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/tags?post=3266"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}