Machine Learning and Regression: the Complete Path, from the Line to the Algorithms

There is a border that many imagine to be sharp and that in fact does not exist: the one between “old-school” statistics and machine learning. On one side regression, which at school is presented as a line drawn through a cloud of points; on the other the algorithms that learn from data, wrapped in an almost mysterious aura. In reality machine learning is not a world apart: it is the natural continuation of the same idea that drives regression — using what we have observed to say something about what we have not yet seen. A line that predicts sales from advertising spend and a decision tree that classifies visitors into customers and non-customers answer, at bottom, the same question: given what I know, what is it reasonable to expect?

Understanding this link, however, does not mean leaping straight onto the most fashionable algorithms. It means walking a road that starts from the simplest relationship between two quantities — correlation — climbs up to the regression models that bring several variables into play at once, learns to recognise when those models creak, and only then crosses the border towards the algorithms we call machine learning proper. Tackled in this order, decision trees, gradient descent and dimensionality reduction stop looking like magic and reveal themselves for what they are: ingenious extensions of ideas we already knew.

This page is that road, in order. We do not re-explain the theory here: each stage is an article on the blog, and the order in which we have arranged them is the order in which it makes sense to read them. Anyone starting from scratch can follow them in sequence; anyone with some grounding can jump to the group they need. The two sections that follow — first regression, then machine learning — are the two slopes of a single ridge. We start with the slope we know best.

Regression

Before the algorithms that learn, we need to master the oldest and most reliable way of tying variables together: explaining or predicting a quantity from one or more others. It is the heart of applied statistics, and it is also the ground on which many machine learning models rest, often without saying so.
Anyone who can truly read a regression already holds half of the conceptual tools they will need later on.

Correlation is the non-negotiable starting point. It measures whether and how much two quantities move together — ad spend and conversions, time on page and purchase rate — summing up the strength of their bond in a single number. It is the first question to ask in front of two variables, and grasping it well also teaches the most important lesson of the whole path: correlation is not causation.

Simple linear regression takes the next step: from noticing that two quantities move together to building a model that uses one to predict the other. It is the line drawn through the points, yes, but above all it is the first real predictive model, the one from which everything more complex sets out. Here we learn the concepts — coefficients, residuals, goodness of fit — that will return at every later stage.

Multiple regression generalises the idea to the realistic case: not a single cause, but many acting together. Sales do not depend only on the advertising budget, but also on the season, the price, the channel. Learning to make several predictors live in the same model — and to interpret their weights without fooling ourselves — is the leap that carries us from textbook statistics to real problems.

Multicollinearity, heteroscedasticity and autocorrelation are the three most common ways in which a seemingly flawless regression betrays us. Variables too tightly bound to one another, errors that do not behave as they should, residuals that drag on over time: recognising these symptoms is what separates someone who applies a model with their eyes closed from someone who knows when to trust it. It is the stage that teaches caution.

Logistic regression closes the section by shifting the aim from the how much to the whether: no longer predicting a numeric value, but the probability that an event happens — a click, a conversion, a churn. It is the model that acts as a hinge towards machine learning, because it is at once a regression in full and one of the most widely used classifiers of all. Anyone who masters it has already set a foot on the other slope.

Machine learning

Once regression is behind us, the border is crossed almost without noticing. The algorithms in this section share the same goal — learning from data to predict or classify — but they pursue it with more flexible tools, able to capture relationships a line would never see.
Here the vocabulary changes a little (training, feature, overfitting), but the underlying logic remains the one we have built so far.

The basics of machine learning are the entry map to this territory. What it really means to “train” a model, what the difference is between supervised and unsupervised learning, why a model that does brilliantly on the training data can fail miserably on new data: these are the ideas that give meaning to everything that follows, and it is wise to fix them before touching a single algorithm.

Decision trees are often the first machine learning algorithm worth meeting, because they reason in the most human way there is: a sequence of yes/no questions that, step by step, separate the cases into ever more homogeneous groups. They are intuitive, read at a glance, and form the brick from which far more powerful models such as random forests are built. They are the ideal bridge between regression and the more abstract algorithms.

Gradient descent is the engine that, behind the scenes, makes an enormous number of models work — from regression itself up to neural networks. It is the method by which an algorithm learns: adjusting its own parameters one small step at a time, descending along the surface of the error to the lowest point it can reach. Grasping its idea, simple and powerful, reveals what really happens when we say that “the model is training”.

Principal component analysis (PCA) tackles the opposite problem to prediction: not adding information, but reducing it without losing the essential. When the variables in play number in the dozens, PCA compresses them into a few new dimensions that capture most of their variability, making the data readable and the models leaner. It is the most elegant example of unsupervised learning, and it closes the path by showing that machine learning serves not only to predict, but also to see better what we have in front of us.

Where to start

If this is the first contact with the subject, the entry point is only one: correlation and, right after it, simple linear regression. They are the two stages from which everything else takes on meaning; tackle the more sophisticated algorithms without this base and, sooner or later, we always come back here, to the elementary question of how one quantity depends on another.

This is one of the thematic paths we are building to navigate the blog’s articles: regression and machine learning are the way data move from description to prediction. Anyone who wants the foundations that come before — describing and summarising the data — finds them in the basic statistics path; anyone who wants to understand how we establish whether an effect is real or only apparent — the ground on which every model, sooner or later, must be put to the proof — can move on to the path devoted to inferential statistics, the toolbox from which machine learning too, sooner or later, ends up drawing.