The Central Limit Theorem: Why Statistics Works (Even When Data Isn’t Normal)

Throughout the previous articles, we’ve had the chance to examine the normal distribution and its properties. And then we moved forward: we built confidence intervals, conducted hypothesis tests, calculated margins of error. In all these steps, the normal distribution was there, always present, like a quiet thread running through everything.

But there’s a question we may have asked ourselves without yet finding a satisfying answer: why does the normal distribution work so well, even when our data aren’t normal at all? Who said that organic traffic, conversion rates, or session durations follow a bell curve? In most cases, they don’t follow one at all.

The answer lies in one of the most elegant and powerful results in all of mathematics: the Central Limit Theorem (often abbreviated as CLT). It’s the theorem that, in a sense, justifies all of inferential statistics.

Continue reading “The Central Limit Theorem: Why Statistics Works (Even When Data Isn’t Normal)”

How to Use Decision Trees to Classify Data

Decision Trees are a type of machine learning algorithm that uses a tree structure to divide data based on logical rules and predict the class of new data. They are easy to interpret and adaptable to different types of data, but can also suffer from problems such as overfitting, complexity, and imbalance.
Let’s understand a bit more about them and examine a simple example of use in R.

Continue reading “How to Use Decision Trees to Classify Data”

The Gradient Descent Algorithm Explained Clearly: From Intuition to Practice

A blindfolded person on a mountain

Imagine standing on a mountainous terrain, completely blindfolded. Your goal: reach the lowest point in the valley. You can’t see anything, but you can feel the slope of the ground beneath your feet. What do you do? You move in the direction where the ground goes down, one step at a time. If it slopes more steeply to the left, you go left. If it drops more to the right, you go right. With each step, you feel the slope again and redirect yourself.

This strategy, so simple and natural, is exactly what neural networks use to learn. Every time an AI model improves — learning to recognize a face, translate a sentence, or generate text — it does so by descending through a mathematical landscape, one step at a time, following the slope.

It’s called gradient descent, and it’s arguably the most important algorithm in modern machine learning.

Infographic: the blindfolded explorer metaphor for gradient descent, with three steps: Sensor, Action, Cycle
Continue reading “The Gradient Descent Algorithm Explained Clearly: From Intuition to Practice”

The Hypergeometric Distribution

We have seen that the binomial distribution is based on the hypothesis of an infinite population N, a condition that can be practically realized by sampling from a finite population with replacement.

If this does not occur, meaning if we are sampling from a population without replacement, we must use the hypergeometric distribution. (In reality, if N is large, the hypergeometric probability density function tends towards the binomial).

The hypergeometric distribution is used to calculate the probability of obtaining a certain number of successes in a series of binary trials (yes or no), which are dependent and have a variable probability of success.

The hypergeometric distribution allows us to answer questions like:

If I take a sample of size N, in which M elements meet certain requirements, what is the probability of drawing x elements that meet those requirements?

Continue reading “The Hypergeometric Distribution”