The Normal Distribution, the Z-Score and the Z-Table

paolo — Mon, 29 Oct 2018 15:16:00 +0000

The concept of the normal distribution is one of the key elements in the field of statistical research. Very often, the data we collect shows typical characteristics, so typical that the resulting distribution is simply called… “normal”. In this post, we will look at the characteristics of this distribution, as well as touch on some other concepts of notable importance such as:

the empirical rule
the standardized variable – The concept of Z score
Chebyshev’s inequality

What we will discuss

Visualizing the "normality" of our data
Transforming the data
The empirical rule
Standardizing is useful (and beautiful…). The Z score.
- Let's do a quick example
And now the fun part: let's do some practical examples!
Chebyshev's inequality
A Practical SEO Case: the Z-Score to Catch Traffic Anomalies
Try It Yourself
- Further Reading

In previous posts, we have seen examples of probability distributions for discrete variables: for example, the Binomial, the Geometric, the Poisson distribution…

The normal distribution is a continuous probability distribution; in fact, it is the most famous and most used of the continuous probability distributions. We recall that a continuous variable can take an infinite number of values within any given interval.

The normal distribution has a bell shape, is also called Gaussian – after the famous mathematician who made a fundamental contribution to this field – and is symmetrical about its mean. It extends indefinitely in both directions, but most of the area – that is, the probability – is collected around the mean. The curve appears to change shape at two points, which we call inflection points, and which coincide with a distance of one standard deviation more and less than the mean.

I generate with a couple of lines in R the characteristic shape of this distribution:

Visualizing the “normality” of our data

R offers several tools to evaluate the deviation of a distribution from a theoretical normal distribution.

One of these is the qqnorm() function, which creates a plot of the distribution as a function of the theoretical normal quantiles (qq=quantile-quantile):

qqnorm(variable) qqline(variable)

I verify this with an example by generating a normal distribution:

x<- rnorm(100,5,10) qqnorm(x) qqline(x)

The result is this, and as you can see, we have visual confirmation of the substantial normality of the distribution:

Transforming the data

When the asymmetry of a distribution depends on the fact that a variable extends over various orders of magnitude, we have an easy way to make our distribution symmetrical and similar to a normal distribution: transform the variable into its logarithm:

qqnorm(log10(variable)) qqline(log10(variable))

But how do I calculate the central tendency in this case? If I use something like mean(log10(variable)), I no longer have the unit of measurement… To recover it, I can use the antilogarithm, i.e., calculate: 10^^result. However, it must always be kept in mind that this is the geometric mean.

Good: we have our dataset and we have verified that the distribution is reasonably similar to a normal distribution. It’s time to find some practical applications to put our new knowledge to use!

The empirical rule

The empirical rule is one of the pillars of statistics. Without going into too much theoretical detail, the gist is this: the percentages of data from a normal distribution within 1, 2, and 3 standard deviations from the mean are approximately 68%, 95%, and 99.7%. It is a rule of such importance and common use that it is better to rewrite it with greater emphasis…

THE EMPIRICAL RULE
The percentages of data from a normal distribution within
1, 2, and 3 standard deviations from the mean
are approximately
68%, 95%, and 99.7%.

Standardizing is useful (and beautiful…). The Z score.

The standardized normal distribution is a normal distribution with a zero mean and standard deviation (or standard deviation, as the English speakers say) of one.

* In the blog, I use the two terms “standard deviation” and “standard deviation” interchangeably… since they express the same concept and are both commonly used.

That is, with:

$ \mu=0 \ \sigma=1 \ $

Any normal distribution can be converted into a standardized normal distribution by setting the mean to zero and expressing the deviations from the mean in units of standard deviation, what the English speakers very effectively call Z-score.

A Z-score measures the distance between a data point and the mean, using standard deviations. Therefore, a Z-score can be positive (the observation is above the mean) or negative (below the mean). A Z-score of -1, for example, will indicate that our observation falls one standard deviation below the mean. Obviously, a Z-score equal to 0 is equivalent to the mean.

The Z-score is a “pure” value, so it provides us with a “measure” of extraordinary effectiveness. In practice, it is an index that allows me to compare values between different distributions (as long as they are “normal,” of course), using a “standard” measure.

The calculation, as we have seen, is almost trivial: simply divide the deviation by the standard deviation:

$ Z = \frac{Deviation}{Standard\ Deviation} \ \ $

Under these conditions, we know that approximately 68% of the area under the standardized normal curve is within 1 standard deviation from the mean, 95% within two, and 99.7% within three.
That is:

$ 68.26% \ entro\ \mu \pm \sigma \ 95.4% \ entro\ \mu \pm 2\sigma \ 99.74% \ entro\ \mu \pm 3\sigma \ \ $

To find the probabilities – that is, the areas – for problems involving the normal distribution, the value X is converted into the corresponding Z-score:

$ Z = \frac{X-\mu}{\sigma} \ \ $

Then the value of Z is looked up in the tables and the probability under the curve between the mean and Z is found.

Seems difficult? It’s very easy, and very fun. And with R, or with the TI-83, it’s really a breeze!

The importance of the Z-score also lies (and especially) in its extreme practical utility: it allows, in fact, to usefully compare observations drawn from populations with different means and standard deviations, using a common scale. This is why the process is called standardization: it allows comparing observations between variables that have different distributions. Using the table (or the calculator or the computer) we can quickly calculate probabilities and percentiles, and identify any extreme values (outliers).

Since sigma is positive, Z will be positive if X>mu and negative if X

Let’s do a quick example

I have observations of some phenomenon that have a mean value of 65:

$ \mu = 65 \ $

The standard deviation is 10:

$ \sigma = 10 \ $

And I observe a value of 81 :

$ X = 81 \\ $

The value of the Z-score is calculated in a moment:

$ Z= \frac{X – \mu}{\sigma} = \frac{81 – 65}{10} = \frac{16}{10} = 1.6 \ $

The observed value, on a standard scale, falls 1.6 standard deviations above the mean. To understand, therefore, what percentage of observations result below the observed value, I just need to take the table:

The Z score table in action…

As noted, by crossing my z value: 1.6 at the 0.05 level I find the value 0.9505, which is equivalent to saying that 95.05% of the observed values are less than 81.

Obviously, I could have obtained the value in R without using the table, simply with:

pnorm(1.6)

For those using Python:

from scipy.stats import norm p = norm.cdf(1.6) print(p)

And now the fun part: let’s do some practical examples!

Example 1

What is the probability of an event with a Z-score < 2.47 ?

I take the table and see that 2.47 = 0.9932.

Therefore, 99.32% of the values are within 2.47 standard deviations from the mean.

Graphically representing the situation, what I am asked to do is to find the area of the gray surface, that is, the area under the curve to the left of the point with abscissa Z=2.47:

In R, the calculation is very simple. All I have to do is type:

pnorm(2.47)

The pnorm() function allows us to obtain the cumulative probability curve of the normal distribution. In other words, it allows us to calculate the relative area (remembering that the total area is 1) under the curve, from the given value of Z to +infinity or -infinity.

By default, R uses the lower tail, i.e., it finds the area from -infinity to Z.
To compute the area between Z and +infinity, I just need to set lower.tail=FALSE.

Example 2

What is the probability of a Z-score value > 1.53 ?

From the table, I find the value 0.937, so I deduce that 93.7% of the values are below a Z-score of 1.53.
So, to find out how many are above: 100-93.7 = 6.3%

In R, all I have to do is type:

1 - pnorm(1.53)

Example 3

What is the probability of “drawing” a random value of less than 3.65, given a normal distribution with a mean = 5 and a standard deviation = 2.2 ?

I immediately find the Z-score for the value 3.65:

$ Z= \frac{3.65 – 5}{2.2} = \frac{-1.35}{2.2} \simeq -0.61 \ \ $

I look for this value in the table: 0.2709. Therefore, there is a 27.09% probability that a value less than 3.65 will “come out” of a random selection with a mean of 5 and a standard deviation of 2.2.

If I wanted to use a scientific calculator, with the TI83 I would just have to type:

normalcdf(-1e99,3.65,5,2.2)

While with a Casio fx I would just have to follow these steps:

MENU STAT DIST NORM Ncd Data: Variable Lower: -100 Upper: 3.65 sigma: 2.2 mu: 5 EXECUTE

The result is obviously slightly different from that obtained from the table, because in the table I rounded the value of the division (3.65-5)/2.2 to -0.61, omitting the remaining decimal part…

Example 4: finding probabilities between 2 Z-scores

This is the most fun case of all. Actually, just find the 2 probabilities and subtract…

What is the probability associated with a value between Z=1.2 and Z=2.31 ?

I think of my normal curve: first I find the area to the left of Z₂. Then I find the area to the left of Z₁. Then I subtract the two values to find the area between the two, which is the probability sought.

Or I use R and simply write:

pnorm(2.31)-pnorm(1.2)

and the result, in this case 10.46%, is found in a moment!

Wait, but what if I wanted to calculate the value of Z starting from a cumulative probability? Just use the inverse function of pnorm() which in R is qnorm(). For example, to find the value of Z with an area of 0.5, I type:

qnorm(0.5)

and I will get the result, which is clearly 0 (the mean of a standardized normal has a value of 0 and the mean divides the normal into two equal areas…).

For those using Python, the code is:

from scipy.stats import norm q = norm.ppf(0.5) print(q)

Chebyshev’s inequality

The most important characteristic of Chebyshev’s inequality is that it applies to any probability distribution of which the mean and standard deviation are known.

Dealing with a distribution of unknown type or certainly not normal, Chebyshev’s inequality comes to our aid, stating that:

If we assume a real positive value k, the probability that the random variable X has a value between:

$ \mu \ – \ k \sigma \ e \ \mu \ + \ k \sigma \$

is greater than:

$ 1 – \frac{1}{k^{2}} \ $

In other words: suppose we know the mean and standard deviation of a set of data, which do not follow a normal distribution. We can say that for every value k >0 at least a fraction (1-1/k²) of the data falls within the interval between:

$ \mu \ – \ k \sigma \ e \ \mu \ + \ k \sigma \$

As always, an example is useful to clarify everything. I take an example dataset…the average salaries paid by U.S. baseball teams in 2016:

Team	Salary (\$M)
Arizona Diamondbacks	91,995583
Atlanta Braves	77,073541
Baltimore Orioles	141,741213
Boston Red Sox	198,328678
Chicago Cubs	163,805667
Chicago White Sox	113,911667
Cincinnati Reds	80,905951
Cleveland Indians	92,652499
Colorado Rockies	103,603571
Detroit Tigers	192,3075
Houston Astros	89,0625
Kansas City Royals	136,564175
Los Angeles Angels	160,98619
Los Angeles Dodgers	248,321662
Miami Marlins	64,02
Milwaukee Brewers	51,2
Minnesota Twins	99,8125
New York Mets	128,413458
New York Yankees	221,574999
Oakland Athletics	80,613332
Philadelphia Phillies	91,616668
Pittsburgh Pirates	95,840999
San Diego Padres	94,12
San Francisco Giants	166,744443
Seattle Mariners	139,804258
St, Louis Cardinals	143,514
Tampa Bay Rays	60,065366
Texas Rangers	158,68022
Toronto Blue Jays	131,905327
Washington Nationals	142,501785

The mean is: 125.3896
The standard deviation: 48.64039

Chebyshev’s inequality

$ 1 – \frac{1}{k^{2}} \ $

tells us that at least 55.56% in this case is in the interval:

$ (\mu − 1.5\sigma, \mu + 1.5\sigma)= (52.42902, 198.3502) \ \ $

A Practical SEO Case: the Z-Score to Catch Traffic Anomalies

The z-score is not a textbook exercise: it is the tool with which, every time we look at a report, we can say whether a number is “within the norm” or deserves a second look. Picture twenty days of daily organic sessions for a page, and today the figure has jumped to 1,450. A good sign, or an anomaly to investigate — a spike from a Google update, perhaps, or a tracking error?

The question, translated into statistics, is: how far is 1,450 from the mean of the normal days, measured in standard deviations? That is exactly what the z-score computes. I do it in R:

history <- c(1180, 860, 1050, 920, 1240, 800, 1010, 1130, 880, 1190,
            950, 1070, 830, 1210, 990, 1100, 770, 1040, 1160, 900)
mu    <- mean(history)     # 1014
sigma <- sd(history)       # 145.6
z     <- (1450 - mu) / sigma  # 2.99
1 - pnorm(z)                 # 0.0014

The z-score is 2.99: today’s figure sits almost three standard deviations from the mean. And here the empirical rule we saw above comes in handy: in a normal distribution, 99.7% of observations fall within three standard deviations of the mean — in our case, between 577 and 1,451 sessions. Today’s value, 1,450, is practically leaning against the upper edge of that band. The probability of seeing a day this high (or higher) by pure chance is 0.14%: a little more than once every seven hundred days. This is not noise: it is a genuine anomaly, and it is worth finding out where it came from.

This is, in a nutshell, the idea behind anomaly detection: we set a threshold on the z-score (typically 2 or 3) and flag anything that crosses it. A word of caution, though: the z-score assumes the data are at least roughly normal and free of strong seasonality — a Monday peak or a Black Friday spike is not an anomaly, it is a rhythm.

Try It Yourself

To make the mechanism stick, here are fifteen days of sessions for a landing page, and a sudden drop today to 600. Anomaly or normal fluctuation?

900  720  850  680  950  760  820  700  880  740  800  960  690  870  780

The task: compute the mean and standard deviation with mean() and sd(), derive the z-score of 600 with (600 - mu) / sigma, and read off with pnorm() how likely such a drop is.

To check your work: the mean is 806.7, the standard deviation 92.6, the z-score is -2.23 and the probability of a day this low (or lower) is 1.3%. We are beyond two standard deviations but not beyond three: a warning bell that deserves a check, not yet the certainty of a problem. The difference between “suspicious” and “anomalous” lies entirely in the threshold we choose.

Standardizing a value and measuring its distance from the mean in standard deviations is the move that opens the door to a whole field: the systematic detection of outliers, where the z-score is joined by more robust methods (Tukey’s IQR, Grubbs’ test) for when the data are not perfectly normal. That is where our path continues, with anomaly detection applied to traffic.

regola empirica – paologironi blog

The Normal Distribution, the Z-Score and the Z-Table

Visualizing the “normality” of our data

Transforming the data

The empirical rule

Standardizing is useful (and beautiful…). The Z score.

Let’s do a quick example

And now the fun part: let’s do some practical examples!

Chebyshev’s inequality

A Practical SEO Case: the Z-Score to Catch Traffic Anomalies

Try It Yourself

Further Reading