We have seen that the binomial distribution is based on the hypothesis of an infinite population N, a condition that can be practically realized by sampling from a finite population with replacement.
If this does not occur, meaning if we are sampling from a population without replacement, we must use the hypergeometric distribution. (In reality, if N is large, the hypergeometric probability density function tends towards the binomial).
The hypergeometric distribution is used to calculate the probability of obtaining a certain number of successes in a series of binary trials (yes or no), which are dependent and have a variable probability of success.
The hypergeometric distribution allows us to answer questions like:
If I take a sample of size N, in which M elements meet certain requirements, what is the probability of drawing x elements that meet those requirements?
I express my distribution in the form of a formula:
\( f(X|N,M,n)=\frac{C^{N-M}_{n-x}\times C^M_x}{C^N_n} \ \)We know that a batch of 30 pieces contains 6 malfunctioning pieces.
If I take a sample of 5 pieces, what is the probability of finding exactly 2 defective pieces?
I’ll immediately write down the data:
Let’s see how to solve the same problem in R:
# Definition of the hypergeometric distribution parameters x <- 2 # I want to know the probability of finding 2 defective pieces n <- 5 # the size of my sample M <- 6 # the total malfunctioning pieces present in the batch N <- 30 # the total number of pieces in my batch # Probability calculation with the dhyper function prob <- dhyper(x, M, N - M, n) prob
and I get the output:
[1] 0.2130437
Let’s now make another example: let’s estimate the probability that in an urn with 10 white balls and 5 black ones, drawing 4 balls without replacement, we get 3 white and 1 black. So:
We have seen that in R, it’s possible to use the dhyper function to calculate the probability of drawing 3 white balls and 1 black ball from the described urn.
Here’s the R code:
# Definition of the hypergeometric distribution parameters x <- 3 # Number of white balls drawn n <- 4 # Number of balls drawn M <- 5 # Number of black balls N <- 15 # Total number of balls # Probability calculation with the dhyper function prob <- dhyper(x, M, N - M, n) prob
The probability of drawing 3 white balls and 1 black ball is therefore 0.07326007, or about 7.33%.
For an accessible walk through the discrete distributions — hypergeometric included — Finalmente ho capito la statistica by Maurizio De Pra (Italian edition) covers them with plenty of worked examples.
On 21 January 2015 Optimizely — one of the most widely used A/B testing platforms…
In the Israeli Air Force, Daniel Kahneman recounts, the flight instructors were sure of one…
Over the previous articles we have looked at how hypothesis testing works and how the…
Principal Component Analysis (PCA) is a widely used statistical technique for reducing the complexity of…
Anyone who looks at a website's data does it constantly, often without noticing: they spot…
We closed the article on the A/B test significance calculator with a promise. We said…