Backpropagate
Posts
Teaching Series #2 - Central Limit Theorem

Teaching Series #2 - Central Limit Theorem

"...who desires an end desires the means..."

Dr Milos Ojdanic
May 28, 2025 • Estimated Reading Time: 14 minutes

In scientific research, we rarely measure an entire population. Why? Because it is very hard. Can you measure every human height on Earth? I bet you can’t. Instead, we work with samples. The challenge is: how do we generalise findings from a small sample to a vast, unknown population, especially if we don't know the population's distribution?

Well, the Central Limit Theorem is your answer! Even if individual data points in your population aren't normally distributed (an example is of salary data - there are some peaks on one end, right?), the CLT (short for Central Limit Theorem) allows you to assume that the averages you obtain from sufficiently large samples will follow a normal distribution. This means you can confidently apply powerful properties of the normal distribution and statistical methods that rely on normality to make inferences about the population mean.

For example. If a company wants to know the average lifespan of a new, let’s say, light bulb they are producing, they can't test every bulb (that would take forever, and they would simply not sell any). Instead, they take many samples of bulbs, record their average lifespans, and thanks to the Central Limit Theory, they can build a normal distribution of these averages to estimate the true average lifespan of all bulbs. Simple as that!

But how, you may ask? What is the magic? Well, while a full mathematical proof is quite advanced, we can get a good intuitive sense of why it happens.

Imagine your population is extremely skewed. Another example is the number of hours people spend playing video games in a day. Most people might play 0-2 hours, but a few dedicated gamers might play 10-14 hours, creating a distribution heavily skewed to the right.

Now, when you take a random sample (say, 30 people) and calculate the average:

Some samples will have mostly low values: This will result in a low sample mean.
Some samples will have mostly high values: This will result in a high sample mean (though less likely if the population is skewed low).
Most samples will have a mix: This is the key! Because you're drawing randomly, most samples will likely contain a mix of low, medium, and perhaps a few high values. When you average these mixed values, they tend to "pull" the mean towards the center of the overall population's distribution. The extreme values get averaged out by the more common, central values.

It's like tossing a coin many times. A single toss is either heads or tails (not normal!). But if you take groups of 100 coin tosses and record the number of heads in each group, that distribution of "number of heads" will start to look normal. Why? Because while you might get a group with 30 heads or 70 heads occasionally, it's far more likely to get a group with around 50 heads. The extremes become less probable, and the centre becomes more probable, creating that bell shape.

The CLT is silently used behind many everyday things and scientific breakthroughs. Just think, Polling and Surveys (you cannot ask every single voter), Quality Control in Manufactoring (you cannot measure and test every single product produced in the manufacturing line), Medical Research and Clinical Trials (impact of new drugs on patients, you cannot test every single patient), Financial Analysis, Machine Learning, etc etc..

Key conditions of the CLT

Now, let's talk about the key conditions that need to be met for the Central Limit Theorem to hold true. While the CLT is incredibly robust, it's not a magical wand that works in every single scenario. Think of these as the ingredients for the "magic" to happen:

Random Samples: This is fundamental. Each sample must be drawn randomly from the population. If your samples are biased (e.g., you only pick your favourite gamers), then your sample means won't accurately reflect the population, and the CLT won't apply correctly.
Independence: The observations within each sample and the samples themselves must be independent of each other. This means that picking one gamer shouldn't influence which other gamer you pick in your study.
Sufficiently Large Sample Size (n): This is arguably the most crucial condition. The "sufficiently large" part is a bit flexible, but a common rule of thumb is n ≥ 30. The larger the sample size, the more closely the distribution of sample means will resemble a normal distribution, even if the original population is very far from normal. If your sample size is too small, the distribution of sample means might still reflect the original population's non-normal shape.

Let’s talk about an example. Imagine you're the head chef at a popular restaurant, and you're very proud of your best dessert: a perfectly balanced carrot cake. You want to ensure that each cake, on average, contains 100 grams of sugar. However, you know that your bakers, being human, might sometimes add a little more or a little less sugar to individual cakes. The amount of sugar in each individual cake might be slightly variable, and perhaps even a bit skewed (maybe they're more likely to underfill slightly than overfill a lot). You decide to implement a quality control measure. Every day, you randomly select 35 carrot cakes that were baked that morning and carefully weigh the amount of sugar in each. You then calculate the average sugar content for that batch of 35 cakes. You do this every day for a month. Based on what we have already discussed about the Central Limit Theorem, what would you expect the distribution of these daily average sugar weights to look like after a month, even if the individual cakes' sugar weights aren't perfectly normally distributed? And why is that important for your quality control?

We talked about taking many, many samples, calculating an average (or mean) for each sample, and then looking at the distribution of those averages. This distribution of sample means is precisely what we call a sampling distribution of the sample mean.

In simple terms, a sampling distribution is the distribution of a statistic (like the mean, median, or standard deviation) that we would get if we drew an infinite number of samples of a certain size from a population.

The Central Limit Theorem specifically tells us about the shape of the sampling distribution of the sample mean: it tells us that as the sample size increases, this sampling distribution will become approximately normal.

The "standard deviation" of this sampling distribution of sample means has a special name: the Standard Error of the Mean (SEM).

Just like a regular standard deviation tells you how spread out individual data points are around the mean, the standard error of the mean tells you how much variability or "spread" there is among the sample means themselves. Here's the formula for the standard error of the mean:

SE_x = σ (sigma) / square root of n (the sample size)

where σ is the standard deviation

Law of Large Numbers

Two concepts often get confused: the Central Limit Theorem (CLT) and the Law of Large Numbers (LLN). They both involve large samples, but they tell us different, though complementary, things. The Law of Large Numbers is perhaps more intuitive. It essentially states that as the sample size grows, the sample mean will get closer and closer to the true population mean.

Key takeaway for LLN: It's about where the sample mean converges (approaches the true mean).

The Central Limit Theorem, as we've discussed, is about the shape of the distribution of those sample means. It tells us that as the sample size increases, this distribution of sample means will become approximately normal. Key takeaway for CLT: It's about the distributional shape of sample means (normal).

The LLN guarantees accuracy, while the CLT guarantees normality of the sampling distribution.

Confidence Intervals

Imagine you want to estimate the average height of all adults in your country. You take a random sample of 1000 people and find their average height. The Law of Large Numbers tells you this sample average is likely close to the true population average. But how close? That's where the CLT and confidence intervals come in.

Because the CLT tells us the distribution of these sample means is normal, we can use the properties of the normal distribution (like how much data falls within 1, 2, or 3 standard errors) to construct a confidence interval.

For a normal distribution:

Approximately 68% of the data falls within 1 standard deviation of the mean.
Approximately 95% of the data falls within 1.96 standard deviations of the mean.
Approximately 99.7% of the data falls within 3 standard deviations of the mean.

So, for a 95% confidence interval, we typically use 1.96 standard errors from the mean.

Since a confidence interval is an estimate of the population mean based on a sample mean, we are interested in the variability of those sample means, which is described by the standard error.

Hypothesis Testing: The Bakery Example

Imagine a baker wants to test if a new cake leads to an increase in sales. Here's the procedure, step-by-step:

Step 1: Formulate the Hypotheses

We set up two competing statements:

Null Hypothesis (H₀): This is the "status quo" or the assumption of no effect/no change. It's the statement we try to find evidence against.
- In our case: H₀: The new cake does not affect sales (i.e., the average sale with the new cake is the same as without it, or equal to the historical average). Let's say the historical average sale is some imaginary number, 100 units per day. So, H₀: μ = 100.
Alternative Hypothesis (H_a or H₁ ): This is what we are trying to prove. It's the statement of an effect or change. In our case: H_a: The new cake increases sales (i.e., the average sale with the new cake is greater than 100 units per day). So, H_a: μ > 100.

Step 2: Collect Data

The baker applies the new cake to a random sample of, say, n = 50 days. After a day ends, he measures the sales from each of these 50 days and calculates the sample mean sale. Let's say the sample mean sale comes out to be 105 sales per day. They also calculate the sample standard deviation (s).

Step 3: Assume the Null Hypothesis is True (The "What If" Scenario)

This is the crucial step where the CLT comes into play! We ask ourselves: "If the null hypothesis (H₀:μ = 100) were true, what would the sampling distribution of sample means look like?"

Thanks to the Central Limit Theorem, we know that if we took many samples (50 days) from a population where the true mean sale was 100 (i.e., H₀ is true), the distribution of those sample means would be approximately normal.
This normal distribution would be centred at the assumed population mean (100 in this case, from H₀).
Its spread would be determined by the standard error of the mean, SE_x = σ (sigma) / square root of n (the sample size)
where σ is the standard deviation.

Step 4: Calculate the Test Statistic

We calculate a "test statistic" (often a Z-score or a t-score) that tells us how many standard errors away our observed sample mean (105 sales) is from the assumed population mean under the null hypothesis (μ = 100). The formula for a Z-score (when population standard deviation is known or sample size is large):

Z = (observed sample mean - population mean under the null hypothesis) / standard error of the mean

Let's imagine, for the sake of this example's simplicity, that our calculated Z-score for the sample mean of 105 sales per day is, say, 2.0. This means our observed sample mean of 105 is 2.0 standard errors above the assumed population mean of 100.

Step 5: Determine the p-value

This is where we answer the question: ‘How likely is it to observe our sample mean if the null hypothesis were true?‘ How do we calculate p-value?"

What it means: The p-value is the probability of obtaining a sample mean as extreme as, or more extreme than, the one we actually observed (105 sales per day), assuming that the null hypothesis (H0 :μ=100) is true.
- In our bakery example, with a Z-score of 2.0 and an alternative hypothesis of μ > 100, we're asking: "If the new cake has no effect (mean is 100), how likely is it that we would randomly get a sample mean of 105 or higher just by chance?"
How do we calculate p-value:
- Since we know the sampling distribution of sample means is normal (thanks to CLT), and we've converted our sample mean into a Z-score, we can use the properties of the standard normal distribution (or a Z-table/calculator).
- A Z-score of 2.0 corresponds to a certain area under the standard normal curve. For a "greater than" alternative hypothesis (Ha:μ>100), the p-value is the area in the tail of the distribution beyond our calculated Z-score.
- For Z=2.0, the p-value is approximately 0.0228 (meaning about 2.28%). You can find the z-value table or calculator.

Step 6: Make a Decision and Conclusion

We compare our p-value to a pre-determined significance level (α), often set at 0.05 (or 5%). This α is our threshold for "rare enough" to reject the null hypothesis.

If p-value ≤ α: We reject the null hypothesis. This means our observed result is unlikely to have happened by random chance if H₀ were true. We have statistically significant evidence to support the alternative hypothesis.
- In our example, 0.0228 ≤ 0.05. So, we reject H₀.
- Conclusion: We have significant evidence to conclude that the new cake does increase sales.
If p-value > α: We fail to reject the null hypothesis. This means our observed result is not unusual enough to conclude that H₀ is false. We don't have enough statistical evidence to support the alternative hypothesis. (If our p-value had been, say, 0.15, we would fail to reject H0 .)

Remember the scary one-tail or two-tail test. Well:

For a One-Tailed Test (H_a : μ > something): You're interested in the area to the right of that Z-score (distribution).
For a One-Tailed Test (H_a : μ < something): You're interested in the area to the left of that Z-score (distribution).
For a Two-Tailed Test (H_a : μ ≠ something): You're interested in the probability of getting an extreme result in either direction.

Final notes: the Central Limit Theorem is fundamental because it allows us to assume a normal sampling distribution, which then lets us calculate probabilities (p-values) and build confidence intervals, giving us the tools to make informed decisions from sample data!

That’s all for this lesson. Keep learning!

Sincerely, MO