Confidence Intervals: Quantifying Uncertainty in Estimates

Confidence Intervals: Quantifying Uncertainty in Estimates
Confidence Intervals: Quantifying Uncertainty in Estimates | Ideasthesia

You measure the average height of 100 people. The sample mean is 170 cm.

What's the true population mean? You don't know. You only measured 100 people out of billions. Your sample mean is an estimate, and estimates have error.

But you can quantify that error. You can say: "I'm 95% confident the true mean is between 168 and 172 cm."

That's a confidence interval. It's not a single number—it's a range that likely contains the truth. And "likely" is quantified: 95% confidence means that if you repeated this process 1,000 times, about 950 of those intervals would contain the true mean.

Confidence intervals are everywhere. Every poll margin of error. Every clinical trial result. Every A/B test conclusion. They're the mechanism that lets you go from "this is what I measured" to "this is probably true, give or take."

But here's the catch: confidence intervals don't mean what most people think they mean. Even researchers get this wrong.

A 95% confidence interval does not mean "there's a 95% chance the true value is in this range." That sounds right. It's wrong. And the difference matters.

This article explains what confidence intervals actually are, how they're calculated, and why the common interpretation is subtly but catastrophically incorrect.


The Setup: Point Estimates Have Uncertainty

You run a survey. You measure your sample. You calculate a statistic—maybe the mean, maybe a proportion, maybe a difference between groups.

That statistic is a point estimate. It's your best guess at the population parameter. But it's just one estimate from one sample.

If you sampled again, you'd get a slightly different number. Sample 1,000 times, and you'd get a distribution of estimates—scattered around the true value, with some spread.

The Central Limit Theorem tells us that distribution is approximately normal (for large enough samples). And the spread of that distribution—the standard error—tells us how much sampling variability to expect.

Confidence intervals use that standard error to build a range: "The true value is probably somewhere in here."


The 95% Confidence Interval: What It Actually Means

Let's be precise.

You collect a sample. You calculate the sample mean ($\bar{x}$) and standard error (SE). You construct a 95% confidence interval:

$$\text{CI}_{95%} = \bar{x} \pm 1.96 \times \text{SE}$$

The Correct Interpretation

"If we repeated this sampling procedure many times, approximately 95% of the confidence intervals we constructed would contain the true population mean."

Read that again. It's a statement about the procedure, not about this specific interval.

Imagine you run 1,000 experiments. Each time, you sample 100 people, calculate the mean, and build a 95% CI.

  • ~950 of those intervals will contain the true population mean.
  • ~50 of those intervals will miss it (just by random chance).

You don't know which intervals are right and which are wrong. But on average, 95% of them are right. That's what "95% confidence" means.

The Incorrect (But Common) Interpretation

"There's a 95% probability the true mean is in this interval."

This sounds intuitive. It's what most people—including researchers—think confidence intervals mean.

But it's wrong. Here's why:

In frequentist statistics (which is what confidence intervals come from), the population parameter is fixed. It's not random. It's a specific number—you just don't know what it is.

Your interval is random—it changes with each sample. But the parameter doesn't.

So saying "there's a 95% probability the parameter is in this interval" is incoherent in the frequentist framework. The parameter is either in the interval or it isn't. There's no probability about it.

Analogy: I flip a coin, hide the result, and build a "confidence interval" based on my guess. I might say "I'm 95% confident it's heads." But the coin is already either heads or tails. My confidence doesn't change that. My statement is about my procedure (how often it's right), not about this coin.

Why This Matters

The misinterpretation leads to overconfidence.

People see a narrow confidence interval and think: "The true value is almost certainly in here." But that's not what it means.

It means: "The procedure that generated this interval is usually right. But this specific interval might be one of the 5% that's wrong."

That distinction is subtle. But it's the difference between rigorous reasoning and wishful thinking.


How Confidence Intervals Are Calculated

Let's work through the math.

For a Mean (Known or Large Sample)

You have a sample of size $n$. You calculate:

  • Sample mean: $\bar{x}$
  • Sample standard deviation: $s$
  • Standard error: $\text{SE} = \frac{s}{\sqrt{n}}$

For a 95% confidence interval:

$$\text{CI}_{95%} = \bar{x} \pm 1.96 \times \text{SE}$$

Where does 1.96 come from?

It's the Z-score corresponding to the middle 95% of a normal distribution. If you go 1.96 standard deviations on either side of the mean, you capture 95% of the area under the curve.

For other confidence levels:

  • 90% CI: $\bar{x} \pm 1.645 \times \text{SE}$
  • 99% CI: $\bar{x} \pm 2.576 \times \text{SE}$

Higher confidence = wider interval. You're more certain the true value is inside, but you're less precise about where.

For a Mean (Small Sample, Unknown Variance)

If $n < 30$ and you don't know the population variance, use the t-distribution instead of the normal distribution.

$$\text{CI}_{95%} = \bar{x} \pm t^* \times \text{SE}$$

Where $t^*$ is the critical value from the t-distribution with $n-1$ degrees of freedom.

The t-distribution has heavier tails than the normal distribution, so $t^*$ is larger than 1.96. That makes the interval wider—accounting for the extra uncertainty when you have less data.

As $n$ increases, the t-distribution converges to the normal distribution. By $n = 30$, they're nearly identical.

For a Proportion

You have a sample of $n$ people, and $p$ is the sample proportion (e.g., 60% support a policy).

Standard error for a proportion:

$$\text{SE} = \sqrt{\frac{p(1-p)}{n}}$$

95% confidence interval:

$$\text{CI}_{95%} = p \pm 1.96 \times \text{SE}$$

Example: You poll 1,000 voters. 520 support Candidate A. Sample proportion: $p = 0.52$.

$$\text{SE} = \sqrt{\frac{0.52 \times 0.48}{1000}} \approx 0.0158$$

$$\text{CI}_{95%} = 0.52 \pm 1.96 \times 0.0158 \approx [0.489, 0.551]$$

So you'd report: "52% support Candidate A, ±3.1% margin of error."

That ±3.1% is the half-width of the confidence interval—what journalists call the "margin of error."


What Affects the Width of a Confidence Interval?

Three factors:

1. Sample Size

Larger samples = narrower intervals.

Standard error decreases as $\frac{1}{\sqrt{n}}$. Double your sample size, and your SE shrinks by $\sqrt{2} \approx 1.41$.

Diminishing returns: Going from $n = 100$ to $n = 400$ cuts your SE in half. But going from $n = 1,000$ to $n = 4,000$ only cuts it in half again. Precision gets expensive.

2. Variability in the Data

More spread in the population = wider intervals.

If everyone in your population is nearly identical, a small sample tells you a lot. If the population is highly variable, you need more data to pin down the mean.

Standard deviation ($s$) appears in the numerator of SE. High $s$ means high SE means wide intervals.

3. Confidence Level

Higher confidence = wider intervals.

  • 90% CI: ±1.645 SE
  • 95% CI: ±1.96 SE
  • 99% CI: ±2.576 SE

You're trading off precision (narrow interval) against confidence (high probability of being right).

Most fields use 95% as the standard. It's a convention, not a law of nature.


Confidence Intervals for Differences

Often, you're not estimating a single mean. You're comparing two groups.

Example: Does the drug work better than placebo?

You measure the mean improvement in the drug group ($\bar{x}_1$) and the placebo group ($\bar{x}_2$). You calculate the difference: $\bar{x}_1 - \bar{x}_2$.

The confidence interval for the difference is:

$$\text{CI} = (\bar{x}_1 - \bar{x}2) \pm t^* \times \text{SE}{\text{diff}}$$

Where:

$$\text{SE}_{\text{diff}} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$$

Interpretation: If the confidence interval does not include zero, the groups are significantly different.

If the 95% CI is [2, 8], you can be confident the drug works better than placebo (the difference is positive).

If the 95% CI is [-1, 5], you can't rule out "no difference" (zero is inside the interval). The result is inconclusive.


Common Misinterpretations (and Why They're Wrong)

Let's catalog the errors.

Misinterpretation 1: "There's a 95% probability the true value is in this interval."

Wrong. The true value is either in the interval or it isn't. The 95% refers to the procedure, not this specific interval.

Correct: "If we repeated this procedure, 95% of the intervals we constructed would contain the true value."

Misinterpretation 2: "Values inside the interval are more likely than values outside."

Wrong. All values inside the interval are equally plausible (from a frequentist perspective). The interval doesn't give you a probability distribution over the parameter.

If your 95% CI is [168, 172], you can't say "170 is more likely than 171." They're both inside the interval. That's all you know.

(If you want probability distributions over parameters, you need Bayesian statistics, which uses credible intervals instead of confidence intervals. Different framework.)

Misinterpretation 3: "Wider intervals mean more uncertainty."

Correct—but incomplete. Wider intervals also mean higher confidence.

A 99% CI is wider than a 95% CI for the same data. That doesn't mean you're "more uncertain"—it means you're demanding higher confidence that the interval contains the truth.

You choose the confidence level. The width follows.


The Bayesian Alternative: Credible Intervals

Frequentist confidence intervals are conceptually slippery. They answer a question nobody asked: "How often would this procedure work?"

Bayesian credible intervals answer the intuitive question: "What's the probability the parameter is in this range?"

In Bayesian statistics, you treat the parameter as a random variable with a probability distribution (the posterior distribution). You calculate an interval that contains, say, 95% of the posterior probability.

Interpretation: "There's a 95% probability the true value is in this interval."

That's what most people think a confidence interval means. And in Bayesian stats, it's actually true.

The catch: Bayesian inference requires a prior—your initial belief about the parameter before seeing the data. The posterior updates that prior using Bayes' theorem. The result depends on your prior.

Frequentist stats avoids priors. It only uses the data. That's philosophically clean but conceptually awkward.

Which framework is "right"? Depends who you ask. Both are mathematically rigorous. Both have trade-offs.

But if you're using confidence intervals, know that you're in the frequentist framework—and the intuitive interpretation doesn't apply.


Confidence Intervals vs. Hypothesis Tests

There's a deep connection.

A 95% confidence interval tells you which null hypotheses you'd reject at the 5% significance level.

If your 95% CI for a mean is [168, 172], then:

  • You'd reject the null hypothesis that the true mean is 165 (it's outside the interval).
  • You'd fail to reject the null that the true mean is 170 (it's inside the interval).

In fact, you can think of hypothesis testing as just checking whether a specific value is inside or outside the confidence interval.

Confidence intervals are more informative. They tell you the range of plausible values, not just "reject" or "don't reject."

That's why many statisticians prefer reporting CIs over p-values. CIs quantify effect size and uncertainty simultaneously.


Practical Issues: When Confidence Intervals Mislead

Assumption Violations

Confidence intervals assume:

  1. Random sampling. If your sample is biased, the interval is meaningless.
  2. Independence. If observations are correlated (e.g., repeated measures, clustered data), standard errors are wrong.
  3. Normality (for small samples). The CLT only applies for large $n$. For small $n$, non-normal data breaks the math.

Violate these assumptions, and your "95% confidence" might be 70% or 85% in reality.

Multiple Comparisons

If you calculate 20 confidence intervals, you'd expect 1 of them to miss the truth (just by chance). That's the 5% error rate.

But researchers often report only the intervals that "look interesting." That's selection bias. If you calculated 100 intervals and only reported the 10 that excluded zero, you're drastically inflating false positives.

The solution: Adjust for multiple comparisons (Bonferroni correction, false discovery rate control). Or report all intervals, not just the "significant" ones.

Overlapping Intervals Don't Test Differences

You compare two groups. Group A has CI [10, 14]. Group B has CI [12, 16]. The intervals overlap.

Does that mean the groups are not significantly different?

No. Overlapping confidence intervals for individual groups don't tell you about the confidence interval for the difference.

You need to explicitly test the difference. The CI for $(\bar{x}_A - \bar{x}_B)$ might not include zero, even if the individual CIs overlap.


The Coherence Connection: Intervals as Uncertainty Quantification

Here's the conceptual thread.

Confidence intervals quantify epistemic uncertainty. You don't know the true value. The interval represents the range of values consistent with your data (under your assumptions).

In information-theoretic terms, the width of the interval reflects entropy. Narrow intervals = low entropy = high information. Wide intervals = high entropy = low information.

And this maps to M = C/T. Meaning is coherence over time. Statistical inference detects coherence—patterns that persist beyond random noise. Confidence intervals tell you how much coherence you've detected.

A narrow CI says: "The signal is strong. The pattern is clear." A wide CI says: "The signal is weak. We're still mostly uncertain."

And the confidence level (95%, 99%) is a threshold—how much evidence do you demand before you act on a claim? Higher confidence = more conservative = fewer false positives.

That's a tunable parameter. Science conventionally uses 95%. But in high-stakes domains (medical safety, structural engineering), you might demand 99% or 99.9%. The math doesn't change. Just the threshold.


Practical Workflow: How to Use Confidence Intervals

Here's the process:

1. Calculate your point estimate (mean, proportion, difference, etc.).

2. Calculate the standard error. You need the sample standard deviation and sample size.

3. Choose your confidence level. 95% is standard. Adjust if needed.

4. Calculate the critical value (Z or t, depending on sample size).

5. Construct the interval: Point estimate ± (critical value × SE).

6. Interpret carefully. The interval tells you the range of plausible values, not the probability the parameter is inside.

7. Report it. Always report the interval, not just the point estimate. "Mean = 50" is incomplete. "Mean = 50, 95% CI [48, 52]" is informative.


What's Next

Confidence intervals quantify uncertainty in your estimates. But they don't tell you whether an effect is real or just random noise.

That's the domain of hypothesis testing—the machinery for deciding "Is this pattern signal or noise?"

Next up: Hypothesis Testing Explained—how we formalize the question "Is the effect real?"


Further Reading

  • Cumming, G. (2014). "The new statistics: Why and how." Psychological Science, 25(1), 7-29.
  • Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E. J. (2016). "The fallacy of placing confidence in confidence intervals." Psychonomic Bulletin & Review, 23(1), 103-123.
  • Neyman, J. (1937). "Outline of a theory of statistical estimation based on the classical theory of probability." Philosophical Transactions of the Royal Society A, 236(767), 333-380.
  • Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). "Moving to a world beyond 'p < 0.05'." The American Statistician, 73(sup1), 1-19.

This is Part 5 of the Statistics series, exploring how we extract knowledge from data. Next: "Hypothesis Testing Explained."


Part 4 of the Statistics series.

Previous: Sampling and Populations: Part Representing Whole Next: Hypothesis Testing: Is the Effect Real?