Sampling and Populations: Part Representing Whole

Sampling and Populations: Part Representing Whole
Sampling and Populations: Part Representing Whole | Ideasthesia

In 1936, Literary Digest magazine conducted the largest election poll in history. They mailed 10 million ballots. They got 2.4 million responses. Their prediction: Republican Alf Landon would crush Franklin D. Roosevelt in a landslide.

Roosevelt won 46 of 48 states. The largest poll in history was catastrophically wrong.

Meanwhile, a young statistician named George Gallup surveyed only 50,000 people—less than 2% the size of the Digest poll—and predicted the outcome almost perfectly.

What happened?

Sample size doesn't matter if your sample is garbage. The Literary Digest polled their subscribers plus car owners and telephone users. In 1936, those were wealthy people—not representative of the voting population. Their sample was huge but systematically biased.

Gallup used random sampling. His 50,000 respondents were demographically representative. Small but representative beats large but biased every time.

This is the core problem of statistical inference: How do you generalize from a sample to a population? And the answer is not "measure more." The answer is "measure right."

This article explains how sampling works, why it's both powerful and fragile, and what happens when you get it wrong.


The Conceptual Leap: From Sample to Population

Here's the fundamental logic:

Population: The entire group you care about. All humans. All voters. Every possible outcome of a process.

Sample: The subset you actually measure. 1,000 survey respondents. 500 trial participants. The data you collected.

The leap: You measure the sample and infer properties of the population.

That leap is not magic. It's mathematics. But it only works under specific conditions:

  1. The sample must be representative. It must reflect the population's structure.
  2. The sample must be random (or account for non-randomness). Systematic bias breaks everything.
  3. The sample must be large enough. Too small, and random variation dominates.

Get those right, and you can make astonishingly accurate claims about billions of people based on a few thousand measurements. Get them wrong, and you're just extrapolating noise.


Why Sampling Works: The Law of Large Numbers

Flip a coin once. You get heads or tails. 50/50 probability doesn't help you predict that flip.

Flip a coin 10 times. You might get 7 heads, 3 tails. Still noisy.

Flip a coin 1,000 times. You'll get close to 500 heads, 500 tails.

Flip a coin 1,000,000 times. You'll get very close to 50/50.

This is the Law of Large Numbers: As your sample size increases, the sample mean converges to the population mean.

More formally: If you repeatedly sample from a population and calculate the mean each time, those sample means will cluster tighter and tighter around the true population mean as sample size grows.

Why this matters:

You can't measure every voter. But if you randomly sample 1,000 voters, the proportion who support Candidate A in your sample will be close to the true proportion in the population.

How close? That's quantifiable—it's the standard error, which decreases as sample size grows.

But here's the kicker: the Law of Large Numbers assumes your samples are independent and identically distributed (i.i.d.). If you're sampling from a biased pool, large numbers don't save you. You just converge on the wrong answer with high confidence.


Random Sampling: The Gold Standard

Random sampling means every member of the population has an equal chance of being selected.

Why does this matter?

Because randomness eliminates systematic bias. If you pick 1,000 people truly at random from the U.S. population, you'll get a representative mix of ages, genders, income levels, political affiliations, etc.—not because you tried to balance those factors, but because randomness naturally produces proportional representation at scale.

Simple Random Sampling

The purest form. You list every member of the population, assign each a number, and use a random number generator to pick your sample.

Pros: Unbiased. Every subset is equally likely.

Cons: Requires a complete list of the population (often impossible). Inefficient for rare subgroups.

Stratified Sampling

You divide the population into strata (subgroups), then randomly sample within each stratum.

Example: You want to poll voters. You know the population is 51% female, 49% male. So you sample 510 women and 490 men from your 1,000-person survey.

Pros: Ensures representation of key subgroups. Reduces variance.

Cons: Requires knowing the population's structure in advance.

Cluster Sampling

You divide the population into clusters (e.g., counties, schools, city blocks), randomly select clusters, then measure everyone in those clusters.

Pros: Logistically easier. You don't need a list of every individual—just a list of clusters.

Cons: Higher variance than simple random sampling. If clusters differ systematically, bias creeps in.

Systematic Sampling

You sample every kth member. E.g., every 10th person on a list.

Pros: Simple. Works if the list is randomly ordered.

Cons: Fails catastrophically if there's periodicity in the list. (Imagine sampling every 7th day for weekly trends—you'd miss everything.)


The Central Limit Theorem: Why Sampling Gets Easier at Scale

Here's the magic result that makes statistical inference possible.

The Central Limit Theorem (CLT): If you take repeated random samples from any population (with finite variance) and calculate the sample mean each time, those sample means will be normally distributedeven if the population itself is not normal.

And the mean of those sample means equals the population mean. The standard deviation of those sample means (the standard error) decreases as sample size grows.

What this means in practice:

You don't need to know the population's distribution. You just need a large enough sample. The CLT guarantees that your sample mean will be approximately normally distributed around the true mean.

This is why confidence intervals work. This is why hypothesis tests work. The CLT is the foundation of almost all inferential statistics.

How large is "large enough"?

Rule of thumb: $n \geq 30$. For highly skewed distributions, you might need $n \geq 50$ or more. For near-normal distributions, even $n = 10$ might suffice.

But the CLT assumes i.i.d. samples. If your sampling process is biased, all bets are off.


Sampling Error vs. Sampling Bias

Two different problems. Often confused.

Sampling Error: Random Variation

Sampling error is the difference between your sample statistic and the true population parameter due to random chance.

Even with perfect random sampling, your sample mean won't exactly equal the population mean. There's noise. If you sample again, you'll get a slightly different mean.

This is not a problem. It's just randomness. And it's quantifiable—you can calculate the standard error and build confidence intervals around your estimate.

Sampling error decreases with larger samples. More data = less noise.

Sampling Bias: Systematic Skew

Sampling bias happens when your sampling method systematically over- or under-represents certain groups.

Examples:

  • Polling only landline phones (excludes young people).
  • Medical trials that exclude women (fails to capture sex-specific effects).
  • Psychology experiments on college students (WEIRD populations).
  • Survivor bias (only analyzing successful cases, ignoring failures).

Sampling bias does NOT decrease with larger samples. You just get more precise estimates of the wrong value.

The Literary Digest debacle: 2.4 million responses, massively biased sample. Gallup: 50,000 responses, representative sample. Bias beats size.


Common Sampling Pitfalls

Convenience Sampling

You sample whoever is easy to reach—students in your class, people on the street, users who respond to your survey.

The problem: People who are easy to reach differ systematically from people who aren't. Students differ from non-students. Respondents differ from non-respondents.

This is not random sampling. It's sampling bias disguised as practicality.

Voluntary Response Bias

You post a survey online. People choose whether to respond.

The problem: People with strong opinions respond. People who don't care ignore it. Your sample over-represents extremes.

Internet polls are notoriously terrible because of this. "97% of our users love this feature!" Yeah, because the 3% who hate it didn't bother responding.

Survivorship Bias

You analyze only the cases that "survived" some selection process, ignoring the cases that didn't.

Famous example: During WWII, analysts studied bullet holes in returning aircraft to decide where to add armor. Planes came back with holes in the wings and fuselage but not the engines.

Statistician Abraham Wald pointed out: You're only seeing the planes that survived. The ones shot in the engines didn't come back. Armor the engines, not the wings.

The lesson: If your sample is filtered by an outcome you care about, your conclusions will be backwards.

Non-Response Bias

You send out 10,000 surveys. Only 1,000 respond. You analyze those 1,000.

The problem: The 9,000 who didn't respond might differ systematically from the 1,000 who did. If non-respondents are disproportionately poor, young, or disengaged, your sample is biased.

This is a huge problem in polling. Response rates have collapsed over the past 30 years. If only 5% of people answer polls, are they representative? Probably not.


Sample Size: How Much Data Do You Need?

Everyone asks: "How big should my sample be?"

The answer is: It depends on what you're trying to measure and how precise you want to be.

The Formula

For estimating a population mean with a desired margin of error:

$$n = \left( \frac{Z \cdot \sigma}{E} \right)^2$$

Where:

  • $n$ = required sample size
  • $Z$ = Z-score for your confidence level (e.g., 1.96 for 95% confidence)
  • $\sigma$ = population standard deviation (often unknown, estimated from a pilot study)
  • $E$ = desired margin of error

Key insights:

  1. Precision costs exponentially. To cut your margin of error in half, you need 4× the data.
  2. Variability matters. High-variance populations require larger samples.
  3. Confidence matters. 99% confidence requires more data than 95% confidence.

Diminishing Returns

Going from 10 to 100 samples massively improves precision. Going from 1,000 to 10,000 barely helps.

This is why national polls use ~1,000 respondents. That's enough to get ±3% margin of error at 95% confidence. Going to 10,000 would only improve it to ±1%, which isn't worth the cost.

Rare Events Require Larger Samples

If you're studying something common (50% of people have the trait), $n = 1,000$ works great.

If you're studying something rare (1% of people have the trait), you need way more data. With $n = 1,000$, you'll only see ~10 cases. That's too few for reliable inference.

For rare events, you need $n$ large enough that you expect dozens or hundreds of positive cases.


Stratification: When You Need to Guarantee Representation

Sometimes random sampling isn't enough. If your population has important subgroups, you might under-sample rare groups by chance.

Example: You're polling a country that's 80% ethnic majority, 20% ethnic minority. With simple random sampling of 1,000 people, you'd expect ~200 from the minority—but you might get 180 or 220 just by chance.

Solution: Stratified sampling. You divide the sample proportionally: 800 from the majority, 200 from the minority. Now you guarantee proportional representation.

Benefit: Lower variance. More precise estimates for each subgroup.

Cost: You need to know the population's structure in advance (the proportions of each stratum).


Weighting: Fixing Non-Representative Samples

Sometimes you can't get a representative sample. Certain groups are hard to reach. So you oversample them, then weight the results to adjust for the imbalance.

Example: Your survey respondents skew older (because young people don't answer surveys). You know the population is:

  • 50% ages 18-40
  • 50% ages 40+

But your sample is:

  • 30% ages 18-40
  • 70% ages 40+

Solution: Weight the 18-40 responses more heavily (multiply by 50/30) and the 40+ responses less (multiply by 50/70). This corrects for the imbalance.

Caution: Weighting only fixes known imbalances. If there are unknown biases, weighting can't help.


The Population Doesn't Always Exist

Here's a conceptual problem that breaks people's brains.

In survey research, the population is concrete: all U.S. voters, all consumers, etc.

But in experimental science, the population is hypothetical.

You run a drug trial with 500 participants. What's the population? "All people who might take the drug"? That includes people not yet born. People who don't exist yet.

Or consider physics: you measure the speed of light 1,000 times. What's the population? "All possible measurements"? That's an abstract ensemble, not a real group.

In these cases, the population is a theoretical construct—the infinite set of possible observations under the same conditions. Your sample is a finite draw from that infinite possibility space.

This works mathematically. But it requires a conceptual shift: the population isn't "out there." It's a model.


The Coherence Connection: Sampling as Compression

Here's the deeper pattern.

Sampling is lossy compression. You can't measure everything, so you measure a subset. You lose information. But if your sample is representative, you preserve the essential structure—the coherence—of the whole.

This connects to M = C/T. The population has some underlying structure (coherence). A representative sample captures that structure. A biased sample distorts it.

In information theory terms, a representative sample has high mutual information with the population. The sample tells you about the whole. A biased sample has low mutual information—it tells you about itself but not the population.

And here's the key: randomness is what preserves structure at scale. If you cherry-pick samples, you inject your own biases. If you sample randomly, the biases cancel out. Randomness is the closest thing we have to "letting reality speak for itself."

That's why random sampling is the foundation of science. It's the mechanism that lets partial knowledge generalize.


Practical Workflow: How to Sample Well

Here's the checklist:

1. Define your population precisely. "All humans" is too vague. "All U.S. adults aged 18+ who are registered voters" is specific.

2. Use random sampling if possible. It's the gold standard. Everything else is a compromise.

3. Stratify if key subgroups matter. Ensures representation. Reduces variance.

4. Calculate required sample size. Use the formula. Don't just guess.

5. Check for bias. Compare your sample demographics to the known population. If they differ, your sample is biased.

6. Weight if necessary. Correct for known imbalances. But acknowledge limitations.

7. Report your sampling method. Be transparent. Let readers judge representativeness.

8. Don't confuse sample size with representativeness. 10,000 biased samples < 1,000 random samples.


What's Next

Sampling lets you make claims about populations. But how confident should you be in those claims?

That's where confidence intervals come in. They quantify the uncertainty in your estimates—the range of values the true population parameter is likely to fall within.

Next up: Confidence Intervals Explained—how we go from "our sample mean is 50" to "the population mean is probably between 48 and 52."


Further Reading

  • Cochran, W. G. (1977). Sampling Techniques (3rd ed.). Wiley.
  • Lohr, S. L. (2019). Sampling: Design and Analysis (2nd ed.). CRC Press.
  • Freedman, D., Pisani, R., & Purves, R. (2007). Statistics (4th ed.). W.W. Norton & Company. (Chapter on sampling.)
  • Wald, A. (1943). "A method of estimating plane vulnerability based on damage of survivors." Statistical Research Group, Columbia University.

This is Part 4 of the Statistics series, exploring how we extract knowledge from data. Next: "Confidence Intervals Explained."


Part 3 of the Statistics series.

Previous: Descriptive Statistics: Summarizing Data Next: Confidence Intervals: Quantifying Uncertainty in Estimates