Synthesis: Statistics as the Science of Learning from Data
We've covered the machinery: descriptive statistics, sampling, confidence intervals, hypothesis testing, regression, ANOVA, chi-square. The tools of statistical inference.
But step back. What is statistics actually doing?
Here's the reframe: Statistics is formalized coherence detection. It's a rigorous method for distinguishing structure (patterns that persist, replicate, generalize) from noise (random fluctuation, sampling variation, entropy).
Every statistical test asks the same fundamental question: "Is there enough coherence here to warrant belief?"
- Hypothesis testing: Is this pattern coherent enough to reject randomness?
- Confidence intervals: What range of values is coherent with this data?
- Regression: How much coherent variance does X explain in Y?
- P-values: How surprised should I be if there's no coherent structure?
And when statistics breaks—when we get replication failures, false positives, biased literatures—it's because we mistook noise for coherence or failed to detect coherence that exists.
This article synthesizes the series: what statistics measures, why it works, where it fails, and how to think about it as signal detection in a noisy world.
Statistics as Entropy Reduction
In information theory, entropy measures uncertainty. Maximum entropy = uniform randomness. No structure. No predictability.
Statistics detects departures from maximum entropy. It finds structure—correlations, group differences, predictable relationships.
A low p-value says: "This data has lower entropy than random noise would produce." There's structure here.
A high R² says: "X reduces Y's entropy. Knowing X makes Y more predictable."
A narrow confidence interval says: "The parameter has low epistemic entropy. We've pinned it down."
Conversely, high variance, null results, and wide intervals signal high entropy—the data is noisy, uncertain, unstructured.
And this connects to M = C/T (meaning equals coherence over time). Statistical significance is a claim about coherence—a claim that the pattern will replicate, that it's not just noise in this sample.
But here's the key: coherence detection requires assumptions. If your assumptions fail (non-independence, confounding, measurement error), you detect false coherence or miss true coherence.
The Two-Way Error: Signal Detection Theory
Every statistical test is a signal detection problem. You're deciding: signal or noise?
Four outcomes:
| Signal Present | Signal Absent | |
|---|---|---|
| You Say "Signal" | True Positive (Hit) | False Positive (Type I) |
| You Say "Noise" | False Negative (Type II) | True Negative (Correct Rejection) |
The trade-off: You can't minimize both errors simultaneously. Lowering your threshold (easier to call "signal") increases hits but also false alarms. Raising it decreases false alarms but increases misses.
Science chose to minimize false positives (Type I, α = 0.05) and largely ignore false negatives (Type II, β often 0.20-0.80).
The result: A literature biased toward flukes. We overstate effect sizes (winner's curse). We miss real but subtle effects (low power). We confuse "unlikely if false" with "probably true" (base rate neglect).
The fix: Balance both errors. Pre-register studies. Report null results. Demand replication. Use Bayesian methods that quantify evidence rather than binary thresholds.
Why Statistics Fails: Violating the Coherence Assumption
Statistics assumes the pattern you detect is stable (will replicate) and generalizable (applies beyond your sample).
But it can fail in systematic ways:
1. Sampling Bias
Your sample doesn't represent the population. The pattern exists in your sample but not in reality.
- Polling landlines in 2024 (young people don't have landlines).
- Medical trials excluding women (effects differ by sex).
- Psychology on WEIRD populations (doesn't generalize to non-Western cultures).
Result: You detect coherence that's local (to your biased sample) but not global.
2. P-Hacking and HARKing
You torture the data until it confesses. Multiple tests, optional stopping, selective reporting.
Result: You detect noise and call it signal. The false-positive rate explodes.
3. Publication Bias
Only "significant" results get published. The file drawer fills with null results.
Result: The literature systematically overstates effect sizes. You think effects are larger and more reliable than they are.
4. Low Power
Your test has 30% power. Even if the effect is real, you'll miss it 70% of the time.
Result: You fail to detect true coherence. Type II error dominates.
5. Confounding
You detect X → Y, but actually Z → X and Z → Y. X and Y correlate without causing each other.
Result: You detect spurious coherence. The pattern replicates (because Z exists) but the causal story is wrong.
6. Overfitting
You fit a model with too many parameters. It memorizes your training data—including the noise.
Result: Perfect fit on training data, catastrophic failure on test data. You modeled noise, not signal.
What Statistics Can't Do
Let's be honest about the limits.
1. Statistics Can't Give You Priors
Frequentist statistics calculates $P(\text{data} | H_0)$. It doesn't give you $P(H_0 | \text{data})$—the probability the hypothesis is true given the data.
To get that, you need Bayes' theorem, which requires a prior—your belief before seeing the data.
P-values don't tell you "how likely it is the effect is real." They tell you "how surprising the data would be if the effect weren't real." That's not the same.
Solution: Use Bayesian inference. Or acknowledge the limits of frequentist tests.
2. Statistics Can't Prove Causation
Correlation isn't causation. Regression detects associations. Only experiments (or quasi-experiments, IVs, DAGs) establish causation.
Example: Ice cream sales correlate with drowning. Statistics detects the correlation. It can't tell you ice cream doesn't cause drowning. You need domain knowledge (hot weather causes both).
Solution: Design experiments. Use causal inference tools (RCTs, natural experiments, IVs, causal graphs).
3. Statistics Can't Fix Bad Data
Garbage in, garbage out. If your sample is biased, your measurement is noisy, your variables are confounded—no amount of statistical sophistication saves you.
Solution: Design better studies. Measure carefully. Sample representatively. Control confounds.
4. Statistics Can't Tell You What's Important
"Statistically significant" doesn't mean "practically important." With 10 million observations, a 0.01% effect becomes significant. But does it matter?
Solution: Report effect sizes, confidence intervals, and contextual relevance. Not just p-values.
The Bayesian Alternative: Quantifying Evidence Continuously
Frequentist statistics is binary: reject or don't reject. $p < 0.05$ or $p \geq 0.05$. There's no middle ground.
Bayesian inference quantifies strength of evidence continuously.
- You start with a prior: your belief about the hypothesis before seeing data.
- You observe data.
- You update your belief using Bayes' theorem to get a posterior: your belief after seeing data.
The posterior tells you: "Given this data, how likely is the hypothesis?"
That's the question most people think p-values answer. Only Bayesian methods actually answer it.
Bayes factors quantify evidence ratios:
- BF = 10: Data are 10× more likely under H₁ than H₀.
- BF = 0.1: Data are 10× more likely under H₀ than H₁.
Advantages:
- Intuitive interpretation (probability of hypothesis given data).
- Accumulates evidence across studies.
- No arbitrary thresholds.
- Handles small samples gracefully.
Disadvantages:
- Requires priors (contentious in some fields).
- Computationally intensive (MCMC, etc.).
- Less established conventions (what prior to use?).
The future: Many fields are shifting Bayesian. Psychology, medicine, machine learning. Frequentist methods won't disappear—but Bayesian complements them.
The Replication Movement: Fixing What's Broken
The replication crisis forced a reckoning. The field realized: we've been doing this wrong.
What's changing:
1. Pre-registration. Specify hypothesis, method, and analysis before collecting data. Prevents p-hacking and HARKing.
2. Open data and code. Share everything. Let others verify and reanalyze. Transparency prevents fraud and error.
3. Replication studies. Journals now publish replications. Null results matter. One "significant" finding is weak evidence. Five replications are strong.
4. Larger samples. Power analysis is standard. Underpowered studies are recognized as unreliable.
5. Multi-site collaborations. Instead of one lab with n=50, ten labs with n=500 total. Increases power and generalizability.
6. Bayesian methods. Quantifying evidence, not just binary decisions. More nuanced, less prone to threshold effects.
7. Meta-analysis. Systematically combining results across studies. More reliable than any single study.
The result: Science is becoming more rigorous, transparent, and reliable. But the transition is slow and contentious.
Practical Wisdom: How to Use Statistics Rigorously
Here's the workflow:
1. Pre-register. Lock in your hypothesis, sample size, and analysis plan before you start.
2. Power your study. Calculate required sample size. Don't run underpowered studies.
3. Measure carefully. Reliable instruments. Representative samples. Control confounds.
4. Analyze transparently. Report all outcomes, all tests. Don't cherry-pick.
5. Adjust for multiple comparisons. If you test 20 hypotheses, correct your α (Bonferroni, FDR).
6. Report effect sizes and CIs. Not just "significant" or "not significant." Show magnitude and uncertainty.
7. Interpret cautiously. "Significant" means "surprising under null," not "probably true." Acknowledge limits.
8. Replicate. One study is a data point. Replication is evidence.
9. Be Bayesian when appropriate. Quantify evidence. Update beliefs continuously. Don't worship thresholds.
10. Think causally. Use DAGs, experiments, natural experiments, IVs. Don't confuse correlation with causation.
The Deeper Point: Science as Collective Coherence Detection
Science isn't individual tests. It's a social process of collective signal detection.
One study detects potential signal. It might be real. It might be noise.
Replication across labs, populations, and methods strengthens the signal. Or reveals it was noise.
Meta-analysis combines weak signals into stronger evidence. Or shows the effect doesn't exist.
Theory integrates findings into a coherent framework. Or shows they're contradictory.
Statistics is the local tool. Science is the global process. And the global process works—when we use the tools rigorously, replicate findings, and correct errors.
The replication crisis wasn't statistics failing. It was us failing to use statistics correctly. We treated p-values as magic. We ignored power. We p-hacked. We filed away null results.
The solution isn't "abandon statistics." It's use statistics rigorously, transparently, and humbly.
The Coherence Synthesis
Back to the core framework: M = C/T. Meaning equals coherence over time (or coherence over tension).
Statistics quantifies coherence. Low p-values, narrow CIs, high R²—these are claims about pattern stability. They say: "This structure will replicate. It's not just noise."
But coherence requires time (replication) and context (generalizability). One study is a moment. Replication across studies, labs, and populations is coherence over time.
And when we confuse one moment (one significant result) with coherence (the reliable pattern), we mistake noise for meaning.
Real coherence:
- Replicates across samples.
- Generalizes across contexts.
- Resists perturbation (remains after controlling confounds).
- Integrates with theory.
Spurious coherence:
- One fluke result.
- P-hacked, publication-biased.
- Disappears when replicated.
- Contradicts other evidence.
Statistics detects potential coherence. Science confirms it.
Further Reading
- Ioannidis, J. P. (2005). "Why most published research findings are false." PLoS Medicine, 2(8), e124.
- Open Science Collaboration. (2015). "Estimating the reproducibility of psychological science." Science, 349(6251), aac4716.
- McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples in R and Stan (2nd ed.). CRC Press.
- Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.
- Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
- Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). "Moving to a world beyond 'p < 0.05'." The American Statistician, 73(sup1), 1-19.
This is Part 13 of the Statistics series, exploring how we extract knowledge from data. This concludes the series. Return to the Series Hub.
Part 12 of the Statistics series.
Comments ()