P-Values: What They Actually Mean
Here's a sentence that sounds right but is catastrophically wrong:
"P < 0.05 means there's a 95% chance my hypothesis is true."
That's what most people think p-values mean. Including researchers. Including people who publish papers using p-values every day.
And it's completely false.
The p-value does not tell you the probability your hypothesis is true. It doesn't tell you the probability the null is false. It doesn't even tell you the probability your result will replicate.
So what does it tell you?
A p-value is the probability of observing data at least as extreme as yours, if the null hypothesis were true.
That's it. That's the definition. And it's not what anyone thinks it means.
This misunderstanding isn't trivial. It's why the replication crisis happened. It's why journals are full of false positives. It's why "statistically significant" has become a cargo-cult ritual instead of rigorous reasoning.
This article explains what p-values actually measure, why the intuitive interpretation is wrong, and how to think about them correctly—so you don't accidentally break science.
The Definition (Read This Slowly)
P-value: The probability of obtaining data at least as extreme as what you observed, assuming the null hypothesis is true.
Let's unpack that.
"Data at least as extreme": Not just your exact result. Any result as far from the null (or farther) in the direction you observed.
"If the null hypothesis were true": This is conditional. You're calculating a probability under the assumption that the null is correct.
Example: You flip a coin 100 times, get 65 heads. You test the null: "The coin is fair (50% heads)."
The p-value is: "If the coin were fair, how often would I get 65+ heads (or 35- heads, for a two-tailed test) in 100 flips?"
You calculate this using the binomial distribution. The answer: $p \approx 0.003$. That's surprising. So you reject the null—probably not a fair coin.
But notice: the p-value does not tell you "the probability the coin is biased." It tells you "the probability of this data if the coin were fair."
Those are not the same.
What P-Values Do NOT Mean
Let's catalog the errors.
Error 1: "P-value is the probability the null hypothesis is true."
Wrong. The p-value is calculated assuming the null is true. It's $P(\text{data} | H_0)$, not $P(H_0 | \text{data})$.
To get $P(H_0 | \text{data})$, you'd need Bayes' theorem. And that requires a prior—your belief about how likely $H_0$ is before seeing the data. Frequentist p-values don't use priors. So they can't give you $P(H_0 | \text{data})$.
Analogy: You test positive for a rare disease. The test has a 1% false-positive rate. Does that mean there's a 99% chance you have the disease?
No. If the disease is very rare (say, 1 in 10,000), most positive tests are false positives. You need to account for the base rate—the prior probability of the disease.
P-values don't account for base rates. They assume you're starting from "no effect" (the null) and only look at the data.
Error 2: "P < 0.05 means there's a 95% chance the effect is real."
Wrong. The p-value is not the probability the effect is real. It's the probability of your data under the null.
Even if $p = 0.01$, that doesn't mean "99% chance the effect exists." It means "1% chance of data this extreme if there's no effect."
Those statements are not equivalent. To convert one to the other, you'd need to know the prior probability that an effect exists. And p-values don't use priors.
Error 3: "A smaller p-value means a larger or more important effect."
Wrong. P-values confound effect size and sample size.
You can get $p < 0.001$ with a tiny, meaningless effect—if your sample is huge.
Conversely, you can get $p = 0.20$ with a massive, important effect—if your sample is tiny.
Example:
- Study A: 10,000 people. Drug improves symptoms by 0.5%. $p = 0.001$. (Statistically significant, practically useless.)
- Study B: 50 people. Drug improves symptoms by 30%. $p = 0.08$. (Not significant, but huge effect—just underpowered.)
The p-value tells you about statistical detectability, not effect magnitude.
Error 4: "P = 0.051 means no effect. P = 0.049 means there's an effect."
Wrong. The p-value is continuous. There's nothing magical about 0.05.
$p = 0.051$ and $p = 0.049$ are functionally identical. Both say: "This data is somewhat surprising under the null." One is arbitrarily just below the threshold, one just above.
Treating 0.05 as a hard cutoff—"significant" vs. "not significant"—distorts reasoning. It makes people think there's a qualitative difference between $p = 0.049$ (effect exists!) and $p = 0.051$ (no effect!). There isn't.
Error 5: "Non-significant means the null is true."
Wrong. "Fail to reject the null" doesn't mean "accept the null."
Maybe there is an effect, but your sample was too small to detect it. Maybe your measurement was too noisy. Maybe the effect exists, but only in certain subgroups.
"Non-significant" means "insufficient evidence," not "no effect."
And here's the kicker: with a small sample, you can fail to reject the null even when the effect is enormous. That's low statistical power. We'll return to this.
The Prosecutor's Fallacy: Inverting Conditional Probabilities
The core confusion is a textbook example of the prosecutor's fallacy—inverting conditional probabilities.
What the p-value gives you: $P(\text{data} | H_0)$. "How likely is this data, if the null is true?"
What people think it gives them: $P(H_0 | \text{data})$. "How likely is the null, given this data?"
These are not the same. And confusing them is formally invalid.
Bayes' theorem shows the relationship:
$$P(H_0 | \text{data}) = \frac{P(\text{data} | H_0) \cdot P(H_0)}{P(\text{data})}$$
To get $P(H_0 | \text{data})$, you need:
- $P(\text{data} | H_0)$ — that's the p-value.
- $P(H_0)$ — the prior probability of the null. How likely was "no effect" before you ran the experiment?
- $P(\text{data})$ — the total probability of the data under all hypotheses.
P-values only give you the first term. Without priors, you can't invert the probability.
Example:
You're testing a psychic. They guess 70 out of 100 coin flips correctly. $p = 0.0001$. Very unlikely under the null (random guessing).
Does that mean there's a 99.99% chance they're psychic?
No. Because the prior probability of psychic powers is extremely low. Even with strong data, the posterior probability of psychic powers is still tiny—most likely, they cheated or you made a measurement error.
P-values ignore priors. So they can't tell you what's "probably true."
How P-Values Are Actually Calculated
Let's demystify the mechanics.
Step 1: State the Null Hypothesis
Example: "The drug has no effect. Mean improvement = 0."
Step 2: Collect Data and Calculate a Test Statistic
You measure outcomes in drug and placebo groups. You calculate the difference in means and scale it by the standard error:
$$t = \frac{\bar{x}{\text{drug}} - \bar{x}{\text{placebo}}}{\text{SE}_{\text{diff}}}$$
This $t$-value tells you: "How many standard errors is the observed difference from zero?"
Step 3: Calculate the P-Value
You ask: "If the null is true (mean difference = 0), what's the probability of seeing a $t$-value this large (or larger)?"
You look up the $t$-value in the t-distribution (or use software). The tail area beyond your $t$-value is the p-value.
Example:
- $t = 2.5$ on a t-distribution with 100 degrees of freedom.
- The p-value (two-tailed) is $p \approx 0.014$.
Interpretation: "If the drug had no effect, we'd see a difference this large (or larger) about 1.4% of the time."
That's surprising. So we reject the null.
But notice: we're not saying "there's a 98.6% chance the drug works." We're saying "the null is implausible given this data."
The Replication Crisis: What Happens When P-Values Are Misunderstood
Here's where the wheels come off.
The Base Rate Fallacy
Imagine 1,000 researchers test hypotheses that are actually false (no real effect—the null is true).
With $\alpha = 0.05$, we'd expect ~50 of them to get $p < 0.05$ just by chance. That's the false-positive rate.
Those 50 "significant" results get published. The 950 "non-significant" results don't.
The literature now contains 50 false positives and 0 true findings. 100% of published results are false.
Now add hypotheses that are true. Assume 100 true effects exist, each tested once. If statistical power is 80%, we detect 80 of them.
Published results:
- 50 false positives (from the 1,000 false nulls).
- 80 true positives (from the 100 true effects).
False discovery rate: $50 / (50 + 80) = 38%$.
Nearly 40% of "significant" results are false positives—even with no p-hacking, no publication bias, and 80% power.
Now add p-hacking (researchers trying multiple tests until they get $p < 0.05$). Add publication bias (journals rejecting null results). Add low power (many studies have <50% power).
The false discovery rate skyrockets. Some estimates put it at >50% in psychology, medicine, and economics.
This is the replication crisis. The published literature is full of effects that don't exist—because p-values were misinterpreted as "probability the finding is true."
The Solution: Adjust for Multiple Testing
If you run 20 tests, you'd expect 1 to be "significant" just by chance.
Solution: Adjust your threshold. Use Bonferroni correction: divide $\alpha$ by the number of tests.
For 20 tests, use $\alpha = 0.05 / 20 = 0.0025$.
Now your false-positive rate stays at 5% across all 20 tests, not per test.
Or: Use False Discovery Rate (FDR) control (Benjamini-Hochberg procedure), which is less conservative.
But the key point: you can't ignore multiple comparisons. Every extra test you run inflates your false-positive rate.
P-Hacking: How to Lie With P-Values
Even if you're not deliberately cheating, there are a dozen ways to manipulate p-values.
1. Optional Stopping
You collect data. You check if $p < 0.05$. Not yet? Collect more data. Check again. Repeat until you get significance.
The problem: This inflates the false-positive rate. If you peek at the data repeatedly, you're effectively running multiple tests. The $\alpha = 0.05$ guarantee no longer holds.
The fix: Pre-register your sample size. Collect all data before analyzing. Or use sequential analysis methods that account for multiple looks.
2. Selective Reporting
You measure 10 outcomes. One is significant. You report only that one.
The problem: Same as above—multiple comparisons. If you're cherry-picking, your false-positive rate is way higher than 5%.
The fix: Report all outcomes. Or adjust $\alpha$ for multiple comparisons.
3. Excluding Outliers
You run the test. $p = 0.08$. You remove a few "outliers." Now $p = 0.04$. Significant!
The problem: "Outliers" are subjective. If you're deciding what to exclude after seeing the results, you're data-fishing.
The fix: Pre-specify your outlier exclusion criteria. Or use robust statistics that don't depend on excluding data.
4. Subgroup Analysis
Your main result is $p = 0.12$. But you notice that among men aged 25-35 who live in urban areas, $p = 0.03$.
The problem: If you slice the data enough ways, you'll find some subgroup where the result is significant—just by chance.
The fix: Pre-specify subgroups. Or adjust for multiple comparisons. Or treat subgroup findings as exploratory (not confirmatory).
The ASA Statement: What Professional Statisticians Say About P-Values
In 2016, the American Statistical Association released an official statement on p-values—the first time in its 177-year history it took a position on a statistical practice.
Why? Because p-values were being systematically misused, and it was breaking science.
The ASA's Six Principles
- P-values can indicate how incompatible the data are with a specified statistical model.Translation: P-values measure surprise, not truth.
- P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.Translation: Stop thinking $p < 0.05$ means "95% chance the effect is real."
- Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.Translation: Stop treating $p = 0.049$ as fundamentally different from $p = 0.051$.
- Proper inference requires full reporting and transparency.Translation: Report everything. Not just the significant results.
- A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.Translation: "Significant" doesn't mean "important."
- By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.Translation: P-values are weak evidence. Use them with effect sizes, confidence intervals, replication, and theory.
This statement was a rebuke to decades of p-value worship in science.
Better Alternatives to P-Values
If p-values are so problematic, what should we use instead?
1. Effect Sizes and Confidence Intervals
Always report the magnitude of the effect, not just whether it's "significant."
- Effect size: Cohen's $d$, $r$, odds ratio, etc. How large is the difference?
- Confidence interval: What's the range of plausible values?
These tell you what matters—not just "is there an effect?" but "how big is it?"
2. Bayesian Methods
Instead of p-values, calculate Bayes factors: the ratio of evidence for the alternative vs. the null.
- $\text{BF} = 10$: The data are 10× more likely under the alternative than the null.
- $\text{BF} = 0.1$: The data are 10× more likely under the null.
Bayes factors quantify evidence. They're interpretable. They accumulate across studies. And they let you update beliefs continuously.
The catch: They require priors. But that's a feature, not a bug—it makes your assumptions explicit.
3. Estimation, Not Testing
Instead of asking "Is there an effect?" (hypothesis testing), ask "How big is the effect?" (estimation).
You always report point estimates, confidence intervals, and effect sizes. You never reduce findings to "significant" or "not significant."
This shifts focus from binary decisions to quantitative evidence. Which is what science is supposed to be about.
4. Pre-Registration and Replication
Pre-register your hypothesis, method, and analysis plan before collecting data. This prevents p-hacking and HARKing.
And value replication. One $p < 0.05$ is weak evidence. Five independent replications? Strong evidence.
The Coherence Connection: P-Values as Surprise Quantification
Here's the conceptual core.
P-values measure how surprising your data is under a null model.
Surprise is information. In information theory, surprise = negative log probability. Rare events carry more information.
A low p-value says: "This data is improbable under the null. The null is a bad model."
And this connects to M = C/T. Meaning emerges when patterns persist (coherence over time). A low p-value detects coherence—structure that's unlikely to be random noise.
But here's the key: low p-values don't guarantee coherence. They can arise from:
- Real patterns (what we want).
- Multiple testing (p-hacking).
- Misspecified models (wrong assumptions).
- Publication bias (selective reporting).
P-values detect surprise. But surprise can come from signal or from flawed methodology.
That's why replication matters. If the pattern replicates across independent samples, that's coherence. One low p-value is just one noisy data point.
Practical Workflow: How to Use P-Values Correctly
Here's the checklist:
1. Pre-register your hypothesis and analysis plan. Specify what you're testing before you collect data.
2. Set your significance level before analyzing. Don't adjust it after seeing the results.
3. Report the exact p-value. Not just "$p < 0.05$." Report "$p = 0.023$." Let readers see how close you were to the threshold.
4. Report effect sizes and confidence intervals. P-values alone are incomplete. How large is the effect? What's the range of uncertainty?
5. Interpret cautiously. Low p-values suggest the null is implausible—but they don't prove your hypothesis.
6. Adjust for multiple comparisons. If you're testing multiple hypotheses, correct your $\alpha$ (Bonferroni, FDR, etc.).
7. Don't dichotomize. "Significant" and "not significant" are not categories. The p-value is continuous. Treat it that way.
8. Replicate. One study is one data point. Seek independent replication before drawing strong conclusions.
What's Next
P-values quantify surprise under the null. But they don't tell you what kind of error you're making when you reject or fail to reject.
That's the domain of Type I and Type II errors—false positives and false negatives. And understanding the trade-off between them is critical for designing good studies.
Next up: Type I and Type II Errors—the two ways hypothesis testing goes wrong.
Further Reading
- Wasserstein, R. L., & Lazar, N. A. (2016). "The ASA's statement on p-values: context, process, and purpose." The American Statistician, 70(2), 129-133.
- Ioannidis, J. P. (2005). "Why most published research findings are false." PLoS Medicine, 2(8), e124.
- Goodman, S. (2008). "A dirty dozen: twelve p-value misconceptions." Seminars in Hematology, 45(3), 135-140.
- Benjamin, D. J., et al. (2018). "Redefine statistical significance." Nature Human Behaviour, 2(1), 6-10.
- Nuzzo, R. (2014). "Scientific method: statistical errors." Nature, 506(7487), 150-152.
This is Part 7 of the Statistics series, exploring how we extract knowledge from data. Next: "Type I and Type II Errors."
Part 6 of the Statistics series.
Previous: Hypothesis Testing: Is the Effect Real? Next: Type I and Type II Errors: False Positives and False Negatives
Comments ()