Type I and Type II Errors: False Positives and False Negatives

Type I and Type II Errors: False Positives and False Negatives
Type I and Type II Errors: False Positives and False Negatives | Ideasthesia

Every hypothesis test can go wrong in exactly two ways.

Type I Error (False Positive): You reject the null when it's actually true. You conclude there's an effect when there isn't. You say the drug works when it doesn't.

Type II Error (False Negative): You fail to reject the null when it's actually false. You miss a real effect. You say the drug doesn't work when it does.

These are not symmetrical. Type I errors get published. Type II errors disappear into file drawers. And that asymmetry has systematically distorted the scientific literature.

Here's the uncomfortable truth: we've optimized science to minimize Type I errors while ignoring Type II errors. We demand $p < 0.05$ (5% false-positive rate). But we tolerate 50% or 80% false-negative rates without blinking.

The result? Journals full of false positives. Real effects overlooked because studies were underpowered. A literature that's biased, unreliable, and hard to replicate.

This article explains the two types of errors, the trade-off between them, and why we've been thinking about statistical testing backwards for decades.


The Two-by-Two Table

Let's formalize this.

Reality: Either the null hypothesis is true (no effect) or false (effect exists).

Decision: Either you reject the null or you don't.

That gives you four possibilities:

Null True (No Effect) Null False (Effect Exists)
Reject Null Type I Error (False Positive) Correct (True Positive)
Fail to Reject Correct (True Negative) Type II Error (False Negative)

Two ways to be right. Two ways to be wrong.

Type I Error: False Positive

You conclude there's an effect when there isn't.

Example: You test a drug. It doesn't work. But by random chance, your sample showed improvement. You conclude the drug works and publish. That's a false positive.

Probability of Type I Error: $\alpha$ (the significance level). Usually 0.05.

That means: If you do this test 100 times and the null is always true, you'll get 5 false positives.

We set $\alpha$ in advance. It's the false-positive rate we're willing to tolerate.

Type II Error: False Negative

You fail to detect an effect that exists.

Example: You test a drug. It does work. But your sample was small, or your measurement was noisy, or you just got unlucky. You conclude "no significant effect" and don't publish. That's a false negative.

Probability of Type II Error: $\beta$ (often unknown or ignored).

Statistical Power: $1 - \beta$. The probability you do detect the effect if it exists.

Typical target: 80% power ($\beta = 0.20$). That means: If the effect is real, you'll detect it 80% of the time. And miss it 20% of the time.

But many studies have way less power. Some have 20% or 30%. That means they'll miss the effect 70-80% of the time—even if it's real.


The Trade-Off: You Can't Minimize Both

Here's the core tension.

Lowering $\alpha$ (stricter threshold) makes Type I errors rarer but Type II errors more common.

If you demand $p < 0.01$ instead of $p < 0.05$, you'll have fewer false positives. But you'll also reject the null less often—so you'll miss more real effects.

Increasing power (lowering $\beta$) requires larger samples or larger effects.

If you want 90% power instead of 80%, you need more data. That costs time and money.

There's no free lunch. You're always balancing false positives against false negatives. Science has chosen to minimize $\alpha$ and largely ignore $\beta$. That choice has consequences.


Why Science Obsesses Over Type I Errors

Why do we care so much about false positives?

Historical reason: Ronald Fisher (1925) argued that science should be conservative. We shouldn't claim an effect exists unless the evidence is strong. A false positive is a claim that enters the literature. A false negative is just... silence.

Philosophical reason: Science is supposed to avoid asserting falsehoods. A false positive is a positive claim that's wrong. A false negative is just a lack of discovery.

Cultural reason: "Statistically significant" results get published. Null results don't. So journals, careers, and funding are biased toward rejecting the null.

This creates asymmetric incentives. Researchers are rewarded for Type I risks (claiming effects) and punished for Type II risks (missing effects). The system selects for false positives.


The Power Problem: Most Studies Are Underpowered

Here's the dirty secret: many published studies have power below 50%.

That means even if the effect is real, they'll miss it more often than they detect it.

Why Low Power Happens

1. Small samples are cheaper and faster.

Running a trial with 50 people is easier than 500. But 50 people might give you 30% power. You're gambling that you'll get lucky.

2. Effect sizes are often overestimated.

The "winner's curse": The first study to detect an effect usually overestimates its size (because small effects are hard to detect, so the first detection is likely a lucky outlier). Future studies, expecting that inflated effect size, underpower their designs.

3. Power analysis is often skipped.

Researchers don't calculate required sample size in advance. They just use "whatever we can afford" or "whatever is conventional in this field."

4. Null results don't get published.

If your underpowered study finds "no effect," you don't publish. Only the lucky few who detect the effect (or get false positives) publish. The literature becomes biased.

The Consequences of Low Power

1. High false-negative rate.

You miss real effects. That's bad for progress—important discoveries get overlooked.

2. Published effects are inflated.

When you do detect an effect with low power, it's likely because your sample overestimated the true effect size. The published literature overstates effect magnitudes.

3. Low replication rates.

If your original study barely had power to detect the effect, follow-up studies (with similar power) will often fail to replicate—not because the effect is fake, but because both studies are underpowered.

4. Winner's curse dominates the literature.

The first study on any topic is likely to overestimate the effect. That study gets cited, drives follow-ups, and sets expectations. But it's systematically biased.


Calculating Power: What You Need to Know

Power depends on four factors:

1. Effect Size

Larger effects = easier to detect = higher power.

If the drug reduces symptoms by 50%, you'll detect it with a small sample. If it reduces them by 2%, you need a huge sample.

Effect size is measured in standardized units (Cohen's $d$):

  • Small effect: $d = 0.2$
  • Medium effect: $d = 0.5$
  • Large effect: $d = 0.8$

You can't control the true effect size—it's a property of reality. But you need to estimate it to design your study.

2. Sample Size

Larger samples = higher power.

Power increases with $\sqrt{n}$. To double your power, you need 4× the data.

This is the easiest factor to control. But data is expensive.

3. Significance Level ($\alpha$)

Stricter $\alpha$ (e.g., 0.01 instead of 0.05) = lower power.

If you demand stronger evidence, you'll reject the null less often—including when you should.

$\alpha$ is a policy choice. Conventionally 0.05. But some fields (particle physics, genetics) use 0.001 or stricter.

4. Variance

Lower noise = higher power.

If your measurements are precise, you can detect smaller effects. If they're noisy, you need larger effects or larger samples.

You can reduce noise through better instruments, more controlled conditions, or repeated measurements.

The Formula (for t-tests)

Power can be calculated exactly for specific tests. For a two-sample t-test:

$$\text{Power} = \Phi\left( \frac{\delta \sqrt{n}}{2} - z_{\alpha/2} \right)$$

Where:

  • $\delta$ = effect size (in standardized units)
  • $n$ = sample size per group
  • $z_{\alpha/2}$ = critical value for your $\alpha$
  • $\Phi$ = cumulative distribution function of the standard normal

This is what power calculators use. You plug in three of the four variables (effect size, $n$, $\alpha$, power) and solve for the fourth.


Pre-Study Power Analysis: The Right Way to Design Research

Here's the workflow:

Step 1: Estimate the Effect Size

Look at prior studies. What effect size did they find? Discount by 20-30% (to account for publication bias and winner's curse).

If there are no prior studies, make an educated guess. What's the smallest effect size that would be practically meaningful?

Step 2: Choose Your Significance Level

Usually $\alpha = 0.05$. Adjust if needed (stricter for exploratory studies, more lenient if power is critical).

Step 3: Choose Your Desired Power

Convention: 80% ($\beta = 0.20$). But 90% is better if you can afford it.

Step 4: Calculate Required Sample Size

Use a power calculator (G*Power, R's pwr package, online tools). Input effect size, $\alpha$, and desired power. It outputs required $n$.

Step 5: Collect That Much Data

Don't stop early because you got significance. Don't add more data after seeing null results. Stick to your pre-registered $n$.

If you can't afford the required sample size, don't run the study. An underpowered study wastes resources and produces unreliable results.


Post-Hoc Power Analysis: Why It's Useless

After your study, you get a non-significant result. Someone suggests: "Let's calculate the power. Maybe we were underpowered."

Don't. Post-hoc power analysis is logically incoherent.

Here's why:

Power depends on the true effect size. But if your study found no significant effect, you don't know the true effect size. It could be zero. It could be large but you missed it.

Post-hoc power uses the observed effect size from your sample. But that's circular—you're using your data to explain why your data didn't reach significance.

What post-hoc power actually tells you: "If the true effect were exactly what I observed, how likely was I to detect it?" That's not a useful question.

The right approach: If you get a null result, report the confidence interval around the effect. That tells you the range of effect sizes compatible with your data. If the interval includes large effects, you were underpowered. If it's tightly around zero, the null is plausible.


The Asymmetry Problem: Why Type II Errors Are Ignored

Science treats Type I and Type II errors very differently.

Type I errors (false positives):

  • Highly stigmatized. "You claimed something that's not true."
  • Journals demand $p < 0.05$ to prevent them.
  • Careers suffer if you publish false positives that fail to replicate.

Type II errors (false negatives):

  • Invisible. "You just didn't find anything."
  • No one checks power. Underpowered studies are published all the time.
  • Null results don't get published, so false negatives vanish.

The result: Science minimizes false positives at the expense of false negatives. We accept 20-50-80% false-negative rates without comment.

And this creates perverse incentives:

  • Researchers run underpowered studies (cheap, fast).
  • If they get lucky and find significance, they publish.
  • If they don't, they file it away and try something else.
  • The published literature becomes full of flukes and overestimates.

The Neyman-Pearson Framework: Balancing Both Errors

Jerzy Neyman and Egon Pearson (1930s) developed a framework that treats Type I and Type II errors symmetrically.

They argued: You should choose $\alpha$ and $\beta$ based on the costs of each error type.

Example: Medical Testing

Type I error (false positive): Diagnose a healthy person with a disease.

  • Cost: Unnecessary treatment, anxiety, side effects.

Type II error (false negative): Miss a disease in a sick person.

  • Cost: Disease progresses untreated, possibly fatal.

Which error is worse? It depends on the disease.

  • For cancer screening: Type II errors are catastrophic (you miss cancer). Set $\beta$ low, even if it means higher $\alpha$ (more false alarms).
  • For a benign condition: Type I errors might be worse (unnecessary surgery). Set $\alpha$ low, tolerate higher $\beta$.

The key insight: There's no universal "right" threshold. It's a trade-off based on consequences.

But science has adopted a one-size-fits-all approach: $\alpha = 0.05$, ignore $\beta$. That's philosophically unjustified.


Multiple Testing: When Type I Errors Compound

Every test you run has a 5% false-positive rate. Run 20 tests, and you'd expect 1 false positive just by chance.

But researchers often run dozens or hundreds of tests (testing multiple outcomes, subgroups, timepoints). The family-wise error rate (probability of any false positive) skyrockets.

Example:

You test 20 hypotheses. Each has $\alpha = 0.05$. What's the probability of at least one false positive?

$$P(\text{at least one false positive}) = 1 - (1 - 0.05)^{20} \approx 0.64$$

64% chance of a false positive. Your "significance threshold" is meaningless.

The Solution: Adjust for Multiple Comparisons

Bonferroni correction: Divide $\alpha$ by the number of tests. For 20 tests, use $\alpha = 0.05 / 20 = 0.0025$.

Pros: Simple. Guarantees family-wise error rate stays at 5%.

Cons: Very conservative. Increases Type II errors (false negatives). You'll miss real effects.

Alternative: False Discovery Rate (FDR):

Instead of controlling the probability of any false positive, control the proportion of false positives among your "significant" results.

The Benjamini-Hochberg procedure does this. It's less conservative than Bonferroni but still protects against multiple testing.


The Replication Debate: Type I or Type II?

The replication crisis revealed that many "significant" findings don't replicate.

Two interpretations:

1. The original studies were false positives (Type I errors).

P-hacking, low priors, publication bias inflated the literature with noise. The effects were never real.

2. The replication studies were false negatives (Type II errors).

The original effects are real, but they're smaller than reported (winner's curse). Replication studies, often with similar or smaller samples, lack power to detect the true (smaller) effect.

The truth: Both. Some original findings are false positives. Some replication failures are false negatives. Disentangling them requires meta-analysis across many studies.

But the key point: if we ignore Type II errors, we can't interpret replication failures. "Failed to replicate" could mean "the effect is fake" or "the replication was underpowered."


The Coherence Connection: Error Rates as Signal Detection

Here's the conceptual link.

Hypothesis testing is signal detection. You're trying to distinguish signal (real effect) from noise (random variation).

Type I error: You mistake noise for signal. You see a pattern that isn't there.

Type II error: You mistake signal for noise. You miss a pattern that is there.

In information theory, this is the classic detection problem. You're setting a threshold. Above it, you say "signal." Below it, you say "noise."

Lowering the threshold: More Type I errors (false alarms), fewer Type II errors (misses).

Raising the threshold: Fewer Type I errors, more Type II errors.

The optimal threshold depends on the signal-to-noise ratio and the costs of each error type.

And this maps to M = C/T. Coherence is signal. Tension (or noise) is entropy. Statistical tests quantify how much coherence exists—whether the pattern is stable enough to replicate.

But here's the key: low-powered studies can't detect weak coherence. Even if the pattern is real (low entropy, high signal), a small sample won't reliably detect it. That's a Type II error.

Science has optimized for avoiding false claims of coherence (Type I). But in doing so, it misses true coherence (Type II). That's not epistemic humility—it's epistemic blindness.


Practical Workflow: Designing Studies to Minimize Both Errors

Here's how to balance Type I and Type II risks:

1. Set your $\alpha$ based on context.

High-stakes claims (medical treatments, policy changes): use $\alpha = 0.01$ or stricter.

Exploratory research: $\alpha = 0.05$ is fine.

2. Do a power analysis.

Don't run a study without knowing your power. Aim for 80% minimum. 90% is better.

3. Pre-register your hypothesis and sample size.

Prevents p-hacking, HARKing, and optional stopping. Locks in your error rates.

4. Report confidence intervals and effect sizes.

Don't just say "significant" or "not significant." Show the magnitude and uncertainty.

5. Adjust for multiple comparisons.

If you're testing multiple hypotheses, use Bonferroni or FDR control.

6. Value replication.

One study is weak evidence. Multiple independent replications are strong.

7. Be transparent about non-significant results.

Null findings matter. Publish them. They prevent publication bias and inform future power analyses.


What's Next

Type I and Type II errors are about binary decisions: reject or don't reject. But often, you're not testing a simple null. You're modeling relationships between variables.

That's the domain of linear regression—one of the most powerful tools in statistics for prediction, explanation, and causal inference.

Next up: Linear Regression Explained—how to fit lines to data and make predictions.


Further Reading

  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Routledge.
  • Button, K. S., et al. (2013). "Power failure: why small sample size undermines the reliability of neuroscience." Nature Reviews Neuroscience, 14(5), 365-376.
  • Ioannidis, J. P. (2005). "Why most published research findings are false." PLoS Medicine, 2(8), e124.
  • Neyman, J., & Pearson, E. S. (1933). "On the problem of the most efficient tests of statistical hypotheses." Philosophical Transactions of the Royal Society A, 231(694-706), 289-337.
  • Open Science Collaboration. (2015). "Estimating the reproducibility of psychological science." Science, 349(6251), aac4716.

This is Part 8 of the Statistics series, exploring how we extract knowledge from data. Next: "Linear Regression Explained."


Part 7 of the Statistics series.

Previous: P-Values: What They Actually Mean Next: Linear Regression: Fitting Lines to Data