ANOVA: Comparing Multiple Groups
You have three groups. Group A, Group B, Group C. You want to know: do they differ?
You could run three t-tests: A vs. B, B vs. C, A vs. C. But that's wrong. Multiple comparisons inflate your false-positive rate. Three tests at α = 0.05 give you a 14% chance of at least one false positive.
The solution: Analysis of Variance (ANOVA). One test. Multiple groups. Controlled error rate.
ANOVA asks: "Is there more variance between groups than within groups?" If yes, the groups differ. If no, they're probably the same.
It's the foundation of experimental design. Every time you see "significant main effect" or "interaction effect," that's ANOVA. This article explains how it works, when to use it, and what it actually tests.
The Core Logic: Variance Decomposition
ANOVA decomposes total variance into two parts:
1. Between-group variance: How much do group means differ from each other?
2. Within-group variance: How much do individuals within each group differ from their group mean?
The F-statistic:
$$F = \frac{\text{Variance between groups}}{\text{Variance within groups}}$$
If F is large: Groups differ more than individuals within groups. The group membership explains variance. Reject the null.
If F is small: Groups are as variable as individuals. Group membership doesn't matter. Fail to reject.
When to Use ANOVA
Use ANOVA when:
- You have 3+ groups (categorical independent variable).
- You have one continuous outcome (dependent variable).
- You want to test if any groups differ (omnibus test).
Example:
- Test three teaching methods (lecture, discussion, flipped classroom) on exam scores.
- Compare four drug dosages (0mg, 10mg, 20mg, 30mg) on symptom reduction.
- Test five diets (keto, paleo, vegan, Mediterranean, control) on weight loss.
Don't use ANOVA when:
- You have only 2 groups → use t-test.
- Your outcome is binary → use logistic regression or chi-square.
- Groups aren't independent → use repeated measures ANOVA or mixed models.
The Math: Sum of Squares
ANOVA partitions the total sum of squares (SST) into components:
$$\text{SST} = \text{SSB} + \text{SSW}$$
SST (Total): How much do all data points vary from the grand mean?
$$\text{SST} = \sum (Y_i - \bar{Y}_{\text{grand}})^2$$
SSB (Between groups): How much do group means vary from the grand mean?
$$\text{SSB} = \sum n_j (\bar{Y}j - \bar{Y}{\text{grand}})^2$$
SSW (Within groups): How much do individual points vary from their group mean?
$$\text{SSW} = \sum \sum (Y_{ij} - \bar{Y}_j)^2$$
F-statistic:
$$F = \frac{\text{SSB} / (k-1)}{\text{SSW} / (N-k)}$$
Where:
- $k$ = number of groups
- $N$ = total sample size
- $k-1$ = degrees of freedom between groups
- $N-k$ = degrees of freedom within groups
Interpretation: F is the ratio of mean squares. If F is large (and p < 0.05), groups differ significantly.
Assumptions
ANOVA assumes:
1. Independence: Observations are independent. One person's score doesn't affect another's.
2. Normality: Residuals are normally distributed (or sample is large enough for CLT to apply).
3. Homogeneity of variance: All groups have equal variance (homoscedasticity).
Check: Levene's test for equal variances. If violated, use Welch's ANOVA or transform the data.
Post-Hoc Tests: Which Groups Differ?
ANOVA tells you: "At least one group differs." It doesn't tell you which groups differ.
For that, you need post-hoc tests:
1. Tukey HSD: Compares all pairs. Controls family-wise error rate.
2. Bonferroni: Divides α by number of comparisons. Very conservative.
3. Scheffé: Most conservative. Works for complex comparisons.
4. Dunnett: Compares all groups to a control. More powerful if you only care about control comparisons.
Always adjust for multiple comparisons. Don't just run t-tests on all pairs—that inflates false positives.
Two-Way ANOVA: Testing Interactions
What if you have two categorical predictors?
Example: Test effect of teaching method (3 levels) and class size (2 levels: small, large) on exam scores.
Two-way ANOVA tests:
- Main effect of teaching method: Do teaching methods differ (averaging across class sizes)?
- Main effect of class size: Do class sizes differ (averaging across teaching methods)?
- Interaction effect: Does the effect of teaching method depend on class size?
Interaction is key. If teaching method works in small classes but not large, that's an interaction. Main effects alone miss that.
Repeated Measures ANOVA: Same Subjects, Multiple Times
What if you measure the same people multiple times?
Example: Measure stress before, during, and after an intervention (within-subjects design).
Regular ANOVA assumes independence. But measurements from the same person are correlated.
Repeated measures ANOVA accounts for this. It partitions variance into:
- Between-subjects variance.
- Within-subjects variance (time effects).
- Residual variance.
Assumption: Sphericity—variance of differences between conditions is equal. Test with Mauchly's test. If violated, use Greenhouse-Geisser correction.
Effect Size: Eta-Squared and Partial Eta-Squared
ANOVA gives you F and p. But how large is the effect?
Eta-squared ($\eta^2$):
$$\eta^2 = \frac{\text{SSB}}{\text{SST}}$$
Proportion of total variance explained by group membership.
- Small effect: $\eta^2 \approx 0.01$
- Medium: $\eta^2 \approx 0.06$
- Large: $\eta^2 \approx 0.14$
Partial eta-squared ($\eta_p^2$): Proportion of variance explained after accounting for other factors (used in multi-way ANOVA).
Always report effect size alongside p-value.
When ANOVA Fails
1. Unequal variances (heteroscedasticity): Use Welch's ANOVA or transform data.
2. Non-normality: For small samples, use Kruskal-Wallis (nonparametric alternative).
3. Outliers: Robust ANOVA or trimmed means.
4. Unbalanced designs (different group sizes): Use Type II or Type III sums of squares.
Further Reading
- Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver and Boyd.
- Maxwell, S. E., Delaney, H. D., & Kelley, K. (2017). Designing Experiments and Analyzing Data: A Model Comparison Perspective (3rd ed.). Routledge.
This is Part 11 of the Statistics series. Next: "Chi-Square Tests."
Part 10 of the Statistics series.
Previous: Correlation vs Causation: Why Ice Cream Does Not Cause Drowning Next: Chi-Square Tests: Testing Independence and Fit
Comments ()