Descriptive Statistics: Summarizing Data

Descriptive Statistics: Summarizing Data
Descriptive Statistics: Summarizing Data | Ideasthesia

You measure 10,000 people's heights. Now what?

You can't just list all 10,000 numbers. No one can think about that much data at once. You need to compress it—extract the essential features without losing the meaningful structure.

That's what descriptive statistics does. It takes a dataset and gives you a handful of numbers that capture the shape, center, and spread. Mean, median, mode, range, variance, standard deviation—these aren't arbitrary. They're optimized summaries, each revealing something different about your data.

And here's the thing: choosing the wrong summary can lie to you. The mean looks stable until you realize it's being dragged by outliers. The median hides bimodality. The correlation coefficient says "no relationship" when there's a perfect curve.

Descriptive statistics isn't just calculation. It's translation—turning raw data into insight. And if you don't understand what each metric actually measures, you'll summarize yourself into nonsense.

This article unpacks the core descriptive tools: what they mean, when to use them, and how they can mislead you if you're not careful.


The Problem: Raw Data Is Uninterpretable

Imagine I give you this list:

Heights (in cm): 172, 165, 180, 158, 175, 169, 182, 177, 163, 171, 168, 174, 179, 166, 173, 170, 176, 164, 181, 167...

And it goes on for 10,000 entries.

What do you know after reading that? Nothing actionable. You can't hold 10,000 numbers in your head. You can't see the pattern. You need compression.

Descriptive statistics gives you that compression. Instead of 10,000 numbers, you get:

  • Mean: 170 cm
  • Median: 170 cm
  • Standard deviation: 8 cm
  • Range: 145 cm to 195 cm

Suddenly, you know something. The typical height is around 170 cm. Most people cluster within 8 cm of that. The extremes span 50 cm. You've gone from "incomprehensible list" to "structured insight."

But here's the catch: every summary loses information. The mean tells you nothing about the shape of the distribution. The range tells you nothing about density. If you pick the wrong summary for your question, you'll miss what matters.


Measures of Central Tendency: Where's the Middle?

The first question you ask about data: What's typical? What's the "middle" or "center" of the distribution?

There are three main answers, and they're not interchangeable.

Mean: The Arithmetic Average

Definition: Add all values, divide by the count.

$$\text{Mean} = \frac{1}{n} \sum_{i=1}^{n} x_i$$

The mean is the center of mass of your data. If you plotted your data as weights on a number line, the mean is where you'd balance it.

When it works: Normally distributed data, no extreme outliers.

When it fails: Skewed distributions, outliers.

Example: You measure income in a neighborhood.

  • 9 people earn $50,000/year.
  • 1 person earns $5,000,000/year (a tech founder).

The mean income is $545,000. Does that describe anyone? No. The mean has been dragged by the outlier. It's technically correct but functionally useless for describing "typical" income.

The mean is sensitive to outliers. One extreme value shifts it dramatically. That's sometimes useful (in physics, when calculating centers of mass). But in social data with fat tails, the mean can be wildly misleading.

Median: The Middle Value

Definition: Sort the data. The median is the value in the middle.

If you have 99 data points, the median is the 50th value. Half the data is above it, half below.

When it works: Skewed data, outliers, ordinal data (rankings).

When it fails: When you care about totals (e.g., total income affects tax revenue).

Back to the income example:

  • 9 people earn $50,000.
  • 1 person earns $5,000,000.

The median income is $50,000. That's the middle value—five people earn less, five people earn more. This actually describes a typical person.

The median is robust to outliers. The tech founder could earn $50 million or $500 million—the median stays $50,000. That makes it better for summarizing skewed distributions.

But the median loses information about the extremes. If you're calculating total tax revenue, the median is useless—you need to know about the high earners.

Mode: The Most Common Value

Definition: The value that appears most often.

When it works: Categorical data, discrete distributions, multimodal data.

When it fails: Continuous data (where exact repeats are rare), unimodal distributions.

Example: You survey 100 people about their favorite color.

  • 35 say blue.
  • 25 say red.
  • 20 say green.
  • 20 say yellow.

The mode is blue. That's the most common answer. Mean and median don't even make sense here—you can't "average" colors.

For continuous data, the mode is tricky. If you measure heights to the nearest cm, you might have a mode. But if you measure to the nearest 0.1 cm, exact repeats become rare. In that case, you think about modal regions—density peaks in the distribution.

The mode reveals multimodality. If your data has two peaks (e.g., bimodal height distribution for men and women combined), the mode shows you that. Mean and median hide it.

When to Use Which?

Use Metric
Normally distributed, no outliers Mean
Skewed data, outliers present Median
Categorical data, or multimodal distributions Mode

And often, you report all three. If mean and median differ substantially, that tells you something—your data is skewed or has outliers.


Measures of Spread: How Scattered Is the Data?

Knowing the center isn't enough. You also need to know how much the data varies.

Consider two datasets:

  • Dataset A: Heights are 169, 170, 170, 171, 171 cm (mean = 170).
  • Dataset B: Heights are 150, 160, 170, 180, 190 cm (mean = 170).

Same mean. Completely different distributions. Dataset A is tightly clustered. Dataset B is spread out. You need a metric for that spread.

Range: Maximum Minus Minimum

Definition: Range = Max - Min.

The simplest measure of spread. If heights range from 150 to 190 cm, the range is 40 cm.

When it works: Quick-and-dirty summary, identifying outliers.

When it fails: Ignores everything except the two extremes.

The range is fragile. One outlier—a 210 cm basketball player—and your range explodes, even though 99.9% of the data is tightly clustered. It tells you the extremes exist but nothing about the distribution's shape.

Variance: Average Squared Deviation

Definition: The average of squared distances from the mean.

$$\text{Variance} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2$$

This measures how much data points deviate from the mean on average. If variance is small, data clusters near the mean. If large, data is spread out.

Why squared deviations? Two reasons:

  1. Positive and negative deviations cancel out. If you just averaged $(x_i - \bar{x})$, values above and below the mean cancel to zero. Squaring makes everything positive.
  2. It's mathematically convenient. Variance has nice properties in calculus and probability theory. It makes later derivations work cleanly.

But variance has a problem: units are squared. If you measure height in cm, variance is in cm². That's unintuitive. You can't directly compare "the variance is 64 cm²" to the original data.

Standard Deviation: The Square Root of Variance

Definition: $\sigma = \sqrt{\text{Variance}}$.

Standard deviation brings you back to the original units. If variance is 64 cm², standard deviation is 8 cm. Now you can say: "Most people's heights are within 8 cm of the mean."

When it works: Normally distributed data.

When it fails: Skewed distributions, heavy tails.

Standard deviation is the most common spread metric. And for normally distributed data, it has a beautiful property:

  • ~68% of data falls within 1 standard deviation of the mean.
  • ~95% falls within 2 standard deviations.
  • ~99.7% falls within 3 standard deviations.

If mean height is 170 cm and standard deviation is 8 cm:

  • ~68% of people are 162–178 cm.
  • ~95% are 154–186 cm.
  • ~99.7% are 146–194 cm.

That's the empirical rule, and it only holds for normal distributions. If your data is skewed—like income—standard deviation is less interpretable.

Interquartile Range (IQR): The Middle 50%

Definition: IQR = Q3 - Q1 (the 75th percentile minus the 25th percentile).

The IQR tells you the range of the middle half of your data. It ignores the top 25% and bottom 25%, focusing on the central bulk.

When it works: Skewed data, outliers, non-normal distributions.

When it fails: When you care about the tails (e.g., risk modeling).

The IQR is robust to outliers, just like the median. If Bill Gates walks into a bar, the IQR barely budges. That makes it better than standard deviation for messy, real-world data.

Choosing a Spread Metric

Use Metric
Normally distributed, no outliers Standard deviation
Skewed data, outliers IQR
Quick summary Range

And again, report multiple metrics. If standard deviation is huge but IQR is small, you have outliers. The combination tells the story.


Distribution Shape: Beyond Center and Spread

Mean and standard deviation are great for normal distributions. But not all data is normal. Sometimes the shape itself matters.

Skewness: Is It Symmetric?

Skewness measures whether data is symmetric or lopsided.

  • Right-skewed (positive skew): Long tail on the right. Mean > Median. Examples: income, wealth, city sizes.
  • Left-skewed (negative skew): Long tail on the left. Mean < Median. Examples: age at death (most people die old), test scores (most students do well).
  • Symmetric (zero skew): Balanced. Mean ≈ Median. Example: height.

Why it matters: If you report the mean for right-skewed data, you're overstating "typical" values. The mean gets pulled toward the tail. The median is more representative.

Income is the canonical example. The mean household income in the U.S. is around $90,000. The median is around $70,000. That gap tells you: right-skewed. A small number of ultra-wealthy households drag the mean up.

Kurtosis: How Heavy Are the Tails?

Kurtosis measures the "tailedness" of a distribution. High kurtosis means more extreme outliers than you'd expect in a normal distribution.

  • High kurtosis (leptokurtic): Fat tails, sharp peak. Example: financial returns (rare but extreme crashes).
  • Low kurtosis (platykurtic): Thin tails, flat peak. Example: uniform distributions.
  • Normal kurtosis (mesokurtic): The normal distribution is the reference.

Why it matters: Financial risk models assume normal distributions. But asset returns have fat tails—crashes happen more often than normal models predict. Ignoring kurtosis led to underestimating risk before 2008.

Kurtosis is advanced. Most people don't calculate it. But noticing heavy tails—visually, or through summary stats—can prevent disaster.


Visualizing Distributions: The Picture Tells the Story

Numbers are great. Pictures are better.

Histograms: Binned Frequency Counts

A histogram divides your data into bins and counts how many values fall in each bin. It shows the shape directly.

You can instantly see:

  • Is it symmetric or skewed?
  • Is it unimodal (one peak) or multimodal (multiple peaks)?
  • Are there outliers?

Histograms make problems obvious. If you just calculated the mean, you might miss that your data is bimodal—two separate clusters pretending to be one.

Box Plots: Summarizing with Quartiles

A box plot shows:

  • The median (line in the middle).
  • Q1 and Q3 (the box).
  • The range (whiskers).
  • Outliers (points beyond the whiskers).

It's a compact summary of center, spread, and outliers. You can compare multiple groups side-by-side easily.

Density Plots: Smooth Approximations

A density plot is a smoothed histogram. Instead of discrete bins, you get a continuous curve showing the probability density.

It's elegant and reveals fine structure. But it can also over-smooth—hiding real features if you choose the wrong bandwidth.

The Takeaway: Always Visualize

Never trust summary statistics without visualizing the data. Anscombe's quartet proves this: four datasets with identical means, variances, and correlations—but completely different shapes. The statistics lie. The plots tell the truth.


When Descriptive Statistics Mislead

Descriptive statistics are tools. Tools can be misused.

Simpson's Paradox: Aggregation Reverses the Trend

Imagine a hospital with two doctors.

Doctor A:

  • Treats 10 low-risk patients: 9 survive (90% survival rate).
  • Treats 90 high-risk patients: 50 survive (56% survival rate).
  • Overall: 59/100 survive (59%).

Doctor B:

  • Treats 90 low-risk patients: 80 survive (89% survival rate).
  • Treats 10 high-risk patients: 4 survive (40% survival rate).
  • Overall: 84/100 survive (84%).

Doctor B has a higher overall survival rate. But within each risk category, Doctor A performs better. How?

Simpson's Paradox: The trend reverses when you aggregate. Doctor A treats harder cases—mostly high-risk. Doctor B treats easier cases—mostly low-risk. The overall survival rate is confounded by case mix.

The lesson: Aggregated statistics can mislead if groups differ in composition. Always disaggregate and check subgroups.

The Ecological Fallacy: Aggregates Don't Describe Individuals

You find: "States with higher education spending have lower test scores."

You conclude: "Spending more on education makes students worse."

Wrong. That's the ecological fallacy—inferring individual behavior from aggregate data.

Maybe states with higher spending have more immigrant populations (who face language barriers). Maybe they have more disabled students (who get more resources but score lower). The state-level correlation doesn't tell you what happens if one student gets more resources.

The lesson: Aggregate statistics describe aggregates, not individuals. Don't assume the group-level pattern applies to every member.

Misleading Averages: When the Mean Lies

A company reports: "Average employee salary is $120,000."

Sounds great. But:

  • The CEO earns $5 million.
  • 99 employees earn $70,000.

The mean is $120,000. The median is $70,000. The mean is technically true and functionally misleading.

This isn't lying, exactly. But it's presenting the summary that tells the story you want. Always ask: "Mean or median? Why that one?"


The Coherence Connection: Compression as Pattern Detection

Here's the deeper point.

Descriptive statistics are lossy compression. You take a high-dimensional dataset and reduce it to a few numbers. Information is lost. But what you keep is the pattern—the structure, the signal, the coherence.

A dataset with low variance is predictable. High coherence. You can guess the next value with confidence. A dataset with high variance is unpredictable. Low coherence. Each observation is a surprise.

In information theory terms, variance is entropy. More spread = more uncertainty = more information required to specify any given value.

And this connects to M = C/T. Meaning arises when patterns persist (coherence over time). Descriptive statistics quantify that persistence. Low standard deviation = high coherence = high meaning. The data has structure you can rely on.

Conversely, maximum entropy is uniform randomness. All values equally likely. No structure. No summary helps—every value is as "typical" as any other. That's the limit where descriptive statistics break down.


Practical Workflow: How to Describe Data

Here's the checklist:

1. Visualize first. Always plot the data before calculating anything. Look for outliers, skew, multimodality.

2. Report center: mean, median, mode. If mean and median differ substantially, note it. Your data is skewed.

3. Report spread: range, IQR, standard deviation. If standard deviation is large relative to the mean, your data has high variance.

4. Check for outliers. Are there extreme values? Do they matter? Should you remove them or report them separately?

5. Describe the shape. Symmetric or skewed? Unimodal or multimodal? Heavy tails?

6. Contextualize. What does "high variance" mean for your domain? Is an 8 cm standard deviation in height "large"? No. Is an 8% standard deviation in death rates "large"? Yes.

7. Don't over-summarize. If your data has meaningful structure—subgroups, time trends, spatial patterns—don't collapse it into one number. Disaggregate.


What's Next

Descriptive statistics summarize data. But they don't generalize. They describe your sample, period.

To make claims about the population—to infer beyond your data—you need inferential statistics. And that starts with understanding sampling.

Next up: Sampling and Populations—how we go from "these 100 people" to "all humans."


Further Reading

  • Anscombe, F. J. (1973). "Graphs in statistical analysis." The American Statistician, 27(1), 17-21.
  • Cleveland, W. S. (1985). The Elements of Graphing Data. Wadsworth Advanced Books and Software.
  • Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
  • Wainer, H. (1984). "How to display data badly." The American Statistician, 38(2), 137-147.

This is Part 3 of the Statistics series, exploring how we extract knowledge from data. Next: "Sampling and Populations."


Part 2 of the Statistics series.

Previous: What Is Statistics? Making Sense of Data Next: Sampling and Populations: Part Representing Whole