Logarithms and Information: Why Entropy Uses Log
Information is surprise, and surprise is logarithmic.
When you flip a fair coin, learning the outcome gives you exactly 1 bit of information. Why 1 bit? Because you eliminated half the possibilities. You divided your uncertainty by 2.
Flip a fair die and learn the result? About 2.58 bits. You eliminated 5/6 of the possibilities. The logarithm (base 2) of 6 is 2.58.
This isn't a metaphor. Claude Shannon proved in 1948 that the amount of information in a message is exactly the logarithm of the number of possibilities it eliminates. That discovery launched the digital age.
Logarithms aren't just mathematically convenient for information theory. They're necessary. Information adds when events combine, and only logarithms turn multiplication into addition.
Shannon Entropy: The Core Formula
For a random source with possible outcomes x₁, x₂, ..., xₙ with probabilities p₁, p₂, ..., pₙ:
H = -∑ pᵢ log₂(pᵢ)
This is called Shannon entropy, measured in bits.
Why the negative sign? Because log of a probability (which is ≤ 1) is negative or zero, so the negative sign makes H positive.
Fair coin: H = -[0.5 log₂(0.5) + 0.5 log₂(0.5)] H = -[0.5(-1) + 0.5(-1)] H = 1 bit
Fair die: H = -6 × (1/6) log₂(1/6) H = -log₂(1/6) H = log₂(6) ≈ 2.58 bits
Biased coin (90% heads): H = -[0.9 log₂(0.9) + 0.1 log₂(0.1)] H = -[0.9(-0.152) + 0.1(-3.32)] H ≈ 0.47 bits
Less uncertainty means less information when the outcome is revealed. A biased coin is more predictable, so learning its outcome gives less information.
Why Logarithms? The Additivity Requirement
Shannon required that information satisfy a crucial property: when two independent events occur, the total information should be the sum of the individual informations.
Flip a coin (2 outcomes) and roll a die (6 outcomes). Together, there are 12 possible outcomes. The information should be:
I(coin) + I(die) = I(coin and die)
For this to work: f(2) + f(6) = f(12) f(2) + f(6) = f(2 × 6)
What function satisfies f(a) + f(b) = f(a × b)?
Only logarithms. That's the fundamental property: log(a) + log(b) = log(ab).
So information MUST be logarithmic. There's no other option if we want information to add when events combine.
Bits, Nats, and Hartleys
The base of the logarithm determines the unit:
Base 2 → bits (binary digits)
- Standard in computing
- 1 bit = information from one binary choice
Base e → nats (natural units)
- Common in physics and mathematics
- 1 nat ≈ 1.44 bits
Base 10 → hartleys (or bans)
- Named after Ralph Hartley
- 1 hartley ≈ 3.32 bits
Converting between them:
- bits = nats × log₂(e) ≈ nats × 1.44
- bits = hartleys × log₂(10) ≈ hartleys × 3.32
The formulas are the same; only the units change.
Self-Information: The Surprise of a Single Event
The information content of a single event with probability p is:
I(x) = -log₂(p) = log₂(1/p)
This measures surprise. The less likely an event, the more surprising when it occurs, the more information it conveys.
| Event | Probability | Self-information |
|---|---|---|
| Fair coin heads | 0.5 | 1 bit |
| Roll a 6 | 1/6 | 2.58 bits |
| Specific card from deck | 1/52 | 5.7 bits |
| Win lottery (1 in 300M) | 3.3×10⁻⁹ | 28.2 bits |
Certain events (p = 1) carry 0 bits—no surprise, no information. Impossible events would carry infinite information—but they don't happen.
Entropy H is the expected self-information: the average surprise per event.
Data Compression: The Entropy Limit
Shannon's source coding theorem says: you cannot compress data below its entropy.
If a source has entropy H bits per symbol, you need at least H bits per symbol on average to represent it losslessly.
English text has about 1-1.5 bits of entropy per character (after accounting for redundancy), even though ASCII uses 8 bits per character. That's why text compresses well—there's lots of redundancy to exploit.
Random data has maximum entropy. A perfectly random byte has 8 bits of entropy, using all 8 bits of storage. Random data doesn't compress.
Compression algorithms like ZIP, GZIP, and PNG work by getting close to the entropy limit. The remaining gap is the algorithm's inefficiency.
Channel Capacity: The Transmission Limit
For communication channels, Shannon defined channel capacity C—the maximum rate at which information can be transmitted reliably.
C = B × log₂(1 + S/N)
where B is bandwidth (Hz), S is signal power, and N is noise power.
This is the Shannon-Hartley theorem. The logarithm appears because information is inherently logarithmic.
Double the signal-to-noise ratio? You don't double the capacity. You add log₂(2) = 1 bit per second per Hz. Diminishing returns, measured logarithmically.
This theorem sets fundamental limits on data rates. Your WiFi, your phone, fiber optics—all operate within these logarithmic bounds.
Relative Entropy: Comparing Distributions
Kullback-Leibler divergence measures how one probability distribution differs from another:
D_KL(P || Q) = ∑ P(x) log₂[P(x) / Q(x)]
This is the extra bits needed if you use a code optimized for distribution Q when the true distribution is P.
KL divergence is always non-negative (Gibbs' inequality). It's zero only when P = Q.
Applications:
- Machine learning: loss functions for classification
- Statistics: comparing models
- Physics: non-equilibrium thermodynamics
Thermodynamic Entropy: The Physical Connection
Boltzmann's entropy formula:
S = k_B × ln(W)
where W is the number of microstates and k_B is Boltzmann's constant.
This is the same logarithmic structure as Shannon entropy, just in different units (using natural log and physical constants).
The connection isn't coincidental. Thermodynamic entropy measures missing information about the microscopic state. Shannon entropy measures missing information about a message. They're the same concept in different contexts.
This unification was recognized by Jaynes, who showed that thermodynamics can be derived from information-theoretic principles.
1 bit of Shannon entropy corresponds to k_B × ln(2) ≈ 10⁻²³ joules per kelvin of thermodynamic entropy.
Mutual Information: Shared Knowledge
Mutual information measures how much knowing X tells you about Y:
I(X; Y) = H(X) + H(Y) - H(X, Y)
Or equivalently:
I(X; Y) = ∑∑ P(x,y) log₂[P(x,y) / (P(x)P(y))]
If X and Y are independent, mutual information is zero—knowing one tells you nothing about the other.
If X completely determines Y, mutual information equals H(Y)—knowing X eliminates all uncertainty about Y.
Mutual information is symmetric: I(X; Y) = I(Y; X). The information X gives about Y equals the information Y gives about X.
Cross-Entropy: Measuring Code Mismatch
Cross-entropy between true distribution P and assumed distribution Q:
H(P, Q) = -∑ P(x) log₂ Q(x)
This measures the average number of bits needed to encode samples from P using a code optimized for Q.
Always: H(P, Q) ≥ H(P), with equality only when P = Q.
In machine learning, cross-entropy loss is the standard for classification. You're penalized for predicting low probability for events that actually occur—exactly what the log captures.
Why All These Formulas Use Logarithms
Every information-theoretic quantity involves logarithms for the same fundamental reason:
Information must add when events combine.
Two independent sources together should give information equal to the sum of their individual informations. Only logarithms satisfy this.
This isn't convention. It's mathematical necessity. If you define information any other way, you get contradictions when combining independent sources.
The logarithm also matches our intuitive sense of "surprise" and "uncertainty." Something 8 times less likely is 3 bits more surprising (log₂(8) = 3). This scales naturally.
The Information-Theoretic Worldview
Information theory reveals that:
- Uncertainty is quantifiable. Entropy gives a precise number to vagueness.
- Compression has fundamental limits. You can't beat entropy, only approach it.
- Communication has fundamental limits. Channel capacity is real and measurable.
- Information is physical. Erasing a bit dissipates at least k_B T ln(2) of energy (Landauer's principle).
- Everything is connected. Shannon entropy, thermodynamic entropy, and quantum information share the same mathematical core.
The logarithm isn't decoration. It's the skeleton of information itself. When Shannon put log into his formula, he didn't invent a convention. He discovered a law.
Information is the log of surprise. That's not philosophy. That's mathematics.
Part 7 of the Logarithms series.
Previous: Logarithmic Scales: When Numbers Span Many Orders of Magnitude Next: Synthesis: Logarithms as the Language of Growth and Scale
Comments ()