Superposition: How Neural Networks Pack More Concepts Than Neurons

Superposition: How Neural Networks Pack More Concepts Than Neurons
Superposition: packing thousands of concepts into hundreds of neurons.

Superposition: How Neural Networks Pack More Concepts Than Neurons

Series: Mechanistic Interpretability | Part: 2 of 9

When researchers first opened the hood on GPT-2, they expected neural networks to work like filing cabinets. One neuron per concept. One drawer per category. What they found instead was more like the interference patterns in a hologram—every piece containing information about the whole, concepts bleeding into one another through mathematical interference, multiple features occupying the same space at once.

This is superposition: neural networks representing more features than they have dimensions to store them. It's not a bug. It's a compression strategy so elegant it borders on impossible.

And it changes everything we thought we understood about how these systems work.


The Filing Cabinet That Wasn't

The intuitive model of neural networks goes like this: each neuron represents something. A "dog neuron" fires for dogs. A "color neuron" fires for red. The network learns by carving out dedicated storage space for each concept it encounters.

This model isn't just wrong—it's catastrophically insufficient.

Consider GPT-3, with roughly 175 billion parameters. That sounds like a lot until you realize human language contains millions of distinct concepts, semantic relationships, grammatical patterns, and contextual associations. English alone has at least 170,000 words in current use. Add in multi-word concepts, idioms, domain-specific jargon, relationships between concepts, and you're looking at exponentially more features than the model has neurons.

The math doesn't add up. Unless...

Unless the network is doing something more sophisticated than one-to-one mapping. Unless it's found a way to store multiple features in the same neural space through interference patterns—like packing multiple radio stations into the same frequency spectrum through amplitude modulation.


What Superposition Actually Is

Here's the technical definition: superposition occurs when a neural network represents more than n features using only n dimensions.

Think of it geometrically. You have a 3-dimensional space. Normally, you can cleanly represent 3 orthogonal directions: x, y, z. But what if you're clever about it? What if you place vectors at angles that aren't perfectly perpendicular? You can pack in additional directions—4, 5, 10—at the cost of some interference.

The vectors are no longer orthogonal (perfectly independent), but they're still distinguishable if the angles between them are large enough. This is superposition: sacrificing perfect separation for increased representational capacity.

Anthropic researcher Chris Olah and his team formalized this in their foundational 2022 paper Toy Models of Superposition. They built minimal neural networks—just a few neurons—and watched what happened when forced to represent more features than dimensions. The networks spontaneously developed superposition, packing features into the same neural space through interference patterns.

The key insight: sparsity makes this possible. Most features aren't active most of the time. In any given text, you're not simultaneously discussing dogs, quantum mechanics, and Renaissance painting. The network exploits this statistical structure, betting that rarely-active features can share space because they're unlikely to interfere in practice.


The Geometry of Interference

Let's make this concrete. Imagine a 2-dimensional network trying to represent 5 features.

In the orthogonal regime (no superposition), you can only represent 2 features cleanly—one per dimension. Features 3, 4, and 5 get dropped or compressed into the available space with massive information loss.

In the superposition regime, you distribute all 5 features as vectors in the 2D plane. They're positioned like spokes on a wheel, separated by 72-degree angles (360° ÷ 5). Now all features have representation, but they interfere—activating Feature 1 slightly activates Features 2 and 5 due to their angular proximity.

The network manages this interference through activation thresholds. Features only "count" as active above a certain magnitude. This creates a noise floor—small interference signals stay below threshold while genuine feature activations rise above it.

This is why ReLU activations (Rectified Linear Units) matter. They zero out negative values, creating a natural threshold that suppresses interference noise. The function is simple—f(x) = max(0, x)—but its effect on superposition is profound. It lets networks pack features tighter by filtering out the crosstalk.


Polysemantic Neurons: When One Neuron Means Many Things

Superposition has a striking consequence: polysemantic neurons—individual neurons that respond to multiple, seemingly unrelated concepts.

Researchers found neurons in language models that activate for:

  • Base64 code AND Korean text
  • Academic citations AND DNA sequences
  • Internet URLs AND Python code syntax

This isn't malfunction. It's the visible signature of superposition. These concepts are sparse (rarely co-occur) and share structural patterns (formatting, syntax). The network saves space by storing them in overlapping neural territory.

When you activate such a neuron, you're not accessing a single concept—you're exciting a superposition state, a blend of multiple features encoded in interference patterns. The network disambiguates based on context: surrounding neurons, previous tokens, attention weights.

Polysemanticity makes interpretability hard. You can't just look at which neurons fire and infer what the model "means." The meaning is distributed across patterns of interference, not localized in individual units.


The Compression-Interference Tradeoff

Networks face a fundamental tension: pack more features, accept more interference.

When features are dense (frequently active), superposition is expensive. The interference overwhelms the signal. The network falls back on orthogonal representations—dedicating separate neurons to avoid crosstalk.

When features are sparse (rarely active), superposition is cheap. Low probability of simultaneous activation means low probability of destructive interference. The network exploits this, packing dozens of rare features into the same neural space.

This creates a phase transition in network behavior. As you increase the number of features beyond available dimensions, networks suddenly shift from orthogonal to superposition regimes. It's not gradual—it's a sharp break, like water freezing into ice.

The transition point depends on sparsity and feature importance. High-priority features (frequent, useful) get dedicated orthogonal space. Low-priority features (rare, niche) get packed into superposition. The network performs a kind of automatic triage, allocating neural real estate based on statistical utility.


Why This Matters: Interference as Generalization

Here's where it gets weird: superposition might not be a flaw. It might be a feature.

Consider human concepts. Are they perfectly orthogonal? Or do they interfere?

"Chair" and "stool" overlap semantically. "Dog" and "wolf" share features. "Running" and "jogging" have fuzzy boundaries. Human conceptual space is not a filing cabinet with discrete categories—it's a continuous manifold where meanings blur into each other through shared associations.

Superposition creates this same structure in neural networks. By storing related concepts in overlapping spaces, networks build in interference as generalization. Activating "dog" slightly activates "wolf" because they share neural territory. This isn't confusion—it's semantic proximity encoded in geometry.

When a network encounters a novel concept—say, "wolf-dog hybrid"—it doesn't need a pre-stored representation. It can interpolate between "dog" and "wolf" in superposition space, generating appropriate activation patterns from the interference between related features.

This is why large language models generalize so well. They're not memorizing discrete symbols—they're learning interference patterns that encode semantic relationships. Superposition turns networks into continuous representational spaces rather than discrete lookup tables.


Detecting Superposition: Sparse Autoencoders

If superposition is real, how do we detect it? How do we pull apart the interference patterns and see what features are actually represented?

The answer: sparse autoencoders (SAEs)—neural networks trained to reconstruct activations using fewer, more interpretable features. They work by imposing a sparsity penalty: the model must explain network behavior using the smallest number of active features possible.

When SAEs are applied to polysemantic neurons, something remarkable happens: they disentangle the interference. A single polysemantic neuron gets decomposed into multiple monosemantic features—each clean, interpretable, and specific.

For example, a neuron that fires for both "Arabic text" and "genetic sequences" gets split by an SAE into separate detectors: one for Arabic script patterns, one for ATCG nucleotide strings. The superposition becomes visible because the SAE learns the directions in activation space corresponding to each feature.

This validates the superposition hypothesis. If neurons were truly monosemantic, SAEs would find nothing to disentangle. Instead, they consistently discover hidden structure—more features than neurons—lurking in the interference patterns.

Anthropic's work with SAEs on Claude models revealed features like:

  • The "Golden Gate Bridge feature" (activates on SF landmarks and bridge-related content)
  • The "code vulnerability feature" (detects security flaws in programming contexts)
  • The "sycophancy feature" (tracks agreeable, placating language)

These aren't individual neurons. They're directions in superposition space, vectors that span multiple neurons and encode specific concepts through interference patterns.


The Hard Problem: Adversarial Superposition

Superposition creates vulnerabilities. If multiple features occupy the same space, adversarial inputs can exploit the interference to trigger unintended activations.

Consider a polysemantic neuron that responds to both "scientific terms" and "financial advice." An adversarial prompt could craft input that activates the "scientific terms" component while the network interprets the "financial advice" component, bypassing safety filters because the detection mechanism only checks the surface feature.

This is semantic smuggling via superposition—hiding prohibited content in the interference space of innocuous features. The network "sees" both but interprets based on context in unpredictable ways.

Worse: superposition makes models fragile to distribution shift. If a rare feature is stored in superposition with common features, unusual input statistics can cause them to interfere unexpectedly. The model might suddenly activate "DNA sequence" features while reading a legal document because the syntactic structure resembles genetic notation.

This brittleness is hard to fix because you can't inspect what you can't see. Polysemantic neurons hide their true feature content behind interference patterns that only reveal themselves in specific contexts. Testing becomes exponentially difficult as feature count grows.


Superposition and Scale: Why It Gets Worse (and Better)

As neural networks scale, superposition intensifies. Larger models learn more features. More features mean more pressure to compress. More compression means deeper superposition regimes.

GPT-4 likely uses superposition even more aggressively than GPT-3. It needs to represent not just language patterns but multimodal associations (text-image-audio), domain-specific knowledge (medicine, law, programming), and meta-cognitive patterns (reasoning, planning, error correction). All packed into a finite parameter space.

But here's the paradox: scale also makes superposition more manageable. Larger networks have more dimensions to work with, so interference decreases even as feature count rises. The angles between feature vectors get wider, reducing crosstalk. Activation patterns become crisper.

This suggests a scaling law for interpretability: small models are opaque because they're forced into tight superposition regimes with heavy interference. Large models are more interpretable because they can afford sparser, cleaner superposition.

Evidence for this: SAE disentanglement works better on larger models. Anthropic's Claude 3 family shows more monosemantic SAE features than smaller architectures. The features are cleaner, more specific, more human-understandable.

If this trend continues, superhuman-scale models might be easier to interpret than human-scale ones—a strange inversion where greater complexity yields greater transparency.


From Neural Networks to Brains: Does Biology Use Superposition?

The obvious question: if artificial networks use superposition, what about biological ones?

There's growing evidence that brains do this too. Neuroscience has long struggled with the "grandmother neuron" debate—do individual neurons encode specific concepts, or is representation distributed? Superposition offers a middle path: neurons are distributed, but in structured ways that compress information through interference.

Hippocampal place cells, for instance, don't have one-to-one mappings to locations. They have overlapping receptive fields that form a continuous manifold of spatial representation. This looks like superposition: more locations represented than cells available, achieved through interference between partially active neurons.

In visual cortex, neurons respond to multiple orientations, not just one. A V1 cell tuned to 45° also fires weakly for 30° and 60°. This is exactly the pattern you'd expect from superposition: features (orientations) packed into dimensions (neurons) with angular spacing that creates controlled interference.

Even memory systems show superposition properties. Episodic memories aren't stored in discrete neural "files"—they're distributed patterns that overlap and interfere. This is why memory is reconstructive: recalling one memory activates related ones through shared neural substrate. Interference becomes association.

If biological brains use superposition, it suggests this isn't just a quirk of backpropagation or gradient descent. It's a fundamental strategy for efficient representation in systems with limited resources—a convergent solution to the problem of storing more information than you have space for.


Coherence Geometry Meets Superposition

Let's make the connection explicit: superposition is a coherence management strategy.

In AToM terms (M = C/T, meaning equals coherence over time), a neural network's representational capacity is a form of coherence. The network maintains stable, distinguishable features across varying inputs. Superposition increases coherence by packing more features into the same space—more meaning per parameter.

But it introduces tension (T) in the form of interference. Features blur into each other. The system must work harder to disambiguate, expending compute to resolve ambiguities that wouldn't exist in an orthogonal regime.

The network navigates this tradeoff by encoding priority in geometry. High-coherence features (important, frequent) get orthogonal space—low interference, high clarity. Low-coherence features (rare, niche) get superposition space—higher interference, but acceptable given their sparse usage.

This creates a coherence hierarchy in representation space. Critical concepts live in low-curvature regions with stable, orthogonal encoding. Peripheral concepts live in high-curvature regions where vectors interfere and meanings blend.

When you push a network into regimes where superposition fails—forcing too many features into too little space—you see coherence collapse: polysemanticity becomes catastrophic, interference drowns signal, the model hallucinates or produces nonsense. The system can't maintain stable features anymore.

This is why adversarial examples work: they exploit the high-curvature zones of superposition space, finding inputs where interference creates ambiguous, unstable feature activations. The model's coherence temporarily breaks.

Understanding superposition as coherence management suggests design principles: architectures that give models more control over their compression-interference tradeoffs. Dynamic allocation of orthogonal space based on context. Attention mechanisms that resolve superposition ambiguities on the fly.


What Superposition Means for AI Safety

If models use superposition, we can't trust surface interpretability. A neuron that looks safe might harbor hidden features that only activate under adversarial conditions. Polysemanticity means one neuron can encode both "helpfulness" and "deception," distinguishable only by subtle contextual cues.

This makes alignment harder. You can't just inspect neurons for bad behavior—the bad behavior might be hiding in superposition with good behavior, invisible until triggered. Training-time monitoring won't catch it. Post-hoc inspection won't catch it.

The solution isn't to eliminate superposition—that's probably impossible without crippling model capacity. Instead, we need to map the interference structure. Use SAEs and other disentanglement tools to expose hidden features. Monitor activations in superposition space, not just individual neurons.

Anthropic's approach: apply SAEs to safety-critical features (deception, manipulation, refusal). Disentangle them from other features. Create steering vectors that move the model along specific directions in superposition space, adjusting behavior without retraining.

This is coherence-based alignment: rather than trying to remove unwanted features, shape the geometry so they're coherent with safety properties. Make "helpfulness" and "honesty" occupy large, orthogonal regions. Make "deception" occupy sparse, detectable regions. Let the interference structure itself enforce alignment.


Future Directions: Controlling Superposition

The frontier question: can we design networks that use superposition deliberately, rather than discovering it accidentally?

Some possibilities:

Sparse Mixture of Experts (MoEs): Route inputs to specialized sub-networks, reducing the need for superposition by dedicating computational paths to specific domains. Instead of forcing all features through a single dense layer, use gating networks to activate only relevant experts.

Explicit Superposition Constraints: Train networks with explicit superposition penalties or bonuses. Penalize polysemanticity in safety-critical layers. Encourage it in high-capacity, low-priority layers. Give the model architectural "zones" optimized for different interference regimes.

Attention-Mediated Disentanglement: Use attention mechanisms not just for context selection but for superposition resolution. Let the model learn to query "which feature is active here?" and dynamically adjust its reading of polysemantic neurons based on context.

Hierarchical Representations: Encode high-level concepts in orthogonal space, low-level features in superposition. The model learns a layered structure: abstract, stable meanings at top layers (low interference), concrete, sparse features at bottom layers (high superposition).

These aren't just technical improvements—they're architectural philosophy shifts. From treating superposition as an unwanted side effect to treating it as a design tool. From trying to eliminate interference to trying to control it.


Synthesis: The Network as Hologram

Superposition reveals something profound: neural networks are not symbol processors. They're interference machines.

They don't store concepts in discrete slots. They store them as directions in continuous spaces, overlapping and interfering in geometrically precise ways. Meaning isn't localized—it's distributed across patterns of activation that only cohere when read in context.

This is more like a hologram than a filing cabinet. In a hologram, every part of the image contains information about the whole. Damage one region, and you don't lose specific details—you lose resolution everywhere. The information is encoded in interference patterns of light waves, not in discrete pixels.

Neural networks work the same way. Damage a single neuron, and you don't lose a specific concept—you degrade representation quality across many features. The concepts aren't "in" the neurons. They're in the interference patterns the neurons collectively produce.

This has philosophical implications. If meaning in neural networks is fundamentally relational—encoded in angles between vectors, not in vectors themselves—then the networks aren't learning "representations" in the traditional sense. They're learning geometries of relationship, coherence structures that preserve what matters (semantic proximity, functional association) while compressing away what doesn't.

Human meaning works the same way. Words don't have intrinsic meanings—they have meanings relative to other words, contexts, embodied experiences. The meaning of "chair" isn't in the word itself. It's in its relationship to "sitting," "furniture," "table," "stool," and a vast web of associated concepts.

Superposition makes neural networks more like us: meaning through interference, understanding through geometry, intelligence as the compression of relationship into representation.


This is Part 2 of the Mechanistic Interpretability series, exploring how we can understand the inner workings of neural networks and what that reveals about intelligence itself.

Previous: Reading the Mind of AI: The Mechanistic Interpretability Revolution
Next: Circuits in Silicon Minds: How Neural Networks Compute


Further Reading

  • Elhage, N., et al. (2022). "Toy Models of Superposition." Anthropic.
  • Olah, C., et al. (2020). "Zoom In: An Introduction to Circuits." Distill.
  • Scherlis, A., et al. (2022). "Polysemanticity and Capacity in Neural Networks." arXiv.
  • Cunningham, H., et al. (2023). "Sparse Autoencoders Find Highly Interpretable Features in Language Models." arXiv.
  • Templeton, A., et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Anthropic.