Sparse Autoencoders: Extracting the Dictionary of Neural Concepts
Sparse Autoencoders: Extracting the Dictionary of Neural Concepts
Series: Mechanistic Interpretability | Part: 5 of 9
The problem seemed insurmountable. Neural networks work brilliantly, achieving superhuman performance on increasingly complex tasks. But when you try to understand how they work—when you peer inside to find the concepts they've learned—you encounter something that looks less like orderly reasoning and more like a high-dimensional mess. Neurons don't represent single concepts. Instead, each activation represents a jumbled superposition of many features, packed together through the mathematical accident of having more concepts than neurons to represent them.
This is the superposition problem we explored earlier: neural networks compress hundreds or thousands of concepts into far fewer neurons, creating a representation that's informationally efficient but interpretively opaque. It's like trying to read a book where every word is a homophone with dozens of meanings, and you need the full context of every other word to disambiguate each one.
For years, this seemed like a fundamental barrier to interpretability. If neurons don't correspond to concepts, what hope do we have of understanding the conceptual vocabulary of a neural network?
Then researchers at Anthropic and elsewhere realized something crucial: just because the network stores concepts in superposition doesn't mean we can't extract them into a cleaner representation. What if we could find the true dictionary of features the network is using—the actual concepts it has learned—even though they're compressed into a smaller number of neurons?
The answer is sparse autoencoders: a technique for decomposing neural activations into their constituent features, revealing the interpretable vocabulary hidden beneath the superposition.
The Dictionary Learning Perspective
To understand sparse autoencoders, we need to shift our mental model of what neural networks are doing.
Traditional view: A neural network is a composition of learned transformations. Each layer applies a nonlinear function to its inputs, gradually transforming raw data into useful representations. Neurons are the basic units of representation.
Dictionary learning view: A neural network learns a dictionary of features—a collection of meaningful concepts like "this is a curve," "this is red," "this is a face," or in language models, "this token is part of a quotation" or "this clause is sarcastic." These features are the network's actual vocabulary, the primitive concepts it uses to understand and generate outputs.
But here's the key insight: the features are not the same as the neurons. Due to superposition, many features are represented as sparse patterns of activation across multiple neurons. Each neuron participates in representing many features, and each feature activates a sparse subset of neurons.
This is exactly analogous to dictionary learning in signal processing, where you try to represent a signal as a sparse linear combination of elements from an overcomplete dictionary. A sound might be decomposed into a few frequency components from a much larger basis of possible frequencies. An image might be represented as a sparse combination of edges and textures from a large dictionary of primitive visual patterns.
Sparse autoencoders bring this same principle to neural network activations: decompose the activation vector at any layer into a sparse combination of features from a larger learned dictionary.
The Architecture: Autoencoders with a Sparsity Constraint
A sparse autoencoder (SAE) is remarkably simple in structure. It consists of three components:
-
An encoder: Takes the neural network's activation vector as input and maps it to a higher-dimensional space. If the network layer has 512 neurons, the encoder might map to 4,096 or 16,384 dimensions—creating an overcomplete representation with far more dimensions than the original activation.
-
A decoder: Maps from this high-dimensional space back to the original activation dimension, attempting to reconstruct the network's original activation.
-
A sparsity penalty: Encourages the high-dimensional representation to be sparse—meaning most dimensions should be zero for any given input, with only a few active at once.
The training objective has two competing goals:
Reconstruction loss: The autoencoder should accurately reconstruct the original activation. If you pass a network activation through the encoder and then the decoder, you should get back something very close to what you started with.
Sparsity loss: The encoded representation should use as few dimensions as possible. For any given input, only a small fraction of the learned features should be active.
Mathematically, if x is the original activation, f(x) is the encoder output, and g(f(x)) is the decoder reconstruction:
Loss = ||x - g(f(x))||² + λ ||f(x)||₁
The first term is reconstruction error (L2 norm of the difference). The second term is the L1 sparsity penalty, which encourages most elements of f(x) to be exactly zero. The hyperparameter λ controls the trade-off: higher λ means sparser representations but potentially worse reconstruction; lower λ means better reconstruction but less interpretable features.
The magic happens through the interaction of these two objectives. The autoencoder can't just memorize the input—it has to find a small number of meaningful features that, when combined, can reconstruct the original activation. The sparsity constraint forces it to discover the true concepts the network is using, the actual dictionary of features that were compressed into the lower-dimensional neuronal representation through superposition.
Training on Internal Activations
Here's a crucial detail: sparse autoencoders aren't trained on the network's inputs and outputs. They're trained on the network's internal activations—the vectors of neural responses at specific layers during the network's normal operation.
The process works like this:
-
Collect a large dataset of activations: Run your neural network (let's say a language model) on a large corpus of text. For each token processed, record the activation vector at the layer you want to interpret. You might end up with millions of activation vectors, each representing the network's internal state when processing a particular token in context. This dataset captures the distribution of internal states the network actually visits during real computation—not hypothetical or adversarially constructed states, but the activations that occur during the network's ordinary operation.
-
Train the sparse autoencoder: Using these collected activations as training data, optimize the autoencoder's encoder and decoder weights to minimize reconstruction loss while maintaining sparsity. The autoencoder learns to decompose each activation into a sparse combination of features from its learned dictionary. This is typically done with standard gradient descent, treating the SAE as a separate network learning a useful transformation of the base model's activations. The training can take days or weeks for large models, requiring careful tuning of the sparsity coefficient λ to balance reconstruction quality against feature interpretability.
-
Interpret the learned features: Once trained, examine what each dimension of the encoder output represents. What inputs cause a particular feature to activate? What does the decoder weight vector for that feature look like? What happens downstream in the network when that feature is active? Researchers use a combination of automated analysis (finding maximum-activating examples from large datasets) and manual inspection (reading through examples to identify semantic patterns) to characterize each feature.
The result is a learned dictionary where each dimension corresponds to a (hopefully) interpretable feature. The encoder tells you "which features are active in this activation," and the decoder tells you "what pattern of neural activation corresponds to each feature."
This approach is fundamentally unsupervised—the SAE learns its dictionary purely from the statistics of activations, without any human labeling of concepts. The interpretability emerges from the structure of the network's learned representations, not from external supervision. This is why the discoveries can be surprising: features like base64 encoding or specific cultural references weren't anticipated by researchers but were found to exist in the network's internal ontology.
What Sparse Autoencoders Reveal
When researchers at Anthropic applied sparse autoencoders to Claude and other language models, they found something remarkable: the learned features often correspond to strikingly interpretable concepts.
Some examples from published research:
The Golden Gate Bridge feature: A feature that activates specifically for mentions of the Golden Gate Bridge, but also for related concepts like San Francisco landmarks, bridge architecture, and even metaphorical uses like "bridging ideas." What's remarkable is the feature's semantic coherence—it captures not just the literal landmark but the cluster of associated meanings that make the Golden Gate Bridge culturally significant. This is concept learning, not just pattern matching.
The sarcasm feature: A feature that activates on sarcastic statements, even when the surface-level words are positive. The network has learned to detect the rhetorical move of meaning-inversion, recognizing that "Oh great, just what I needed" in certain contexts means the opposite of what the words literally say. This demonstrates that SAE features can capture pragmatic and rhetorical structure, not just semantic content.
Code features: Features that activate specifically on function definitions, loop constructs, or error handling patterns in programming contexts. One particularly interesting discovery was a feature that activated on code indentation—the network had learned that whitespace structure carries syntactic meaning in Python and YAML, treating formatting as a distinct conceptual dimension.
Grammatical features: Features corresponding to syntactic structures—one might activate on the subject of a sentence, another on relative clauses, another on conditional statements. These features reveal that language models learn explicit grammatical knowledge, even though they're never trained on parse trees or grammatical rules. The grammar emerges from statistical patterns in text.
Base64 encoding feature: In a particularly striking example, researchers found a feature that activated specifically on base64-encoded text, even though this is a relatively rare pattern in training data. The network had learned to recognize this specific encoding scheme as a distinct concept, suggesting it had formed an internal model of the encoding transformation itself. The network doesn't just predict tokens—it learns about the processes that generate them.
What makes these features interpretable is their selectivity and composition. Each feature activates on a relatively narrow set of related inputs (selectivity), and the full activation can be understood as a weighted sum of multiple features combining to represent the full meaning of the input (composition).
This is profoundly different from looking at individual neurons. A single neuron typically participates in representing many unrelated features—it might be slightly active for "the word 'the'," "text in Spanish," "beginning of a sentence," and "technical documentation" all at once, with no coherent interpretation. But a feature extracted by a sparse autoencoder often has a clean, interpretable meaning.
The Geometry of Disentanglement
From a geometric perspective, sparse autoencoders are performing disentanglement: they're finding a basis for the activation space where each basis vector corresponds to a relatively independent source of variation.
The original neuronal basis—the activation of each individual neuron—is entangled. Changing the meaning of what's being processed requires changing many neurons simultaneously in a coordinated way, and each neuron participates in many different meanings.
The SAE-learned basis is disentangled. Each feature direction in the high-dimensional space corresponds to a relatively independent concept. You can increase the "Golden Gate Bridge-ness" of a representation by moving along one feature direction, and increase the "sarcasm-ness" by moving along another, and these operations are largely independent.
This connects to the concept of manifolds in neural networks. The network's actual data lies on a lower-dimensional manifold embedded in the high-dimensional activation space. But the neuronal coordinates—the activations of individual neurons—don't align with the natural coordinates of this manifold. The SAE is learning coordinates that do align with the manifold's structure, revealing the true dimensionality and organization of the learned representation.
In information geometry terms, we might say the SAE is finding a coordinate system where the Fisher information metric becomes more diagonal—where different features provide relatively independent information about the network's epistemic state.
Limitations and Open Questions
Sparse autoencoders are currently the state of the art in feature extraction, but they're not a complete solution to interpretability.
Computational cost: Training SAEs requires collecting millions of activations and optimizing large networks (often with more parameters than the original model layer being interpreted). This is feasible for research labs but remains expensive.
Feature completeness: Are we extracting all the features the network uses, or only the ones that happen to be discoverable through this particular autoencoding approach? There's no guarantee that the SAE dictionary is complete.
Cross-layer interpretation: SAEs are typically trained on single layers. How do features compose across layers? How does a feature in layer 10 relate to features in layer 11? We're still developing methods to track feature evolution through depth.
Polysemanticity still exists: Even SAE features aren't perfectly monosemantic. Many features still activate on multiple distinct concepts, just less entangled than raw neurons. This suggests we may need even larger dictionaries or different architectures.
The grounding problem: Even when we can name what a feature detects, do we understand why the network learned that particular feature? What computational role does it play in the network's overall behavior? Identifying features is not the same as understanding the circuits that connect them.
Toward a Complete Mechanistic Picture
Sparse autoencoders represent a crucial step toward comprehensive mechanistic interpretability: they give us access to the network's conceptual vocabulary, the dictionary of features it has learned.
But a dictionary alone doesn't tell you how language works. You also need grammar—the rules for how concepts combine. In neural network terms, you need to understand the circuits: the computational paths by which features in one layer influence features in downstream layers, ultimately producing the network's outputs.
This is where circuit analysis comes in. With SAE-extracted features as nodes, researchers can now map out how features connect: which features cause which other features to activate, how features combine to make predictions, what computations are performed by particular architectural components.
The combination of sparse autoencoders (for feature extraction) and circuit analysis (for computational structure) is bringing us closer to a complete mechanistic understanding. We're beginning to read neural networks not as inscrutable matrices of floating-point numbers, but as interpretable programs expressed in a learned language.
The next frontier is understanding how these mechanistic descriptions connect to information geometry—how the geometric structure of the network's learned representations relates to the computational functions those representations serve. This is where mechanistic interpretability meets the mathematics of meaning.
This is Part 5 of the Mechanistic Interpretability series, exploring how to reverse-engineer the algorithms learned by neural networks.
Previous: Grokking: When Neural Networks Suddenly Understand
Next: Where Interpretability Meets Information Geometry
Further Reading
- Bricken, T., et al. (2023). "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." Anthropic Transformer Circuits Thread.
- Cunningham, H., et al. (2023). "Sparse Autoencoders Find Highly Interpretable Features in Language Models." arXiv preprint.
- Olah, C., et al. (2020). "Zoom In: An Introduction to Circuits." Distill.
- Elhage, N., et al. (2022). "Toy Models of Superposition." Anthropic Transformer Circuits Thread.
Comments ()