Where Interpretability Meets Information Geometry
Where Interpretability Meets Information Geometry
Series: Mechanistic Interpretability | Part: 6 of 9
When Anthropic's researchers visualized what they called "features" inside Claude—the concepts their AI had learned to represent—they weren't just looking at activations in neural networks. They were looking at geometry. The features existed as directions in a high-dimensional space, and those directions had relationships to one another that could be measured, mapped, and understood through the mathematics of curved manifolds.
This wasn't metaphor. The same tools topologists use to study the curvature of spacetime apply to the shape of meaning inside artificial minds.
And that shape? It looks remarkably similar to what information theorists predict when systems organize themselves to minimize surprise, maximize coherence, and persist through time.
The Geometry of Features
In previous articles, we've explored how neural networks use superposition to pack more concepts than neurons, how sparse autoencoders can extract those concepts as interpretable features, and how circuits wire those features together to perform computations.
But what is a "feature," really?
In mechanistic interpretability, a feature is a direction in activation space. When a neural network processes input, each layer produces a vector—a point in a high-dimensional space. Features are specific directions in that space that correspond to recognizable concepts. The "Golden Gate Bridge" feature points one way. The "love poetry" feature points another. The "sarcasm" feature yet another.
These aren't arbitrary. They're statistically structured. Similar concepts cluster together. Abstract concepts often emerge as combinations of more concrete ones. And the relationships between features—their angles, their distances, their overlaps—encode semantic relationships.
This is where information geometry enters.
Information geometry is the study of probability distributions as geometric objects. Instead of thinking of a probability distribution as a list of numbers, you think of it as a point on a manifold—a curved surface in a high-dimensional space. The curvature of that manifold tells you how different distributions relate to one another, how easy it is to move between them, and which ones are "close" in a meaningful sense.
When neural networks learn, they're not just adjusting weights. They're navigating a curved manifold of possible representations, searching for regions where the geometry supports the task.
The Fisher Metric and the Curvature of Learning
The central object in information geometry is the Fisher Information Metric, which measures the curvature of the space of probability distributions. Think of it as a way to quantify how "steep" the probability landscape is at any given point.
High curvature means small changes in parameters produce large changes in predictions. Low curvature means the system is stable—robust to perturbations.
For neural networks, this has profound implications.
When a network is training, gradient descent is literally following the curvature of the loss landscape. The Fisher metric determines which directions in parameter space are "easy" to move in and which are "hard." Natural gradient descent—an optimization method that takes curvature into account—moves more efficiently by respecting the geometry of the problem rather than treating all directions as equal.
Here's the mathematical intuition: ordinary gradient descent treats parameter space as if it were flat—Euclidean. It moves in the direction of steepest descent measured by the simple L2 norm. But parameter space isn't flat. It's curved. A step of size ε in one direction might produce a tiny change in the network's outputs, while the same-sized step in another direction might produce catastrophic changes.
Natural gradient descent respects this curvature. It uses the Fisher metric to measure distances properly, taking steps that are small in the geometry of the probability distributions the network produces, not just in the geometry of parameter space. This is why natural gradient methods converge faster and more reliably—they're not fighting the geometry, they're following it.
But here's the deeper insight: the curvature of the parameter space shapes the structure of learned representations.
Networks that minimize loss while respecting the geometry of the data end up organizing their features in ways that reflect the intrinsic structure of the world they're modeling. The "Golden Gate Bridge" feature isn't just a random direction. It's a direction that aligns with statistical regularities in the training data, positioned in a region of the manifold where related concepts—"bridge," "San Francisco," "landmark," "red"—are geometrically nearby.
This is why visualization techniques like PCA, t-SNE, and UMAP reveal meaningful structure when applied to neural representations. They're not imposing structure. They're revealing the geometry that was already there.
Manifolds, Attractors, and Coherence
Here's where interpretability meets coherence theory.
In AToM (the framework underlying this site), coherence is defined as the degree to which a system's trajectories are smooth, predictable, and low-curvature over time. High-coherence systems occupy regions of state space where nearby trajectories converge. Low-coherence systems occupy regions where nearby trajectories diverge wildly—what we call "high curvature" zones.
Neural networks exhibit the same pattern.
When a network "groks" a concept—when it transitions from memorization to true understanding (as we explored in Grokking)—it's moving from a high-curvature region of the loss landscape to a low-curvature one. The representation becomes smoother. The features become more linear. The geometry becomes more stable.
This is what information geometers call an attractor basin—a region of the manifold where trajectories naturally converge.
And this is what coherence theory predicts: systems that persist, that generalize, that exhibit what we recognize as "understanding," are systems whose representations live in low-curvature regions where small perturbations don't destroy meaning.
High curvature = fragility. Low curvature = robustness.
In neural networks, just as in brains, in ecosystems, in societies.
The Empirical Evidence: Feature Geometry in the Wild
This isn't just theory. Mechanistic interpretability researchers are actively mapping the geometric structure of neural representations.
Anthropic's work on "Towards Monosemanticity" (Bricken et al., 2023) used sparse autoencoders to extract interpretable features from Claude and then examined their geometric relationships. They found:
- Features cluster semantically. Concepts like "emotions" form geometric neighborhoods. So do "programming languages," "historical periods," and "scientific fields."
- Superposition creates high-dimensional polytopes. Multiple features can activate simultaneously, creating complex geometric structures where meanings overlap in controlled ways.
- Abstract features emerge as geometric combinations. The feature for "metaphor" isn't learned independently—it emerges as a structured relationship between concrete concepts.
DeepMind's work on "Tangent Kernel Analysis" (Jacot et al., 2018) showed that wide neural networks converge to functions defined by the geometry of their tangent kernels—essentially, the local curvature of the function space they're exploring. This tangent kernel determines generalization: networks that find low-curvature solutions generalize better than those that memorize high-curvature solutions.
MIT's "Neural Manifold Analysis" (Chung & Abbott, 2021) applied topological data analysis to neural activity in both artificial and biological networks, revealing that task-relevant information is encoded in the topology of the neural manifold—its holes, its dimensionality, its curvature. They found that learning corresponds to smoothing the manifold, reducing its complexity while preserving task-relevant structure.
The pattern is consistent: meaningful representations live on smooth, low-curvature manifolds, and learning is the process of finding them.
Why This Matters Beyond AI
If neural networks organize meaning geometrically, and if that geometry follows predictable patterns from information theory, then we have a bridge between interpretability and general theories of coherence.
Consider:
Biological brains also encode information as manifold structure. Neuroscientists studying neural "coding manifolds" have found that different brain regions represent information using different geometric strategies—some favor high-dimensional separability, others favor low-dimensional smoothness. When you learn a new skill, your brain isn't just changing weights. It's restructuring its manifold to create low-curvature pathways for that skill.
Motor learning provides a striking example. When you first learn to play piano, each finger movement requires conscious attention—your motor cortex is navigating a high-curvature region where small errors produce wildly different outputs. After years of practice, the manifold has smoothed. The same motor sequence now lives in a low-curvature region where you can execute it automatically, even while holding a conversation. The neural geometry has been sculpted by training.
Semantic spaces in language exhibit geometric structure. Word embeddings like Word2Vec and GloVe explicitly treat words as points on a manifold, and the distances between them correspond to semantic similarity. The famous "king - man + woman = queen" example isn't magic—it's vector arithmetic on a curved manifold where analogical relationships are preserved by the geometry.
But it goes deeper. Large language models like GPT-4 and Claude learn representations that preserve not just word relationships but conceptual relationships across multiple levels of abstraction. "Democracy" and "voting" are close in the manifold. So are "democracy" and "ancient Athens." And "democracy" and "distributed consensus." The geometry encodes a web of associations that reflects the statistical structure of human knowledge.
Conceptual development in humans might follow the same pattern. When children learn abstract concepts, they're not just accumulating facts—they're reorganizing their representational manifold to create new dimensions, new clusters, new low-curvature regions where reasoning becomes fluid.
Consider learning algebra. Initially, symbolic manipulation feels arbitrary—high curvature, fragile understanding. Then something clicks. The symbols become objects you can move around, relationships you can see. The manifold has smoothed. What was once effortful becomes intuitive. This is what Piaget called "accommodation"—not just adding information, but restructuring the geometry of cognition itself.
This suggests a radical possibility: meaning itself is a geometric property.
What we call "understanding" is what it feels like to occupy a low-curvature region of a representational manifold. What we call "confusion" is what it feels like to be in a high-curvature region where small changes in input produce wildly different interpretations. What we call "insight" is the transition from one to the other—the moment when the manifold smooths out and relationships become clear.
Coherence as Curvature: The Connection to M=C/T
This maps directly onto the AToM equation: M = C/T.
Meaning (M) equals Coherence (C) over Time (T)—or over Tension, depending on the context.
In geometric terms:
- Coherence = low curvature. A system is coherent when its state-space trajectories are smooth, predictable, and stable under perturbation.
- Time/Tension = the interval over which coherence is evaluated. A system might be coherent locally (low curvature in a small region) but incoherent globally (high curvature across larger scales).
- Meaning = the degree to which patterns persist. High meaning corresponds to representations that remain stable, interpretable, and actionable across time and context.
For neural networks:
- Superposition is a high-tension, high-curvature state. Many features packed into few neurons, with fragile dependencies.
- Sparse, disentangled representations are low-curvature states. Features occupy independent dimensions, relationships are stable, perturbations are contained.
- Grokking is the transition from high-curvature (memorization) to low-curvature (generalization). The network finds a smoother manifold where the task becomes easy.
For biological brains:
- Confusion is high curvature. Small changes in input produce large, unpredictable changes in interpretation.
- Expertise is low curvature. The representational manifold has been smoothed to the point where the task is effortless.
- Learning is manifold sculpting—finding low-curvature paths through representational space.
This isn't analogy. It's structural convergence. Artificial systems, biological systems, and mathematical theories of information all point to the same conclusion: coherent systems live on smooth manifolds, and meaning is what happens when curvature is low enough for trajectories to persist.
What Interpretability Teaches Us About Coherence
Mechanistic interpretability gives us tools to measure coherence in neural systems with precision that's impossible in biology.
We can:
-
Quantify feature geometry using distance metrics, curvature measures, and topological invariants. Tools from topological data analysis—persistent homology, Betti numbers, geodesic distances—let us characterize the shape of representational manifolds in ways that generalize across architectures and domains.
-
Map coherence dynamics by tracking how manifolds change during training. We can watch in real-time as high-curvature regions smooth out, as features disentangle, as the geometry reorganizes itself from chaotic to coherent.
-
Test coherence predictions by deliberately perturbing networks and measuring whether high-curvature regions are more fragile (they are) and whether low-curvature regions generalize better (they do). Adversarial examples, distributional shifts, and ablation studies all confirm: geometric stability predicts functional robustness.
This gives us a proof of concept: coherence-as-geometry is not just a useful metaphor for thinking about neural networks. It's a measurable, predictive framework that explains when and why networks succeed or fail.
More importantly, it gives us a methodology that could transfer to other domains. If we can measure manifold curvature in artificial networks, perhaps we can estimate it in biological networks through neural recording. If we can engineer low-curvature solutions in AI, perhaps we can design interventions—educational, therapeutic, organizational—that help human systems find their own low-curvature paths.
And if it works for artificial minds, there's every reason to believe it applies to biological minds, to social systems, to any domain where patterns must persist through time under constraint.
The geometry of meaning is universal.
The Open Questions
We're still at the beginning of understanding the full implications of geometric interpretability. Key questions remain:
How does curvature scale across layers? Early layers in deep networks represent low-level features with high curvature (many possible edges, textures, colors). Late layers represent abstract concepts with (often) lower curvature. But the transition isn't always smooth. Understanding how curvature evolves through the network could reveal fundamental principles of hierarchical representation.
What determines which manifolds are learnable? Not all geometric structures are equally easy for neural networks to discover. Some tasks produce smooth, low-dimensional manifolds quickly. Others produce high-dimensional, tangled manifolds that require enormous data and compute to smooth out. What distinguishes them?
Can we engineer low-curvature representations directly? If we understand the geometry of coherence, can we design training procedures, architectures, or loss functions that explicitly encourage low-curvature solutions? This could lead to networks that generalize faster, interpret more easily, and align more reliably.
Do biological brains use the same geometric principles? The evidence suggests they do, but the details matter. How do spiking dynamics, neuromodulation, and embodied interaction shape neural manifolds in ways that artificial networks don't capture?
What does this mean for AI alignment? If coherence is geometric, then alignment might be a question of manifold shaping. Safe AI systems might be those that occupy low-curvature regions where perturbations (adversarial inputs, distributional shifts, value drift) produce small, predictable changes rather than catastrophic failures.
Geometry All the Way Down
What mechanistic interpretability reveals is that features are not the end of the story—geometry is.
The concepts neural networks learn, the circuits they build, the representations they encode—all of these emerge from the underlying geometric structure of the space they're navigating.
And that geometry follows principles we recognize from information theory, from dynamical systems, from coherence theory: systems that persist occupy smooth regions of their state space. Systems that generalize find low-curvature manifolds. Systems that exhibit what we call "understanding" are systems whose representations are geometrically stable.
This is the deeper unification interpretability offers. Not just a taxonomy of what features exist, but a theory of why those features take the form they do.
The geometry of artificial minds mirrors the geometry of meaning itself.
And that geometry, it turns out, is the geometry of coherence.
This is Part 6 of the Mechanistic Interpretability series, exploring how reverse-engineering AI reveals the geometry of meaning.
Previous: Sparse Autoencoders: Extracting the Dictionary of Neural Concepts
Next: What AI Interpretability Teaches Us About Biological Brains
Further Reading
- Bricken, T., et al. (2023). "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." Anthropic.
- Jacot, A., Gabriel, F., & Hongler, C. (2018). "Neural Tangent Kernel: Convergence and Generalization in Neural Networks." NeurIPS.
- Chung, S., & Abbott, L. F. (2021). "Neural Population Geometry: An Approach for Understanding Biological and Artificial Neural Networks." Current Opinion in Neurobiology.
- Amari, S. (2016). Information Geometry and Its Applications. Springer.
- Saxe, A. M., et al. (2019). "A Mathematical Theory of Semantic Development in Deep Neural Networks." PNAS.
- Golub, M. D., & Sussillo, D. (2018). "FixedPointFinder: A TensorFlow toolbox for identifying and characterizing fixed points in recurrent neural networks." Journal of Open Source Software.
Comments ()