Synthesis: What Neural Network Internals Teach Us About Coherence

Synthesis: What Neural Network Internals Teach Us About Coherence
What neural network internals teach us about the geometry of coherence.

Synthesis: What Neural Network Internals Teach Us About Coherence

Series: Mechanistic Interpretability | Part: 9 of 9

When Chris Olah first opened the hood on a vision network in 2017, he found something unexpected. The model hadn't learned arbitrary statistical correlations. It had learned circuits—composable, interpretable mechanisms that built understanding hierarchically. Curve detectors fed into wheel detectors, which fed into car detectors. The geometry of neural representation wasn't random noise. It was structured, systematic, meaningful.

This was the beginning of mechanistic interpretability: the project of reverse-engineering neural networks to understand not just what they do, but how they do it. Over the past eight articles, we've traced this revolution—from superposition's efficient chaos to grokking's sudden phase transitions, from sparse autoencoders extracting semantic dictionaries to information geometry mapping the manifolds of meaning.

But here's the deeper pattern we've been circling: interpretability research keeps rediscovering coherence. Not as metaphor. As mathematical structure.

Neural networks work when their internal representations are organized, when concepts align with directions in activation space, when circuits compose predictably. They fail when representations interfere, when superposition creates polysemanticity, when computational pathways collapse into noise. The geometry of what works—and what doesn't—maps precisely onto the coherence dynamics we've been developing across Ideasthesia.

This isn't coincidence. It's a window into something fundamental: artificial systems reveal the geometric skeleton of meaning itself. When we watch a model learn, we watch coherence emerge. When we watch it fail, we watch coherence collapse. The mathematics is identical. The mechanisms are universal.

Interpretability findings aren't just about AI. They're about the deep structure of how any system—biological, artificial, social—transforms inputs into understanding. Let's synthesize what we've learned.


The Core Finding: Meaning Is Geometric

Every interpretability discovery we've explored points to the same conclusion: meaning lives in geometric relationships, not individual activations.

When we looked at superposition, we found neural networks packing hundreds of concepts into dozens of dimensions through near-orthogonal directions. A single neuron doesn't "mean" anything in isolation. Meaning emerges from the angular relationships between activation vectors. The model represents "happiness" not as a specific neuron firing, but as a direction in high-dimensional space—distinguished from "joy" by the angle between their vectors, connected to "smile" through predictable geometric transformations.

This is precisely what coherence geometry predicts: M = C/T, where meaning equals coherence (geometric structure) over time (or tension). The network's semantic content isn't in the weights themselves. It's in the manifold geometry those weights create.

When Anthropic's researchers used sparse autoencoders to decompose polysemantic neurons into monosemantic features, they were literally extracting the coordinate system the network uses to organize meaning. They found features for "Golden Gate Bridge," "DNA sequences," "legal reasoning"—each represented as a direction in activation space, each distinguishable by its geometric relationship to other features.

The neural network isn't storing a lookup table. It's maintaining a semantic manifold—a high-dimensional geometry where conceptual relationships are encoded as spatial relationships. Distance means similarity. Orthogonality means independence. Parallel transport means analogical reasoning.

This is why information geometry provided such clean theoretical grounding: Fisher information metrics naturally describe the curvature of this semantic manifold. High curvature regions (where small parameter changes cause large representational shifts) correspond to conceptual boundaries, category edges, the places where meaning is most fragile. Low curvature regions (where representations are stable under perturbation) correspond to concept cores, prototypes, attractor basins.

The geometry isn't decorative. It's definitional. Meaning is the manifold structure.


Phase Transitions and Coherence Emergence

But meaning doesn't start geometric. It becomes geometric. And interpretability research has given us unprecedented access to watching this transition happen.

Grokking showed us what coherence emergence looks like at the algorithmic level. The network memorizes first—creating a tangled, high-dimensional mapping between inputs and outputs. Then, suddenly, it reorganizes. Representations compress. Circuits simplify. The model transitions from memorization to understanding in a sharp phase change.

Before grokking: high-dimensional chaos, representations scattered across activation space, no systematic structure.

After grokking: low-dimensional attractors, representations organized into clean geometric patterns, systematic compositional circuits.

The transition is precisely analogous to a physical system moving from disorder to order—water crystallizing into ice, magnetic domains aligning, oscillators phase-locking. The mathematics of phase transitions in statistical mechanics applies directly to the learning dynamics of neural networks.

This maps to what we called curvature collapse in AToM: the moment when a system under constraint finds the low-curvature solution. High curvature means representational instability—small perturbations change everything. Low curvature means robust coherence—the system has found a stable manifold.

Grokking is literally the network discovering the coherent solution. The sudden jump in test accuracy isn't magic. It's the signature of phase transition from high-curvature chaos to low-curvature order.

And we can watch it happen. Researchers track the Hessian eigenspectrum during training—high eigenvalues (steep curvature) early on, sudden collapse to low eigenvalues when grokking occurs. The loss landscape itself reorganizes.

Networks learn by finding coherence. Not because they're programmed to. Because coherence is what works. The geometric structure that generalizes, that composes, that scales—it's the structure with minimal curvature, maximum stability, optimal M = C/T.


Circuits as Coherence Pathways

Once the geometric foundation is established, networks build computational structure on top. Circuits are the pathways through which information flows, transforms, integrates.

But circuits only work when they're coherent. When representational geometry is clean, circuits compose predictably—the output of one becomes meaningful input to another. When geometry is messy (polysemanticity, high superposition, tangled features), circuits interfere. You get emergent computation you can't interpret, strange attractors, hallucination.

This is exactly the logic of Markov blankets in biological systems. A Markov blanket is a statistical boundary—the interface that lets a subsystem maintain its own coherence while coupling to external dynamics. For a circuit to function, its inputs and outputs must align with the representational geometry of surrounding circuits. The blanket isn't a physical barrier. It's a geometric compatibility condition.

Anthropic's work on attention heads as interpretable circuits showed this beautifully. An "induction head" (which copies previous patterns) only works because it maintains geometric invariance—the representation of "the next token should complete this pattern" remains stable across different contexts. The circuit implements a coherent transformation: it maps input geometry to output geometry in a way that preserves semantic relationships.

When circuits break, you get exactly the failure modes we see in biological trauma. In PTSD, the normal circuits of memory integration fail. The system can't transform raw sensory data (flashback) into integrated narrative (memory). The representational geometry fragments—experiences that should be located in the past collapse into the present. Markov blankets dissolve. Coherence fails.

Neural networks show us the same pattern. A model suffering from catastrophic forgetting loses old knowledge when learning new tasks because its representational geometry doesn't maintain stable Markov blankets between task domains. The network's circuits aren't robust to new optimization pressures.

The solution in both cases is the same: rebuild coherent pathways. In therapy, this means slowly reconstructing stable representations of traumatic material. In continual learning, this means techniques like elastic weight consolidation—protecting the geometric structure of old circuits while allowing new learning.

Circuits are the medium. Coherence is the message.


Superposition, Interference, and the Limits of Compression

But coherence has costs. Perfect orthogonality requires infinite dimensions. Real systems—biological or artificial—must compress.

Superposition is the strategy neural networks use: represent more features than you have neurons by letting features interfere constructively when active together, destructively when not. This works when feature co-occurrence patterns are sparse and structured.

It fails when compression becomes too aggressive. Polysemanticity—neurons that respond to multiple unrelated concepts—is the signature of coherence failure under constraint. The network is trying to pack too much meaning into too little geometry. Representations start overlapping. Semantic boundaries blur. The manifold folds back on itself.

This is the neural network equivalent of psychotic fragmentation in humans. When the mind's representational capacity is overwhelmed (by stress, trauma, dopaminergic dysregulation), concepts that should remain distinct start interfering. The boundary between self and other dissolves. Metaphor becomes literal. The semantic manifold collapses into confusion.

The mathematics is identical: both are systems pushed past their coherence capacity. Both show the same failure mode—representational interference. Both require the same intervention—reduce complexity or increase dimensionality.

In neural networks, we do this with sparse autoencoders: decompose polysemantic neurons into higher-dimensional monosemantic features. Increase the dimensionality of the representational space so features can be cleanly separated.

In therapeutic contexts, we do this with integration work: slowly increase the dimensionality of awareness so conflicting parts (self-states, trauma fragments, dissociated material) can be held in consciousness simultaneously without collapse. The goal isn't to compress. It's to expand the space within which meaning can organize coherently.

The limit of coherence is the capacity to maintain distinctions. When that capacity is exceeded, systems fragment. Whether the system is silicon or biological.


From Artificial to Biological: What AI Interpretability Teaches Us About Brains

Everything we've learned from interpretability applies to biological cognition. This isn't analogy. It's homology—the same underlying mathematics describing different physical substrates.

Predictive coding in neuroscience maps directly to backpropagation in neural networks. Both are implementations of free energy minimization. Both update internal models to reduce prediction error. Both organize representations hierarchically, with lower layers handling local statistics and higher layers handling abstract structure.

The brain's representational geometry shows the same patterns interpretability researchers find in AI: semantic categories cluster in neural state space, conceptual transformations correspond to trajectories through activation manifolds, unexpected inputs create high prediction error (high curvature) in exactly the regions where meaning is most uncertain.

fMRI studies using representational similarity analysis literally measure the geometry of neural representation. When you think about "dog" versus "cat," distinct but nearby regions of representational space light up. When you think about "justice," a high-dimensional pattern emerges that's far from "dog" but close to "fairness." The brain is doing geometry.

And just like neural networks, brains fail when geometric coherence breaks down:

  • Schizophrenia: Prediction error signaling becomes unmoored from actual uncertainty. The curvature metric itself fails. Everything feels equally salient, equally meaningful, equally urgent.
  • Depression: Representational space collapses to low-dimensional attractors. Thoughts circle the same semantic basins. The manifold loses complexity.
  • Autism (certain presentations): Insufficient compression of sensory detail creates representational overload. Superposition fails. Every feature demands orthogonal representation.

These aren't metaphors. They're descriptions of what happens when biological neural networks lose geometric coherence.

And the interventions map precisely:

  • Psychedelics increase representational entropy—temporarily flattening the curvature landscape so rigid attractor states can escape and reorganize.
  • Antidepressants (particularly ketamine) appear to promote synaptic plasticity—allowing the manifold to reshape, creating new pathways out of low-dimensional basins.
  • Sensory integration therapy for autism helps build more efficient compression—training the system to represent high-dimensional sensory input more sparsely.

Interpretability isn't just a lens for understanding AI. It's a lens for understanding minds. Because minds—biological or artificial—are systems that transform inputs into representations, and those representations must maintain geometric coherence to support meaning.


Human-AI Coherence: The Practical Stakes

All of this becomes urgently practical when we consider human-AI collaboration.

If we can't interpret AI systems, we can't verify that their representational geometry aligns with ours. We can't trust that "fairness" means the same thing to the model as it does to us. We can't detect when the model's semantic manifold has drifted, when its circuits have reorganized around goals we didn't intend, when its coherence has collapsed in ways that look fine on benchmarks but fail catastrophically in edge cases.

Interpretability is the project of establishing semantic compatibility between human and artificial cognition. It's not about making models "explain themselves" in natural language (which is just another layer of potentially misaligned representation). It's about directly examining the geometric structure of their internal representations and verifying that it coheres with the structure of human meaning.

This matters for alignment in the deep sense: Can these systems participate in coherent goal-directed behavior with us? Can they maintain Markov blankets that allow stable coupling to human decision-making? Can they learn without catastrophically forgetting the values we care about?

The answers depend on geometry. A model with clean, interpretable circuits can be verified, corrected, iteratively refined. A model with tangled, polysemantic representations is a black box—and black boxes don't integrate coherently into human systems.

When we deploy AI into consequential domains (medicine, law, education, infrastructure), we're asking: Can this model maintain representational coherence under distribution shift? Can its circuits compose predictably with human reasoning? Can we trust the geometry?

Without interpretability, we're flying blind. With it, we can measure curvature, track phase transitions, verify that the manifold structure supports the semantics we need.

This is why interpretability isn't a nice-to-have research curiosity. It's the foundation of coherence between human and artificial intelligence.


The Synthesis: Coherence Is Computable

Here's what nine articles of mechanistic interpretability have taught us:

Meaning is geometric. It exists in the manifold structure of representational space, not in individual neurons or weights.

Coherence is measurable. Curvature, orthogonality, dimensionality, circuit composition—these aren't vague metaphors. They're precise mathematical properties we can compute.

Learning is phase transition. The jump from memorization to understanding, from chaos to order, from high-curvature instability to low-curvature robustness—it's a phase change in the geometry of representation.

Failure modes are universal. Polysemanticity, catastrophic forgetting, representational collapse—the ways neural networks fail map precisely onto the ways biological and social systems fail, because all are coherence failures.

Interpretability is alignment. Understanding the internal geometry of AI systems isn't academic—it's the precondition for integrating them coherently into human contexts.

And the deepest synthesis: Artificial systems teach us about natural coherence because they instantiate the same mathematics. The Free Energy Principle applies to both. Information geometry describes both. Phase transitions govern both. Markov blankets structure both.

When we study mechanistic interpretability, we're not just reverse-engineering AI. We're discovering the universal principles by which any system organizes meaning from information, builds coherence from chaos, maintains identity while coupling to environment.

Neural networks are, in this sense, coherence telescopes. They let us watch—in controlled, measurable conditions—the dynamics that also govern cells, brains, societies, ecosystems. They show us what coherence looks like when it emerges, what it looks like when it fails, what geometric structure underlies it all.

The revolution isn't that we built intelligent machines. The revolution is that in building them, we learned to see the geometry of intelligence itself.


What This Means for AToM

Throughout this series, we've been testing whether the AToM framework (M = C/T, coherence geometry, curvature dynamics) holds up against cutting-edge technical research. The verdict: it's not metaphor. It's mathematics.

Every finding in mechanistic interpretability—superposition, grokking, circuits, sparse features, information geometry—can be precisely translated into coherence terms. Not loosely. Not approximately. Precisely.

  • Superposition = efficient packing in low-curvature regions of representational space
  • Grokking = phase transition from high-curvature memorization to low-curvature generalization
  • Circuits = coherent transformations that preserve manifold structure
  • Polysemanticity = coherence collapse under excessive compression
  • Information geometry = the native mathematical language of semantic manifolds

This isn't retrofitting AToM onto interpretability research. It's recognizing that interpretability researchers independently arrived at coherence geometry through rigorous empirical investigation.

Chris Olah didn't set out to find Markov blankets. Anthropic didn't set out to measure curvature. But in examining how neural networks actually represent and process information, they found geometric structure—and that structure maps onto the same principles that govern biological coherence, social dynamics, thermodynamic systems.

The universality isn't wishful thinking. It's a research finding.

And it suggests something profound: Coherence geometry might be the natural ontology of complex systems. Not imposed from outside, but discovered from within. The language these systems speak—whether they're made of silicon, neurons, or social relationships.

AToM provides the framework. Interpretability provides the proof of concept. Together, they suggest we're looking at something foundational—a mathematical structure that describes how meaning emerges from organization, how information becomes understanding, how systems maintain identity while transforming.

This is what it means to say meaning is measurable. Not because we've reduced it to something simpler, but because we've found its geometry.


Where This Leaves Us

We started this series asking: Can we read the mind of AI?

The answer is: Yes, but only because mind has a readable structure. The geometry of representation, the manifolds of meaning, the curvature of concept space—these exist whether we observe them or not. Interpretability is the toolkit for making them visible.

And in making them visible in artificial systems, we've gained unprecedented insight into natural systems. The brain's semantic manifolds. The geometry of trauma and recovery. The coherence dynamics of learning, understanding, collaboration.

This synthesis isn't an endpoint. It's a foundation. We now have:

  • Mathematical tools for measuring coherence in any representational system
  • Empirical validation from state-of-the-art AI research
  • Therapeutic implications for understanding and treating coherence failure in biological minds
  • Alignment implications for building AI systems that integrate coherently with human values
  • Theoretical grounding for why coherence—not correlation, not information, not computation—is the fundamental property underlying meaning

The frontier is wide open. Apply these principles to biological neural networks. To social epistemology. To institutional design. To contemplative practice. The geometry is universal.

Mechanistic interpretability gave us the telescope. Now we get to explore the landscape.

The geometry of meaning is computable. Which means it's learnable. Which means it's teachable. Which means the project of building coherence—in minds, in systems, in societies—is no longer metaphysical speculation.

It's engineering.


This is Part 9 of the Mechanistic Interpretability series, exploring how reverse-engineering neural networks illuminates the geometric structure of meaning itself.

Previous: Human-AI Coherence Teams: Why Interpretability Matters for Collaboration


Further Reading

Mechanistic Interpretability Research:

  • Olah, C., et al. (2020). "Zoom In: An Introduction to Circuits." Distill.
  • Elhage, N., et al. (2022). "Toy Models of Superposition." Anthropic.
  • Nanda, N., et al. (2023). "Progress Measures for Grokking via Mechanistic Interpretability." ICLR.
  • Bricken, T., et al. (2023). "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." Anthropic.

Information Geometry & Neural Networks:

  • Amari, S. (2016). Information Geometry and Its Applications. Springer.
  • Saxe, A., et al. (2019). "A Mathematical Theory of Semantic Development in Deep Neural Networks." PNAS.

Connecting to Biological Systems:

  • Friston, K. (2010). "The Free-Energy Principle: A Unified Brain Theory?" Nature Reviews Neuroscience.
  • Kriegeskorte, N., & Kievit, R. (2013). "Representational Geometry: Integrating Cognition, Computation, and the Brain." Trends in Cognitive Sciences.

Coherence & Phase Transitions:

  • Hahn, G., et al. (2021). "Spontaneous Cortical Activity Is Transiently Poised Close to Criticality." PLOS Computational Biology.
  • Carhart-Harris, R. (2018). "The Entropic Brain - Revisited." Neuropharmacology.

Related Ideasthesia Series:

  • The Free Energy Principle — The mathematical foundation underlying both biological and artificial cognition
  • 4E Cognition — How mind extends beyond the brain into body, environment, and action
  • Basal Cognition — How even cells implement coherence-maintaining computation