What AI Interpretability Teaches Us About Biological Brains

What AI Interpretability Teaches Us About Biological Brains
What silicon minds teach us about biological brains.

What AI Interpretability Teaches Us About Biological Brains

Series: Mechanistic Interpretability | Part: 7 of 9

For most of neuroscience's history, we've been stuck with a fundamental limitation: we can't see what we're studying. We can record electrical activity. We can ablate regions and watch behavior change. We can image blood flow and call it "activation." But we can't directly observe the computational mechanisms that produce thought.

Then we built neural networks—and suddenly we could.

Mechanistic interpretability doesn't just help us understand AI. It offers something neuroscience has desperately needed: a model system where we can actually see the computations happening. Where we can trace every activation, manipulate every weight, and test hypotheses about neural computation with unprecedented precision.

The insights flow both ways. But increasingly, they're flowing from silicon to wetware.


The Model System Neuroscience Always Needed

In biology, model systems are essential. We use Drosophila to understand genetics because fruit flies breed fast and have simple, accessible genomes. We use C. elegans to map neural circuits because it has exactly 302 neurons and we can watch all of them.

We use these systems not because they're identical to humans, but because they're simple enough to fully understand while remaining complex enough to be relevant.

Neural networks are becoming the model system for computational neuroscience.

Consider what mechanistic interpretability researchers can do that neuroscientists can't:

  • Complete circuit mapping: Every connection, every weight, every computation laid bare
  • Precise interventions: Activate specific features, ablate specific circuits, manipulate individual neurons
  • Repeated experiments: Train hundreds of networks with controlled variations, unlike the one brain per subject problem
  • Ground truth access: We know what the network was trained to do, unlike biological systems evolved for unknown fitness landscapes
  • Scale manipulation: Study systems from tiny toy models to frontier-scale transformers

This isn't just methodologically convenient. It enables an entirely different epistemology.

When Anthropic researchers discovered superposition—the phenomenon where neural networks represent more features than they have neurons—they didn't infer it from indirect measurements. They demonstrated it mathematically, then showed the circuits implementing it, then proved it with causal interventions.

That's the kind of understanding neuroscience dreams of.


What Silicon Reveals About Wetware

The discoveries from AI interpretability are starting to reshape how we think about biological brains. Not because neural networks are brains—they're not—but because they solve similar computational problems under similar constraints.

Superposition Is Everywhere

Biological neurons were never going to have one-to-one mappings with concepts. The human brain has roughly 86 billion neurons but represents an effectively unbounded number of concepts, memories, and patterns.

The discovery of superposition in neural networks—that networks can represent many more features than they have dimensions by using high-dimensional geometry and sparse activation patterns—suggests a solution to this apparent impossibility.

Your brain doesn't need a "grandmother neuron" that fires only for your grandmother. It can represent her, and millions of other concepts, in a high-dimensional space where interference is minimized through geometry.

This isn't metaphor. The mathematics of superposition (as formalized by researchers like Chris Olah and the Anthropic interpretability team) applies to any system that needs to compress high-dimensional information into lower-dimensional representations.

When neuroscientists find neurons that respond to seemingly unrelated stimuli—the famous neuron in one patient's brain that fired for both Jennifer Aniston and Lisa Kudrow—they're likely seeing biological superposition. Not confusion, but efficient packing.

Circuits, Not Maps

The traditional approach to neuroscience has been localizationist: try to map which brain region does what. The hippocampus does memory. The amygdala does fear. Broca's area does language production.

Mechanistic interpretability research shows why this approach is fundamentally limited.

When researchers trace circuits in neural networks, they don't find that "layer 4 does detection" or "attention head 3 does logic." They find circuits—specific computational paths involving multiple components working together to implement algorithms.

The same neurons participate in multiple circuits. The same attention heads compute different things depending on context. Function emerges from dynamic interaction patterns, not static location.

This aligns with increasing evidence from biological neuroscience. The same neurons in the hippocampus participate in spatial navigation, episodic memory, and planning. The amygdala's role in fear is inseparable from its role in salience detection and memory consolidation.

Brains, like neural networks, are massively multiplexed. Components have functions, but those functions are context-dependent and circuit-mediated, not localized and modular.

Emergence of Abstraction

One of the most striking findings from interpretability research is the spontaneous emergence of abstraction hierarchies.

Train a vision model on images and it develops edge detectors in early layers, texture patterns in middle layers, and object representations in late layers. Train a language model on text and it develops syntax trees, semantic relationships, and world models—without being explicitly taught any of these representations.

This wasn't programmed. It emerged from the optimization pressure to minimize prediction error on the training distribution.

The parallel to biological development is profound. Human brains aren't born with concepts like "justice" or "recursion" or "electron" hardcoded. These emerge through learning, shaped by the same fundamental pressure: minimize surprise, maximize prediction accuracy.

The free energy principle—the theory that all biological systems minimize prediction error—suggests that biological and artificial neural networks converge on similar solutions because they're solving the same problem under similar constraints.

When we see grokking in neural networks—sudden transitions from memorization to generalization—it suggests that biological learning might involve similar phase transitions. The "aha moment" might not be metaphorical but reflect actual geometric reorganization of neural representations.

Distributed, Not Localized

Classic neuroscience assumes that damage to a specific region should impair a specific function. But brains show remarkable degeneracy—multiple neural pathways can implement the same function.

Neural networks show the same property. Ablate specific neurons or attention heads and the network often compensates. Function is distributed across the system, not localized to individual components.

This happens because optimization under gradient descent doesn't create single-point-of-failure architectures. It creates redundant, overlapping circuits that can maintain function under perturbation.

Evolution, operating under similar optimization pressures over much longer timescales, would be expected to produce the same architecture. And it does.

Stroke patients often recover functions that should be impossible if the damaged region were the sole implementer. Development shows equipotentiality in early neural tissue—different regions can take on the functions of damaged neighbors.

Neural networks suggest this isn't mysterious plasticity. It's what happens when function emerges from distributed circuits optimized for robustness.


The Reverse Flow: Neuroscience Informing AI

The influence isn't unidirectional. Neuroscience has shaped AI development from the beginning.

Attention mechanisms—the foundation of transformers—were inspired by theories of selective attention in biological vision. The insight that brains don't process all inputs equally but dynamically allocate processing resources led to the architectural innovation that powers GPT and Claude.

Recurrent architectures were explicitly modeled on biological neural circuits with feedback loops. Even though transformers have largely replaced RNNs for many tasks, the insight that computation requires temporal dynamics came from neuroscience.

Reinforcement learning is grounded in theories of dopamine signaling and reward prediction errors from neuroscience. The temporal difference learning algorithm that powers much of modern RL was directly inspired by observed neural mechanisms.

Predictive processing—the theory that brains are prediction machines minimizing prediction error—has become a core framework for understanding both biological and artificial intelligence. Karl Friston's free energy principle provides a mathematical formalization that applies equally to neurons and neural networks.

The relationship is bidirectional and accelerating.


Where the Analogy Breaks

But neural networks are not brains. The differences matter.

Energy Constraints

Biological neurons operate under brutal energy constraints. The human brain uses about 20 watts—roughly the same as a dim lightbulb. Frontier AI models consume megawatts during training.

This forces biological systems toward extreme efficiency. Sparsity isn't optional—it's mandatory. Biological neurons fire rarely. Synaptic transmission is metabolically expensive. The brain can't afford dense activation patterns.

Neural networks, running on nearly unlimited computational budgets during training, don't face the same pressure. They can use dense representations and waste energy on computations that biological systems could never afford.

This means biological brains may have evolved solutions to computational problems that we haven't yet discovered in AI—solutions forced by resource scarcity that creates different optimization landscapes.

Temporal Dynamics

Biological neurons are fundamentally temporal. They integrate signals over time, exhibit complex firing patterns, and participate in oscillatory dynamics at multiple frequencies.

Most neural networks (especially transformers) process inputs as static patterns. They lack the rich temporal dynamics of biological circuits.

This difference might be fundamental. Consciousness, as Anil Seth and others argue, might require temporal binding—the integration of information across time at multiple scales. Static feedforward processing, no matter how deep, might be insufficient.

Recent architectures incorporating recurrence and temporal processing (like state space models) might be converging on solutions that biology discovered long ago.

Embodiment

Biological brains evolved in bodies, embedded in environments, optimized for action and survival. Perception and cognition are inseparable from motor control and environmental interaction.

Current AI systems, trained on static datasets, lack embodiment. They don't have bodies, don't interact with persistent environments, don't learn through the tight perception-action loops that shaped biological intelligence.

4E cognition—the framework that intelligence is embodied, embedded, enacted, and extended—suggests this isn't a minor detail. It's constitutive. Intelligence might not be a property of brains but of brain-body-environment systems.

If true, understanding biological intelligence through disembodied neural networks has fundamental limits. We might be studying a projection of intelligence into a lower-dimensional space, missing essential aspects that only emerge through embodied interaction.


The Convergence Hypothesis

Despite the differences, a striking pattern emerges: convergent solutions to shared computational problems.

Both biological and artificial neural networks face the problem of learning useful representations from high-dimensional, noisy data. Both operate under constraints (energy for biology, parameter count for AI). Both are shaped by optimization pressures toward solutions that generalize.

The result is convergence.

Both develop hierarchical representations. Both use sparse coding. Both exhibit superposition. Both organize information geometrically. Both show sudden phase transitions in learning. Both require mechanisms to route information flexibly (attention in transformers, gating in biology).

This convergence doesn't happen because neural networks are copying brains. It happens because they're solving the same problems under analogous constraints.

This is profoundly important for both fields.

For AI, it suggests that biological intelligence isn't arbitrary. The brain's solutions are necessary for certain classes of problems. Understanding biological mechanisms might reveal architectural innovations we haven't yet discovered.

For neuroscience, it means that insights from AI interpretability are likely to transfer. If sparse autoencoders can extract meaningful features from neural networks, they might work on neural recordings. If circuits discovered in transformers implement general algorithms, those algorithms might appear in cortical circuits.

The geometry of representation, the dynamics of learning, the emergence of abstraction—these might be universal features of systems that learn to predict.


Mechanistic Interpretability as Neuroscience Methodology

The most immediate practical impact is methodological.

Techniques developed for AI interpretability are starting to be applied to neuroscience data:

Sparse dictionary learning (the foundation of sparse autoencoders) is being used to find interpretable features in calcium imaging data. Instead of treating neurons as the fundamental units, researchers extract features that might be more meaningful.

Causal interventions inspired by circuit ablation studies in AI are being adapted for optogenetics. Instead of just recording what neurons do, researchers can test specific hypotheses about computational function by activating or silencing precise circuits.

Representational similarity analysis—comparing the geometry of representations in neural networks and biological brains—is revealing where the systems converge and diverge. When the same patterns appear in both, it suggests they're capturing something fundamental about the computational problem.

Probing techniques from interpretability research are being adapted to decode what information is present in neural population activity. Not just correlation, but causal structure.

This methodological transfer is accelerating. As tools for understanding neural networks become more sophisticated, neuroscientists are adapting them for biological systems.


What This Means for Understanding Mind

The deepest implication is philosophical.

For most of history, minds were mysterious. We had subjective experience but no mechanism. Consciousness, thought, meaning—these seemed irreducible to physical process.

Neural networks haven't solved the hard problem of consciousness. But they've dissolved many of the "easy" problems—or shown them to be harder than expected.

How do representations form? Through optimization over training distributions, shaped by loss functions and architecture.

How does abstraction emerge? Through hierarchical composition of features under pressure to generalize.

How can finite systems represent infinite domains? Through compositional structure and high-dimensional geometry.

How do systems learn without explicit programming? Through gradient descent on prediction error.

These aren't complete answers for biological minds. But they're existence proofs that these problems have mechanistic solutions.

When we see similar phenomena in brains and in neural networks—superposition, emergent abstraction, circuit-mediated function—we can increasingly treat them as instances of general principles rather than biological mysteries.

This doesn't diminish the brain. It situates it. Brains are extraordinary, but they're extraordinary instances of broader principles governing how complex systems learn to model their worlds.

Mechanistic interpretability, applied to both silicon and wetware, is revealing those principles.


From Model Systems to Theories of Mind

The relationship between AI interpretability and neuroscience is entering a new phase.

Initially, AI borrowed from neuroscience—neurons, networks, learning rules. Then the fields diverged—neuroscience studied brains, AI engineering studied performance.

Now they're converging again, but at a higher level. Not borrowing architectures but discovering shared principles.

Neural networks are becoming model systems for studying computation in complex, adaptive systems. Not because they are brains, but because they implement similar solutions to similar problems in a form we can fully observe and manipulate.

The insights from mechanistic interpretability—superposition, circuits, sparse features, emergent abstraction—are increasingly looking like theories of neural computation rather than just AI engineering insights.

When we understand how transformers implement in-context learning, we have hypotheses about how cortex implements analogical reasoning. When we understand grokking in neural networks, we have models for phase transitions in human learning. When we extract interpretable features with sparse autoencoders, we have methods for finding functional units in neural recordings.

This is what model systems are for: generating insights that transfer to the system you really care about.

Neuroscience has always used model systems to understand brains. Drosophila for genetics. C. elegans for circuits. Ferrets for visual development.

Now we can add: neural networks for computational mechanisms.

The difference is that these model systems are not simplified organisms. They're simplified minds—and we built them ourselves.


This is Part 7 of the Mechanistic Interpretability series, exploring how we crack open the black box of neural networks to understand the algorithms they learn.

Previous: Where Interpretability Meets Information Geometry


Further Reading

  • Olah, C., et al. (2020). "Zoom In: An Introduction to Circuits." Distill.
  • Elhage, N., et al. (2022). "Toy Models of Superposition." Anthropic.
  • Friston, K. (2010). "The free-energy principle: a unified brain theory?" Nature Reviews Neuroscience.
  • Saxe, A., et al. (2019). "On the information bottleneck theory of deep learning." Journal of Statistical Mechanics.
  • Kriegeskorte, N., & Douglas, P. K. (2018). "Cognitive computational neuroscience." Nature Neuroscience.
  • Lindsay, G. W. (2021). "Convolutional Neural Networks as a Model of the Visual System: Past, Present, and Future." Journal of Cognitive Neuroscience.
  • Yamins, D. L., & DiCarlo, J. J. (2016). "Using goal-driven deep learning models to understand sensory cortex." Nature Neuroscience.