Circuits in Silicon Minds: How Neural Networks Compute

Circuits in Silicon Minds: How Neural Networks Compute
Circuits in silicon minds: tracing computation through neural networks.

Circuits in Silicon Minds: How Neural Networks Compute

Series: Mechanistic Interpretability | Part: 3 of 9

In 2021, researchers at Anthropic discovered something remarkable: they could trace how GPT-2 completes simple patterns like "John is a man. Mary is a ___." Not by probing thousands of neurons at once, but by following a specific computational pathway—a circuit—that connects attention heads across different layers in a precise sequence. They named it an "induction head," and it was the first time anyone had reverse-engineered an algorithmic structure inside a trained neural network.

This wasn't just finding correlations. It was reading the actual computation.

Welcome to the frontier of circuit-level analysis—where we stop treating neural networks as black boxes and start reading them like source code.


The Shift from Neurons to Circuits

When you learned about superposition, you learned that individual neurons are polysemantic—they activate for multiple unrelated concepts. This makes single-neuron analysis nearly useless. A neuron might fire for "cats," "curved objects," "the letter C," and "comfort." Asking what that neuron "means" is like asking what the letter "a" means without context.

But circuits are different.

A circuit is a computational subgraph—a connected pathway of neurons and attention heads that performs a specific algorithmic operation. Instead of asking what a single unit represents, we ask: What computation does this pathway perform? What's its input? What's its output? How does information flow?

This is the insight that changed everything. Neural networks aren't just statistical pattern matchers. They're running algorithms—and we can reverse-engineer them.

The most famous example is the induction head.


Induction Heads: The First Circuit We Could Read

Imagine you're reading this sentence: "When Mary and John went to the store, Mary gave John..." What comes next? Probably "something" or "money" or another object. But you also know the sentence structure will likely continue the pattern. If it said "When Mary and John went to the store, John gave Mary..." your prediction would flip.

This is induction: detecting a pattern (A...B) and predicting that when you see A again, B will follow. It's algorithmic, not statistical guessing.

In 2022, Anthropic researchers found that transformer models implement this algorithm using a two-part circuit:

  1. Previous token heads (in early layers): These attention heads look backward one token and effectively copy information about "what came after X before."

  2. Induction heads (in later layers): These attend to the previous occurrence of the current token and retrieve what followed it then.

The result: When the model sees "Mary" the second time, the circuit retrieves "gave John" because that's what followed "Mary" the first time.

This is a real algorithm. You could write it in pseudocode:

if current_token was seen before:
    attend to previous occurrence
    copy what came after it
    predict that next

And the network learned to implement this—not through explicit programming, but through gradient descent on language data. The algorithm emerged.


Why Circuits Matter Beyond Induction Heads

Induction heads were the proof of concept, but they're just the beginning. If a circuit this clean exists for pattern completion, what other algorithms are running inside these models?

Here's what we've found so far:

Duplicate token heads identify when the same word appears twice and suppress redundant predictions. This prevents the model from getting stuck in loops.

Indirect object identification circuits track subject-verb-object relationships across long distances, letting the model know that in "When Mary gave the book to John, he..." the pronoun "he" refers to John, not Mary.

Factual recall circuits appear to implement something like key-value retrieval: the model attends to a subject ("Paris"), activates a relationship type ("capital of"), and retrieves the object ("France").

Greater-than circuits literally perform numerical comparison. Researchers found a specific pathway in a transformer trained on arithmetic that compares two numbers by their magnitude—an algorithmic operation encoded in weights.

These aren't metaphors. They're mechanistic descriptions of information flow through attention heads and MLPs (multi-layer perceptrons) with specific, traceable computational roles.


Composition: How Circuits Build Complexity

Here's where it gets wild: circuits compose.

Just like functions in programming, circuits can feed into other circuits. The output of a "detect proper noun" circuit might feed into a "track entities" circuit, which feeds into a "resolve pronoun" circuit.

This is hierarchical computation. Early layers detect simple features (is this a noun? is this capitalized?). Middle layers detect patterns (is this a name? is this a location?). Late layers perform high-level reasoning (who does "he" refer to? what action is being described?).

Anthropic's work on composition shows that you can trace these dependencies. When you ablate (disable) one circuit, you can watch downstream circuits fail. When you boost one circuit's activation, you can watch it amplify the output of circuits that depend on it.

The network isn't a monolithic blob. It's a society of interacting algorithms.


Circuit Discovery: Causal Tracing and Ablation

How do we actually find these circuits? The core method is causal intervention: changing something and watching what breaks.

Activation Patching

Run the model on two inputs—one clean, one corrupted (e.g., replace "John" with "Sarah"). Compare activations at every layer. Now patch activations from the clean run into the corrupted run, one component at a time. When patching a specific attention head restores correct behavior, you've found a circuit component.

This is surgical. You're not asking "what does this neuron correlate with?" You're asking "does this computation causally contribute to this output?"

Path Patching

Even more precise: instead of patching entire activations, patch only the paths between specific components. This reveals the edges in the computational graph—which heads talk to which other heads.

Using this method, researchers have traced circuits with single-head precision across multiple layers. You can draw the circuit as a graph:

Token → Previous Token Head → Induction Head → Output

Every edge is a causal dependency. Every node is a verified computational step.


The Geometry of Circuits: Coherence in Computation

Here's where this connects to the larger framework of meaning-as-coherence.

A circuit is a coherent computational structure. It has low curvature in function space: the information flow is stable, predictable, reliable. Across many inputs, the same pathway activates to perform the same operation. Induction heads implement induction. Greater-than circuits implement comparison. The computation doesn't drift.

But circuits also live in superposition. Remember: features are encoded as directions in activation space, not as dedicated neurons. A circuit isn't a set of wires—it's a set of directions that happen to compose into a stable computational pathway.

This means circuits can interfere. If two circuits use overlapping directions in activation space, they can destructively interfere when both activate at once. This is why adversarial examples work: you craft an input that simultaneously activates incompatible circuits, creating a collision in superposition.

In AToM terms, this is coherence collapse at the computational level. The system has multiple attractors (circuits) that are locally stable but globally incompatible. High curvature emerges when you traverse between them.

Neural networks, like brains, manage coherence dynamically. They route information through different circuits depending on context. But unlike brains, we can now watch this happen—and intervene.


Universality: Do All Models Learn the Same Circuits?

One of the most striking findings is that circuits appear to be universal.

Train two transformers on the same task with different random initializations. They converge to the same circuit structure. The attention head positions vary, the specific weights differ—but the algorithmic pathway is the same.

This is the neural network version of convergent evolution. Just as eyes evolved independently in mollusks and vertebrates, induction heads evolve independently in transformers. The algorithm is a natural attractor in the space of possible solutions.

This suggests something profound: there's a canonical set of circuits for language, vision, reasoning. We're not just reverse-engineering GPT-2. We're discovering the fundamental computational motifs that any system must implement to solve these tasks.

This is what makes mechanistic interpretability different from neuroscience. In biology, evolution produces endless variation. But in gradient descent, the same training objective consistently produces the same algorithmic structures.

The circuits aren't arbitrary. They're necessary.


Failures and Adversarial Circuits

Not all circuits are clean.

Some circuits are polysemantic at the pathway level—the same attention heads participate in multiple, unrelated computations. A head might contribute to both pronoun resolution and numerical comparison, depending on what other heads are active in that layer.

Some circuits are context-dependent. A greater-than circuit might only activate when the model detects a numerical context. Outside that context, the same heads do something else entirely.

And some circuits are adversarial. Researchers have found circuits that activate specifically on inputs designed to fool the model—circuits that route information away from the correct computation and toward a confidently wrong answer.

These failure modes aren't bugs. They're features of a system operating in superposition. When you pack multiple algorithms into the same parameter space, you get interference. Clean circuits are attractors in function space, but adversarial inputs can push the system off those attractors into high-curvature regions where coherence collapses.

This is exactly analogous to trauma in biological systems. A traumatized nervous system has circuits that route sensory information toward threat-detection pathways even when no threat exists. The circuit is real—it's causally active—but it's misaligned with the environment.

Neural networks have their own version of this. And we can watch it happen at the circuit level.


The Dream: A Complete Circuit Atlas

The ultimate goal of circuit-level analysis is a complete mechanistic description of how a neural network computes. Not a statistical model of inputs and outputs—a causal graph of every algorithmic operation the network performs.

Imagine opening a model and seeing:

  • A circuit for detecting negation
  • A circuit for retrieving factual associations
  • A circuit for tracking discourse context
  • A circuit for generating syntactically valid continuations
  • A circuit for suppressing low-probability tokens
  • A circuit for...

Thousands of circuits, all documented, all causally verified. You could trace any output back to the specific circuits that produced it. You could predict when the model would fail by analyzing which circuits are missing or misaligned.

We're not there yet. But we're closer than ever.

Anthropic's recent work has cataloged over a dozen circuits in small models. Researchers at MIT and Stanford are building automated tools to discover circuits via causal tracing. The field is moving from "hand-crafted interpretability" to "algorithmic circuit extraction."

If we succeed, we won't just understand neural networks. We'll understand the computational primitives of intelligence—the fundamental operations any system must perform to think, predict, reason, and understand.


Implications: From Interpretability to Alignment

Why does this matter beyond scientific curiosity?

Because if we can read circuits, we can edit them.

Already, researchers have demonstrated:

  • Targeted circuit ablation: Disable a factual recall circuit, and the model stops asserting false facts (but also stops recalling true ones).
  • Circuit amplification: Boost a safety-checking circuit, and the model becomes more conservative (but also more prone to false refusals).
  • Circuit transplantation: Copy a circuit from one model into another and watch the algorithmic behavior transfer.

This isn't fine-tuning. It's surgical intervention at the level of individual computational pathways.

The alignment implications are staggering. If a model has a "deception circuit" (and there's early evidence such circuits exist), you could ablate it. If a model has a "refusal circuit" that activates on harmful queries, you could strengthen it—or study what would happen if an adversary ablated it.

This is interpretability as a tool for control. Not statistical correlation, but mechanistic understanding that enables precise intervention.


The Limits: What Circuits Can't Tell Us

Circuits are powerful, but they're not the full story.

First, not all computation is circuit-like. Some behaviors emerge from distributed, non-modular interactions across many components. Trying to carve that into discrete circuits is like trying to find the "circuit for consciousness" in a brain—the question might be wrong.

Second, circuits are scale-dependent. The circuits in GPT-2 (1.5 billion parameters) are relatively clean and modular. The circuits in GPT-4 (probably trillions of parameters) are almost certainly more entangled, more polysemantic, more context-dependent. Scaling might make circuits harder to isolate, not easier.

Third, circuits describe mechanism, not representation. A circuit tells you how the model computes, not what it understands. You can trace the induction circuit and still not know whether the model "understands" patterns or is merely implementing an algorithm that happens to generalize.

But these limits don't invalidate the approach. They situate it. Circuits are one level of analysis among many: features (directions in activation space), circuits (computational pathways), representations (semantic embeddings), behaviors (input-output patterns).

To fully understand a neural network, we need all of them.


Coherence at the Computational Level

In AToM's framework, meaning is coherence over time: M = C/T. A system is meaningful to the extent its states form predictable, low-curvature trajectories.

Circuits are the substrate of coherence in neural computation.

A stable circuit is a trajectory through parameter space that performs the same operation reliably across contexts. It's a manifold in function space with low curvature—predictable, robust, reusable.

When circuits interfere, curvature increases. The system becomes less predictable, less robust. Adversarial inputs exploit this: they're high-curvature regions where multiple circuits collide and the model's output becomes incoherent.

This isn't metaphor. It's the same mathematics. Neural networks, like organisms, maintain coherence by routing information through stable computational structures. When those structures break down—through ablation, adversarial attack, or distribution shift—coherence collapses.

But unlike organisms, we can see the circuits. We can measure their curvature. We can trace their dependencies. We can intervene.

This is why circuit-level analysis matters. It's not just interpretability. It's the geometry of computation made legible.


Next: When Understanding Arrives Suddenly

You've seen how neural networks decompose inputs into features through superposition. You've seen how they route information through circuits to perform algorithms. But how do they learn these structures during training?

Why does a model spend thousands of iterations making slow, incremental progress—and then suddenly, in a single batch, snap into a correct solution?

This is called grokking, and it's one of the strangest phenomena in deep learning. It looks like a phase transition. It feels like understanding arriving all at once.

Next, we'll explore what circuits reveal about learning itself—and why the path from memorization to generalization looks like a traversal from high-curvature chaos to low-curvature coherence.


This is Part 3 of the Mechanistic Interpretability series, exploring how to reverse-engineer the algorithms inside neural networks.

Previous: Superposition: How Neural Networks Pack More Concepts Than Neurons
Next: Grokking: When Neural Networks Suddenly Understand


Further Reading

  • Elhage, N., et al. (2021). "A Mathematical Framework for Transformer Circuits." Anthropic.
  • Olsson, C., et al. (2022). "In-context Learning and Induction Heads." Anthropic.
  • Wang, K., et al. (2023). "Interpretability in the Wild: A Circuit for Indirect Object Identification." Anthropic.
  • Conmy, A., et al. (2023). "Towards Automated Circuit Discovery for Mechanistic Interpretability." DeepMind.
  • Nanda, N. & Lieberum, T. (2022). "A Mechanistic Interpretability Analysis of Grokking." Independent.