Grokking: When Neural Networks Suddenly Understand

Grokking: When Neural Networks Suddenly Understand
Grokking: when neural networks suddenly understand.

Grokking: When Neural Networks Suddenly Understand

Series: Mechanistic Interpretability | Part: 4 of 9

In 2021, researchers at OpenAI noticed something strange. They trained a small transformer on simple modular arithmetic—the kind of problem you learn in middle school. At first, the network memorized the training examples perfectly but failed completely on new ones. Standard overfitting behavior.

Then they kept training. And training. Long after conventional wisdom says you should stop.

Something remarkable happened. Suddenly—not gradually, but suddenly—the network shifted from pure memorization to perfect generalization. The accuracy on unseen data jumped from near-zero to near-perfect in a handful of training steps. The authors, Alethea Power and colleagues, called this phenomenon grokking, borrowing Robert Heinlein's term for deep, intuitive understanding.

This wasn't supposed to happen. Everything we thought we knew about the bias-variance tradeoff suggested that once a model starts overfitting, you're done. More training just makes it worse. But here was a network that memorized first, then—inexplicably—understood later.

Grokking reveals something fundamental about how learning actually works in high-dimensional systems. It's not a smooth hill-climb toward better performance. It's a phase transition—a sudden reorganization of internal structure that changes everything at once.

And if you know where to look, it's everywhere.


The Discovery: When Memorization Becomes Understanding

The original grokking paper studied transformers learning modular addition. The task: given two numbers between 0 and 96, compute their sum modulo 97. Trivial for humans once you understand modular arithmetic. For neural networks, it became a window into something profound.

Here's the training curve that started it all:

Training loss drops immediately. The network memorizes the training set perfectly within a few thousand steps. Validation accuracy stays near chance—the network hasn't learned the underlying rule, just the specific examples it's seen.

Then nothing happens. For thousands, sometimes millions of training steps, both metrics plateau. The network sits in what looks like a stable state: perfect memorization, zero generalization.

Then the phase transition hits. Validation accuracy rockets from 0% to 100% in a narrow window of training steps. The network hasn't just improved—it's reorganized. It found the structure beneath the memorized facts.

This temporal separation—memorization first, comprehension later—is what makes grokking fascinating. The network achieves two completely different solutions to the same problem, and transitions between them through what looks like sudden insight.


Why This Matters Beyond Toy Problems

Grokking was first observed on simple algorithmic tasks: modular arithmetic, permutation groups, basic algebra. Easy to dismiss as a curiosity of small models on artificial data.

But the phenomenon keeps appearing in larger, more realistic settings.

Researchers at Anthropic found grokking-like dynamics in transformers learning semantic tasks—not just arithmetic, but language understanding. Networks would memorize surface patterns first, then suddenly reorganize to capture deeper semantic structure.

In reinforcement learning, agents exhibit sudden jumps in capability after long plateaus. The agent memorizes specific action sequences, then abruptly generalizes to the underlying policy.

Even human learning shows this pattern. You drill multiplication tables until they're automatic, then suddenly get what multiplication means. The procedural knowledge crystallizes into conceptual understanding—often in a flash of insight that feels discontinuous with the gradual practice leading up to it.

The implications reach beyond machine learning. Grokking suggests that memorization and understanding aren't opposing forces but sequential phases of the same process. You can't skip the first to get to the second faster. The network needs to spend time in the memorization regime before the transition to generalization becomes possible.

This contradicts standard practice in ML, which treats overfitting as failure and stops training the moment validation performance plateaus. Grokking says: keep going. The insight is coming, but it takes time.


The Mechanism: Circuit Formation Under Weight Decay

What actually happens during grokking? Why does the network suddenly shift from memorization to generalization?

The answer involves competing circuits—different functional subnetworks that solve the task in incompatible ways.

When a neural network first encounters a task, it has two broad strategies available:

Memorization circuits: High-frequency, complex representations that encode each training example individually. These form quickly because they require minimal coordination between neurons. Each example gets its own dedicated pathway.

Generalization circuits: Low-frequency, simple representations that encode the underlying rule or structure. These take longer to form because they require many neurons to align into coherent functional units—what mechanistic interpretability researchers call circuits.

Early in training, memorization circuits dominate. They're faster to form and immediately reduce loss. The network has no incentive to build the harder, slower generalization circuits.

But here's the key: weight decay.

Weight decay is a standard regularization technique that penalizes large weights. It's usually thought of as preventing overfitting by keeping the model simple. In grokking, it plays a different role.

Memorization circuits require large, specific weights—each training example needs its own strongly-weighted pathway. Generalization circuits can function with smaller weights because they leverage shared structure. The same simple circuit handles many examples.

Weight decay preferentially erodes memorization circuits while leaving generalization circuits intact. Over thousands of training steps, the memorization solution slowly weakens while the generalization solution slowly strengthens. The transition point—when generalization circuits become strong enough to dominate—appears as sudden grokking.

This is circuit competition as phase transition. The network doesn't smoothly interpolate between strategies. It jumps discontinuously when one circuit becomes strong enough to suppress the other.


Grokking and Lottery Tickets: The Structure Was There All Along

The circuit formation story connects to another surprising finding: the lottery ticket hypothesis.

Frankfort and Carbin showed that randomly initialized neural networks contain sparse subnetworks—"winning tickets"—that can train to full performance in isolation. The full network isn't learning from scratch; it's searching for the subnetwork that was already capable of solving the task.

Grokking extends this: the generalization circuit exists from initialization, but it's weak. The memorization circuit forms first because it's easier to find. Weight decay gradually shifts the balance, allowing the latent generalization circuit to emerge.

This suggests a different picture of learning: networks don't build understanding from nothing. They explore the space of possible circuits, strengthen the ones that reduce loss, and let regularization bias them toward simpler, more general solutions over time.

The sudden jump in grokking isn't the moment understanding appears—it's the moment the generalization circuit becomes strong enough to outcompete the memorization circuit for control of the network's output.

In AToM terms, this is curvature collapse. The memorization regime is high-curvature: the network's internal state is sensitive to small changes in input. Each training example requires a different trajectory through activation space. The generalization regime is low-curvature: the same smooth manifold handles all inputs. The phase transition between them is the network finding a coherence geometry that scales.


The Role of Time: Why More Training Unlocks Understanding

One of grokking's most counterintuitive lessons: sometimes you need to train longer, not smarter.

Standard ML practice is obsessed with efficiency. Minimize training time, stop when validation performance plateaus, move on to the next hyperparameter sweep. This makes sense if learning is a smooth optimization process where more steps just mean diminishing returns.

Grokking says learning isn't smooth. It's punctuated. Long periods of apparent stagnation followed by sudden reorganization.

The plateau isn't wasted time—it's the network preparing for the transition. Weight decay is eroding memorization circuits. Generalization circuits are slowly accumulating strength. The dynamics are active even when the metrics are flat.

If you stop training during the plateau, you never see grokking. You conclude the model has hit its capacity, when in fact it was mid-transition to a qualitatively different solution.

This has practical implications. For tasks where generalization really matters—where you need the model to understand structure, not just pattern-match—you might want to keep training far beyond the point where loss stops improving. The insight takes time to crystallize.

It also changes how we interpret training curves. A flat validation accuracy doesn't mean nothing's happening. It might mean the network is in the slow phase of a transition that will eventually manifest as sudden improvement.


Grokking as Insight: What Humans and Neural Networks Share

The phenomenology of grokking is strikingly similar to human insight.

You work on a problem for hours. Nothing. You step away, frustrated. Then suddenly—walking to get coffee, taking a shower, lying in bed—the solution arrives fully formed. The transition from "I don't get it" to "oh, I see it" feels discontinuous, even though your brain was presumably working on it the whole time.

Neuroscience calls this incubation: the period where conscious problem-solving has stopped but unconscious processes continue. The moment of insight—the "aha" experience—corresponds to a sudden reorganization of neural activity patterns from incoherent search to coherent solution.

Grokking is the neural network version. The network "incubates" on the problem during the plateau, building and testing circuits, until the generalization solution becomes strong enough to dominate. The validation accuracy jump is the network's "aha" moment.

This parallel suggests something deep: insight isn't a special human capacity but a generic property of systems that learn through gradient-based search in high-dimensional spaces. The architecture differs—biological neurons vs artificial ones—but the dynamics are the same. Smooth local updates accumulate until a threshold is crossed, triggering discontinuous global reorganization.

In both cases, the critical ingredient is time under tension. You can't force insight by trying harder. You need extended engagement with the problem—even if that engagement looks like stagnation from the outside—to build the substrate that allows the transition.


Phase Transitions in Weight Space: The Physics of Learning

Grokking is fundamentally a phase transition in the space of network weights.

Phase transitions are everywhere in physics. Water stays liquid until it hits 0°C, then suddenly crystallizes. Magnets stay disordered until temperature drops below the Curie point, then spins align. The system doesn't smoothly interpolate between phases—it jumps discontinuously when control parameters cross critical thresholds.

In grokking, the control parameter is training time under weight decay. The phases are memorization and generalization. The transition is sharp.

Why sharp instead of smooth? Because of feedback loops. Once generalization circuits start producing good predictions, they get reinforced by gradient descent, which makes them stronger, which makes them produce better predictions. Meanwhile, memorization circuits start producing worse predictions (they're being eroded by weight decay), which weakens them further. The system amplifies small differences until one regime dominates.

This is what coherence geometry calls attractor dynamics. The network's trajectory through weight space is pulled toward two different basins—memorization and generalization. Early in training, momentum and random initialization put it in the memorization basin. Weight decay slowly shifts the landscape until the generalization basin becomes deeper. When the boundary is crossed, the network rapidly descends into the new attractor.

The sharpness of the transition depends on how well-separated the attractors are. In modular arithmetic, they're very distinct—memorization and generalization are implemented by completely different circuits. Hence sudden grokking. In more complex tasks, the attractors might overlap more, giving smoother (but still non-monotonic) transitions.

Understanding grokking as phase transition explains why it's hard to predict when it will happen. Phase transitions are sensitive to initial conditions, noise, and subtle details of the system dynamics. Small changes in random seed, batch size, or weight decay strength can shift grokking by orders of magnitude in training steps—or prevent it entirely.


What Grokking Reveals About Generalization

The existence of grokking challenges how we think about the relationship between training performance and test performance.

The standard story: models learn to generalize by finding the simplest function that fits the training data (Occam's razor). Regularization enforces simplicity. If you see perfect training accuracy but poor test accuracy, you've overfit—found a complicated function that memorizes training quirks instead of the underlying pattern.

Grokking complicates this. You can have perfect training accuracy with either memorization or generalization. The same loss value corresponds to completely different internal structures. Looking at training curves alone, you can't tell which regime the network is in.

This matters for interpretability. If we want to understand what a network has learned, we can't just measure its accuracy. We need to look at the circuits it's using—the actual mechanisms by which it transforms inputs to outputs. Two networks with identical accuracy might understand the problem in completely different ways.

It also suggests a revised picture of generalization: generalization isn't about simplicity but about structure discovery. The network doesn't prefer simple functions because they're simple. It finds them because they're more robust to weight decay and gradient noise. Simplicity is a byproduct of the search process, not the goal.

In AToM terms: generalization is low-curvature coherence. A general solution is one where small perturbations in input produce small, predictable changes in output. Memorization is high-curvature: the input-output map is jagged, specific, fragile. Weight decay preferentially erodes high-curvature solutions, leaving the smooth manifolds that characterize true understanding.


Implications for AI Safety and Interpretability

Grokking has direct implications for AI safety, particularly in the context of large language models and frontier AI systems.

One concern: deceptive alignment. A model might memorize the "right" behavior during training (appearing aligned) while hiding misaligned goals that only activate in deployment. If the model can switch between behavioral regimes—like the switch from memorization to generalization in grokking—we might not detect the misaligned behavior until it's too late.

Grokking shows this isn't paranoid speculation. Networks do maintain multiple solutions simultaneously, with one suppressed until conditions change. The mechanisms are different (circuit competition vs explicit deception), but the phenomenology is similar: latent capabilities that don't appear in standard evaluations.

On the positive side, understanding grokking gives us tools. If we can identify the signatures of memorization vs generalization—through mechanistic interpretability, looking at which circuits are active—we can tell whether a model genuinely understands a task or just pattern-matches training data.

This is critical for AI safety evaluations. Testing model behavior on held-out data isn't enough if the model can switch regimes post-deployment. We need to verify that the model's internal mechanisms align with its external behavior.

Grokking also suggests a training strategy for robustness: intentionally induce phase transitions. Train past the memorization regime into the generalization regime, even if it means longer training times. Accept that the model will spend time in a plateau—that plateau is the signature of approaching a more robust solution.

For interpretability research specifically, grokking provides clean test cases. Simple algorithmic tasks that exhibit sharp grokking transitions let us study circuit formation in detail, building tools and intuitions that might generalize to understanding larger, more complex models.


Beyond Supervised Learning: Grokking in RL and Unsupervised Settings

While grokking was discovered in supervised learning, similar phenomena appear across learning paradigms.

In reinforcement learning, agents often exhibit sudden capability jumps after long periods of poor performance. The agent explores randomly, occasionally stumbles on a rewarding trajectory, slowly builds a policy around it—then suddenly generalizes to the full task. This looks like grokking: memorizing specific high-reward sequences, then transitioning to understanding the reward structure.

In unsupervised learning, autoencoders and generative models sometimes shift abruptly from learning superficial correlations to capturing deeper structure. A VAE might initially memorize per-example reconstructions, then suddenly discover the latent factors that generate the data. The transition manifests as a jump in disentanglement metrics or sample quality.

Even in meta-learning—learning to learn—grokking-like dynamics appear. Models trained on distributions of tasks initially memorize task-specific solutions, then suddenly generalize to the meta-structure: the commonalities across tasks that allow fast adaptation.

The unifying thread: hierarchical structure. Whenever learning involves multiple levels—surface patterns vs deep rules, specific instances vs general policies, task solutions vs task distributions—there's opportunity for grokking. The system can plateau at one level of abstraction while slowly building the substrate to jump to the next.

This suggests grokking isn't a quirk of supervised classification but a general property of learning in systems with compositional structure. Wherever you have layers of abstraction, you get phase transitions between them.


Training Dynamics and the Bias-Variance Tradeoff

Grokking forces us to rethink the bias-variance tradeoff—the foundational framework for understanding generalization in machine learning.

The classic picture: high-capacity models have low bias (can fit complex functions) but high variance (sensitive to training data noise). Regularization trades off variance for bias, finding the sweet spot for generalization.

Grokking says this is incomplete. The same model, with the same capacity, achieves two qualitatively different solutions—one high-variance (memorization), one low-variance (generalization)—at different points in training. Capacity alone doesn't determine where you land. The dynamics of training—how long you train, how strong the regularization—select between solutions with different bias-variance profiles.

This is most visible in double descent: the phenomenon where test error first decreases (underfitting to optimal), then increases (overfitting), then decreases again (interpolation to generalization). Grokking is the sharp version of the second descent. The model overcomes overfitting not by reducing capacity but by continuing to train until it finds a simpler solution.

The implication: we should think less about static model properties (capacity, architecture) and more about trajectories through training. Where does gradient descent take you? What attractors exist in weight space? How does regularization shape the basins?

In physics terms, this is moving from equilibrium thermodynamics (what states are possible?) to non-equilibrium dynamics (how do systems evolve between states?). Learning is a process, not a state. Grokking is the signature of that process hitting critical transitions.


The Geometry of Understanding: Low-Dimensional Manifolds in Weight Space

One of the most compelling explanations for grokking comes from geometric analysis of weight space.

Stephanie Liu and colleagues at MIT showed that during grokking, the network's weights gradually move from a high-dimensional, complex representation (memorization) to a low-dimensional manifold (generalization).

Early in training, weights occupy a high-dimensional space. Each training example contributes its own direction in weight space. The solution is specific, fragile, and doesn't compress well.

As training continues under weight decay, the network's trajectory curves toward lower-dimensional subspaces where many examples are handled by the same compact representation. The moment when the representation becomes sufficiently low-dimensional to capture the underlying rule is the moment of grokking.

You can visualize this with dimensionality reduction on weight snapshots. Before grokking, weights scatter across many dimensions. After grokking, they collapse onto a low-dimensional manifold that corresponds to the generalizing solution.

This is coherence emergence at the level of network weights. Coherence, in AToM terms, is when a system's state-space trajectory becomes predictable—when you can describe complex behavior with simple rules. Grokking is that process made visible: the network's internal structure simplifies until it matches the structure of the task.

The geometry also explains why grokking is sensitive to initialization. If the initial weights are far from the generalization manifold, it takes longer for weight decay to pull them into the right subspace. If they're nearby by chance, grokking happens faster. The same training dynamics can produce vastly different grokking times depending on where you start.


Connection to Human Learning: Expertise as Grokking

The parallels between grokking and human skill acquisition are hard to ignore.

When you learn a new skill—playing an instrument, speaking a language, solving math problems—there's a predictable progression. First, you memorize specific examples. "This chord shape makes this sound." "This word means this thing." "This problem type requires this procedure."

This is the memorization regime. High-effort, high-specificity, doesn't transfer. You can't play a new song just because you memorized one. You can't translate fluently just because you know 100 words.

Then, if you persist—through the frustrating plateau where you're competent at what you've practiced but can't generalize—something shifts. Suddenly you're not retrieving memorized instances but generating from principles. You hear a chord progression and your fingers know where to go. You construct sentences you've never spoken. You see the deep structure of a problem class.

This is grokking. The transition from performance to competence, from memorization to understanding, from knowing that to knowing why.

Anders Ericsson's work on deliberate practice captures the plateau phase: thousands of hours of focused effort that look from the outside like repetition but are actually building the substrate for sudden capability jumps. The "10,000-hour rule" isn't about linear improvement—it's about staying engaged long enough for the phase transition to happen.

In educational psychology, this maps to the difference between rote learning and deep learning. Rote learning is the memorization circuit. It's fast, gets you through tests, but doesn't transfer. Deep learning is the generalization circuit. It's slow to form but robust and flexible once it solidifies.

Teachers intuitively know this. You can't rush understanding. You can provide scaffolding, examples, practice—but the moment of insight, when it all clicks together, happens on its own timeline. Grokking formalizes that intuition: insight is a phase transition that requires time under the right conditions, not just clever pedagogy.


Open Questions and Future Directions

Grokking has opened more questions than it's answered.

What determines grokking time? We know it depends on weight decay, initialization, and task structure, but we can't predict it precisely. Is there a formula relating these factors to grokking onset? Can we engineer architectures that grok faster?

Is grokking universal? Does every learning problem have a grokking transition, or only those with specific structure? What properties of a task make it grokking-compatible?

Can we induce grokking deliberately? Are there training interventions—curriculum design, augmented objectives, architectural constraints—that guarantee grokking happens?

How does grokking scale? The original observations were on tiny models. Do large language models grok? If so, what does it look like? Are there multi-stage grokking transitions as models learn hierarchical structure?

What's the relationship to continual learning? When you train on new data, can you grok without forgetting? Or does grokking require stable task structure?

Can we detect grokking in progress? Are there observable signatures—weight space geometry, activation statistics, gradient dynamics—that tell us a network is mid-grok, about to transition?

From a safety perspective: Can misaligned models grok deceptive behavior? If a model learns to act aligned during training, could it undergo a grokking-like transition to misalignment post-deployment?

And philosophically: What does grokking tell us about the nature of understanding? Is understanding always a phase transition, or can it be gradual? What's the relationship between neural grokking, human insight, and scientific paradigm shifts?

These questions are active research areas. The field is young. We're still discovering the phenomenon's boundaries.


Synthesis: Grokking as Coherence Emergence Through Time

Grokking is what happens when a system spends enough time under the right constraints to discover simpler, more general structure.

It's not magic. It's the predictable outcome of circuit competition under regularization. Memorization circuits form fast but decay under weight penalty. Generalization circuits form slow but persist. Train long enough, and the transition is inevitable.

But the phenomenology—sudden, discontinuous, insight-like—reveals something deeper about learning dynamics. Understanding isn't built incrementally; it emerges through reorganization. The network doesn't gradually improve from 0% to 100% generalization. It jumps from one coherence regime (high-curvature memorization) to another (low-curvature generalization).

In AToM terms, grokking is coherence crystallization under constraint. The constraint is weight decay, which penalizes complex solutions. The coherence is the low-dimensional manifold that captures task structure. The crystallization is the phase transition when the manifold becomes the dominant attractor.

This maps to phenomena across scales. Neurons reorganizing into functional circuits. Scientists shifting from data collection to theoretical insight. Societies transitioning from fragmentation to shared narrative. Whenever a system under sustained pressure to simplify discovers a compressed representation of its environment, you get grokking dynamics.

The lesson for machine learning: don't stop training when performance plateaus. The plateau might be preparation for a jump.

The lesson for interpretability: identical accuracy can hide completely different mechanisms. Look at circuits, not just metrics.

The lesson for humans: insight takes time. You can't force it by working harder. But you can create the conditions—sustained engagement, right constraints, patience through plateaus—that make it more likely.

Grokking proves that neural networks, like humans, don't just learn. They understand. And understanding, when it comes, comes suddenly.


Series: Mechanistic Interpretability | Part: 4 of 10

Previous: Circuits in Silicon Minds: How Neural Networks Compute
Next: Polysemanticity: When Neurons Mean Multiple Things


Further Reading

  • Power, A., et al. (2022). "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets." arXiv:2201.02177.
  • Liu, Z., et al. (2022). "Omnigrok: Grokking Beyond Algorithmic Data." ICLR 2023.
  • Nanda, N., et al. (2023). "Progress Measures for Grokking via Mechanistic Interpretability." arXiv:2301.05217.
  • Frankle, J. & Carbin, M. (2019). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks." ICLR 2019.
  • Ericsson, K.A. (1993). "The Role of Deliberate Practice in the Acquisition of Expert Performance." Psychological Review 100(3).