To Grok Grokking: Why Neural Networks Sometimes Understand Late

In machine learning, we expect a model to either learn or overfit. What we don’t expect is for a model to overfit first and then — much later, with no changes — suddenly start generalizing well. This phenomenon is called grokking, and it has puzzled researchers since its discovery. A new paper finally explains why it happens and proves it mathematically — in the simplest possible setting.

What is Grokking?

Grokking was first observed in 2022 on small algorithmic tasks (like modular arithmetic). The pattern is striking:

The model quickly reaches near-perfect accuracy on training data
Test accuracy remains terrible for a long time
Then, seemingly out of nowhere, test accuracy jumps up

It looks like the model “memorizes” first and “understands” much later. Training loss says “I’m done”, but the model keeps secretly improving under the hood.

This raised a fundamental question: is grokking a deep mystery of neural networks, or does it have a simple explanation?

The Answer: Ridge Regression Shows the Way

The authors prove that grokking occurs even in ridge regression — the simplest possible learning setting. No deep networks, no complex architectures. Just linear regression with L2 regularization:

$$\min_\theta \frac{1}{n} |X\theta - y|^2 + \lambda |\theta|^2$$

where $\lambda$ is the weight decay parameter.

If grokking happens here, it’s not about neural network magic — it’s about the fundamental dynamics of gradient descent with regularization.

Three Phases of Grokking

The paper formally proves three distinct phases during gradient descent training:

Phase 1: Fast Overfitting

The model quickly fits the training data. How fast? Controlled by the smallest non-zero eigenvalue of the empirical feature matrix $\Phi^\top \Phi$:

$$t_{\text{overfit}} \leq \frac{n \cdot \ln(6b^2|\theta^{(0)}|2^2 / \varepsilon)}{2\eta \cdot \lambda{\min}^+(\Phi^\top \Phi)}$$

In plain terms: training loss drops fast because the model has enough parameters to memorize the data.

Phase 2: Prolonged Poor Generalization

Here’s the key. After overfitting, the model has learned the right answer for training points. But its parameters in directions orthogonal to the training data remain close to their initial (random) values.

These “extra” directions don’t affect training loss — the model already fits the data. But they hurt generalization because they add noise to predictions on new inputs.

Phase 3: Late Generalization

Weight decay slowly shrinks all parameters, including those noisy orthogonal components. Eventually, they become small enough that the model generalizes well.

The generalization time is bounded by:

$$t_{\text{generalize}} \geq \frac{1}{4\eta\lambda} \cdot \ln\left(\frac{(m-n)\nu^2}{2} \cdot \left(\sqrt{\frac{c}{\lambda_{\min}(\Sigma)}} + |\theta^*|_2\right)^{-2}\right)$$

The critical factor: this scales as $1/\lambda$ — the smaller the weight decay, the longer the grokking delay.

Why Does This Happen?

The intuition is surprisingly clean:

Over-parameterization ($m \gg n$): The model has many more parameters than data points
Random initialization: Parameters start at random values in all directions
Fast memorization: The model quickly finds parameters that fit training data
Slow regularization: Weight decay slowly removes the “junk” in unused directions

The gap between Phase 1 and Phase 3 is the grokking delay. It exists because memorization is fast (depends on data) but regularization is slow (depends on weight decay).

Controlling Grokking

The paper’s most practical insight: grokking is controllable through hyperparameters.

Parameter	Effect on Grokking
Weight decay $\lambda$ ↑	Shorter delay (faster regularization)
Weight decay $\lambda$ ↓	Longer delay (can be arbitrarily long)
Learning rate $\eta$ ↑	Shorter delay
Over-parameterization ↑	More pronounced grokking
Initialization scale ↑	More pronounced grokking

You can make grokking disappear by increasing weight decay. Or make it arbitrarily long by decreasing it.

Beyond Ridge Regression

The theory is proven for ridge regression, but the authors validate it experimentally on neural networks:

Random-feature networks

Two-layer ReLU networks with frozen hidden layer (effectively ridge regression in feature space). The theoretical predictions match precisely.

Full neural networks

Two-layer networks with both layers trained. The grokking behavior qualitatively matches the theoretical predictions — the same hyperparameter dependencies hold.

This suggests the mechanism is universal: it’s not about the model architecture, but about the dynamics of gradient descent with regularization in an over-parameterized setting.

What This Means

Grokking is not mysterious

The paper’s central message: grokking is not an inherent failure mode of deep learning, but rather a consequence of specific training conditions.

It happens when:

The model is over-parameterized
Weight decay is small
There’s a gap between memorization speed and regularization speed

Practical implications

If you observe grokking in practice:

Increase weight decay — the simplest fix
Increase learning rate — speeds up both phases
Monitor test loss longer — your model may not be done yet
Don’t stop early — early stopping during Phase 2 kills generalization

Theoretical implications

This is the first rigorous end-to-end proof of grokking with quantitative bounds. It connects grokking to well-understood phenomena in optimization theory:

Implicit bias of gradient descent
Role of regularization in generalization
Spectral properties of feature matrices

Technical Details

Over-parameterization is key

The setting requires $m \gg n$ (many more parameters than samples). In the under-parameterized regime, there’s no room for “junk” directions — the model is forced to generalize from the start.

Eigenvalue separation

The grokking delay depends on the spectral gap of the feature covariance matrix. Large gaps between eigenvalues create conditions where some directions converge fast (fitting training data) while others converge slowly (generalization).

Weight decay as implicit feature selection

Weight decay doesn’t just prevent overfitting — it performs implicit feature selection by slowly removing components that don’t contribute to fitting the training data. In the grokking regime, this selection happens much later than memorization.

Limitations

Linear theory: The main proofs apply to ridge regression; neural network results are empirical
Distribution assumptions: Requires specific properties of the feature distribution
Fixed design: Analysis focuses on fixed training data, not online learning
Finite-width effects: Very wide networks may behave differently

Summary

Grokking — the delayed generalization phenomenon — finally has a rigorous explanation:

Over-parameterized models can memorize training data in many directions
Random initialization places parameters in noisy directions orthogonal to the data
Gradient descent fits training data fast using the data-aligned directions
Weight decay slowly removes noise in unused directions
Generalization emerges once the noise is sufficiently reduced

The grokking delay scales as $1/\lambda$ — it’s entirely controlled by weight decay. No architecture changes needed, no mysterious emergent property. Just the interplay of memorization speed and regularization speed.

Sometimes the simplest explanation is the right one.

What is Grokking?#

The Answer: Ridge Regression Shows the Way#

Three Phases of Grokking#

Phase 1: Fast Overfitting#

Phase 2: Prolonged Poor Generalization#

Phase 3: Late Generalization#

Why Does This Happen?#

Controlling Grokking#

Beyond Ridge Regression#

Random-feature networks#

Full neural networks#

What This Means#

Grokking is not mysterious#

Practical implications#

Theoretical implications#

Technical Details#

Over-parameterization is key#

Eigenvalue separation#

Weight decay as implicit feature selection#

Limitations#

Summary#

Links#