In machine learning, we expect a model to either learn or overfit. What we don’t expect is for a model to overfit first and then — much later, with no changes — suddenly start generalizing well. This phenomenon is called grokking, and it has puzzled researchers since its discovery. A new paper finally explains why it happens and proves it mathematically — in the simplest possible setting.
What is Grokking?
Grokking was first observed in 2022 on small algorithmic tasks (like modular arithmetic). The pattern is striking:
- The model quickly reaches near-perfect accuracy on training data
- Test accuracy remains terrible for a long time
- Then, seemingly out of nowhere, test accuracy jumps up
It looks like the model “memorizes” first and “understands” much later. Training loss says “I’m done”, but the model keeps secretly improving under the hood.
This raised a fundamental question: is grokking a deep mystery of neural networks, or does it have a simple explanation?
The Answer: Ridge Regression Shows the Way
The authors prove that grokking occurs even in ridge regression — the simplest possible learning setting. No deep networks, no complex architectures. Just linear regression with L2 regularization:
$$\min_\theta \frac{1}{n} |X\theta - y|^2 + \lambda |\theta|^2$$
where $\lambda$ is the weight decay parameter.
If grokking happens here, it’s not about neural network magic — it’s about the fundamental dynamics of gradient descent with regularization.
Three Phases of Grokking
The paper formally proves three distinct phases during gradient descent training:
Phase 1: Fast Overfitting
The model quickly fits the training data. How fast? Controlled by the smallest non-zero eigenvalue of the empirical feature matrix $\Phi^\top \Phi$:
$$t_{\text{overfit}} \leq \frac{n \cdot \ln(6b^2|\theta^{(0)}|2^2 / \varepsilon)}{2\eta \cdot \lambda{\min}^+(\Phi^\top \Phi)}$$
In plain terms: training loss drops fast because the model has enough parameters to memorize the data.
Phase 2: Prolonged Poor Generalization
Here’s the key. After overfitting, the model has learned the right answer for training points. But its parameters in directions orthogonal to the training data remain close to their initial (random) values.
These “extra” directions don’t affect training loss — the model already fits the data. But they hurt generalization because they add noise to predictions on new inputs.
Phase 3: Late Generalization
Weight decay slowly shrinks all parameters, including those noisy orthogonal components. Eventually, they become small enough that the model generalizes well.
The generalization time is bounded by:
$$t_{\text{generalize}} \geq \frac{1}{4\eta\lambda} \cdot \ln\left(\frac{(m-n)\nu^2}{2} \cdot \left(\sqrt{\frac{c}{\lambda_{\min}(\Sigma)}} + |\theta^*|_2\right)^{-2}\right)$$
The critical factor: this scales as $1/\lambda$ — the smaller the weight decay, the longer the grokking delay.
Why Does This Happen?
The intuition is surprisingly clean:
- Over-parameterization ($m \gg n$): The model has many more parameters than data points
- Random initialization: Parameters start at random values in all directions
- Fast memorization: The model quickly finds parameters that fit training data
- Slow regularization: Weight decay slowly removes the “junk” in unused directions
The gap between Phase 1 and Phase 3 is the grokking delay. It exists because memorization is fast (depends on data) but regularization is slow (depends on weight decay).
Controlling Grokking
The paper’s most practical insight: grokking is controllable through hyperparameters.
| Parameter | Effect on Grokking |
|---|---|
| Weight decay $\lambda$ ↑ | Shorter delay (faster regularization) |
| Weight decay $\lambda$ ↓ | Longer delay (can be arbitrarily long) |
| Learning rate $\eta$ ↑ | Shorter delay |
| Over-parameterization ↑ | More pronounced grokking |
| Initialization scale ↑ | More pronounced grokking |
You can make grokking disappear by increasing weight decay. Or make it arbitrarily long by decreasing it.
Beyond Ridge Regression
The theory is proven for ridge regression, but the authors validate it experimentally on neural networks:
Random-feature networks
Two-layer ReLU networks with frozen hidden layer (effectively ridge regression in feature space). The theoretical predictions match precisely.
Full neural networks
Two-layer networks with both layers trained. The grokking behavior qualitatively matches the theoretical predictions — the same hyperparameter dependencies hold.
This suggests the mechanism is universal: it’s not about the model architecture, but about the dynamics of gradient descent with regularization in an over-parameterized setting.
What This Means
Grokking is not mysterious
The paper’s central message: grokking is not an inherent failure mode of deep learning, but rather a consequence of specific training conditions.
It happens when:
- The model is over-parameterized
- Weight decay is small
- There’s a gap between memorization speed and regularization speed
Practical implications
If you observe grokking in practice:
- Increase weight decay — the simplest fix
- Increase learning rate — speeds up both phases
- Monitor test loss longer — your model may not be done yet
- Don’t stop early — early stopping during Phase 2 kills generalization
Theoretical implications
This is the first rigorous end-to-end proof of grokking with quantitative bounds. It connects grokking to well-understood phenomena in optimization theory:
- Implicit bias of gradient descent
- Role of regularization in generalization
- Spectral properties of feature matrices
Technical Details
Over-parameterization is key
The setting requires $m \gg n$ (many more parameters than samples). In the under-parameterized regime, there’s no room for “junk” directions — the model is forced to generalize from the start.
Eigenvalue separation
The grokking delay depends on the spectral gap of the feature covariance matrix. Large gaps between eigenvalues create conditions where some directions converge fast (fitting training data) while others converge slowly (generalization).
Weight decay as implicit feature selection
Weight decay doesn’t just prevent overfitting — it performs implicit feature selection by slowly removing components that don’t contribute to fitting the training data. In the grokking regime, this selection happens much later than memorization.
Limitations
- Linear theory: The main proofs apply to ridge regression; neural network results are empirical
- Distribution assumptions: Requires specific properties of the feature distribution
- Fixed design: Analysis focuses on fixed training data, not online learning
- Finite-width effects: Very wide networks may behave differently
Summary
Grokking — the delayed generalization phenomenon — finally has a rigorous explanation:
- Over-parameterized models can memorize training data in many directions
- Random initialization places parameters in noisy directions orthogonal to the data
- Gradient descent fits training data fast using the data-aligned directions
- Weight decay slowly removes noise in unused directions
- Generalization emerges once the noise is sufficiently reduced
The grokking delay scales as $1/\lambda$ — it’s entirely controlled by weight decay. No architecture changes needed, no mysterious emergent property. Just the interplay of memorization speed and regularization speed.
Sometimes the simplest explanation is the right one.
Links
- Paper: arXiv:2601.19791