To Grok Grokking: Why Neural Networks Sometimes Understand Late

In machine learning, we expect a model to either learn or overfit. What we don’t expect is for a model to overfit first and then — much later, with no changes — suddenly start generalizing well. This phenomenon is called grokking, and it has puzzled researchers since its discovery. A new paper finally explains why it happens and proves it mathematically — in the simplest possible setting. What is Grokking? Grokking was first observed in 2022 on small algorithmic tasks (like modular arithmetic). The pattern is striking: ...

January 27, 2026