Associative memory is the ability to store patterns and retrieve them when presented with partial or noisy inputs. Inspired by how the human brain recalls memories, associative memory models are recurrent neural networks that converge to stored patterns over time. The tutorial ‘Modern Methods in Associative Memory’ by Krotov et al. offers an accessible overview for newcomers and a rigorous mathematical treatment for experts, bridging classical ideas with cutting-edge developments in deep learning.

Classical Hopfield Networks

Introduced in 1982 by John Hopfield, the Hopfield network uses binary neurons $s_i ∈ [{-1,+1}]$ and symmetric weights $w_{ij}$. The network energy is

$$ E(s) = -\frac{1}{2}\sum_{i,j} w_{ij},s_i,s_j $$

and the asynchronous update rule

$$ s_i \to\ \operatorname{sign}\Bigl(\sum_j w_{ij},s_j\Bigr) $$

guarantees that E decreases with each update. Stored patterns {ξ^μ} are embedded via

$$ w_{ij} = \frac{1}{N}\sum_{\mu=1}^{P} \xi_i^\mu,\xi_j^\mu $$

However, capacity is limited to $α_c$ N patterns with $α_c ≈ 0.14$.

Dense and Modern Hopfield Networks

To overcome capacity limits, researchers extended interactions to higher orders. In Dense Associative Memory, the energy takes the form

$$ E(s) = -\sum_{\mu=1}^{P} F!\Bigl(\tfrac{1}{N}\sum_{i=1}^N \xi_i^\mu,s_i\Bigr) $$

where $F$ is a non-quadratic function (e.g., a polynomial of degree d), boosting capacity to $O(N^{d-1})$. The modern Hopfield network uses

$$ E(s) = -\frac{1}{\beta}\sum_{\mu=1}^{P}\log!\Bigl(\sum_{i=1}^N e^{\beta,\xi_i^\mu,s_i}\Bigr) $$

which yields the update

$$ s_i = \sum_{\mu=1}^{P} \xi_i^\mu \mathrm{softmax}!\bigl(\beta,(\xi^\mu!\cdot!s)\bigr)_i $$

This continuous model achieves exponential capacity $O(e^N)$.

Lagrangian Framework

A key contribution of the tutorial is the Lagrangian perspective. By introducing auxiliary variables $h_i$ and convex conjugates, one defines a Lagrangian $L(s,h)$ whose extremization via Euler–Lagrange equations yields the modern update rules in a principled way, unifying energy minimization and gradient dynamics.

Connections to Transformers and Diffusion Models

Remarkably, the modern Hopfield update aligns with the scaled dot-product attention used in Transformers: queries and keys correspond to the current state and stored patterns, and the softmax weights implement associative retrieval. Similarly, diffusion models, which iteratively denoise samples, can be seen as continuous energy minimization akin to associative updates.