In many real-world tasks—like forecasting the paths of cars at a busy intersection, coordinating fleets of delivery robots, or simulating pedestrian movement—models must reason about not just where things are, but how they face or rotate relative to each other. That’s the SE(2) geometry: 2D position + heading.
Traditional Transformer models that account for rotation and translation invariance (SE(2)-invariant) need to compute relative poses between every pair of objects. If you have $n$ objects, this leads to memory cost growing like $O(n^2)$—which becomes prohibitively expensive when $n$ is large.
🚀 What this paper brings
The new method from Ethan Pronovost etal. introduces an attention mechanism that:
- Is truly SE(2)-invariant (ignoring both scene shifts and rotations),
- Uses only linear memory in the number of objects ($O(n)$ instead of $O(n^2)$).
How? By approximating the relative pose encoding (translation + rotation) as a Fourier series, with error controlled below $10^{-3}$. That encoding is baked into the attention computation without explicitly comparing every pair.
🧠 How they do it in practice
Instead of storing a big $n \times n$ matrix of relative poses, they embed each object’s pose into keys and queries in such a way that dot‑product attention implicitly factors in geometry:
$$ \text{attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^T + \text{SE2Enc}(Q,K)}{\sqrt{d_k}}\right) V $$
But thanks to clever Fourier-based encoding, SE2Enc
is computed efficiently in linear memory.
🌍 Everyday analogy
Imagine you and a friend are at a crowded train station with many travelers. You want to anticipate where everyone will move next. Usually, you’d have to consider the direction and position of every single person relative to every other—that’s quadratic scaling. The SE(2)-Invariant Linear Attention method, in contrast, is like if each person carried a compact summary of their orientation and location encoded in a “portable code.” You glance at each passer’s code and instantly infer interactions without comparing every pair—making your mental model vastly more scalable.
📈 Why it matters
- Handles scenes with many agents (e.g. self-driving car environments, delivery drone swarms).
- Maintains geometric consistency regardless of how the scene is rotated or shifted.
- Efficient enough for real-time inference even when dozens or hundreds of agents are involved.
⚡ Results
Pronovost et al. show that the method:
- Outperforms non-invariant models, especially on tasks involving prediction or planning among moving agents.
- Beats prior SE(2)-invariant methods that require quadratic memory, like GTA or RoPE-style encodings.
- Earned the Best Paper Award at the RSS 2025 Equivariant Systems Workshop—validation from the community.
🔍 Pitfalls & future directions
- The Fourier approximation introduces a small error (<$10^{-3}$), trading precision for efficiency.
- Currently limited to 2D motion (SE(2)); it does not yet handle full 3D rotational invariance (SE(3)), which remains a challenge for future work.
📎 Links
- Based on the publication 📄 2507.18597