A robot that can fold laundry, pack medication, and pour tea - controlled by a single model - sounds like science fiction. But it’s exactly what’s needed for real deployment. The problem? The best robot controllers are either closed-source (π0.5), too slow (reasoning models that generate hundreds of tokens before moving), or tied to hardware most labs can’t afford. MolmoAct2 (Fang, Duan et al., Allen AI / UW / Stanford / NVIDIA / MIT, May 2026) solves all five problems at once: it’s fully open (weights, code, data), runs at 55.79 Hz, deploys on platforms costing under $6,000, and achieves 97.2% success on LIBERO - beating every open and closed baseline. The secret? Let the robot’s action generator peek into the language model’s brain at every layer, not just the final output.
The Problem: Why Current Robot Controllers Fall Short
Vision-Language-Action (VLA) models represent the dream of robotics: one model that maps pixels and language to motor commands across any task. But in 2026, the field faces four compounding failures:
Closed frontier models - π0.5, Gemini Robotics, and others release at most weights. Training data, recipes, and full pipelines remain proprietary. You can’t reproduce, adapt, or audit them.
Reasoning overhead - Models that “think” before acting (chain-of-thought, world-model rollouts) generate hundreds of tokens before emitting a single motor command. This destroys real-time control.
Hardware lock-in - The few deployable open VLAs require expensive platforms (e.g., Franka arms at $30k+). Academic labs and independent researchers are locked out.
Brittle generalization - Zero-shot performance remains too unreliable for production. Fine-tuning helps but doesn’t close the gap to dependable deployment.
MolmoAct2 tackles each axis simultaneously.
What MolmoAct2 Does Differently - Five Axes
The architecture advances over its predecessor MolmoAct along five dimensions:
| Axis | What’s New |
|---|---|
| Backbone | Molmo2-ER - specialized for spatial & embodied reasoning |
| Data | 720h bimanual YAM + filtered DROID + SO-100/101 (largest open bimanual dataset) |
| Tokenizer | OpenFAST - open-weight action tokenizer, 5 embodiments |
| Architecture | Flow-matching expert with per-layer KV-cache conditioning |
| Reasoning | MolmoAct2-Think - adaptive depth tokens for changed regions only |
Let’s unpack each.
Molmo2-ER: A VLM That Understands Space
General-purpose VLMs (GPT-5, Gemini) are trained on web-scale image-text data. They’re excellent at “what is in this image?” but terrible at “how far is object A from object B?” or “if I move forward, will I hit the table?” These are exactly the questions a robot policy needs answered.
Molmo2-ER is Molmo2 specialized for embodied reasoning through a two-stage recipe:
Stage 1 - Embodied specialization (20K steps): Starting from the Molmo2 checkpoint, fine-tune on a 3.3M-sample corpus covering six capability pillars: image embodied QA, video embodied QA, pixel-accurate pointing, object detection, multi-image/ego-exo correspondence, and abstract reasoning. Add 8% text-only data to preserve language competence.
Stage 2 - Joint refinement (1.5K steps): Interleave the embodied corpus with Molmo2’s original multimodal data. Sweep the embodied/general ratio, finding p=0.5 optimal.
The result? Molmo2-ER scores 63.8% across 13 embodied reasoning benchmarks - surpassing GPT-5 (57.9%), Gemini Robotics ER-1.5 Thinking (61.3%), and its own base model Molmo2 by +17 points.
The Architecture: From Discrete Tokens to Continuous Control
MolmoAct2 follows a three-stage training pipeline: pre-training → post-training → fine-tuning. The key insight is to bridge discrete language modeling with continuous robot control through a novel architecture.
OpenFAST: Turning Robot Motions into Tokens
Robot actions are continuous (joint angles, end-effector positions) and embodiment-specific (7-DOF arm vs. 14-DOF bimanual). You can’t just shove them into a language model. OpenFAST solves this:
- Take a 1-second action trajectory
- Apply a frequency-domain transform (DCT-like)
- Quantize the coefficients
- Apply byte-pair encoding → compact discrete token sequence from a 2048-token vocabulary
Trained on 1M trajectories across 5 embodiments (YAM, SO-100/101, DROID Franka, BC-Z, BridgeData V2), OpenFAST is the first fully open action tokenizer with transparent training data.
During pre-training (200K steps), MolmoAct2 simply predicts these discrete action tokens alongside text - a unified next-token objective. 90% robot data, 10% multimodal.
Per-Layer KV Connection: The Key Innovation
Here’s where things get interesting. Pre-training gives you a VLM that can predict discrete actions. But for deployment, you need continuous control - smooth, precise trajectories. This is where the flow-matching action expert comes in during post-training.
The expert is a DiT-style network (same depth as the VLM: 36 layers) that generates continuous action trajectories via flow matching. The critical question: how does the expert access the VLM’s understanding?
Previous approach (GR00T N1.7, others): condition on the VLM’s final hidden states. This gives the action expert only a summary - like reading only the conclusion of a paper.
MolmoAct2’s approach: tap into the VLM’s KV cache at every layer:
$$\tilde{K}\ell = \text{reshape}(P_K K^{\text{vlm}}\ell), \quad \tilde{V}\ell = \text{reshape}(P_V V^{\text{vlm}}\ell) \tag{1}$$
where:
- $K^{\text{vlm}}\ell, V^{\text{vlm}}\ell$ - keys and values from VLM self-attention at layer $\ell$
- $P_K, P_V$ - learned linear projections aligning VLM dimensions to expert width
- $\tilde{K}\ell, \tilde{V}\ell$ - projected tensors used by the expert’s cross-attention
Interpretation: Early VLM layers encode low-level spatial features (edges, positions, distances). Later layers encode high-level semantics (object identity, task understanding). By cross-attending at every layer, the action expert gets the full hierarchy - from pixel-level geometry to abstract task comprehension.
The cross-attention at each expert block:
$$\text{CA}(Q_\ell, \tilde{K}\ell, \tilde{V}\ell) = \text{softmax}\left(\frac{Q_\ell \tilde{K}\ell^\top}{\sqrt{d_h}}\right)\tilde{V}\ell \tag{2}$$
where $Q_\ell$ is the expert’s query at layer $\ell$ and $d_h$ is the head dimension.
Crucially, during post-training the KV cache is detached - the flow loss trains only the expert and its projections, not the VLM. This “knowledge insulation” prevents the continuous objective from corrupting the VLM’s representations. During fine-tuning, this constraint is relaxed.
Flow Matching: Learning to Denoise Actions
The expert learns via flow matching. Given a target action chunk $a$ and noise $\epsilon \sim \mathcal{N}(0, I)$:
$$x_t = (1-t)\epsilon + ta, \quad u^\star = a - \epsilon \tag{3}$$
$$\mathcal{L}{\text{flow}} = \mathbb{E}{a,\epsilon,t}\left[|m \odot (f_\theta(x_t, t, c) - u^\star)|_2^2\right] \tag{4}$$
where:
- $x_t$ - noisy action at flow time $t \in [0,1]$
- $u^\star$ - target velocity (direction from noise to data)
- $f_\theta$ - the action expert
- $c$ - VLM context (task, observations, states)
- $m$ - mask zeroing padded dimensions
Interpretation: The expert sees a partially-noisy version of the target action and learns to predict which direction to “push” to recover the clean action. At inference, start from pure noise and integrate the predicted velocity field - out comes a continuous trajectory.
For denser supervision, each example is evaluated at $K$ independent noise levels:
$$\mathcal{L}{\text{flow}}(a, c) = \frac{1}{K}\sum{i=1}^{K}|m \odot (f_\theta(x_{t_i}, t_i, c) - (a - \epsilon_i))|_2^2 \tag{5}$$
Using $K=4$ in post-training and $K=8$ in fine-tuning. Ablations show K=8 gives 95.9% vs K=1 at 94.15% on LIBERO.
MolmoAct2-Think: Adaptive Depth Reasoning
Robot manipulation depends on spatial information - object distance, free space, occlusion - that standard behavior cloning never explicitly represents. MolmoAct2-Think adds an intermediate step: before acting, predict a compact depth map.
The Depth Representation
Each observation is quantized into a 10×10 grid of depth codes (128 possible values per cell = 100 tokens total). These are predicted autoregressively, just like text tokens, then their KV cache conditions the action expert.
Why Adaptive?
Here’s the efficiency trick. Robot trajectories contain massive temporal redundancy: most of the scene doesn’t change between consecutive 30Hz frames. So why re-predict 100 depth tokens every step?
MolmoAct2-Think maintains a depth buffer and only re-predicts cells where the RGB content changed:
$$m_{t,i} = \mathbf{1}[\cos(x_{t,i}, x_{t-1,i}) < 0.996] \tag{6}$$
where $x_{t,i}$ is the RGB patch at grid cell $i$, time $t$. Only cells with $m_{t,i} = 1$ trigger fresh depth prediction. Static cells reuse cached codes for free.
Result: Geometric grounding without the full latency cost. MolmoAct2-Think achieves 98.1% on LIBERO (vs 97.2% for base MolmoAct2) while the depth prediction cost scales with scene change, not scene size.
Training Tricks
Two innovations make adaptive depth work in practice:
Depth noise injection (10%): During fine-tuning, randomly corrupt 10% of depth-code inputs while keeping targets clean. This makes the action expert robust to imperfect depth predictions at inference time.
Learned per-layer depth gate: A scalar gate $g_\ell = \sigma(w_\ell^\top c_\ell + b_\ell)$ at each expert layer controls how much depth information flows through, initialized at $b = -4$ (nearly closed). The model learns how strongly each layer should use depth.
Experiments and Results
The evaluation is the most extensive for any open VLA to date: 7 benchmarks spanning simulation and real-world, across 3 embodiments.
Embodied Reasoning (Molmo2-ER)
| Model | Overall Average (13 benchmarks) |
|---|---|
| Molmo2 (base) | 46.8% |
| GPT-5 | 57.9% |
| Gemini-ER 1.5 Thinking | 61.3% |
| Molmo2-ER | 63.8% |
Out-of-the-Box Deployment
| Benchmark | π0.5-DROID | MolmoAct2 |
|---|---|---|
| MolmoSpaces (sim) | 34.5% | 37.7% |
| MolmoBot (real, DROID) | 48.4% | 87.1% |
| SO-100/101 (real) | 45.3% (π0) | 56.7% |
Fine-tuning (LIBERO)
| Model | Spatial | Object | Goal | Long | Average |
|---|---|---|---|---|---|
| π0.5 | 97.5% | 98.2% | 98.0% | 92.4% | 96.9% |
| MolmoAct2 | 97.8% | 100% | 97.8% | 93.2% | 97.2% |
| MolmoAct2-Think | 98.5% | 99.8% | 98.8% | 95.4% | 98.1% |
Real-World Bimanual YAM (8 tasks, 50 trials each)
MolmoAct2 achieves 50.1% average success - +15% over the runner-up (OpenVLA-OFT at 35.1%). Tasks include in-the-wild deployment: store_candy, hang_tools, prepare_pipette, make_popcorn.
Inference Speed
| Model | Control Rate |
|---|---|
| MolmoAct2 (original) | 23.02 Hz |
| MolmoAct2 (+ caching) | 27.39 Hz |
| MolmoAct2 (+ CUDA Graphs) | 55.79 Hz |
| MolmoAct2-Think (+ CUDA Graphs) | 12.71 Hz |
Key Ablations
- Backbone: Molmo2-ER gives +6% over Molmo2 on LIBERO Long (83.6% vs 77.6%)
- KV connection: Per-layer KV (95.9%) > per-head KV (94.8%) > hidden states (94.0%)
- Flow samples: K=8 (95.9%) > K=4 (95.15%) > K=1 (94.15%)
- Fine-tuning: Full model + discrete co-training (97.2%) » action expert only (93.05%)
Limitations
- Hardware requirements: Inference needs an H100 GPU. Not edge-deployable yet.
- Embodiment fine-tuning still required: The model generalizes across tasks on a given platform, but new platforms need fine-tuning.
- Monocular depth only: MolmoAct2-Think uses Depth Anything V2 estimates, not true stereo/LIDAR.
- 4B parameters: Still large for some deployment contexts, even at 55 Hz.
- Think variant limited CUDA Graph gains: Adaptive decoding (1.58x) benefits less than fixed-shape flow (2.42x).
Summary
What to remember:
- MolmoAct2 is the first fully open VLA (weights + code + data) to outperform closed-source competitors across simulation and real-world benchmarks
- The per-layer KV-cache connection between VLM and action expert is the key architectural innovation - it gives the robot hierarchical access to the language model’s spatial and semantic representations
- Adaptive depth reasoning (MolmoAct2-Think) provides geometric grounding while scaling cost with scene change, not scene size
- The complete data release includes the largest open bimanual dataset (720h, 34.5k demos on a $6k platform)
- With CUDA Graph optimization, MolmoAct2 runs at 55.79 Hz - fast enough for real-time closed-loop control
This is what “fully open” should mean in robotics: not just model weights, but the entire pipeline from data collection to deployment. MolmoAct2 sets the new standard.
Źródła i materiały: