MolmoAct2: The First Fully Open Robot Controller That Beats Closed-Source Giants

A robot that can fold laundry, pack medication, and pour tea - controlled by a single model - sounds like science fiction. But it’s exactly what’s needed for real deployment. The problem? The best robot controllers are either closed-source (π0.5), too slow (reasoning models that generate hundreds of tokens before moving), or tied to hardware most labs can’t afford. MolmoAct2 (Fang, Duan et al., Allen AI / UW / Stanford / NVIDIA / MIT, May 2026) solves all five problems at once: it’s fully open (weights, code, data), runs at 55.79 Hz, deploys on platforms costing under $6,000, and achieves 97.2% success on LIBERO - beating every open and closed baseline. The secret? Let the robot’s action generator peek into the language model’s brain at every layer, not just the final output.

The Problem: Why Current Robot Controllers Fall Short

Vision-Language-Action (VLA) models represent the dream of robotics: one model that maps pixels and language to motor commands across any task. But in 2026, the field faces four compounding failures:

Closed frontier models - π0.5, Gemini Robotics, and others release at most weights. Training data, recipes, and full pipelines remain proprietary. You can’t reproduce, adapt, or audit them.
Reasoning overhead - Models that “think” before acting (chain-of-thought, world-model rollouts) generate hundreds of tokens before emitting a single motor command. This destroys real-time control.
Hardware lock-in - The few deployable open VLAs require expensive platforms (e.g., Franka arms at $30k+). Academic labs and independent researchers are locked out.
Brittle generalization - Zero-shot performance remains too unreliable for production. Fine-tuning helps but doesn’t close the gap to dependable deployment.

MolmoAct2 tackles each axis simultaneously.

What MolmoAct2 Does Differently - Five Axes

The architecture advances over its predecessor MolmoAct along five dimensions:

Axis	What’s New
Backbone	Molmo2-ER - specialized for spatial & embodied reasoning
Data	720h bimanual YAM + filtered DROID + SO-100/101 (largest open bimanual dataset)
Tokenizer	OpenFAST - open-weight action tokenizer, 5 embodiments
Architecture	Flow-matching expert with per-layer KV-cache conditioning
Reasoning	MolmoAct2-Think - adaptive depth tokens for changed regions only

Let’s unpack each.

Molmo2-ER: A VLM That Understands Space

General-purpose VLMs (GPT-5, Gemini) are trained on web-scale image-text data. They’re excellent at “what is in this image?” but terrible at “how far is object A from object B?” or “if I move forward, will I hit the table?” These are exactly the questions a robot policy needs answered.

Molmo2-ER is Molmo2 specialized for embodied reasoning through a two-stage recipe:

Stage 1 - Embodied specialization (20K steps): Starting from the Molmo2 checkpoint, fine-tune on a 3.3M-sample corpus covering six capability pillars: image embodied QA, video embodied QA, pixel-accurate pointing, object detection, multi-image/ego-exo correspondence, and abstract reasoning. Add 8% text-only data to preserve language competence.

Stage 2 - Joint refinement (1.5K steps): Interleave the embodied corpus with Molmo2’s original multimodal data. Sweep the embodied/general ratio, finding p=0.5 optimal.

The result? Molmo2-ER scores 63.8% across 13 embodied reasoning benchmarks - surpassing GPT-5 (57.9%), Gemini Robotics ER-1.5 Thinking (61.3%), and its own base model Molmo2 by +17 points.

The Architecture: From Discrete Tokens to Continuous Control

MolmoAct2 follows a three-stage training pipeline: pre-training → post-training → fine-tuning. The key insight is to bridge discrete language modeling with continuous robot control through a novel architecture.

OpenFAST: Turning Robot Motions into Tokens

Robot actions are continuous (joint angles, end-effector positions) and embodiment-specific (7-DOF arm vs. 14-DOF bimanual). You can’t just shove them into a language model. OpenFAST solves this:

Take a 1-second action trajectory
Apply a frequency-domain transform (DCT-like)
Quantize the coefficients
Apply byte-pair encoding → compact discrete token sequence from a 2048-token vocabulary

Trained on 1M trajectories across 5 embodiments (YAM, SO-100/101, DROID Franka, BC-Z, BridgeData V2), OpenFAST is the first fully open action tokenizer with transparent training data.

During pre-training (200K steps), MolmoAct2 simply predicts these discrete action tokens alongside text - a unified next-token objective. 90% robot data, 10% multimodal.

Per-Layer KV Connection: The Key Innovation

Here’s where things get interesting. Pre-training gives you a VLM that can predict discrete actions. But for deployment, you need continuous control - smooth, precise trajectories. This is where the flow-matching action expert comes in during post-training.

The expert is a DiT-style network (same depth as the VLM: 36 layers) that generates continuous action trajectories via flow matching. The critical question: how does the expert access the VLM’s understanding?

Previous approach (GR00T N1.7, others): condition on the VLM’s final hidden states. This gives the action expert only a summary - like reading only the conclusion of a paper.

MolmoAct2’s approach: tap into the VLM’s KV cache at every layer:

$$\tilde{K}_\ell = \text{reshape}(P_K K^{\text{vlm}}_\ell), \quad \tilde{V}_\ell = \text{reshape}(P_V V^{\text{vlm}}_\ell) \tag{1}$$

where:

$K^{\text{vlm}}_\ell, V^{\text{vlm}}_\ell$ - keys and values from VLM self-attention at layer $\ell$
$P_K, P_V$ - learned linear projections aligning VLM dimensions to expert width
$\tilde{K}_\ell, \tilde{V}_\ell$ - projected tensors used by the expert’s cross-attention

Interpretation: Early VLM layers encode low-level spatial features (edges, positions, distances). Later layers encode high-level semantics (object identity, task understanding). By cross-attending at every layer, the action expert gets the full hierarchy - from pixel-level geometry to abstract task comprehension.

The cross-attention at each expert block:

$$\text{CA}(Q_\ell, \tilde{K}_\ell, \tilde{V}_\ell) = \text{softmax}\left(\frac{Q_\ell \tilde{K}_\ell^\top}{\sqrt{d_h}}\right)\tilde{V}_\ell \tag{2}$$

where $Q_\ell$ is the expert’s query at layer $\ell$ and $d_h$ is the head dimension.

Crucially, during post-training the KV cache is detached - the flow loss trains only the expert and its projections, not the VLM. This “knowledge insulation” prevents the continuous objective from corrupting the VLM’s representations. During fine-tuning, this constraint is relaxed.

Flow Matching: Learning to Denoise Actions

The expert learns via flow matching. Given a target action chunk $a$ and noise $\epsilon \sim \mathcal{N}(0, I)$:

$$x_t = (1-t)\epsilon + ta, \quad u^\star = a - \epsilon \tag{3}$$$$\mathcal{L}_{\text{flow}} = \mathbb{E}_{a,\epsilon,t}\left[\|m \odot (f_\theta(x_t, t, c) - u^\star)\|_2^2\right] \tag{4}$$

where:

$x_t$ - noisy action at flow time $t \in [0,1]$
$u^\star$ - target velocity (direction from noise to data)
$f_\theta$ - the action expert
$c$ - VLM context (task, observations, states)
$m$ - mask zeroing padded dimensions

Interpretation: The expert sees a partially-noisy version of the target action and learns to predict which direction to “push” to recover the clean action. At inference, start from pure noise and integrate the predicted velocity field - out comes a continuous trajectory.

For denser supervision, each example is evaluated at $K$ independent noise levels:

$$\mathcal{L}_{\text{flow}}(a, c) = \frac{1}{K}\sum_{i=1}^{K}\|m \odot (f_\theta(x_{t_i}, t_i, c) - (a - \epsilon_i))\|_2^2 \tag{5}$$

Using $K=4$ in post-training and $K=8$ in fine-tuning. Ablations show K=8 gives 95.9% vs K=1 at 94.15% on LIBERO.

MolmoAct2-Think: Adaptive Depth Reasoning

Robot manipulation depends on spatial information - object distance, free space, occlusion - that standard behavior cloning never explicitly represents. MolmoAct2-Think adds an intermediate step: before acting, predict a compact depth map.

The Depth Representation

Each observation is quantized into a 10×10 grid of depth codes (128 possible values per cell = 100 tokens total). These are predicted autoregressively, just like text tokens, then their KV cache conditions the action expert.

Why Adaptive?

Here’s the efficiency trick. Robot trajectories contain massive temporal redundancy: most of the scene doesn’t change between consecutive 30Hz frames. So why re-predict 100 depth tokens every step?

MolmoAct2-Think maintains a depth buffer and only re-predicts cells where the RGB content changed:

$$m_{t,i} = \mathbf{1}[\cos(x_{t,i}, x_{t-1,i}) < 0.996] \tag{6}$$

where $x_{t,i}$ is the RGB patch at grid cell $i$, time $t$. Only cells with $m_{t,i} = 1$ trigger fresh depth prediction. Static cells reuse cached codes for free.

Result: Geometric grounding without the full latency cost. MolmoAct2-Think achieves 98.1% on LIBERO (vs 97.2% for base MolmoAct2) while the depth prediction cost scales with scene change, not scene size.

Training Tricks

Two innovations make adaptive depth work in practice:

Depth noise injection (10%): During fine-tuning, randomly corrupt 10% of depth-code inputs while keeping targets clean. This makes the action expert robust to imperfect depth predictions at inference time.
Learned per-layer depth gate: A scalar gate $g_\ell = \sigma(w_\ell^\top c_\ell + b_\ell)$ at each expert layer controls how much depth information flows through, initialized at $b = -4$ (nearly closed). The model learns how strongly each layer should use depth.

Experiments and Results

The evaluation is the most extensive for any open VLA to date: 7 benchmarks spanning simulation and real-world, across 3 embodiments.

Embodied Reasoning (Molmo2-ER)

Model	Overall Average (13 benchmarks)
Molmo2 (base)	46.8%
GPT-5	57.9%
Gemini-ER 1.5 Thinking	61.3%
Molmo2-ER	63.8%

Out-of-the-Box Deployment

Benchmark	π0.5-DROID	MolmoAct2
MolmoSpaces (sim)	34.5%	37.7%
MolmoBot (real, DROID)	48.4%	87.1%
SO-100/101 (real)	45.3% (π0)	56.7%

Fine-tuning (LIBERO)

Model	Spatial	Object	Goal	Long	Average
π0.5	97.5%	98.2%	98.0%	92.4%	96.9%
MolmoAct2	97.8%	100%	97.8%	93.2%	97.2%
MolmoAct2-Think	98.5%	99.8%	98.8%	95.4%	98.1%

Real-World Bimanual YAM (8 tasks, 50 trials each)

MolmoAct2 achieves 50.1% average success - +15% over the runner-up (OpenVLA-OFT at 35.1%). Tasks include in-the-wild deployment: store_candy, hang_tools, prepare_pipette, make_popcorn.

Inference Speed

Model	Control Rate
MolmoAct2 (original)	23.02 Hz
MolmoAct2 (+ caching)	27.39 Hz
MolmoAct2 (+ CUDA Graphs)	55.79 Hz
MolmoAct2-Think (+ CUDA Graphs)	12.71 Hz

Key Ablations

Backbone: Molmo2-ER gives +6% over Molmo2 on LIBERO Long (83.6% vs 77.6%)
KV connection: Per-layer KV (95.9%) > per-head KV (94.8%) > hidden states (94.0%)
Flow samples: K=8 (95.9%) > K=4 (95.15%) > K=1 (94.15%)
Fine-tuning: Full model + discrete co-training (97.2%) » action expert only (93.05%)

Limitations

Hardware requirements: Inference needs an H100 GPU. Not edge-deployable yet.
Embodiment fine-tuning still required: The model generalizes across tasks on a given platform, but new platforms need fine-tuning.
Monocular depth only: MolmoAct2-Think uses Depth Anything V2 estimates, not true stereo/LIDAR.
4B parameters: Still large for some deployment contexts, even at 55 Hz.
Think variant limited CUDA Graph gains: Adaptive decoding (1.58x) benefits less than fixed-shape flow (2.42x).

Summary

What to remember:

MolmoAct2 is the first fully open VLA (weights + code + data) to outperform closed-source competitors across simulation and real-world benchmarks
The per-layer KV-cache connection between VLM and action expert is the key architectural innovation - it gives the robot hierarchical access to the language model’s spatial and semantic representations
Adaptive depth reasoning (MolmoAct2-Think) provides geometric grounding while scaling cost with scene change, not scene size
The complete data release includes the largest open bimanual dataset (720h, 34.5k demos on a $6k platform)
With CUDA Graph optimization, MolmoAct2 runs at 55.79 Hz - fast enough for real-time closed-loop control

This is what “fully open” should mean in robotics: not just model weights, but the entire pipeline from data collection to deployment. MolmoAct2 sets the new standard.

Źródła i materiały:

📄 Paper: MolmoAct2: Action Reasoning Models for Real-world Deployment
💻 Project: Allen AI Blog / MolmoAct2

The Problem: Why Current Robot Controllers Fall Short#

What MolmoAct2 Does Differently - Five Axes#

Molmo2-ER: A VLM That Understands Space#

The Architecture: From Discrete Tokens to Continuous Control#

OpenFAST: Turning Robot Motions into Tokens#

Per-Layer KV Connection: The Key Innovation#

Flow Matching: Learning to Denoise Actions#

MolmoAct2-Think: Adaptive Depth Reasoning#

The Depth Representation#

Why Adaptive?#

Training Tricks#

Experiments and Results#

Embodied Reasoning (Molmo2-ER)#

Out-of-the-Box Deployment#

Fine-tuning (LIBERO)#

Real-World Bimanual YAM (8 tasks, 50 trials each)#

Inference Speed#

Key Ablations#

Limitations#

Summary#