Imagine you want a model that can actually use your phone — tap, swipe, type, navigate apps, book a flight. The model exists. The benchmarks exist. So why, in 2026, can you still not pip install a GUI agent and have it do anything on your real device? The answer is almost never the model. It is the infrastructure around the model: the training environment, the evaluation harness, and the deployment stack, each of which is typically closed, fragmented, or both.

ClawGUI is the first open-source release that builds all three, and — crucially — connects them into one pipeline.

Motivation: Three Gaps, One Ceiling

Progress on GUI agents is bottlenecked less by model capacity than by a collection of very concrete engineering failures:

  • Closed RL stacks. Most published RL-trained GUI agents rely on infrastructure that is sandbox-only, single-device, and not open-sourced. You cannot reproduce UI-TARS-2 in your lab, let alone extend it.
  • Drifting evaluation. Every benchmark (ScreenSpot-Pro, OSWorld-G, AndroidControl, MMBench-GUI, UI-Vision, MobileWorld) ships with its own scoring script. Resolutions, prompt templates, and judge policies differ per model, and most papers publish metrics without their inference configuration. When the authors of ClawGUI tried to reproduce published numbers across 11+ models and 6 benchmarks, they found that undocumented prompt or resolution choices are the single largest source of irreproducibility.
  • Broken deployment. Even when a policy trains well, it almost never makes it to a physical phone in a user’s hand. No chat integration, no personalized memory, no HarmonyOS or iOS support, no fallback for apps without accessibility APIs.

Each gap has been addressed in isolation. None had been addressed together — until ClawGUI.

The Key Idea

Think of a GUI agent as a robot learning to use a smartphone. To do this well you need three things simultaneously:

  1. A gym where the phone can crash, be reset, and the agent can try again — thousands of times in parallel, on real and virtual hardware.
  2. A ruler that everyone measures progress with in the same way, so numbers between papers actually mean something.
  3. A delivery mechanism that puts the trained robot in real users’ hands, across chat platforms they already use.

ClawGUI builds a gym (ClawGUI-RL), a ruler (ClawGUI-Eval), and a delivery mechanism (ClawGUI-Agent). The pipeline is unidirectional: a fix in one flows into the others. A model trained in ClawGUI-RL is evaluated by ClawGUI-Eval by default, and deployed via ClawGUI-Agent with no extra glue.

The technical centerpiece is how they do the training. In a long-horizon GUI task — “book me the cheapest flight to Warsaw on Friday” — a naive policy gets one bit of feedback at the very end: did it succeed? Over 30 taps and swipes, that is almost no gradient signal. ClawGUI-RL fixes this by coupling two ideas: a Process Reward Model (PRM) that scores every individual action, and GiGPO (Group-in-Group Policy Optimization), an advantage estimator that hands per-step credit without a learned value network.

ClawGUI-RL: Scalable Online RL Training

Environment Management

The training environment has to survive hostile conditions. Android emulators drift unhealthy during long runs; real devices get thermal-throttled or lose network. The ClawGUI Environment Manager runs dozens of Docker Android emulators (via MobileWorld) behind a unified interface, with a four-phase lifecycle per episode:

Task Reset → Task Evaluation → Spare Server Rotation → Teardown.

The spare server rotation queue is the non-obvious piece. When an emulator throws health-check failures, it is swapped out for a warm spare instead of being restarted in place. Without this, a single stalled container kills the training run.

For real-device training, the environment exposes physical Android / cloud phones. Outcome rewards in this regime cannot come from system-level verification (the OS does not know whether “book a flight” succeeded), so ClawGUI uses an MLLM-as-judge instead.

Reward Design: Outcome + Process

The total reward is the simplest possible combination:

$$ R = R_{\text{outcome}} + R_{\text{step}} \tag{1} $$

where:

  • $R$ — total reward used by the advantage estimator
  • $R_{\text{outcome}}$ — binary episode-end reward (1 for success, 0 for failure), from system-level verification in virtual environments or from an MLLM judge on real devices
  • $R_{\text{step}}$ — dense per-step score from the PRM, judging whether the current action meaningfully contributes to task completion, given (previous screenshot, current screenshot, action history)

Interpretation: The outcome term encodes the true objective. The step term is a local proxy that gives gradient signal everywhere along the trajectory. Without it, a misclick on step 4 and the decisive tap on step 29 receive identical credit — which is why episode-level RL stalls on long horizons.

The step loop is:

  1. Execute rollout, collecting $(s_t, a_t, s_{t+1})$ tuples.
  2. After each step, query the PRM (Qwen3.5-72B in experiments) with $(s_t, s_{t+1}, a_{0:t})$ to obtain $R_{\text{step},t}$.
  3. At episode end, compute $R_{\text{outcome}}$.
  4. Form $R_t = R_{\text{outcome}}\cdot \mathbb{1}[t=T] + R_{\text{step},t}$ and feed it into GiGPO.

GiGPO: Anchor-State Grouping

Standard GRPO gives one uniform advantage to every step in an episode — a trajectory is either “good” or “bad” overall. That is wasteful. GiGPO splits credit into two additive components:

$$ A_t^{\text{GiGPO}} = A^{\text{episode}}(\tau) + A^{\text{step}}\bigl(s_t, a_t \mid \mathcal{G}(s_t)\bigr) \tag{2} $$

where:

  • $\tau$ — full trajectory
  • $A^{\text{episode}}(\tau)$ — macro advantage: z-score normalization of episode returns across the group of $G$ rollouts sharing the same task (standard GRPO)
  • $\mathcal{G}(s_t)$ — anchor-state sub-group: the set of all steps, across the $G$ rollouts, that reached the same intermediate state $s_t$
  • $A^{\text{step}}$ — micro advantage computed within $\mathcal{G}(s_t)$ from discounted returns

Interpretation: At the episode level you keep GRPO’s “was this trajectory globally better than its siblings?” At the step level you ask a sharper question: among all rollouts that passed through this exact intermediate state, which chose the better next action? This is per-step credit assignment without a value network and without extra rollouts.

The full algorithm:

Algorithm: GiGPO with PRM rewards
1. For each task, run G rollouts (G = 8 in the paper).
2. Collect combined per-step reward R_t via eq. (1).
3. Episode-level: normalize episode returns within the group -> A^episode.
4. Bucket step tuples across all G rollouts by anchor state s_t.
5. Within each bucket, compute discounted-return-normalized advantages -> A^step.
6. Set A_t = A^episode + A^step and update policy (verl / verl-agent backbone).

No extra critic, no extra rollouts, no change to policy architecture.

ClawGUI-Eval: Infer → Judge → Metric

The evaluation harness is built on a single observation: inference and judging should be decoupled. Today’s de facto standard is a monolithic eval script per benchmark; if the judge is buggy or you want to try a new scoring rule, you have to rerun every model’s inference on every GPU. ClawGUI-Eval splits this into three stages:

  • Infer. Runs locally (HuggingFace Transformers, multi-GPU via multiprocessing with shard-level checkpointing) or through any OpenAI-compatible API. Crucially, it saves raw predictions to disk.
  • Judge. Benchmark-specific scoring over those predictions: point-in-box for standard grounding, polygon + refusal-aware judging for OSWorld-G, multi-action matching for AndroidControl.
  • Metric. Aggregates with breakdowns by platform / element type / task category.

Because Infer is separated from Judge, the community can contribute new judges without rerunning inference. ClawGUI-Eval ships the raw prediction archives for re-judging.

With this pipeline, the authors measured 95.8% reproduction against official numbers across 48 (model × benchmark) cells: 46 reproduced within $\pm 2%$ of the published value or strictly higher. The two failure cells both involved models whose official eval configs were never publicly disclosed — an empirical validation that undocumented configuration, not methodology, is what breaks reproducibility.

An interesting side finding: closed-source frontier models (Gemini, Seed) required a “Zoom” paradigm — 25% or 50% crop tiles — to reproduce their published ScreenSpot-Pro numbers. Standard inference underestimates these systems because they were evaluated with a different input regime.

ClawGUI-Agent: Hybrid CLI-GUI Control with Memory

Deployment has two quiet design choices worth naming:

  • Hybrid CLI + GUI control. Pure CLI is fast and precise but does not cover all apps. Pure GUI covers everything but is slow for tasks with APIs. ClawGUI-Agent decides per-step which channel to use, falling back to GUI when CLI does not apply.
  • Personalized memory. A vector-embedding store with top-k retrieval and duplicate merging, so the agent remembers a user’s preferences across sessions (home airport, preferred cuisine) without leaking state between users.

ClawGUI-Agent runs across Android, HarmonyOS, and iOS, and is exposed via 12+ chat platforms (Feishu, DingTalk, Telegram, Discord, Slack, QQ, …). ClawGUI-Eval itself is exposed as a natural-language-triggered skill: you can literally chat “evaluate UI-TARS-7B on ScreenSpot-Pro” into a Slack thread.

Experiments

ClawGUI-2B — a 2B parameter model fine-tuned end-to-end in ClawGUI-RL starting from MAI-UI-2B — is the headline result.

ModelParamsMobileWorld GUI-Only SR
MAI-UI-2B (base, untrained)2B11.1%
Qwen3-VL-32B32B11.9%
UI-Venus-72B72B16.4%
ClawGUI-2B (ours)2B17.1%

A 2B model, trained in the right pipeline, beats 32B and 72B untrained competitors on real-world mobile GUI tasks.

Ablation: What Does the Dense Reward Buy?

VariantAdvantage estimatorRewardMobileWorld GUI-Only SR
BaselineGRPOoutcome only14.5%
FullGiGPOoutcome + PRM step17.1%

That is +2.6 absolute SR or +17.9% relative from moving to dense step-level supervision with anchor-state grouping. For long-horizon agent training this is a large effect, and it comes without a learned value network.

Where ClawGUI Still Loses

For context, agentic frameworks pairing a closed planner with a grounding module reach much higher numbers on the same benchmark:

SystemMobileWorld GUI-Only SR
Gemini-3-Pro + UI-Ins-7B55.6%
GPT-5 + UI-Ins-7B54.0%
Claude-4.5-Sonnet + UI-Ins-7B47.8%
ClawGUI-2B (end-to-end)17.1%

This is a different regime — closed planners plus specialized grounding modules — and is not apples-to-apples. The useful reading is: a small open end-to-end policy still has a long runway before it catches a large closed planner, but infrastructure parity has closed the first critical gap.

Why It Works: Infrastructure Is the Real Bottleneck

The quiet thesis of this paper is that recent GUI agent progress has been gated not by ideas but by plumbing. Three concrete pieces of plumbing dominate:

  1. Environment reliability. Spare server rotation alone is the difference between a training run that finishes and one that silently stalls overnight.
  2. Reward density. Outcome-only RL does not work at 30+ step horizons; a judge-as-PRM is a cheap way to inject gradient signal everywhere.
  3. Evaluation hygiene. Most irreproducibility is not a modeling problem; it is undocumented resolutions and prompt templates. Decoupling Infer from Judge makes this auditable.

The paper’s implicit claim — that a 2B model in a clean pipeline can beat 72B in a messy one — is an infrastructure claim, not a modeling claim.

Limitations

To the authors’ credit, the limitations are spelled out honestly:

  • Absolute SR still low. 17.1% is far from useful-for-everyone territory, and well below agentic-framework upper bounds (55.6%).
  • Real-device rewards are judge-noisy. Without system-level verification, the outcome reward on physical phones comes from an MLLM judge that can be wrong.
  • PRM cost. Qwen3.5-72B as the step judge is expensive at inference time and may limit rollout throughput.
  • Lenient reproduction criterion. The $\pm 2%$ tolerance in the 95.8% reproduction number absorbs some small configuration drift and does not guarantee bit-exact reproduction.
  • Evaluation breadth. End-to-end validation is reported on 117 MobileWorld GUI-Only tasks; broader end-to-end numbers (desktop, HarmonyOS) are not yet available.

Takeaways

  1. GUI agent research is a full-stack problem: training, evaluation, and deployment must be open and connected. ClawGUI is the first release that makes all three open at once.
  2. Dense step-level supervision is essential for long-horizon agent RL. A Process Reward Model plus GiGPO gives per-step credit without a learned critic and is worth +17.9% relative SR on MobileWorld.
  3. 95.8% of reported GUI-benchmark numbers are reproducible when model-specific inference configurations are pinned — the remaining gap is undocumented prompts and resolutions, not methodology.
  4. A well-engineered 2B policy can beat untrained 72B competitors on real mobile tasks. Parameter count is not the current bottleneck; infrastructure is.
  5. Agentic frameworks with closed planners still win on absolute SR. End-to-end open policies have room to grow, but they now have a pipeline to grow in.

From my perspective, the most important line of the paper is the one that is never written explicitly: the bottleneck in GUI agents has been engineering, not science. ClawGUI shifts that bottleneck — and in doing so probably unlocks a wave of small-team research that was previously impossible.


Sources and materials: