MLLog.dev

SWE-Explore: The Benchmark That Finally Asks — Did Your Coding Agent Read the Right Code?

Think of a doctor diagnosing a patient. You could evaluate the doctor solely by whether the patient recovered. That matters – but it tells you nothing about whether the right tests were ordered, the right lab results were read, or the doctor simply got lucky with a broad-spectrum antibiotic. If you want to improve the diagnostic process, you need to instrument the intermediate steps. Coding-agent benchmarks have the same problem. SWE-bench, SWE-bench Verified, SWE-bench Multilingual – they all give you a single bit per issue: pass or fail. Did the patch make the tests green? Useful, but it hides an enormous amount of signal. When an agent fails, why did it fail? Wrong reasoning? Wrong edit? Or did it simply never look at the right code? ...

SkillOpt: Training Agent Skills Like Neural Network Weights - Without Touching the Model

You can’t fine-tune GPT-5.5. You can’t fine-tune Claude. You can’t fine-tune most of the models you actually deploy in production. Yet somehow, we expect these frozen models to handle spreadsheet automation, mathematical olympiads, and multi-step search tasks - all from a hand-written system prompt. The paper “SkillOpt: Executive Strategy for Self-Evolving Agent Skills” (arXiv 2605.23904, May 2026) asks: what if the system prompt itself was the trainable parameter? What if we applied the full discipline of deep learning - learning rates, validation splits, negative feedback - to a natural-language document instead of model weights? The result: SkillOpt wins or ties on all 52 evaluated (model, benchmark, harness) cells, achieving gains of up to +39 absolute points on procedural benchmarks and producing compact skill files of just 300-2,000 tokens that transfer across models, harnesses, and benchmarks. ...

MolmoAct2: The First Fully Open Robot Controller That Beats Closed-Source Giants

A robot that can fold laundry, pack medication, and pour tea - controlled by a single model - sounds like science fiction. But it’s exactly what’s needed for real deployment. The problem? The best robot controllers are either closed-source (π0.5), too slow (reasoning models that generate hundreds of tokens before moving), or tied to hardware most labs can’t afford. MolmoAct2 (Fang, Duan et al., Allen AI / UW / Stanford / NVIDIA / MIT, May 2026) solves all five problems at once: it’s fully open (weights, code, data), runs at 55.79 Hz, deploys on platforms costing under $6,000, and achieves 97.2% success on LIBERO - beating every open and closed baseline. The secret? Let the robot’s action generator peek into the language model’s brain at every layer, not just the final output. ...

RecursiveMAS: What If Your Multi-Agent System Was Just One Big Recursive Neural Network?

Multi-agent systems built from LLMs have a dirty secret: the agents talk to each other in text. That sounds natural - after all, text is what LLMs do - but it’s catastrophically wasteful. Every time Agent A finishes reasoning and passes its output to Agent B, the system decodes hidden states into tokens, ships those tokens over, and re-encodes them back into hidden states. Information gets destroyed. Gradients die at the text boundary. And you’re paying for a full vocabulary projection at every handoff. The paper “Recursive Multi-Agent Systems” (Yang, Zou, Pan et al., UIUC/Stanford/NVIDIA/MIT, April 2026) asks: what if we just… didn’t do that? What if the agents shared their thoughts directly, in continuous latent space, and the entire system looped like a single recursive neural network? The result is RecursiveMAS - a framework that adds only 0.31% trainable parameters (13.12M) while delivering +8.3% average accuracy, 2.4x inference speedup, and 75.6% token reduction. ...

Tstars-Tryon 1.0: Virtual Try-On as Multi-Image Editing at Taobao Scale

A user opens the Taobao app, picks a model photo, and drops in six reference images: a coat, an inner shirt, pants, shoes, a hat and a bag. They tap a button. Less than seven seconds later, a fresh photo appears — same face, same background, every garment placed correctly with the coat unzipped, revealing the inner shirt. Multiply this by tens of millions of requests per service window, and you get a sense of what Tstars-Tryon 1.0 is solving. This is not the lab-clean VITON-HD setting where one t-shirt gets pasted onto a fashion model in a studio. This is virtual try-on at e-commerce scale, on real-world photos, with stacked outfits and accessories — and it is running today. ...

ClawGUI: A Full-Stack Open-Source Pipeline for GUI Agents

Imagine you want a model that can actually use your phone — tap, swipe, type, navigate apps, book a flight. The model exists. The benchmarks exist. So why, in 2026, can you still not pip install a GUI agent and have it do anything on your real device? The answer is almost never the model. It is the infrastructure around the model: the training environment, the evaluation harness, and the deployment stack, each of which is typically closed, fragmented, or both. ...

SkillClaw: Making LLM Agent Skills Evolve Collectively

Imagine 8 people in a company using the same AI assistant. Each of them hits the same problems — wrong API port, missing file, malformed argument — and each time independently discovers a workaround. The next day, someone else falls into the exact same hole. The system doesn’t learn from its users’ experience. What if a nightly “editorial shift” automatically analyzed all the day’s interactions, drew conclusions, and served improved procedures to everyone the next morning? ...

TAPS: Why Your Draft Model's Training Data Matters More Than Its Architecture

Speculative decoding is one of the most elegant tricks in LLM inference: a small, fast draft model draft model A lightweight language model that quickly proposes candidate tokens. A larger ‘verifier’ model then checks these proposals in parallel, accepting correct ones and rejecting wrong ones - accelerating generation without changing output quality. proposes tokens, and a large verifier verifier The full-size target language model that checks draft proposals. It processes all candidates in one forward pass and accepts those matching its own distribution, guaranteeing identical output quality to standard autoregressive decoding. approves or rejects them in parallel. Same output distribution, fewer expensive forward passes. ...

Demystifying Video Reasoning: Models Don't Think in Frames - They Think in Denoising Steps

Video generation models like Sora can solve mazes, manipulate objects, and answer math questions - all by generating video. But how do they reason? The intuitive answer: step by step, frame by frame, like a person drawing a solution on a whiteboard. That answer is wrong. The paper “Demystifying Video Reasoning” shows that reasoning in video diffusion models doesn’t unfold across frames. It unfolds across denoising steps - the iterative process that turns noise into a coherent video. The authors call this Chain-of-Steps (CoS), and it fundamentally changes how we understand what these models are doing. ...

Seoul World Model: AI That Generates Video of Real Cities From Street Photos

What if you could fly a virtual camera through any street in a real city — not a game engine, not a pre-recorded video, but a freshly generated, photorealistic view based on actual street photos? That’s exactly what the Seoul World Model (SWM) does. The paper “Grounding World Simulation Models in a Real-World Metropolis” introduces a city-scale world model world model A neural network that learns the dynamics and visual appearance of an environment, allowing it to ‘imagine’ new views and trajectories it has never seen directly. that generates video grounded in real geography — not in imagined scenes. ...