A robot that can fold laundry, pack medication, and pour tea - controlled by a single model - sounds like science fiction. But it’s exactly what’s needed for real deployment. The problem? The best robot controllers are either closed-source (π0.5), too slow (reasoning models that generate hundreds of tokens before moving), or tied to hardware most labs can’t afford. MolmoAct2 (Fang, Duan et al., Allen AI / UW / Stanford / NVIDIA / MIT, May 2026) solves all five problems at once: it’s fully open (weights, code, data), runs at 55.79 Hz, deploys on platforms costing under $6,000, and achieves 97.2% success on LIBERO - beating every open and closed baseline. The secret? Let the robot’s action generator peek into the language model’s brain at every layer, not just the final output. ...
RecursiveMAS: What If Your Multi-Agent System Was Just One Big Recursive Neural Network?
Multi-agent systems built from LLMs have a dirty secret: the agents talk to each other in text. That sounds natural - after all, text is what LLMs do - but it’s catastrophically wasteful. Every time Agent A finishes reasoning and passes its output to Agent B, the system decodes hidden states into tokens, ships those tokens over, and re-encodes them back into hidden states. Information gets destroyed. Gradients die at the text boundary. And you’re paying for a full vocabulary projection at every handoff. The paper “Recursive Multi-Agent Systems” (Yang, Zou, Pan et al., UIUC/Stanford/NVIDIA/MIT, April 2026) asks: what if we just… didn’t do that? What if the agents shared their thoughts directly, in continuous latent space, and the entire system looped like a single recursive neural network? The result is RecursiveMAS - a framework that adds only 0.31% trainable parameters (13.12M) while delivering +8.3% average accuracy, 2.4x inference speedup, and 75.6% token reduction. ...
Tstars-Tryon 1.0: Virtual Try-On as Multi-Image Editing at Taobao Scale
A user opens the Taobao app, picks a model photo, and drops in six reference images: a coat, an inner shirt, pants, shoes, a hat and a bag. They tap a button. Less than seven seconds later, a fresh photo appears — same face, same background, every garment placed correctly with the coat unzipped, revealing the inner shirt. Multiply this by tens of millions of requests per service window, and you get a sense of what Tstars-Tryon 1.0 is solving. This is not the lab-clean VITON-HD setting where one t-shirt gets pasted onto a fashion model in a studio. This is virtual try-on at e-commerce scale, on real-world photos, with stacked outfits and accessories — and it is running today. ...
ClawGUI: A Full-Stack Open-Source Pipeline for GUI Agents
Imagine you want a model that can actually use your phone — tap, swipe, type, navigate apps, book a flight. The model exists. The benchmarks exist. So why, in 2026, can you still not pip install a GUI agent and have it do anything on your real device? The answer is almost never the model. It is the infrastructure around the model: the training environment, the evaluation harness, and the deployment stack, each of which is typically closed, fragmented, or both. ...
SkillClaw: Making LLM Agent Skills Evolve Collectively
Imagine 8 people in a company using the same AI assistant. Each of them hits the same problems — wrong API port, missing file, malformed argument — and each time independently discovers a workaround. The next day, someone else falls into the exact same hole. The system doesn’t learn from its users’ experience. What if a nightly “editorial shift” automatically analyzed all the day’s interactions, drew conclusions, and served improved procedures to everyone the next morning? ...
TAPS: Why Your Draft Model's Training Data Matters More Than Its Architecture
Speculative decoding is one of the most elegant tricks in LLM inference: a small, fast draft model draft model A lightweight language model that quickly proposes candidate tokens. A larger ‘verifier’ model then checks these proposals in parallel, accepting correct ones and rejecting wrong ones - accelerating generation without changing output quality. proposes tokens, and a large verifier verifier The full-size target language model that checks draft proposals. It processes all candidates in one forward pass and accepts those matching its own distribution, guaranteeing identical output quality to standard autoregressive decoding. approves or rejects them in parallel. Same output distribution, fewer expensive forward passes. ...
Demystifying Video Reasoning: Models Don't Think in Frames - They Think in Denoising Steps
Video generation models like Sora can solve mazes, manipulate objects, and answer math questions - all by generating video. But how do they reason? The intuitive answer: step by step, frame by frame, like a person drawing a solution on a whiteboard. That answer is wrong. The paper “Demystifying Video Reasoning” shows that reasoning in video diffusion models doesn’t unfold across frames. It unfolds across denoising steps - the iterative process that turns noise into a coherent video. The authors call this Chain-of-Steps (CoS), and it fundamentally changes how we understand what these models are doing. ...
Seoul World Model: AI That Generates Video of Real Cities From Street Photos
What if you could fly a virtual camera through any street in a real city — not a game engine, not a pre-recorded video, but a freshly generated, photorealistic view based on actual street photos? That’s exactly what the Seoul World Model (SWM) does. The paper “Grounding World Simulation Models in a Real-World Metropolis” introduces a city-scale world model world model A neural network that learns the dynamics and visual appearance of an environment, allowing it to ‘imagine’ new views and trajectories it has never seen directly. that generates video grounded in real geography — not in imagined scenes. ...
Lost in Stories: How LLMs Lose the Thread in Long Narratives
Ask any language model to write a 10,000-word story. On page one, the hero has blue eyes. By page five — brown. In chapter three it’s Thursday; in chapter six, the same day is suddenly Saturday. A character who died on page seven is chatting away on page ten. Sound familiar? The paper “Lost in Stories: Consistency Bugs in Long Story Generation by LLMs” systematically investigates this problem for the first time — and the results are sobering. Even the best models produce an average of one consistency error per 10,000 words, and human experts catch only 17% of them. ...
Utonia: One Encoder For All Point Clouds
A LiDAR on a self-driving car, a depth camera in a home robot, a satellite scanner, and a CAD model from a 3D printer — each produces a point cloud point cloud A set of 3D points (x, y, z) representing the shape of an object or scene. Each point can carry additional attributes: color, normal, intensity. , but with radically different density, scale, and geometry. Until now, each domain required its own model. The paper “Utonia: Toward One Encoder for All Point Clouds” breaks this pattern — one encoder, 137M parameters, five domains, and emergent behaviors nobody expected. ...