Imagine you want a model that can actually use your phone — tap, swipe, type, navigate apps, book a flight. The model exists. The benchmarks exist. So why, in 2026, can you still not pip install a GUI agent and have it do anything on your real device? The answer is almost never the model. It is the infrastructure around the model: the training environment, the evaluation harness, and the deployment stack, each of which is typically closed, fragmented, or both. ...
SkillClaw: Making LLM Agent Skills Evolve Collectively
Imagine 8 people in a company using the same AI assistant. Each of them hits the same problems — wrong API port, missing file, malformed argument — and each time independently discovers a workaround. The next day, someone else falls into the exact same hole. The system doesn’t learn from its users’ experience. What if a nightly “editorial shift” automatically analyzed all the day’s interactions, drew conclusions, and served improved procedures to everyone the next morning? ...
TAPS: Why Your Draft Model's Training Data Matters More Than Its Architecture
Speculative decoding is one of the most elegant tricks in LLM inference: a small, fast draft model draft model A lightweight language model that quickly proposes candidate tokens. A larger ‘verifier’ model then checks these proposals in parallel, accepting correct ones and rejecting wrong ones - accelerating generation without changing output quality. proposes tokens, and a large verifier verifier The full-size target language model that checks draft proposals. It processes all candidates in one forward pass and accepts those matching its own distribution, guaranteeing identical output quality to standard autoregressive decoding. approves or rejects them in parallel. Same output distribution, fewer expensive forward passes. ...
Demystifying Video Reasoning: Models Don't Think in Frames - They Think in Denoising Steps
Video generation models like Sora can solve mazes, manipulate objects, and answer math questions - all by generating video. But how do they reason? The intuitive answer: step by step, frame by frame, like a person drawing a solution on a whiteboard. That answer is wrong. The paper “Demystifying Video Reasoning” shows that reasoning in video diffusion models doesn’t unfold across frames. It unfolds across denoising steps - the iterative process that turns noise into a coherent video. The authors call this Chain-of-Steps (CoS), and it fundamentally changes how we understand what these models are doing. ...
Seoul World Model: AI That Generates Video of Real Cities From Street Photos
What if you could fly a virtual camera through any street in a real city — not a game engine, not a pre-recorded video, but a freshly generated, photorealistic view based on actual street photos? That’s exactly what the Seoul World Model (SWM) does. The paper “Grounding World Simulation Models in a Real-World Metropolis” introduces a city-scale world model world model A neural network that learns the dynamics and visual appearance of an environment, allowing it to ‘imagine’ new views and trajectories it has never seen directly. that generates video grounded in real geography — not in imagined scenes. ...
Lost in Stories: How LLMs Lose the Thread in Long Narratives
Ask any language model to write a 10,000-word story. On page one, the hero has blue eyes. By page five — brown. In chapter three it’s Thursday; in chapter six, the same day is suddenly Saturday. A character who died on page seven is chatting away on page ten. Sound familiar? The paper “Lost in Stories: Consistency Bugs in Long Story Generation by LLMs” systematically investigates this problem for the first time — and the results are sobering. Even the best models produce an average of one consistency error per 10,000 words, and human experts catch only 17% of them. ...
Utonia: One Encoder For All Point Clouds
A LiDAR on a self-driving car, a depth camera in a home robot, a satellite scanner, and a CAD model from a 3D printer — each produces a point cloud point cloud A set of 3D points (x, y, z) representing the shape of an object or scene. Each point can carry additional attributes: color, normal, intensity. , but with radically different density, scale, and geometry. Until now, each domain required its own model. The paper “Utonia: Toward One Encoder for All Point Clouds” breaks this pattern — one encoder, 137M parameters, five domains, and emergent behaviors nobody expected. ...
SAGE: Your Reasoning Model Knows When to Stop Thinking — You Just Won't Let It
Reasoning models generate long chains of thought to arrive at answers. But what if over half of those “thoughts” are useless noise, and the model has known the answer for a while — it just doesn’t know it can stop? The paper “Does Your Reasoning Model Implicitly Know When to Stop Thinking?” discovers that this is exactly the case, and proposes SAGE — a method that cuts token usage by 40-50% while maintaining or improving accuracy. ...
When GPT Discovers Physics: A Breakthrough in Gluon Theory
What happens when you ask artificial intelligence to solve a problem that theoretical physicists have worked on for decades? In a new publication from a team at Princeton, Harvard, Cambridge, and OpenAI, GPT-5.2 Pro GPT-5.2 Pro The latest version of OpenAI’s language model, capable of advanced mathematical reasoning and formulating scientific hypotheses. was the first to propose a key formula describing gluon scattering — a formula that was then proven by another internal OpenAI model and verified by scientists by hand. ...
OPUS: How to Train LLMs 6x Faster by Choosing the Right Data
Training large language models requires astronomical amounts of data and compute. But what if most of that data is redundant redundant Redundant data provides no new information to the learning process — the model already ‘knows’ the patterns it contains. ? The paper “OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration” introduces a framework that achieves comparable results with 6x fewer tokens tokens A token is the basic unit of text in LLMs — it can be a word, part of a word, or a character. Models process text as sequences of tokens. by intelligently selecting what the model should learn from at each step. ...