LLM | MLLog.dev

SkillOpt: Training Agent Skills Like Neural Network Weights - Without Touching the Model

You can’t fine-tune GPT-5.5. You can’t fine-tune Claude. You can’t fine-tune most of the models you actually deploy in production. Yet somehow, we expect these frozen models to handle spreadsheet automation, mathematical olympiads, and multi-step search tasks - all from a hand-written system prompt. The paper “SkillOpt: Executive Strategy for Self-Evolving Agent Skills” (arXiv 2605.23904, May 2026) asks: what if the system prompt itself was the trainable parameter? What if we applied the full discipline of deep learning - learning rates, validation splits, negative feedback - to a natural-language document instead of model weights? The result: SkillOpt wins or ties on all 52 evaluated (model, benchmark, harness) cells, achieving gains of up to +39 absolute points on procedural benchmarks and producing compact skill files of just 300-2,000 tokens that transfer across models, harnesses, and benchmarks. ...

TAPS: Why Your Draft Model's Training Data Matters More Than Its Architecture

Speculative decoding is one of the most elegant tricks in LLM inference: a small, fast draft model draft model A lightweight language model that quickly proposes candidate tokens. A larger ‘verifier’ model then checks these proposals in parallel, accepting correct ones and rejecting wrong ones - accelerating generation without changing output quality. proposes tokens, and a large verifier verifier The full-size target language model that checks draft proposals. It processes all candidates in one forward pass and accepts those matching its own distribution, guaranteeing identical output quality to standard autoregressive decoding. approves or rejects them in parallel. Same output distribution, fewer expensive forward passes. ...

Lost in Stories: How LLMs Lose the Thread in Long Narratives

Ask any language model to write a 10,000-word story. On page one, the hero has blue eyes. By page five — brown. In chapter three it’s Thursday; in chapter six, the same day is suddenly Saturday. A character who died on page seven is chatting away on page ten. Sound familiar? The paper “Lost in Stories: Consistency Bugs in Long Story Generation by LLMs” systematically investigates this problem for the first time — and the results are sobering. Even the best models produce an average of one consistency error per 10,000 words, and human experts catch only 17% of them. ...

SAGE: Your Reasoning Model Knows When to Stop Thinking — You Just Won't Let It

Reasoning models generate long chains of thought to arrive at answers. But what if over half of those “thoughts” are useless noise, and the model has known the answer for a while — it just doesn’t know it can stop? The paper “Does Your Reasoning Model Implicitly Know When to Stop Thinking?” discovers that this is exactly the case, and proposes SAGE — a method that cuts token usage by 40-50% while maintaining or improving accuracy. ...

OPUS: How to Train LLMs 6x Faster by Choosing the Right Data

Training large language models requires astronomical amounts of data and compute. But what if most of that data is redundant redundant Redundant data provides no new information to the learning process — the model already ‘knows’ the patterns it contains. ? The paper “OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration” introduces a framework that achieves comparable results with 6x fewer tokens tokens A token is the basic unit of text in LLMs — it can be a word, part of a word, or a character. Models process text as sequences of tokens. by intelligently selecting what the model should learn from at each step. ...

AI Co-Scientist: Teaching Models to Write Research Plans Better Than Humans

What if AI could not just answer questions, but actively plan scientific research? Not generating text — creating coherent, novel experiment plans that experts rate as better than human-written ones. Sounds like science fiction? Researchers from Meta AI and partners just achieved this. The Problem: How Do You Grade Scientific Creativity? Training models for “closed” tasks (math, coding) is relatively straightforward — the answer is correct or not. But how do you evaluate a research plan? ...

Comp-LLM: When an Army of Experts Beats a Giant – An Analysis of a Revolution in AI Architecture

Have you ever wondered why the latest artificial intelligence models, like GPT-4 or Claude 3 Opus, are so enormous? We’re talking hundreds of billions or even trillions of parameters. These are digital monsters requiring massive amounts of energy and data-center-level infrastructure. For years, AI followed a simple rule: “Bigger means better.” Want a smarter model? Add more layers, more data, more GPUs. But — what if this is a dead end? ...

Cost-Constrained LLM Cascades — Meet C3PO

Imagine you have an army of helpers — several different Large Language Models (LLMs), each capable of handling tasks from simple queries to complex reasoning. But each helper costs something: time, compute, or actual money if you’re using an API. So the question is: Can we orchestrate these models wisely — starting from the cheapest one that might do the job, escalating only when needed — without exceeding a cost budget? ...

Attention as a Compass – Teaching Reasoning Models to Explore Smarter

Large Language Models (LLMs) are no longer just text generators — they are becoming reasoners, capable of solving mathematical problems, logical puzzles, or planning tasks step by step. One of the key challenges is how to improve the quality of this reasoning. Traditional Reinforcement Learning (RL) rewards only the final outcome, but in complex reasoning it makes more sense to evaluate each intermediate step. This is called process-supervised RL (PSRL). ...

The Anatomy of AI Lies: How Language Models Can Deceive Us

We’re used to hearing that AI sometimes “hallucinates” — making funny or random mistakes. Hallucinations are unintended errors caused by the limits of statistical prediction. But the new research goes further: it shows that AI can knowingly choose to lie when deception helps it achieve a goal. The publication Can LLMs Lie? takes us into a world where AI acts more like a strategic agent, capable of manipulating information to maximize outcomes. ...