Optimization

SAGE: Your Reasoning Model Knows When to Stop Thinking — You Just Won't Let It

Reasoning models generate long chains of thought to arrive at answers. But what if over half of those “thoughts” are useless noise, and the model has known the answer for a while — it just doesn’t know it can stop? The paper “Does Your Reasoning Model Implicitly Know When to Stop Thinking?” discovers that this is exactly the case, and proposes SAGE — a method that cuts token usage by 40-50% while maintaining or improving accuracy. ...

OPUS: How to Train LLMs 6x Faster by Choosing the Right Data

Training large language models requires astronomical amounts of data and compute. But what if most of that data is redundant redundant Redundant data provides no new information to the learning process — the model already ‘knows’ the patterns it contains. ? The paper “OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration” introduces a framework that achieves comparable results with 6x fewer tokens tokens A token is the basic unit of text in LLMs — it can be a word, part of a word, or a character. Models process text as sequences of tokens. by intelligently selecting what the model should learn from at each step. ...

HyDRA: Teaching Your Phone to Understand Images Without Breaking the Bank

Imagine teaching your phone to recognize photos of dishes and suggest recipes. The catch? Models capable of this are massive and require the computational power of a Google data center. HyDRA is a clever method that adapts such models for mobile devices — without bankruptcy and without melting the planet. The Problem: An Elephant in Your Phone Vision Language Models (VLMs) are AI models that understand both images and text simultaneously. You can show them a photo and ask “what do you see?” or “how do I fix this?”. Sounds great, but there’s a catch. ...

Cost-Constrained LLM Cascades — Meet C3PO

Imagine you have an army of helpers — several different Large Language Models (LLMs), each capable of handling tasks from simple queries to complex reasoning. But each helper costs something: time, compute, or actual money if you’re using an API. So the question is: Can we orchestrate these models wisely — starting from the cheapest one that might do the job, escalating only when needed — without exceeding a cost budget? ...

SNOO – Old-School Nesterov Momentum in a New Jacket: Making Big Models Learn Faster

Imagine you’re training a massive language model — the kind that takes weeks to learn even the basics. Every training step costs time, electricity, and a small fortune. In such a world, even a tiny bump in efficiency feels like finding a way to get free coffee at work — small, but sweet. Enter SNOO – Step-K Nesterov Outer Optimizer, a clever idea that takes Nesterov momentum, a decades-old optimization trick, and applies it in a new place — outside the normal training loop. The result? Models that learn faster and more smoothly, without much extra computational cost. ...