Reasoning models generate long chains of thought to arrive at answers. But what if over half of those “thoughts” are useless noise, and the model has known the answer for a while — it just doesn’t know it can stop? The paper “Does Your Reasoning Model Implicitly Know When to Stop Thinking?” discovers that this is exactly the case, and proposes SAGE — a method that cuts token usage by 40-50% while maintaining or improving accuracy.


The Problem: Thinking That Hurts

Modern reasoning models reasoning models LLMs trained to generate step-by-step chains of thought (Chain-of-Thought) before giving an answer. Examples: DeepSeek-R1, Qwen3, o1. like DeepSeek-R1 and Qwen3 have been trained to produce long Chain-of-Thought Chain-of-Thought A technique where the model generates sequential reasoning steps leading to an answer. Improves accuracy, but increases computational cost. (CoT) sequences before outputting an answer. The problem is that longer thinking doesn’t always mean better.

The authors measure this with the RFCS (Ratio of First Correct Step) metric:

$$\text{RFCS} = \frac{\text{step containing first correct answer}}{\text{total number of steps}}$$

The results are alarming: over half of correct responses contain massive amounts of redundant steps after the model has already found the solution. For example, DeepSeek-1.5B found the correct answer in 500 tokens, then generated another 452 tokens of redundancy.

Longer chains of thought can actually decrease accuracy — the model “overthinks” its own correct solution.


The Key Discovery: Hidden Self-Awareness

The most important finding of this paper is surprisingly simple:

Reasoning models know when to stop thinking — but standard sampling methods hide this ability.

How was this discovered? The authors compared two confidence measures:

Next-Token Probability (ϕ)

The standard measure — how confident the model is about the next token:

$$\phi(y_i) = \log \pi_\theta(y_i | y_{<i}, x)$$

When the model looks at the termination token termination token A special token (e.g., ) signaling the model to end the reasoning phase and begin generating the final answer. </think>, this measure shows low confidence. The model “doesn’t know” whether it should stop.

Cumulative Log-Probability (Φ)

A new measure — the average confidence across the entire path so far:

$$\Phi(y_{\leq k}) = \frac{1}{k} \sum_{i=1}^{k} \log \pi_\theta(y_i | y_{<i}, x)$$

Under this measure, the </think> token consistently ranks first among candidates. The model is confident it should stop — standard sampling sampling A method of selecting the next token during text generation. Greedy = pick most probable; top-p = sample from the best options. just can’t see it.


SAGE: Letting the Model Stop

The Algorithm Step by Step

SAGE (Self-Aware Guided Efficient Reasoning) is a new sampling paradigm:

1. Exploration — at each step, maintain m candidate sequences. For each, generate 2m candidate tokens.

2. Selection — score each sequence using Φ (cumulative log-probability). Keep the top-m best.

3. Confident Termination — when the </think> token appears among top candidates with a high rank, end reasoning.

Termination Criterion

The acceptance tolerance is defined by parameter h:

$$TR = \frac{h}{2m}, \quad h \in [1, 2m]$$

The model stops reasoning when </think> falls within the top-h candidates — a signal that the model is confident about stopping.

Key Observation

“As the exploration space expands during reasoning, LRM is increasingly capable of identifying precise and compact reasoning paths with high confidence.”

The more options we explore, the more confidently the model identifies the right stopping point.


SAGE-RL: Teaching the Model Efficient Thinking

Better sampling alone isn’t enough. SAGE-RL integrates efficient reasoning patterns into the model itself using reinforcement learning reinforcement learning A training method where the model learns by trying actions and receiving rewards/penalties. Here: reward for correct and short reasoning. :

Training Procedure

For a group of G=8 responses:

  • r=2 generated via SAGE(m,r) — short, efficient chains
  • G-r=6 generated via standard sampling — random, often longer

The model learns from the advantage advantage A measure of how good a given action is compared to the average. Positive = better than average; negative = worse. signal: short, correct SAGE chains receive high advantage, teaching the model to produce concise reasoning.

Objective Function

$$J(\theta) = \mathbb{E}\left[\frac{1}{G}\left(\sum_{i \in \text{SAGE}} + \sum_{i \in \text{Random}}\right) \min(w_{i,t}(\theta)\hat{A}_{i,t},\ \text{clip}(w_{i,t}(\theta), 1-\varepsilon, 1+\varepsilon)\hat{A}_{i,t})\right]$$

where the importance ratio is:

$$w_{i,t}(\theta) = \frac{\pi_\theta(y_{i,t} | x, y_{i,<t})}{\pi_{\theta_{old}}(y_{i,t} | x, y_{i,<t})}$$

Reward

Simple binary 0/1 reward — correct answer or not. No separate reward model needed. The efficiency signal comes naturally from the mix of SAGE + random samples.


Results: Fewer Tokens, Better Accuracy

DeepSeek-1.5B with SAGE-GRPO

BenchmarkAccuracyChangeTokensSavings
MATH-50084.8%+1.6%2,915-39%
AIME 202428.8%+3.7%7,243-41%
AIME 202526.5%+5.6%7,479-36%

DeepSeek-7B with SAGE-GRPO

BenchmarkAccuracyChangeTokensSavings
MATH-50093.0%+1.4%2,141-45%
AIME 202455.3%+3.4%6,422-43%

Qwen3-8B with SAGE-GSPO

BenchmarkAccuracyChangeTokensSavings
AIME 202566.0%-0.7%9,183-50%

The pattern is clear: 40-50% fewer tokens with equal or better accuracy.


What Happens After SAGE-RL Training?

The RFCS metric shows a dramatic behavior change:

  • Before SAGE-RL: The model frequently continued reasoning long after finding the correct answer
  • After SAGE-RL: The model stops almost immediately after the correct solution

The model literally learned to trust itself and not “rework” the problem endlessly.


Implementation Details

ParameterValue
Frameworkverl (HybridFlow)
Learning rate1e-6, cosine warmup
KL regularizationβ=0.001
Batch size32 (8 GPUs)
Max context9,216 tokens
Training600 steps, Adam
TemperatureT=1.0, top-p=0.95
RewardBinary 0/1

Summary

SAGE uncovers something fundamental: reasoning models already know when to stop thinking — you just need to let them. The key is shifting perspective from next-token probability (ϕ) to cumulative path probability (Φ).

Combined with reinforcement learning (SAGE-RL), this approach achieves:

  • 40-50% token reduction on mathematical benchmarks
  • Maintained or improved accuracy — less thinking = better results
  • Simple implementation — binary reward, standard RL frameworks

The implication is profound: we don’t need models that think more. We need models that know when to stop.