OPUS: How to Train LLMs 6x Faster by Choosing the Right Data

Training large language models requires astronomical amounts of data and compute. But what if most of that data is redundant redundant Redundant data provides no new information to the learning process — the model already ‘knows’ the patterns it contains. ? The paper “OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration” introduces a framework that achieves comparable results with 6x fewer tokens tokens A token is the basic unit of text in LLMs — it can be a word, part of a word, or a character. Models process text as sequences of tokens. by intelligently selecting what the model should learn from at each step.

The Problem: Not All Data Is Created Equal

Current LLM LLM Large Language Model — a neural network with billions of parameters trained on massive text corpora. training pipelines treat data selection as a one-time preprocessing step: filter out low-quality content, deduplicate, and feed everything to the model. But this static approach ignores a crucial insight: the value of a data point changes as the model learns.

A sample that’s highly informative early in training might be redundant later. Conversely, a complex example might be useless initially but invaluable once the model has developed basic capabilities. OPUS addresses this by making data selection dynamic and iteration-aware.

Mathematical Foundations

Loss Function

OPUS operates on the standard language modeling objective. For a sequence z = (x₁, x₂, …, xₗ), the per-sequence negative log-likelihood negative log-likelihood A measure of how well the model predicts the next tokens. Lower values mean better predictions. Used as the loss function in LLM training. is:

$$\mathcal{L}(z; \theta) = -\frac{1}{L} \sum_{i=1}^{L} \log p_\theta(x_i | x_{<i})$$

The expected loss over a data distribution Q is:

$$\mathcal{L}(\mathcal{Q}; \theta) := \mathbb{E}_{z \sim \mathcal{Q}}[\mathcal{L}(z; \theta)]$$

Utility Definition

The core innovation of OPUS is defining utility utility In OPUS context: a measure of how much a given sample will contribute to model improvement in a given training iteration. in the optimizer optimizer An algorithm responsible for updating neural network weights based on gradients. Examples: SGD, Adam, AdamW. -induced update space. The base utility of a selected batch batch A subset of training data processed together in one iteration. Larger batches = more stable gradients, but more memory required. S at iteration t measures the reduction in validation loss:

$$U^{(t)}(\mathcal{S}) := \mathcal{L}(\mathcal{D}_{val}; \theta_t) - \mathcal{L}(\mathcal{D}_{val}; \theta_{t+1}(\mathcal{S}))$$

For individual sample selection, OPUS computes the marginal utility — how much adding sample z to the current batch B̂ₜ improves the objective:

$$U_z^{(t)} := U^{(t)}(\hat{\mathcal{B}}_t \cup \{z\}) - U^{(t)}(\hat{\mathcal{B}}_t)$$

The OPUS Framework

1. Optimizer-Induced Preconditioner

Different optimizers induce different update directions. OPUS captures this through a general update form:

$$\Delta\theta_t(\hat{\mathcal{B}}_t) = -\eta_t \sum_{z \in \hat{\mathcal{B}}_t} \mathbf{P}_t \nabla\mathcal{L}(z; \theta_t)$$

where Pₜ is the optimizer-specific preconditioner preconditioner A matrix that transforms the gradient before weight updates. Different optimizers use different preconditioners, affecting step direction and magnitude. and ηₜ is the learning rate learning rate A hyperparameter determining the step size during weight updates. Too large = instability, too small = slow convergence. .

For AdamW AdamW A popular optimizer combining adaptive learning rates with weight decay regularization. Widely used in LLM training. , the preconditioner takes the form:

$$\mathbf{P}_t^{AdamW} := C_t \cdot \text{Diag}\left(\frac{1}{\sqrt{\hat{v}_{t-1}} + \epsilon}\right)$$

where Cₜ := αₜ(1-β₁)/(1-β₁ᵗ) and v̂ₜ₋₁ is the bias-corrected second moment estimate.

For Muon optimizer, the preconditioner for layer ℓ is:

$$\mathbf{P}_{t,\ell}^{Muon} := \kappa_t \mathbf{S}_{t,\ell}$$

where Sₜ,ₗ = aI + bAₜ,ₗ + cA²ₜ,ₗ is a polynomial of the momentum correlation matrix.

2. Scoring with Alignment and Redundancy

Using first-order Taylor approximation Taylor approximation A method of approximating functions using polynomials. First-order approximation uses only the first derivative, simplifying calculations at the cost of accuracy. around the current parameters:

$$\mathcal{L}(\mathcal{D}_{val}; \tilde{\theta}_t + \Delta\theta_t(\{z\})) \approx \mathcal{L}(\mathcal{D}_{val}; \tilde{\theta}_t) + \nabla_\theta \mathcal{L}(\mathcal{D}_{val}; \tilde{\theta}_t)^\top \Delta\theta_t(\{z\})$$

The linearized validation gradient gradient A vector of partial derivatives of the loss function with respect to parameters. Points in the direction of steepest loss increase — we go the opposite way. evolves as:

$$\nabla_\theta \mathcal{L}(\mathcal{D}_{val}; \tilde{\theta}_t) \approx \mathbf{g}_{val}^{(t)} + \mathbf{H}_{val}^{(t)} \Delta\theta_t(\hat{\mathcal{B}}_t)$$

With an isotropic Hessian approximation isotropic Hessian approximation A simplification assuming the matrix of second derivatives (Hessian) is proportional to the identity matrix. Dramatically reduces computational complexity. (Hᵥₐₗ ≈ I), the final utility score decomposes into two interpretable terms:

$$U_z^{(t)} \approx \underbrace{\eta_t \langle \mathbf{u}_z^{(t)}, \mathbf{g}_{proxy}^{(t)} \rangle}_{\text{Alignment}} - \underbrace{\eta_t^2 \langle \mathbf{u}_z^{(t)}, \mathbf{G}^{(t)} \rangle}_{\text{Redundancy Penalty}}$$

where:

uᵢ⁽ᵗ⁾ = Pₜ∇L(z; θₜ) is the optimizer-induced update for sample z
gₚᵣₒₓᵧ⁽ᵗ⁾ is the proxy gradient direction (from validation data)
G⁽ᵗ⁾ := Σⱼ uⱼ⁽ᵗ⁾ is the accumulated effective direction (sum over batch samples)

The alignment term rewards samples that move in the same direction as the validation objective. The redundancy penalty discourages selecting samples similar to those already in the batch.

Efficient Computation: Ghost Gradients + CountSketch

Ghost Gradient Factorization

Computing full gradients for every candidate would be prohibitive. OPUS exploits the structure of linear layers, where the per-sample gradient can be factored as an outer product outer product An operation on two vectors producing a matrix. For vectors a and b, outer product a⊗b is a matrix where element (i,j) = aᵢ·bⱼ. :

$$\nabla_{\mathbf{W}_r}\mathcal{L}(z; \theta_t) = \mathbf{a}_r^{(z)} \otimes \mathbf{b}_r^{(z)}$$

where aᵣ⁽ᶻ⁾ is the input activation and bᵣ⁽ᶻ⁾ is the backpropagated error signal. This “ghost” representation avoids materializing the full gradient tensor.

CountSketch Compression

To handle high-dimensional parameter spaces, OPUS applies CountSketch CountSketch A randomized data structure that compresses high-dimensional vectors while preserving approximate inner products. Uses hash functions for efficient projection. projection. The implicit sketched feature for layer r is:

$$\phi^{(t,r)}(z) = \Pi_r(\mathbf{P}_{t,r}(\mathbf{a}_r^{(z)} \otimes \mathbf{b}_r^{(z)}))$$

where Πᵣ is the CountSketch projection operator.

The utility can then be approximated efficiently:

$$U_z^{(t)} \approx \eta_t \sum_{r \in \mathcal{R}} \langle \phi^{(t,r)}(z), \psi_{proxy}^{(t,r)} \rangle - \eta_t^2 \sum_{r \in \mathcal{R}} \langle \phi^{(t,r)}(z), \Phi^{(t,r)} \rangle$$

where Φ⁽ᵗ’ʳ⁾ = Σⱼ φ⁽ᵗ’ʳ⁾(zⱼ) is the running sketch history.

This reduces the overhead to just 4.7% of training compute.

Boltzmann Sampling for Diversity

Greedy selection of highest-utility samples would cause mode collapse mode collapse A situation where the model focuses on a narrow subset of data, losing generalization ability. In data selection: choosing only similar samples. . OPUS uses soft probabilistic selection based on the Boltzmann distribution Boltzmann distribution A probability distribution from statistical physics. Higher ’energies’ (here: utilities) have higher probability, but not deterministically. :

$$p_z^{(t)} \propto \exp\left(\frac{U_z^{(t)}}{\tau}\right)$$

The temperature temperature A parameter controlling the ‘sharpness’ of the distribution. Low temperature = nearly deterministic selection of the best; high = more random selection. τ > 0 controls the exploration-exploitation tradeoff:

Low τ: Nearly deterministic selection of top samples
High τ: More uniform sampling, greater diversity

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                    OPUS Pipeline                         │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Data Pool ──► Ghost Gradient ──► CountSketch ──►       │
│                  Factorization     Compression           │
│                                                          │
│                      │                                   │
│                      ▼                                   │
│         ┌─────────────────────────┐                      │
│         │  Utility Scoring:       │                      │
│         │  U = Alignment -        │                      │
│         │      Redundancy Penalty │                      │
│         └─────────────────────────┘                      │
│                      │                                   │
│                      ▼                                   │
│           Boltzmann Sampling ──► Training Batch          │
│                                                          │
└─────────────────────────────────────────────────────────┘

Experimental Results

GPT-2 Scale Experiments

On FineWeb datasets with GPT-2 Large and GPT-2 XL:

OPUS consistently outperformed industrial baselines
Matched or exceeded full-data training with fewer tokens
Showed stable improvements across model sizes

Qwen3-8B Domain Adaptation

The most striking result from the paper:

“In continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens”

Method	Tokens Used	Performance
Full training	3B tokens	Baseline
OPUS selection	0.5B tokens	Superior

This represents a 6x efficiency improvement—the same or better results with one-sixth of the data.

Scaling Properties

OPUS benefits increase with:

Larger model sizes
Longer training runs
More heterogeneous data sources

Why This Matters

For Practitioners

Reduced costs: 6x fewer tokens means 6x less compute for the same results
Faster iteration: Test hypotheses and train new models more quickly
Better use of limited data: Especially valuable for domain-specific applications

For Research

Principled framework: Moves beyond heuristic filtering to optimization-based selection
Theoretical grounding: Update-space utility provides a clear objective
Generalizable approach: Works across optimizers (AdamW, Muon, etc.) and model architectures

For the Field

Data selection is becoming the new frontier of LLM efficiency. As raw scaling hits diminishing returns, smarter data utilization becomes crucial. OPUS provides a blueprint for this future.

Summary

OPUS demonstrates that intelligent, dynamic data selection can dramatically improve LLM training efficiency. The key mathematical insights are:

Utility in update space: Scoring that accounts for alignment with target and redundancy penalty
Ghost + CountSketch: Efficient computation via gradient factorization and sketching
Boltzmann sampling: Maintaining diversity through probabilistic selection

The result: 6x efficiency gains with only 4.7% computational overhead.

The Problem: Not All Data Is Created Equal#

Mathematical Foundations#

Loss Function#

Utility Definition#

The OPUS Framework#

1. Optimizer-Induced Preconditioner#

2. Scoring with Alignment and Redundancy#

Efficient Computation: Ghost Gradients + CountSketch#

Ghost Gradient Factorization#

CountSketch Compression#

Boltzmann Sampling for Diversity#

Architecture Overview#

Experimental Results#

GPT-2 Scale Experiments#

Qwen3-8B Domain Adaptation#

Scaling Properties#

Why This Matters#

For Practitioners#

For Research#

For the Field#

Summary#

Links#