Training large language models requires astronomical amounts of data and compute. But what if most of that data is redundant redundant Redundant data provides no new information to the learning process — the model already ‘knows’ the patterns it contains. ? The paper “OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration” introduces a framework that achieves comparable results with 6x fewer tokens tokens A token is the basic unit of text in LLMs — it can be a word, part of a word, or a character. Models process text as sequences of tokens. by intelligently selecting what the model should learn from at each step.
The Problem: Not All Data Is Created Equal
Current LLM LLM Large Language Model — a neural network with billions of parameters trained on massive text corpora. training pipelines treat data selection as a one-time preprocessing step: filter out low-quality content, deduplicate, and feed everything to the model. But this static approach ignores a crucial insight: the value of a data point changes as the model learns.
A sample that’s highly informative early in training might be redundant later. Conversely, a complex example might be useless initially but invaluable once the model has developed basic capabilities. OPUS addresses this by making data selection dynamic and iteration-aware.
Mathematical Foundations
Loss Function
OPUS operates on the standard language modeling objective. For a sequence z = (x₁, x₂, …, xₗ), the per-sequence negative log-likelihood negative log-likelihood A measure of how well the model predicts the next tokens. Lower values mean better predictions. Used as the loss function in LLM training. is:
$$\mathcal{L}(z; \theta) = -\frac{1}{L} \sum_{i=1}^{L} \log p_\theta(x_i | x_{<i})$$
The expected loss over a data distribution Q is:
$$\mathcal{L}(\mathcal{Q}; \theta) := \mathbb{E}_{z \sim \mathcal{Q}}[\mathcal{L}(z; \theta)]$$
Utility Definition
The core innovation of OPUS is defining utility utility In OPUS context: a measure of how much a given sample will contribute to model improvement in a given training iteration. in the optimizer optimizer An algorithm responsible for updating neural network weights based on gradients. Examples: SGD, Adam, AdamW. -induced update space. The base utility of a selected batch batch A subset of training data processed together in one iteration. Larger batches = more stable gradients, but more memory required. S at iteration t measures the reduction in validation loss:
$$U^{(t)}(\mathcal{S}) := \mathcal{L}(\mathcal{D}_{val}; \theta_t) - \mathcal{L}(\mathcal{D}_{val}; \theta_{t+1}(\mathcal{S}))$$
For individual sample selection, OPUS computes the marginal utility — how much adding sample z to the current batch B̂ₜ improves the objective:
$$U_z^{(t)} := U^{(t)}(\hat{\mathcal{B}}_t \cup \{z\}) - U^{(t)}(\hat{\mathcal{B}}_t)$$
The OPUS Framework
1. Optimizer-Induced Preconditioner
Different optimizers induce different update directions. OPUS captures this through a general update form:
$$\Delta\theta_t(\hat{\mathcal{B}}_t) = -\eta_t \sum_{z \in \hat{\mathcal{B}}_t} \mathbf{P}_t \nabla\mathcal{L}(z; \theta_t)$$
where Pₜ is the optimizer-specific preconditioner preconditioner A matrix that transforms the gradient before weight updates. Different optimizers use different preconditioners, affecting step direction and magnitude. and ηₜ is the learning rate learning rate A hyperparameter determining the step size during weight updates. Too large = instability, too small = slow convergence. .
For AdamW AdamW A popular optimizer combining adaptive learning rates with weight decay regularization. Widely used in LLM training. , the preconditioner takes the form:
$$\mathbf{P}_t^{AdamW} := C_t \cdot \text{Diag}\left(\frac{1}{\sqrt{\hat{v}_{t-1}} + \epsilon}\right)$$
where Cₜ := αₜ(1-β₁)/(1-β₁ᵗ) and v̂ₜ₋₁ is the bias-corrected second moment estimate.
For Muon optimizer, the preconditioner for layer ℓ is:
$$\mathbf{P}_{t,\ell}^{Muon} := \kappa_t \mathbf{S}_{t,\ell}$$
where Sₜ,ₗ = aI + bAₜ,ₗ + cA²ₜ,ₗ is a polynomial of the momentum correlation matrix.
2. Scoring with Alignment and Redundancy
Using first-order Taylor approximation Taylor approximation A method of approximating functions using polynomials. First-order approximation uses only the first derivative, simplifying calculations at the cost of accuracy. around the current parameters:
$$\mathcal{L}(\mathcal{D}_{val}; \tilde{\theta}_t + \Delta\theta_t(\{z\})) \approx \mathcal{L}(\mathcal{D}_{val}; \tilde{\theta}_t) + \nabla_\theta \mathcal{L}(\mathcal{D}_{val}; \tilde{\theta}_t)^\top \Delta\theta_t(\{z\})$$
The linearized validation gradient gradient A vector of partial derivatives of the loss function with respect to parameters. Points in the direction of steepest loss increase — we go the opposite way. evolves as:
$$\nabla_\theta \mathcal{L}(\mathcal{D}_{val}; \tilde{\theta}_t) \approx \mathbf{g}_{val}^{(t)} + \mathbf{H}_{val}^{(t)} \Delta\theta_t(\hat{\mathcal{B}}_t)$$
With an isotropic Hessian approximation isotropic Hessian approximation A simplification assuming the matrix of second derivatives (Hessian) is proportional to the identity matrix. Dramatically reduces computational complexity. (Hᵥₐₗ ≈ I), the final utility score decomposes into two interpretable terms:
$$U_z^{(t)} \approx \underbrace{\eta_t \langle \mathbf{u}_z^{(t)}, \mathbf{g}_{proxy}^{(t)} \rangle}_{\text{Alignment}} - \underbrace{\eta_t^2 \langle \mathbf{u}_z^{(t)}, \mathbf{G}^{(t)} \rangle}_{\text{Redundancy Penalty}}$$
where:
- uᵢ⁽ᵗ⁾ = Pₜ∇L(z; θₜ) is the optimizer-induced update for sample z
- gₚᵣₒₓᵧ⁽ᵗ⁾ is the proxy gradient direction (from validation data)
- G⁽ᵗ⁾ := Σⱼ uⱼ⁽ᵗ⁾ is the accumulated effective direction (sum over batch samples)
The alignment term rewards samples that move in the same direction as the validation objective. The redundancy penalty discourages selecting samples similar to those already in the batch.
Efficient Computation: Ghost Gradients + CountSketch
Ghost Gradient Factorization
Computing full gradients for every candidate would be prohibitive. OPUS exploits the structure of linear layers, where the per-sample gradient can be factored as an outer product outer product An operation on two vectors producing a matrix. For vectors a and b, outer product a⊗b is a matrix where element (i,j) = aᵢ·bⱼ. :
$$\nabla_{\mathbf{W}_r}\mathcal{L}(z; \theta_t) = \mathbf{a}_r^{(z)} \otimes \mathbf{b}_r^{(z)}$$
where aᵣ⁽ᶻ⁾ is the input activation and bᵣ⁽ᶻ⁾ is the backpropagated error signal. This “ghost” representation avoids materializing the full gradient tensor.
CountSketch Compression
To handle high-dimensional parameter spaces, OPUS applies CountSketch CountSketch A randomized data structure that compresses high-dimensional vectors while preserving approximate inner products. Uses hash functions for efficient projection. projection. The implicit sketched feature for layer r is:
$$\phi^{(t,r)}(z) = \Pi_r(\mathbf{P}_{t,r}(\mathbf{a}_r^{(z)} \otimes \mathbf{b}_r^{(z)}))$$
where Πᵣ is the CountSketch projection operator.
The utility can then be approximated efficiently:
$$U_z^{(t)} \approx \eta_t \sum_{r \in \mathcal{R}} \langle \phi^{(t,r)}(z), \psi_{proxy}^{(t,r)} \rangle - \eta_t^2 \sum_{r \in \mathcal{R}} \langle \phi^{(t,r)}(z), \Phi^{(t,r)} \rangle$$
where Φ⁽ᵗ’ʳ⁾ = Σⱼ φ⁽ᵗ’ʳ⁾(zⱼ) is the running sketch history.
This reduces the overhead to just 4.7% of training compute.
Boltzmann Sampling for Diversity
Greedy selection of highest-utility samples would cause mode collapse mode collapse A situation where the model focuses on a narrow subset of data, losing generalization ability. In data selection: choosing only similar samples. . OPUS uses soft probabilistic selection based on the Boltzmann distribution Boltzmann distribution A probability distribution from statistical physics. Higher ’energies’ (here: utilities) have higher probability, but not deterministically. :
$$p_z^{(t)} \propto \exp\left(\frac{U_z^{(t)}}{\tau}\right)$$
The temperature temperature A parameter controlling the ‘sharpness’ of the distribution. Low temperature = nearly deterministic selection of the best; high = more random selection. τ > 0 controls the exploration-exploitation tradeoff:
- Low τ: Nearly deterministic selection of top samples
- High τ: More uniform sampling, greater diversity
Architecture Overview
┌─────────────────────────────────────────────────────────┐
│ OPUS Pipeline │
├─────────────────────────────────────────────────────────┤
│ │
│ Data Pool ──► Ghost Gradient ──► CountSketch ──► │
│ Factorization Compression │
│ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Utility Scoring: │ │
│ │ U = Alignment - │ │
│ │ Redundancy Penalty │ │
│ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ Boltzmann Sampling ──► Training Batch │
│ │
└─────────────────────────────────────────────────────────┘
Experimental Results
GPT-2 Scale Experiments
On FineWeb datasets with GPT-2 Large and GPT-2 XL:
- OPUS consistently outperformed industrial baselines
- Matched or exceeded full-data training with fewer tokens
- Showed stable improvements across model sizes
Qwen3-8B Domain Adaptation
The most striking result from the paper:
“In continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens”
| Method | Tokens Used | Performance |
|---|---|---|
| Full training | 3B tokens | Baseline |
| OPUS selection | 0.5B tokens | Superior |
This represents a 6x efficiency improvement—the same or better results with one-sixth of the data.
Scaling Properties
OPUS benefits increase with:
- Larger model sizes
- Longer training runs
- More heterogeneous data sources
Why This Matters
For Practitioners
- Reduced costs: 6x fewer tokens means 6x less compute for the same results
- Faster iteration: Test hypotheses and train new models more quickly
- Better use of limited data: Especially valuable for domain-specific applications
For Research
- Principled framework: Moves beyond heuristic filtering to optimization-based selection
- Theoretical grounding: Update-space utility provides a clear objective
- Generalizable approach: Works across optimizers (AdamW, Muon, etc.) and model architectures
For the Field
Data selection is becoming the new frontier of LLM efficiency. As raw scaling hits diminishing returns, smarter data utilization becomes crucial. OPUS provides a blueprint for this future.
Summary
OPUS demonstrates that intelligent, dynamic data selection can dramatically improve LLM training efficiency. The key mathematical insights are:
- Utility in update space: Scoring that accounts for alignment with target and redundancy penalty
- Ghost + CountSketch: Efficient computation via gradient factorization and sketching
- Boltzmann sampling: Maintaining diversity through probabilistic selection
The result: 6x efficiency gains with only 4.7% computational overhead.
Links
- Based on the publication arXiv:2602.05400 PDF