Speculative decoding is one of the most elegant tricks in LLM inference: a small, fast draft model draft model A lightweight language model that quickly proposes candidate tokens. A larger ‘verifier’ model then checks these proposals in parallel, accepting correct ones and rejecting wrong ones - accelerating generation without changing output quality. proposes tokens, and a large verifier verifier The full-size target language model that checks draft proposals. It processes all candidates in one forward pass and accepts those matching its own distribution, guaranteeing identical output quality to standard autoregressive decoding. approves or rejects them in parallel. Same output distribution, fewer expensive forward passes.

But here’s the question nobody was asking: what happens when your draft model was trained on the wrong data?

The paper “TAPS: Task Aware Proposal Distributions for Speculative Sampling” (Zbib et al., KAUST & AUB, March 2026) shows that draft training data matters as much as architecture. A math-trained drafter excels at math but stumbles on chat. A chat-trained drafter does the opposite. And the way you combine specialists at inference time - not in weight space - determines whether you get the best of both worlds or the worst.


The Problem: One Draft Doesn’t Fit All

Most prior work on speculative decoding focuses on architecture improvements - better feature-level drafting (EAGLE, EAGLE-2, EAGLE-3), dynamic trees, cascaded drafters, self-speculative methods. But almost all draft models are trained on the same broad corpora like ShareGPT.

This creates a blind spot. If your draft is trained on conversational data but your workload is mathematical reasoning, the draft’s proposal distribution proposal distribution The probability distribution over next tokens produced by the draft model. When this distribution closely matches the verifier’s distribution for the current task, more tokens are accepted and inference is faster. will be poorly aligned with the verifier’s behavior on that task. The speculative decoding algorithm is still lossless - you get the right answer - but the acceptance length drops, and with it, the speedup.

TAPS asks five research questions:

  1. Does task-specific training improve acceptance on matched tasks?
  2. Can mixed-data training recover cross-domain robustness?
  3. How should multiple specialized drafters be combined?
  4. What signals (confidence vs. entropy) are useful for routing?
  5. How does speculative depth interact with task specialization?

Experimental Setup

The controlled design is one of the paper’s strengths. Everything is held fixed except two variables: training data and composition strategy.

Fixed Components

  • Verifier: Meta-Llama-3-8B-Instruct
  • Draft architecture: lightweight LLaMA-style decoder, one transformer layer, hidden size 4096, ~0.8B parameters
  • Tokenizer: shared between draft and verifier (eliminates tokenization mismatch)
  • Training: 20 epochs, learning rate $3 \times 10^{-5}$, batch size 8

Variable: Training Data

  • MathInstruct (70k examples) - mathematical reasoning
  • ShareGPT (70k examples) - conversational data
  • Mixed 35k+35k - balanced mix, same total count
  • Mixed 70k+70k - doubled total, both domains fully represented

Variable: Speculative Backbone

Two state-of-the-art architectures tested independently:

  • EAGLE-2 EAGLE-2 A speculative decoding method that predicts future hidden features (not tokens directly) and uses a context-dependent dynamic tree to organize candidate sequences. Improves on EAGLE by adapting tree structure to input. - feature-level drafting with dynamic tree construction
  • HASS HASS Harmonized Speculative Sampling - improves draft-target alignment through Top-K distillation and harmonized context alignment, training on imperfect draft features rather than only clean target features. - harmonized representations with Top-K distillation

Benchmarks

BenchmarkDomainExamples
MT-BenchConversational80
GSM8KGrade-school math1,319
MATH-500Competition math500
SVAMPWord problems300

Primary metric: acceptance length - average number of draft tokens accepted per verifier call. Higher = better alignment = faster inference.


RQ1: Specialization Is Real

The results are unambiguous. Under HASS at temperature 0:

Draft TrainingMT-BenchGSM8KMATH-500SVAMP
MathInstruct2.905.025.353.13
ShareGPT3.984.093.984.44

ShareGPT is 37% better on MT-Bench. MathInstruct is 23% better on GSM8K and 34% better on MATH-500. The same pattern holds for EAGLE-2.

This is not subtle. A drafter trained on the wrong domain doesn’t just lose a few percentage points - it substantially degrades the speculative decoding advantage.

Takeaway: Draft quality depends not only on architecture but on the match between training distribution and deployment workload.


RQ2: Mixed Training Helps, But Doesn’t Dominate

Can you fix specialization by mixing training data? Partially.

Under HASS at temperature 0, Mixed 70k+70k achieves the highest average acceptance length (5.18 across benchmarks). But at temperature 1, it falls to 3.69 - below Mixed 35k+35k’s 4.29.

The pattern repeats for EAGLE-2: Mixed 70k+70k is strongest at temperature 0 (average 4.48), but Mixed 35k+35k is more stable at temperature 1 (3.81 vs. 3.26).

Takeaway: Mixed training broadens coverage but doesn’t uniformly improve across decoding temperatures. It does not remove the need to tune the mixture for your decoding regime.


RQ3: Compose at Inference Time, Not in Weight Space

This is the paper’s most practically useful finding. When you have two specialized drafters, how do you combine them?

Three Strategies

1. Checkpoint Averaging - linear interpolation in parameter space:

$$\theta_{merge} = \lambda \theta_{math} + (1 - \lambda) \theta_{chat}$$

Simple but destructive. Average acceptance lengths of 2.59 (HASS) and 2.62 (EAGLE-2) - the worst results in the entire table, consistently below every single-domain specialist.

2. Confidence Routing - run both drafters, pick the one with higher mean confidence:

$$\mathcal{T}^* = \arg\max_{\mathcal{T} \in \{\mathcal{T}_{math}, \mathcal{T}_{chat}\}} \text{Score}(\mathcal{T}), \quad \text{Score}(\mathcal{T}) = \frac{1}{|\mathcal{T}|} \sum_{v \in \mathcal{T}} c(v)$$

A significant step up: 4.80 (HASS) and 4.63 (EAGLE-2) average acceptance length at temperature 0.

3. Merged-Tree Verification - pack both draft trees under a shared root with separate attention masks, verify all candidates in one pass:

The best overall: 5.11 (HASS) and 5.02 (EAGLE-2) at temperature 0. This beats every single-domain checkpoint and every mixed-data variant.

The Results Table (Temperature 0)

StrategyMethodMT-BenchGSM8KMATH-500SVAMPAverage
AveragedHASS2.292.803.122.132.59
AveragedEAGLE-22.072.532.572.502.42
Confidence RoutedHASS3.935.015.374.894.80
Confidence RoutedEAGLE-23.634.915.254.714.63
Merged TreesHASS4.055.425.655.315.11
Merged TreesEAGLE-23.935.325.635.255.03

Weight averaging destroys specialist knowledge. Inference-time composition preserves it.

Takeaway: Keep specialists separate. Combine their proposals, not their parameters.


RQ4: Confidence Beats Entropy for Routing

Both confidence and entropy are candidate signals for deciding which specialist to trust. TAPS tests both at the benchmark level using EAGLE-2.

Confidence Routing

BenchmarkMathInstruct selectedShareGPT selected
MT-Bench15 (18.8%)65 (81.2%)
GSM8K1,198 (90.8%)121 (9.2%)
MATH-500485 (97.0%)15 (3.0%)
SVAMP279 (93.0%)21 (7.0%)

Confidence routing produces near-perfect domain separation: the math specialist is selected for >90% of math examples, the chat specialist for >80% of conversation.

Entropy Routing

BenchmarkMathInstruct selectedShareGPT selected
MT-Bench42 (52.5%)38 (47.5%)
GSM8K720 (54.6%)599 (45.4%)
MATH-500312 (62.4%)188 (37.6%)
SVAMP159 (53.0%)141 (47.0%)

Entropy routing is barely better than random. It detects that rejected tokens tend to have higher entropy (useful as a diagnostic) but produces near-balanced splits that fail to separate domains.

Takeaway: Confidence is a routing signal. Entropy is a diagnostic signal. Don’t confuse the two.


RQ5: Depth Reveals the Exploration-Exploitation Tradeoff

An elegant finding: how specialization interacts with draft tree depth.

At shallow depths (1–2 tokens), mixed-data drafts often perform best. Broader coverage means more chances of producing an acceptable first token - an exploration advantage.

At deeper depths (3–5 tokens), the task-matched specialist increasingly dominates, especially on reasoning benchmarks. Sustained agreement with the verifier requires close distribution alignment - an exploitation advantage.

This explains why merged-tree verification works so well: it provides exploration (diversity from two specialists) at shallow levels while letting the better-matched specialist drive deeper acceptance.


The Merged-Tree Algorithm

The most novel technical contribution is the merged-tree verification procedure:

  1. Generate draft trees $\mathcal{T}_{math}$ and $\mathcal{T}_{chat}$ from the same root token
  2. Merge under a shared root by concatenating nodes and remapping indices
  3. Build ancestor-preserving attention masks - nodes from one subtree cannot attend to nodes from the other
  4. Assign position IDs by tree depth
  5. Verify the entire merged tree in one verifier forward pass
  6. Extract candidate paths and apply standard speculative acceptance
  7. Commit the accepted prefix

The attention mask isolation is key: each subtree maintains its own internal ancestry, so the verifier evaluates each specialist’s proposals on their own terms. The diversity comes from having both sets of candidates available, not from cross-pollinating their computations.


Practical Implications

For inference engineers

If your workload is domain-specific (coding, math, legal, medical), train or fine-tune your draft model on matched domain data. The architectural backbone (EAGLE-2 vs. HASS) matters less than the training distribution match.

For multi-domain deployments

Don’t average checkpoints. Either:

  • Use confidence routing (lower overhead, requires running both drafters on the prefix)
  • Use merged-tree verification (higher acceptance, requires larger verifier batch from the merged tree)

For researchers

The paper’s limitation section is refreshingly honest: one target model, two source domains, two speculative backbones, four benchmarks. The routing policy is deliberately simple (confidence-based, not learned). End-to-end latency improvements from merged-tree verification are not measured - acceptance length is the proxy. These are all clear directions for follow-up work.


Limitations

  • Single verifier (Llama-3-8B-Instruct) - unclear how results transfer to larger or different architectures
  • Two domains only - real-world deployment involves many more task types
  • No end-to-end latency - merged-tree verification increases the verifier’s batch size; whether the acceptance length gain offsets this cost is an open question
  • Simple routing policy - confidence-based, not learned; a trained router might do better
  • Acceptance length ≠ wall-clock speedup - the overhead of running multiple drafters or larger verification batches is not accounted for

Bottom Line

TAPS delivers a simple but important message: speculative decoding is a systems design problem, not just an architecture problem. The draft model is not a fixed auxiliary component - it’s a design choice that should be aligned with the deployment workload. When multiple specialists exist, composing them at inference time (via routing or merged verification) substantially outperforms merging them in weight space.

The practical recipe: train domain-specific drafters, keep them separate, combine their proposals at inference time, and use confidence (not entropy) to route between them.