TAPS: Why Your Draft Model's Training Data Matters More Than Its Architecture
Speculative decoding is one of the most elegant tricks in LLM inference: a small, fast draft model draft model A lightweight language model that quickly proposes candidate tokens. A larger ‘verifier’ model then checks these proposals in parallel, accepting correct ones and rejecting wrong ones - accelerating generation without changing output quality. proposes tokens, and a large verifier verifier The full-size target language model that checks draft proposals. It processes all candidates in one forward pass and accepts those matching its own distribution, guaranteeing identical output quality to standard autoregressive decoding. approves or rejects them in parallel. Same output distribution, fewer expensive forward passes. ...