<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Inference Optimization on MLLog.dev</title><link>https://mllog.dev/en/tags/inference-optimization/</link><description>Recent content in Inference Optimization on MLLog.dev</description><image><title>MLLog.dev</title><url>https://mllog.dev/images/default_mllog.png</url><link>https://mllog.dev/images/default_mllog.png</link></image><generator>Hugo -- 0.147.9</generator><language>en</language><lastBuildDate>Sat, 28 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://mllog.dev/en/tags/inference-optimization/index.xml" rel="self" type="application/rss+xml"/><item><title>TAPS: Why Your Draft Model's Training Data Matters More Than Its Architecture</title><link>https://mllog.dev/en/posts/taps-task-aware-speculative-decoding/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://mllog.dev/en/posts/taps-task-aware-speculative-decoding/</guid><description>&lt;p>Speculative decoding is one of the most elegant tricks in LLM inference: a small, fast &lt;span class="glossary-term" tabindex="0">
&lt;span class="glossary-word">draft model&lt;/span>
&lt;span class="glossary-tooltip">
&lt;strong>draft model&lt;/strong>
&lt;span class="glossary-def">A lightweight language model that quickly proposes candidate tokens. A larger &amp;lsquo;verifier&amp;rsquo; model then checks these proposals in parallel, accepting correct ones and rejecting wrong ones - accelerating generation without changing output quality.&lt;/span>
&lt;/span>
&lt;/span>
proposes tokens, and a large &lt;span class="glossary-term" tabindex="0">
&lt;span class="glossary-word">verifier&lt;/span>
&lt;span class="glossary-tooltip">
&lt;strong>verifier&lt;/strong>
&lt;span class="glossary-def">The full-size target language model that checks draft proposals. It processes all candidates in one forward pass and accepts those matching its own distribution, guaranteeing identical output quality to standard autoregressive decoding.&lt;/span>
&lt;/span>
&lt;/span>
approves or rejects them in parallel. Same output distribution, fewer expensive forward passes.&lt;/p></description></item></channel></rss>