Kv-Cache on LLM Stories

Kv-Cache on LLM Storieshttps://wgzesg.github.io/llm_stories/tags/kv-cache/Recent content in Kv-Cache on LLM StoriesHugoenSat, 09 May 2026 00:00:00 +0000ORCA and Chunked Prefill: Evening Out the Iterationhttps://wgzesg.github.io/llm_stories/posts/05-orca-and-chunked-prefill/Wed, 06 May 2026 00:00:00 +0000https://wgzesg.github.io/llm_stories/posts/05-orca-and-chunked-prefill/How iteration-level scheduling (ORCA) and chunked prefill flatten per-iteration cost. We walk the cost variance, see what ORCA fixes and what it leaves open, and watch chunked prefill bound the longest iteration.Prefill and Decode Disaggregation: Two Phases on Opposite Sides of the Rooflinehttps://wgzesg.github.io/llm_stories/posts/06-prefill-decode-disaggregation/Sat, 09 May 2026 00:00:00 +0000https://wgzesg.github.io/llm_stories/posts/06-prefill-decode-disaggregation/A roofline-first argument for prefill/decode disaggregation. Defines arithmetic intensity, derives that intensity ≈ tokens-per-iteration for transformers, sweeps context length to show decode falling further below the ridge as L grows, then walks through the split and the KV-transfer cost it introduces.