ORCA and Chunked Prefill: Evening Out the Iteration

Many requests, each finishing at a different time, and some carrying prefills 1000× the size of a decode step. Per-iteration cost swings wildly. ORCA-style iteration-level scheduling fixes one half; chunked prefill bounds the largest iteration so short work isn’t dragged behind long work.

May 6, 2026 · 17 min · Pino

Prefill and Decode Disaggregation: Two Phases on Opposite Sides of the Roofline

Article 05 left two phases politely sharing one engine. This article shows they shouldn’t — prefill is compute-bound, decode is bandwidth-bound, and long context drives the gap wider, not smaller. Once we accept the asymmetry, splitting them is the structural fix.

May 9, 2026 · 17 min · Pino