Explorer
serving
deploy/prod
vllm-args.yaml●
Dockerfile
1 model: "Qwen3-8B-Instruct"
2 tensor_parallel_size: 4
3+enable_chunked_prefill: true # overlap prefill & decode
4+async_scheduling: true # cut per-step IPC on the CPU control plane
5+tokenizer_pool_size: 4 # move tokenization off the hot path
▶ Execution LogTOTAL STEPS: 5 · SUCCESS: 5
[1/5] GPU compute step✓44ms · GPU util 56% · not saturated
[2/5] Tail latency check✗p95 5.1s · TTFT 1.9s · SLO 2.0s breached under load
[3/5] CPU control plane✗scheduler broadcast 228ms (was 12ms) · ~5× the GPU step
[4/5] Hardware vs pipeline✓adding GPUs won't help — bottleneck is CPU-side, not compute
[5/5] Root cause + fix path✓chunked prefill + async scheduling + tokenizer pool
AI Chat
USER U
We scaled to bigger GPUs but p95 latency barely moved. Why?
Host: Scope: support-agent-prod-1
Ask anything…