Skyportal — Command Center

Explorer

serving

deploy/prod

vllm-args.yaml●

Dockerfile

vllm-args.yaml ×

1 model: "Qwen3-8B-Instruct"

2 tensor_parallel_size: 4

3+enable_chunked_prefill: true # overlap prefill & decode

4+async_scheduling: true # cut per-step IPC on the CPU control plane

5+tokenizer_pool_size: 4 # move tokenization off the hot path

▶ Execution LogTOTAL STEPS: 5 · SUCCESS: 5

[1/5] GPU compute step✓44ms · GPU util 56% · not saturated

[2/5] Tail latency check✗p95 5.1s · TTFT 1.9s · SLO 2.0s breached under load

[3/5] CPU control plane✗scheduler broadcast 228ms (was 12ms) · ~5× the GPU step

[4/5] Hardware vs pipeline✓adding GPUs won't help — bottleneck is CPU-side, not compute

[5/5] Root cause + fix path✓chunked prefill + async scheduling + tokenizer pool

AI Chat

USER U

We scaled to bigger GPUs but p95 latency barely moved. Why?

Host: Scope: support-agent-prod-1

Ask anything…