Velo

LLM inference gateway · benchmark results

Requests / sec

+130% vs cold at 60% prompt reuse

p50 latency

0ms

−84% vs cold

TTFT p50

0ms

−88% vs cold

Cache hit rate

vs 0% on cold

Latency distribution

The median rides the cache fast-path (around 10 ms), but the roughly 34% that miss still pay full backend latency, so p95 and p99 stay high. Caching lifts the median, not the tail.

Go pgvector Redis 16 workers × 30s threshold 0.92 60% prompt reuse 0 errors