Skip to content

Benchmark results

Historical benchmark results from the legacy Python FastAPI implementation are available on the Legacy Benchmarks page.

Here we collect results running on latest supported inference engines.

The benchmark tool used is the Hugging Face inference benchmarker.

The collected performance metrics are:

  • Queries per second (QPS)
  • Inter-token latency (ITL)
  • Time to first token (TTFT)
  • End to end (E2E) latency
  • Throughput measured in tokens/s for all queries running in parallel

The benchmark runs on a set of fixed rates, increasing by 50% for the next rate amount.

Results from the Go/Gin rewrite will be added here once available.