Performance benchmarking tool for Gradio apps. Measures per-phase request timing, load tests with concurrent users, and supports mixed background traffic simulation.
pip install hf-perftestOr install from source:
git clone https://github.com/gradio-app/hf-perftest.git
cd hf-perftest
pip install -e .Install the hf-perftest skill/rules for your AI coding tool:
# Install for all supported tools (Claude Code, Cursor, Codex)
hf-perftest install-skill
# Or pick one
hf-perftest install-skill claude
hf-perftest install-skill cursor
hf-perftest install-skill codexThis writes the appropriate skill/rules file into your project so your AI assistant knows how to use hf-perftest.
Run a benchmark against a Gradio app:
hf-perftest run \
--app apps/echo_text.py \
--tiers 1,10,100 \
--requests-per-user 10 \
--output-dir benchmark_results/my_runThis will:
- Launch the app with
GRADIO_PROFILING=1 - Run warmup requests
- For each tier, fire N concurrent users in burst mode for 10 rounds
- Collect client latencies and server-side per-phase traces
- Save results to
benchmark_results/my_run/<timestamp>/
# Minimal
hf-perftest run --app apps/echo_text.py
# Full options
hf-perftest run \
--app apps/echo_text.py \
--tiers 1,10,100 \
--requests-per-user 50 \
--mode burst \
--concurrency-limit 1 \
--mixed-traffic \
--num-workers 2 \
--output-dir results--app Path to the Gradio app to test (required)
--tiers Comma-separated concurrency tiers (default: 1,10,100)
--requests-per-user Number of rounds per tier (default: 10)
--mode Load pattern: "burst" or "wave" (default: burst)
--concurrency-limit Concurrency limit for the app (default: 1, use "none" for unlimited)
--mixed-traffic Run background traffic (page loads, uploads, downloads) alongside predictions
--num-workers Number of Gradio workers via GRADIO_NUM_WORKERS (default: 1)
--output-dir Output directory (default: benchmark_results)
--port Port for the Gradio app (default: 7860)
--api-name API endpoint name (auto-detected if not specified)
- burst: All N users fire simultaneously per round. Measures worst-case queue contention.
- wave: Each user waits a random 0–500ms jitter before firing. Simulates realistic staggered traffic.
With --mixed-traffic, background workers run alongside predictions to simulate realistic server load:
- Page loads:
GET /,/config,/gradio_api/info,/theme.css, plus discovered JS/CSS assets - Uploads:
POST /gradio_api/uploadwith files fromsample-inputs/ - Downloads:
GET /gradio_api/file=...for static files
Submit benchmarks to HF Jobs infrastructure for reproducible results on standardized hardware.
hf-perftest run-remote run \
--apps apps/echo_text.py apps/streaming_chat.py \
--branch main \
--tiers 1,10,100 \
--requests-per-user 50 \
--hardware cpu-upgradehf-perftest run-remote ab \
--apps apps/echo_text.py apps/file_heavy.py \
--base main \
--branch my-optimization \
--tiers 1,10,100 \
--requests-per-user 50 \
--hardware cpu-upgradehf-perftest run-remote run \
--apps mrfakename/z-image-turbo \
--sidecar apps/z-image-turbo.prompts.json \
--api-name /generate_image \
--branch main \
--hardware gpu-l4-1--apps Paths to local Gradio app files or a HF Space ID (required)
--branch/--base Git branches to benchmark (resolved to commit SHA)
--commit Direct commit SHA (overrides --branch)
--hardware HF Jobs hardware flavor (default: cpu-basic)
--tiers Comma-separated concurrency tiers (default: 1,10,100)
--requests-per-user Rounds per tier (default: 10)
--mode Load pattern: "burst" or "wave" (default: burst)
--concurrency-limit App concurrency limit (default: 1)
--mixed-traffic Run background traffic alongside predictions
--num-workers Number of Gradio workers (default: 1)
--sidecar Sidecar prompt files (.prompts.json) to upload alongside apps
--timeout Job timeout (default: 90m)
--run-name Human-readable label (default: auto-generated)
--dry-run Print generated script without submitting
hf-perftest result-schemaPrints the structure of the results directory.
| App | What it tests |
|---|---|
echo_text.py |
lambda x: x — pure framework overhead |
file_heavy.py |
256x256 random image — exercises postprocess serialization |
image_to_image.py |
Image identity — tests file upload + download |
stateful_counter.py |
gr.State with dict — tests session state handling |
streaming_chat.py |
ChatInterface generator with 6 yields — tests streaming |
llm_chat.py |
LLM chat via HF Inference API — real-world chat workload |
text_to_image.py |
Text-to-image via HF Inference API — real-world image gen |
The instrumentation (enabled via GRADIO_PROFILING=1) traces six phases per request:
| Phase | What it measures |
|---|---|
queue_wait |
Time from event creation to processing start |
preprocess |
Input deserialization |
fn_call |
User function execution (accumulated across generator yields) |
postprocess |
Output serialization (e.g. numpy → image → file cache) |
streaming_diff |
Computing incremental diffs for streaming output |
total |
Wall-clock time for the full process_api() call |
Profiling endpoints (only available when GRADIO_PROFILING=1):
curl http://localhost:7860/gradio_api/profiling/traces | python -m json.tool
curl http://localhost:7860/gradio_api/profiling/summary | python -m json.tool
curl -X POST http://localhost:7860/gradio_api/profiling/clearhf jobs logs <job_id>
hf jobs inspect <job_id>