Check us out at profine.ai
Profile your PyTorch code on real GPUs. Get reviewable optimizations. Ship measured speedups before the multi-hour run.
pip install profine
profine auth login # one-time
profine run-all examples/minGPT/projects/chargpt/chargpt.py --hardware 1x_a100profine prints a one-line cost summary, runs the full pipeline on Modal, and produces a benchmark report in ~10 minutes.
On Karpathy's minGPT chargpt config, median of 3 independent runs per GPU, full optimization stack applied (BF16 Mixed Precision + TF32 matmul + torch.compile max-autotune + SDPA + Fused AdamW):
| GPU | Baseline step | Optimized step | Speedup | Peak mem Δ | Correctness |
|---|---|---|---|---|---|
| A10G (24 GB) | 43.8 ms | 16.5 ms | 2.75× faster (63.7%) | −71.1% | ✓ all 3 reps |
| A100 (80 GB) | 25.2 ms | 7.5 ms | 3.48× faster (71.3%) | −68.7% | ✓ all 3 reps |
Per-run speedups (3 reps each): A10G 2.42× / 2.75× / 4.73×; A100 2.14× / 3.48× / 3.51×.
Correctness is checked by replaying baseline and optimized loss curves step-for-step on the same seed; both stay inside the BF16-widened tolerance (rtol=0.05, atol=0.01, the documented bf16-vs-fp32 drift budget) on every rep. Median loss-curve max diff: 0.013 (A10G), 0.098 (A100).
Reproducible:
profine run-all examples/minGPT/projects/chargpt/chargpt.py --hardware 1x_a100Full artifacts in examples/minGPT/profine_output/ (start with SUMMARY.md); the multi-rep comparison data lives under runs/bench_mingpt/.
Requires:
- A Modal account (the GPU backend)
- An LLM: OpenAI, Anthropic, or any OpenAI-compatible local server (Ollama, vLLM, LM Studio, llama.cpp, LiteLLM)
The fastest path is profine auth login which is an interactive prompt that saves keys to ~/.profine/auth.json (chmod 0600):
profine auth login # paste in MODAL_*, OPENAI/ANTHROPIC, HF_TOKEN
profine auth status # show what's saved (redacted)
profine auth set OPENAI_API_KEY sk-...
profine auth logout # clear all
profine auth logout OPENAI_API_KEY # clear oneEnvironment variables always win over the saved file, so CI keeps working:
export MODAL_TOKEN_ID=...
export MODAL_TOKEN_SECRET=...
export OPENAI_API_KEY=... # or ANTHROPIC_API_KEY
export HF_TOKEN=... # optional, gated models onlyprofine talks to any OpenAI-compatible server. Pass --provider local plus --model, and optionally --base-url.
# Ollama (default endpoint http://localhost:11434/v1)
ollama serve &
ollama pull llama3.1:8b
profine run-all train.py --provider local --model llama3.1:8b
# vLLM (or LM Studio / llama.cpp server / LiteLLM — point --base-url at the server)
profine run-all train.py \
--provider local \
--model meta-llama/Llama-3.1-8B-Instruct \
--base-url http://localhost:8000/v1--base-url is also picked up from PROFINE_LOCAL_BASE_URL.
The agent loop expects strong instruction-following and clean JSON. Models ≤7B often fail at the
interpret/suggest/editsteps; we recommend 70B-class or larger for end-to-end reliability.
read → profile → interpret → suggest → edit → benchmark
Each stage reads the previous stage's output from profine_output/ and writes its own. run-all chains them all; the individual profine <stage> commands let you re-run any single step.
- Pre-flight cost summary: one inline line before the run, no prompt unless the estimate exceeds
$5(override viaPROFINE_COST_PROMPT_THRESHOLD). - Resume on failure: re-run the same command after any mid-pipeline crash; stages with existing artifacts under
--outputare skipped. Pass--no-resumeto force a clean run. - Probe-and-adapt: if step times in your script are slow enough that the configured
--stepswould overshoot the wall-clock budget, profine measures the actual step time after a few probe iterations and trimstotal_stepsso the run finishes inside the budget. - Auto-peel on regression: if a runtime crash on the optimized run can't be healed, profine drops the most recent optimization from the stack and re-benchmarks. Loop repeats until success or only one optimization is left.
- Honest confidence intervals: the benchmark report shows per-run p25/p50/p75 + CV. The headline speedup adds a
lo×–hi×band when the run is noisy.
| Flag | Default | Description |
|---|---|---|
--provider |
openai |
openai, anthropic, or local |
--api-key |
from auth/env | Overrides saved auth + env var |
--model |
provider default | Required for --provider local |
--base-url |
— | For --provider local; env: PROFINE_LOCAL_BASE_URL |
--seed |
42 |
LLM seed. Temperature is always 0 |
-o/--output |
profine_output |
Output directory |
--prefs |
— | Markdown of user preferences (biases ranking + edits) |
--no-telemetry |
off | Disable anonymous telemetry for this run |
Run profine env to see every PROFINE_* variable profine reads with its current resolved value.
profine run-all examples/minGPT/projects/chargpt/chargpt.py| Flag | Default | Description |
|---|---|---|
--hardware |
required | Preset name — see Hardware |
--steps |
60 |
Total measured steps |
--warmup |
30 |
Warmup steps (stripped before measurement) |
--timeout |
900 |
Modal container timeout (s). Auto-extends on timeout |
--warmstart |
off | Reuse the deployed Modal app between runs |
--top |
all | Apply top N optimizations sequentially, each stacked on the previous |
--rtol / --atol |
0.01 / 0.0001 |
Loss tolerances (auto-widened for BF16/FP16, quantization) |
--no-resume |
off | Re-run every stage from scratch |
--yes, -y |
off | Skip the cost prompt |
profine read train.pyReads model/optimizer/dataloader/precision/distributed-strategy facts via AST + LLM, plus any local modules the script imports. Output: profine_output/read/architecture_record.json.
profine profile train.pyInstruments the script and runs it on Modal with torch.profiler. Collects step times, kernel breakdown, GPU utilization, memory. Same --hardware / --steps / --warmup / --timeout / --warmstart flags as run-all. Output: profine_output/profile/profile_record.json.
profine interpret --profile-dir profine_output/profileDeterministic analysis + LLM diagnosis. Output: profine_output/interpret/bottleneck_report.json.
profine suggest --interpret-dir profine_output/interpretFilters the catalog by applicability, then ranks remaining candidates by ROI. Output: profine_output/suggest/suggestion_report.json.
profine edit train.py --suggestion-dir profine_output/suggest # top-ranked only
profine edit train.py --suggestion-dir profine_output/suggest --top 3 # stack the top 3
profine edit train.py --suggestion-dir profine_output/suggest --optimization torch_compileMulti-file aware: discovers local modules the entry script imports and edits whichever file owns the code being optimized. Patched library files land under profine_output/edit/files/<rel-path> — your source tree is never touched. With --top N, per-iteration artifacts go in profine_output/edit/NN_<entry_id>/; cumulative result at profine_output/edit/edited_train.py.
profine benchmark train.py # uses <output>/edit/edited_train.py
profine benchmark train.py --optimized profine_output/edit/edited_train.pyRuns original and optimized back-to-back on the same hardware. Files under profine_output/edit/files/ are overlaid on the optimized run. Loss tolerance auto-widens for numerics-perturbing optimizations (BF16/mixed precision: rtol 5%; quantization: rtol 10%) — when widened, the headline verdict surfaces it explicitly. Output: profine_output/benchmark/.
Hardware presets live in profine/config/hardware.yaml. Pass one explicitly via --hardware.
| Preset | GPU | VRAM | Cost/hr |
|---|---|---|---|
1x_t4 |
T4 | 16 GB | $0.59 |
1x_l4 |
L4 | 24 GB | $0.80 |
1x_a10g |
A10G | 24 GB | $1.10 |
1x_a100 |
A100 | 80 GB | $2.50 |
1x_h100 |
H100 | 80 GB | $3.95 |
Prices from modal.com/pricing. The hardware preset, optimization catalog, kernel patterns, and extractor patterns are all editable YAML — extend without code changes.
| Command | What it does |
|---|---|
profine auth login / status / set / logout |
Manage saved credentials in ~/.profine/auth.json |
profine telemetry status / enable / disable |
Anonymous telemetry consent (or PROFINE_NO_TELEMETRY=1) |
profine env |
List every PROFINE_* env var with its current value |
Apache 2.0. See LICENSE.