"GPU poor" is not a lifestyle – it's just a capacity planning mistake.
Running LLMs at scale without thinking about throughput, bandwidth, and KV cache is how you end up either (a) burning money, and (b) under the bridge.
This toolkit helps you avoid both.
It answers a simple question: how many GPUs do you actually need to serve an LLM at your target load? Under the hood, it combines closed-form capacity floors, a discrete-event simulator (simpy), packaged as a streamlit app.
uv sync
uv run streamlit run src/howmanygpus/main.pyThe formulas and framing draw on:
- Making Deep Learning Go Brrrr From First Principles — prefill/decode cost model and serving intuition
- Modal — Roofline model — compute vs. memory-bandwidth binding
- Tensor Economics — LLM inference economics from first principles — capacity planning from hardware specs