GPU fleet capacity planning for LLM Inference

"GPU poor" is not a lifestyle – it's just a capacity planning mistake.

Running LLMs at scale without thinking about throughput, bandwidth, and KV cache is how you end up either (a) burning money, and (b) under the bridge.

This toolkit helps you avoid both.

It answers a simple question: how many GPUs do you actually need to serve an LLM at your target load? Under the hood, it combines closed-form capacity floors, a discrete-event simulator (simpy), packaged as a streamlit app.

Live demo , Blog post

Quick start

uv sync
uv run streamlit run src/howmanygpus/main.py

References

The formulas and framing draw on:

Making Deep Learning Go Brrrr From First Principles — prefill/decode cost model and serving intuition
Modal — Roofline model — compute vs. memory-bandwidth binding
Tensor Economics — LLM inference economics from first principles — capacity planning from hardware specs

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
src/howmanygpus		src/howmanygpus
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU fleet capacity planning for LLM Inference

Quick start

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPU fleet capacity planning for LLM Inference

Quick start

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages