[Klaud Cold] minimaxm3-fp8-mi325x-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) MI325X recipe#1759
Conversation
Adds the spec-decoding=mtp sibling of minimaxm3-fp8-mi325x-vllm (#1748), based on the MI325X non-MTP recipe + the MI300X MTP recipe. gfx942 serve shape (BF16 KV cache, --no-enable-prefix-caching, TRITON_ATTN, minimax_m3 parsers), runs with CUDA graphs (no --enforce-eager, VLLM_USE_BREAKABLE_CUDAGRAPH=0), plus the Inferact/MiniMax-M3-EAGLE3 draft via --speculative-config (eagle3, 3 tokens) + chat-template prompts. Carries the same in-place EAGLE3 patch as the mi300x/mi355x MTP recipes (functionstackx/vllm#1, upstream vllm-project/vllm#45546): the ROCm image lacks SupportsEagle3, so the recipe patches the installed amd/model.py before serving. H200-style search space trimmed at the high-conc end, latency rows at conc 1. Also adds SPEC_SUFFIX to launch_mi325x-amds.sh so spec-decoding=mtp routes to the _mtp script. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27506407976 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27506408589 |
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27510493230 |
Summary
Adds the EAGLE3 (
spec-decoding: mtp) sibling ofminimaxm3-fp8-mi325x-vllm(#1748): MiniMax-M3 MXFP8 on MI325X (gfx942) single-node vLLM (ROCm), pairing MiniMaxAI/MiniMax-M3-MXFP8 with the Inferact/MiniMax-M3-EAGLE3 draft. Based on the MI325X non-MTP recipe + the MI300X MTP recipe.New script
minimaxm3_fp8_mi325x_mtp.sh--no-enable-prefix-caching,--block-size 128,--attention-backend TRITON_ATTN,minimax_m3parsers.--enforce-eager,export VLLM_USE_BREAKABLE_CUDAGRAPH=0.--speculative-config '{"method":"eagle3","model":"Inferact/MiniMax-M3-EAGLE3","num_speculative_tokens":3}'(noattention_backendpin — ROCm runs TRITON_ATTN) +--use-chat-template.SupportsEagle3, so the recipe patches the installedamd/model.pybefore serving. Idempotent; dry-run verified against the image's file.Config + launcher
minimaxm3-fp8-mi325x-vllm-mtpinamd-master.yaml: H200-style search space (TP4/TP8 latency, TP4+EP4/TP8+EP8 TEP, TP8+EP8 dp-attn DEP), trimmed at the high-conc end, latency rows at conc 1.launch_mi325x-amds.shhad noSPEC_SUFFIX(an mtp config would run the non-MTP script); added it somtp→_mtp.sh.Validation
bash -nclean on script + launcher; embedded patch dry-runs cleanly against the image'samd/model.py; routing simulated.Like the other ROCm MTP PRs, this is a validation harness (runtime monkey-patch); once the upstream fix is in a rebuilt image, the patch idempotently no-ops.
🤖 Generated with Claude Code
Note
Medium Risk
Benchmark-only changes, but the job mutates installed vLLM source at runtime; patch drift or a bad apply could fail jobs or skew results until upstream ships the fix in the image.
Overview
Adds MiniMax-M3 MXFP8 on MI325X EAGLE3 speculative decoding (
spec-decoding: mtp) as the MTP sibling of the existing non-MTP MI325X recipe.Registers
minimaxm3-fp8-mi325x-vllm-mtpinamd-master.yamlwith an H200-style search space (TP4/TP8 latency from conc 1, EP and dp-attn rows trimmed vs the base config). Documents the change inperf-changelog.yaml.Introduces
minimaxm3_fp8_mi325x_mtp.sh, which serves with Inferact/MiniMax-M3-EAGLE3 (3 speculative tokens), CUDA graphs (VLLM_USE_BREAKABLE_CUDAGRAPH=0), and--use-chat-templatefor benchmarks. Because the shipped ROCm image’s AMD model lacksSupportsEagle3, the script applies an idempotent in-place patch tovllm’samd/model.pybeforevllm serve(fails if anchors drift).Updates
launch_mi325x-amds.shwithSPEC_SUFFIXsoSPEC_DECODING=mtpruns the_mtpscript, matching H200 launchers.Reviewed by Cursor Bugbot for commit 36ada34. Bugbot is set up for automated code reviews on this repo. Configure here.