Skip to content

[DNM][AMD] agentx-v0.4 rebased from commit chore/agentx-v0.4 commit 7f61#1709

Open
seungrokj wants to merge 13 commits into
chore/agentx-v0.4from
amd/agentx-v0.4_rebase0611
Open

[DNM][AMD] agentx-v0.4 rebased from commit chore/agentx-v0.4 commit 7f61#1709
seungrokj wants to merge 13 commits into
chore/agentx-v0.4from
amd/agentx-v0.4_rebase0611

Conversation

@seungrokj

@seungrokj seungrokj commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add qwen3.5-fp4-mi355x-sglang-agentic-hicache config: SGLang agentic-coding sweep with and without hicache offloading (TP2, EP1)
  • Add minimaxm2.5-fp4-mi355x-vllm-agentic-lmcache config: vLLM agentic-coding sweep with lmcache
  • Add new agentic benchmark scripts: minimaxm2.5_fp4_mi355x.sh, qwen3.5_fp4_mi355x.sh
  • Update existing agentic scripts: glm5.1_fp4_mi355x.sh, kimik2.5_fp4_mi355x.sh, minimaxm2.5_fp8_mi355x.sh, qwen3.5_fp8_mi355x.sh
  • Update launch_mi355x-amds.sh

Test plan

  • Verify hicache/lmcache agentic configs run correctly on MI355X
  • Confirm new agentic scripts launch without errors

🤖 Generated with Claude Code


Note

Medium Risk
Large CI sweep surface and runtime git clone/build of LMCache on benchmark nodes; KV offload and server-flag changes can fail silently or skew comparability until validated on hardware.

Overview
Expands MI355X agentic-coding coverage in amd-master.yaml with new # target matrix entries that A/B GPU-only vs HiCache (SGLang) or LMCache (vLLM/ATOM) across Qwen3.5, GLM-5.1, Kimi K2.5, MiniMax M2.5, and DeepSeek-V4, plus tuned concurrency/TP grids and image pins. Several existing agentic rows are adjusted (e.g. Qwen3.5 FP8 HiCache moves to TP4, MiniMax FP8 agentic images v0.22.1 → v0.22.0).

Agentic launchers gain offloading wiring: SGLang scripts add HiCache sizing/policies and keep radix prefix cache for replay; vLLM/ATOM scripts add LMCache MP (clone + hipcc build), larger host DRAM budgets, and revised CPU offload partitioning. Kimi K2.5 drops the large inline ROCm LMCache sitecustomize/chunked-connector patches in favor of the upstream LMCache install path. New scripts cover DSv4 SGLang/ATOM agentic, MiniMax/Qwen FP4 MI355X, and GLM/Qwen FP8 MI300X HiCache variants.

DSv4 SGLang agentic and fixed-seq recipes are realigned to a newer stack (sglang serve, dsv4 attention backend, updated env toggles; agentic adds HiCache). Slurm launcher launch_mi355x-amds.sh excludes node mia1-p01-g37.

Reviewed by Cursor Bugbot for commit d3caa2b. Bugbot is set up for automated code reviews on this repo. Configure here.

…r mi355x models

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment thread benchmarks/single_node/agentic/glm5.1_fp4_mi355x.sh
$ASYNC_SCHEDULING_ARGS
"${PREFIX_CACHE_ARGS[@]}"
"${OFFLOAD_ARGS[@]}"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vLLM uses wrong model

High Severity

The vLLM command serves "$MODEL" and omits --served-model-name, while the script downloads weights into MODEL_PATH and build_replay_cmd sends --model $MODEL to aiperf. That breaks the usual MODEL_PATH + served-name pairing used by sibling agentic scripts and can fail when MODEL is a Hub id but weights live under MODEL_PATH.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 01cc2af. Configure here.

--mem-fraction-static 0.8 \
--context-length $MAX_MODEL_LEN \
"${CACHE_ARGS[@]}" \
"${WARMUP_ARGS[@]}" \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGLang ignores MODEL_PATH

Medium Severity

SGLang is started with --model-path $MODEL and no --served-model-name, after the script may download into MODEL_PATH. Matrix jobs that set a local MODEL_PATH can still point the server at the Hub id, and the OpenAI model name may not match MODEL used by aiperf.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 01cc2af. Configure here.

cd LMCache
pip install -r requirements/build.txt
CXX=hipcc BUILD_WITH_HIP=1 pip install -e . --no-build-isolation
cd ..

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LMCache clone not idempotent

Medium Severity

The lmcache path runs git clone https://github.com/LMCache/LMCache.git unconditionally. With set -e, a second run in the same working directory exits when LMCache already exists, so lmcache agentic jobs fail on retry or reuse of the job cwd.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 01cc2af. Configure here.

@seungrokj seungrokj changed the title [AMD] agentic: add hicache/lmcache configs, update agentic scripts for mi355x models [DNM][AMD] agentx-v0.4 rebased from commit chore/agentx-v0.4 commit 7f61 Jun 11, 2026
ajith-sirra-amd and others added 3 commits June 11, 2026 12:54
Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…onfig

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

python3 -m sglang.launch_server \
--attention-backend aiter \
--model-path $MODEL \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Server ignores MODEL_PATH

Medium Severity

Weights are downloaded into MODEL_PATH when the workflow sets that directory, but SGLang is started with --model-path $MODEL (Hub id) instead of MODEL_PATH. The server may load a different cache path than the one prepared for the job.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 32f5007. Configure here.

OFFLOAD_ARGS=(
--kv-transfer-config
"{\"kv_connector\":\"LMCacheMPConnector\",\"kv_connector_module_path\":\"lmcache.integration.vllm.lmcache_mp_connector\",\"kv_role\":\"kv_both\",\"kv_connector_extra_config\":{\"lmcache.mp.host\":\"$LMCACHE_CONNECT_HOST\",\"lmcache.mp.port\":$LMCACHE_PORT}}"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LMCache missing hybrid disable

High Severity

The lmcache branch omits --disable-hybrid-kv-cache-manager on vllm serve, while the new minimaxm2.5-fp8-mi355x-vllm-agentic-lmcache config exercises that path. The sibling FP4 script documents that LMCache is incompatible without disabling the hybrid KV manager.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 32f5007. Configure here.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
module = _orig_import(name, globals, locals, fromlist, level)
if name == "lmcache.v1.lazy_memory_allocator" or (
name.startswith("lmcache") and "lmcache.v1.lazy_memory_allocator" in sys.modules
):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kimi LMCache ROCm fixes removed

High Severity

The Kimi MI355X agentic script replaces the prior ROCm LMCache install (ROCm CuPy, nixl cleanup, demand-pinned allocator, MLA block fallback, chunked connector, scheduler KV-transfer patch) with a bare git clone and HIP build. New kimik2.5-fp4-mi355x-vllm-agentic-lmcache sweeps depend on this path for Kimi MLA KV on AMD.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 351e729. Configure here.


# ---- Resolve traces and install deps ----------------------------------------
# https://huggingface.co/datasets/semianalysisai/cc-traces-weka-with-subagents-060826
export WEKA_LOADER_OVERRIDE=semianalysis_cc_traces_weka_with_subagents_060826

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DSv4 atom uncapped traces

Medium Severity

This new DSv4 ATOM agentic script sets WEKA_LOADER_OVERRIDE to the uncapped 060826 trace set, while peer MI355X agentic scripts in the same PR use 060226_256k to avoid ~1M-token traces that are rejected and skew sweeps.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 351e729. Configure here.

$ASYNC_SCHEDULING_ARGS
"${PREFIX_CACHE_ARGS[@]}"
"${OFFLOAD_ARGS[@]}"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MiniMax FP8 launcher regressed

High Severity

The MI355X MiniMax FP8 agentic launcher was replaced with a Kimi-style vLLM recipe. Existing minimaxm2.5-fp8-mi355x-vllm-agentic jobs (TP4/EP4, offloading=cpu) lose the prior --max-model-len, ROCM_AITER_UNIFIED_ATTN backend, MODEL_PATH-based serve, and SimpleCPU offload wiring they depended on.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit faba18f. Configure here.

device,
)
return torch.as_strided(
base,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kimi context length dropped

Medium Severity

The launcher no longer normalizes MAX_MODEL_LEN to 262144 or passes --max-model-len to vLLM. Agentic sweeps typically leave MAX_MODEL_LEN at 0, so the replay harness and Kimi’s enforced context window can disagree and traces may be filtered or rejected differently than the server allows.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit faba18f. Configure here.

Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>
# ZMQ-style host string.
LMCACHE_CONNECT_HOST="${LMCACHE_CONNECT_HOST:-tcp://$LMCACHE_HOST}"
LMCACHE_L1_SIZE_GB="${LMCACHE_L1_SIZE_GB:-$TOTAL_CPU_DRAM_GB}"
LMCACHE_L1_SIZE_GB="${LMCACHE_L1_SIZE_GB:-$((TOTAL_CPU_DRAM_GB / (8 / TP)))}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LMCache pool wrongly partitioned

Medium Severity

LMCACHE_L1_SIZE_GB for the external LMCache MP server is derived with TOTAL_CPU_DRAM_GB / (8 / TP), the same formula used for per-rank vLLM CPU offload. The MP server owns one node pool; at TP=4 this shrinks L1 from ~3 TB to ~1.5 TB versus the prior full TOTAL_CPU_DRAM_GB default.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8ca4bc1. Configure here.

… config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_sglang.sh Outdated
Comment thread benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_sglang.sh Outdated
…cripts and master yaml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread benchmarks/single_node/agentic/dsv4_fp4_mi355x_sglang.sh
--cuda-graph-max-bs "$PER_ENGINE_MAX_RUNNING" \
--disable-radix-cache \
--attention-backend dsv4 \
--max-running-requests ${CONC} \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DP max-running requests wrong

Medium Severity

When DP_ATTENTION=true, the script computes PER_ENGINE_MAX_RUNNING as CONC/TP for per-engine limits, but the server is started with --max-running-requests ${CONC}. Each DP engine may accept too many sequences versus the harness load-balancing assumption.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 76d90e0. Configure here.

…ript

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 14 total unresolved issues (including 12 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4ebc4e2. Configure here.

Comment thread benchmarks/single_node/agentic/dsv4_fp4_mi355x_sglang.sh
python3 -m sglang.launch_server \
--model-path "$MODEL_PATH" --served-model-name "$MODEL" \
sglang serve \
--model-path $MODEL \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong model path for serve

Medium Severity

The script downloads weights into MODEL_PATH when set, but sglang serve uses --model-path $MODEL (Hub id) instead of "$MODEL_PATH". Runs that pre-stage a local directory can ignore the prepared path and rely on a different cache location.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4ebc4e2. Configure here.

seungrokj and others added 2 commits June 15, 2026 11:56
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…c script

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants