EXPERIMENT: GB drop wave1 OFED for inbox RDMA + float kernel#8776
EXPERIMENT: GB drop wave1 OFED for inbox RDMA + float kernel#8776xuexu6666 wants to merge 2 commits into
Conversation
Experiment (NOT for merge) to test whether the GB image can use the kernel's inbox RDMA stack instead of full DOCA/OFED. Branched off main; changes ONLY the OFED->inbox + kernel-pin variables, everything else stays at the main baseline. - pre-install-dependencies.sh: float the kernel - install the latest linux-azure-nvidia unpinned (like the vanilla arm64 image) instead of the pinned 6.14.0-1003.3 + PPA + curl fallback. The latest (6.14.0-1007.7) was verified to ship the Data Direct GPUDirect-RDMA support inbox in mlx5_ib. - gb-mai-bom.json: drop versions-wave1 (doca-ofed + MLNX verbs), doca-custom-repo, and kernel-versions. wave2 (driver) + wave3 unchanged. - install-dependencies.sh: skip the DOCA repo setup and wave1 install; disable the staged doca-net.list so RDMA userspace resolves to distro rdma-core (not MLNX); install rdma-core + ibverbs-providers + ibverbs-utils; drop 'systemctl enable openibd'. nvidia-peermem (wave2 driver DKMS output) now builds against inbox ib_core. Key build/hardware signals: peermem DKMS compiles; ibv_devinfo shows data_direct on CX8; distro rdma-core exposes the data_direct verbs path + SHARP; NCCL perf parity.
PR Title Lint Failed ❌Current Title: Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns: Conventional Commits Format:
Guidelines:
Examples:
Please update your PR title and the lint check will run again automatically. |
There was a problem hiding this comment.
Pull request overview
Experimental VHD builder changes for the NVIDIA GB Ubuntu 24.04 arm64 image to validate running on the kernel’s inbox RDMA stack (rdma-core + in-kernel mlx5/ib_core) instead of installing DOCA/OFED wave1 packages, and to float the linux-azure-nvidia kernel version.
Changes:
- Removed DOCA/OFED (wave1) package pinning/config from the GB BOM and switched userspace RDMA to distro
rdma-coretooling. - Updated the GB install flow to disable the staged DOCA apt source and skip
openibdenablement. - Simplified kernel installation logic on Ubuntu 24.04 arm64 to install unpinned
linux-azure-nvidiafrom the repo.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| vhdbuilder/packer/pre-install-dependencies.sh | Drops pinned/PPA/fallback kernel logic and installs unpinned linux-azure-nvidia for Ubuntu 24.04 arm64. |
| vhdbuilder/packer/install-dependencies.sh | Skips DOCA/OFED wave1; disables DOCA apt repo and installs distro RDMA userspace packages; removes openibd enable. |
| vhdbuilder/packer/gb-mai-bom.json | Removes wave1 + doca repo + kernel pin entries; documents experiment intent and keeps wave2/wave3 pinned sets. |
| # EXPERIMENT: install the latest linux-azure-nvidia from the repo with no version pin (like | ||
| # the vanilla arm64 image), so GB floats the kernel onto the newest azure-nvidia kernel | ||
| # (verified to ship the Data Direct GPUDirect-RDMA support inbox in mlx5_ib). | ||
| apt-get update |
| else | ||
| apt-get update | ||
| if apt-cache show "${NVIDIA_KERNEL_PACKAGE}" &> /dev/null; then | ||
| echo "ARM64 image. Installing NVIDIA kernel and its packages alongside LTS kernel" | ||
| wait_for_apt_locks | ||
| sudo apt install --no-install-recommends -y "${NVIDIA_KERNEL_PACKAGE}" | ||
| echo "after installation:" | ||
| dpkg -l | grep "linux-.*-azure-nvidia" || true | ||
| else | ||
| echo "ARM64 image. NVIDIA kernel not available, skipping installation." | ||
| fi | ||
| echo "ARM64 image. NVIDIA kernel not available, skipping installation." |
Build 169569810 failed: the staged doca-net.list points at the DOCA 'latest' repo, which is signed with a key (DC726C5E41B9CC50) not in the shipped keyring. Every apt-get update then emits a GPG 'is not signed' W:/E:, which the retry-wrapped apt_get_update helper treats as fatal -> 10 retries -> exit 99 -> build failure, before the GB block's rm of the list ever runs. On main this is masked because the GB block replaces 'latest' with the pinned doca/3.1.0 repo (valid key); this experiment dropped that replacement, exposing the broken 'latest' repo. Fix: gate off the DOCA repo staging in packer_source.sh entirely (we install no OFED here - inbox mlx5/ib + distro rdma-core provide RDMA), so no apt-get update ever sees the DOCA repo. Keep the rm in install-dependencies.sh as belt-and-suspenders.
Failed gate runRun: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=169587964 Detective summaryThe VHD scan reported known CIS rule Likely cause and signatureKnown Ubuntu 22.04 Gen2 containerd logfile-access CIS signal, signature Recommended owner/actionLinux VHD/CIS owners should continue the existing repair investigation in Bug #38501652; no new repair item was created. Evidence
|
Experiment (NOT for merge) to test whether the GB image can use the kernel's inbox RDMA stack instead of full DOCA/OFED. Branched off main; changes ONLY the OFED->inbox + kernel-pin variables, everything else stays at the main baseline.
nvidia-peermem (wave2 driver DKMS output) now builds against inbox ib_core. Key build/hardware signals: peermem DKMS compiles; ibv_devinfo shows data_direct on CX8; distro rdma-core exposes the data_direct verbs path + SHARP; NCCL perf parity.
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #