Skip to content

Fix CUDA cleanup and NPU/HCCL setup bugs#92

Open
morluto wants to merge 4 commits into
MoonshotAI:mainfrom
morluto:fix/accelerator-path-pr2
Open

Fix CUDA cleanup and NPU/HCCL setup bugs#92
morluto wants to merge 4 commits into
MoonshotAI:mainfrom
morluto:fix/accelerator-path-pr2

Conversation

@morluto

@morluto morluto commented Jun 30, 2026

Copy link
Copy Markdown

Summary

This fixes three accelerator-specific failure modes in CUDA cleanup, NPU device discovery, and HCCL sub-communicator setup.

What was wrong

CUDA manual unpin could reject valid pinned memory

The in-place pin path registers memory with:

cudaHostRegister(..., 0)

That flag is cudaHostRegisterDefault. During unregister, the cleanup path only accepted cudaHostRegisterMapped (0x02). On drivers that report the actual registration flag, cleanup can fail before cudaHostUnregister() runs.

NPU UUID generation scanned the wrong device IDs

npu_generate_uuid() needs to query npu-smi -i <physical_id>, but the previous logic did not reliably choose physical IDs:

  • a fixed range missed devices above 7
  • using torch.npu.device_count() alone breaks when ASCEND_RT_VISIBLE_DEVICES masks/remaps physical devices

For example, a process visible on physical NPU 4 may report only one visible device, but npu-smi still needs -i 4.

HCCL config fields were misspelled

Two HcclCommConfig assignments did not target the intended ctypes fields:

  • hccl_op_expansize_mode instead of hccl_op_expansion_mode
  • hcll_world_rank_id instead of hccl_world_rank_id

ctypes accepts unknown kwargs as ordinary Python attributes, so the real struct fields stayed at their default values.

What changed

  • Accept both cudaHostRegisterDefault (0x00) and cudaHostRegisterMapped (0x02) in manual unpin validation.
  • Scan physical NPU IDs from ASCEND_RT_VISIBLE_DEVICES when present.
  • Otherwise scan max(8, torch.npu.device_count()), preserving the old 0-7 coverage while supporting larger unmasked hosts.
  • Correct the HCCL ctypes field names.

Testing

pytest tests/test_cuda_pin_memory.py tests/test_npu_device_utils.py tests/test_hccl_config.py -q

Result:

4 passed
ruff check checkpoint_engine/ps.py checkpoint_engine/device_utils.py checkpoint_engine/distributed/vllm_hccl.py tests/test_cuda_pin_memory.py tests/test_npu_device_utils.py tests/test_hccl_config.py
ruff format checkpoint_engine/ps.py checkpoint_engine/device_utils.py checkpoint_engine/distributed/vllm_hccl.py tests/test_cuda_pin_memory.py tests/test_npu_device_utils.py tests/test_hccl_config.py --check

Both passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant