Fix CUDA cleanup and NPU/HCCL setup bugs by morluto · Pull Request #92 · MoonshotAI/checkpoint-engine

morluto · 2026-06-30T22:48:43Z

Summary

This fixes three accelerator-specific failure modes in CUDA cleanup, NPU device discovery, and HCCL sub-communicator setup.

What was wrong

CUDA manual unpin could reject valid pinned memory

The in-place pin path registers memory with:

cudaHostRegister(..., 0)

That flag is cudaHostRegisterDefault. During unregister, the cleanup path only accepted cudaHostRegisterMapped (0x02). On drivers that report the actual registration flag, cleanup can fail before cudaHostUnregister() runs.

NPU UUID generation scanned the wrong device IDs

npu_generate_uuid() needs to query npu-smi -i <physical_id>, but the previous logic did not reliably choose physical IDs:

a fixed range missed devices above 7
using torch.npu.device_count() alone breaks when ASCEND_RT_VISIBLE_DEVICES masks/remaps physical devices

For example, a process visible on physical NPU 4 may report only one visible device, but npu-smi still needs -i 4.

HCCL config fields were misspelled

Two HcclCommConfig assignments did not target the intended ctypes fields:

hccl_op_expansize_mode instead of hccl_op_expansion_mode
hcll_world_rank_id instead of hccl_world_rank_id

ctypes accepts unknown kwargs as ordinary Python attributes, so the real struct fields stayed at their default values.

What changed

Accept both cudaHostRegisterDefault (0x00) and cudaHostRegisterMapped (0x02) in manual unpin validation.
Scan physical NPU IDs from ASCEND_RT_VISIBLE_DEVICES when present.
Otherwise scan max(8, torch.npu.device_count()), preserving the old 0-7 coverage while supporting larger unmasked hosts.
Correct the HCCL ctypes field names.

Testing

pytest tests/test_cuda_pin_memory.py tests/test_npu_device_utils.py tests/test_hccl_config.py -q

Result:

4 passed

ruff check checkpoint_engine/ps.py checkpoint_engine/device_utils.py checkpoint_engine/distributed/vllm_hccl.py tests/test_cuda_pin_memory.py tests/test_npu_device_utils.py tests/test_hccl_config.py
ruff format checkpoint_engine/ps.py checkpoint_engine/device_utils.py checkpoint_engine/distributed/vllm_hccl.py tests/test_cuda_pin_memory.py tests/test_npu_device_utils.py tests/test_hccl_config.py --check

Both passed.

morluto added 4 commits June 30, 2026 22:37

fix: accept default CUDA host register flag

265362b

fix: derive NPU count from torch

a24126a

fix: correct HCCL config field names

60de1e0

fix: respect visible physical NPU IDs

b1bb779

morluto mentioned this pull request Jun 30, 2026

Validate HcclCommConfig size against CANN/HCCL headers #93

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix CUDA cleanup and NPU/HCCL setup bugs#92

Fix CUDA cleanup and NPU/HCCL setup bugs#92
morluto wants to merge 4 commits into
MoonshotAI:mainfrom
morluto:fix/accelerator-path-pr2

morluto commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

morluto commented Jun 30, 2026

Summary

What was wrong

CUDA manual unpin could reject valid pinned memory

NPU UUID generation scanned the wrong device IDs

HCCL config fields were misspelled

What changed

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant