Summary
checkpoint_engine/distributed/vllm_hccl.py defines HcclCommConfig as a Python ctypes.Structure and passes size=312 when creating sub-communicators. The current Python structure has a larger ctypes.sizeof(...) than 312, but it is not clear whether HCCL expects the full Python mirror or a 312-byte prefix for the CANN/HCCL version this project targets.
Why this matters
If the size field is meant to match the exact C struct size, passing 312 could make the HCCL side reject the config or ignore/misread trailing fields. If 312 intentionally matches an older/prefix ABI, the Python fields after that boundary should be documented or adjusted to avoid future accidental changes.
Suggested validation
- Compare the Python
HcclCommConfig layout with the exact CANN/HCCL header version used by supported deployments.
- Confirm whether
size=312 is intentional.
- If intentional, document why the Python struct may be larger than the advertised size.
- If not intentional, update the struct or derive
size from the validated layout.
Related context
PR #92 fixes misspelled HCCL field names but intentionally does not change the size field because this needs ABI validation against the target HCCL headers.
Summary
checkpoint_engine/distributed/vllm_hccl.pydefinesHcclCommConfigas a Pythonctypes.Structureand passessize=312when creating sub-communicators. The current Python structure has a largerctypes.sizeof(...)than 312, but it is not clear whether HCCL expects the full Python mirror or a 312-byte prefix for the CANN/HCCL version this project targets.Why this matters
If the
sizefield is meant to match the exact C struct size, passing 312 could make the HCCL side reject the config or ignore/misread trailing fields. If 312 intentionally matches an older/prefix ABI, the Python fields after that boundary should be documented or adjusted to avoid future accidental changes.Suggested validation
HcclCommConfiglayout with the exact CANN/HCCL header version used by supported deployments.size=312is intentional.sizefrom the validated layout.Related context
PR #92 fixes misspelled HCCL field names but intentionally does not change the
sizefield because this needs ABI validation against the target HCCL headers.