Skip to content

Validate HcclCommConfig size against CANN/HCCL headers #93

Description

@morluto

Summary

checkpoint_engine/distributed/vllm_hccl.py defines HcclCommConfig as a Python ctypes.Structure and passes size=312 when creating sub-communicators. The current Python structure has a larger ctypes.sizeof(...) than 312, but it is not clear whether HCCL expects the full Python mirror or a 312-byte prefix for the CANN/HCCL version this project targets.

Why this matters

If the size field is meant to match the exact C struct size, passing 312 could make the HCCL side reject the config or ignore/misread trailing fields. If 312 intentionally matches an older/prefix ABI, the Python fields after that boundary should be documented or adjusted to avoid future accidental changes.

Suggested validation

  • Compare the Python HcclCommConfig layout with the exact CANN/HCCL header version used by supported deployments.
  • Confirm whether size=312 is intentional.
  • If intentional, document why the Python struct may be larger than the advertised size.
  • If not intentional, update the struct or derive size from the validated layout.

Related context

PR #92 fixes misspelled HCCL field names but intentionally does not change the size field because this needs ABI validation against the target HCCL headers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions