Skip to content

Fix update-path crashes, stale P2P state, and NCCL RDMA parsing#91

Open
morluto wants to merge 5 commits into
MoonshotAI:mainfrom
morluto:fix/core-correctness-pr1
Open

Fix update-path crashes, stale P2P state, and NCCL RDMA parsing#91
morluto wants to merge 5 commits into
MoonshotAI:mainfrom
morluto:fix/core-correctness-pr1

Conversation

@morluto

@morluto morluto commented Jun 30, 2026

Copy link
Copy Markdown

Summary

This fixes several failure paths in checkpoint update plumbing that can cause crashes, leaked resources, or incorrect RDMA device selection.

What was wrong

  • load_metas() accepted empty owners, while gather_metas() filters them out. An empty owner can later reach bucket generation and hit an assertion on a zero-sized bucket.
  • P2PStore.register_named_tensors() updated self.named_tensors before batch_register_memory() succeeded. If the engine registration failed, local state claimed tensors were registered even though the transfer engine rejected them.
  • request_inference_to_update() created an httpx.Client for every update request without closing it, leaving the client transport/connection pool alive longer than necessary.
  • NCCL_IB_HCA parsing did not fully match NCCL:
    • bare tokens should be prefix matches unless the value starts with =
    • empty exclude lists like ^ / ^= should select no devices, matching NCCL behavior

What changed

  • Filter empty owners in load_metas() before storing global metadata.
  • Update P2PStore.named_tensors only after transfer-engine registration succeeds.
  • Use httpx.Client as a context manager in request_inference_to_update().
  • Adjust RDMA parser behavior and tests to match NCCL semantics.

Review Notes

Suggested order:

  1. checkpoint_engine/device_utils.py + tests/test_rdma_parser.py
  2. checkpoint_engine/ps.py, checkpoint_engine/p2p_store.py, tests/test_p2p_store.py
  3. checkpoint_engine/api.py + tests/test_api.py

Commits are split by area:

  • fix: align RDMA device parsing with NCCL
  • fix: keep parameter metadata state consistent
  • fix: close inference update HTTP client

Testing

pytest tests/test_api.py tests/test_rdma_parser.py tests/test_assign_receiver_ranks.py tests/test_p2p_store.py -q

Result:

60 passed, 1 skipped
ruff check checkpoint_engine/device_utils.py checkpoint_engine/api.py checkpoint_engine/ps.py checkpoint_engine/p2p_store.py tests/test_rdma_parser.py tests/test_api.py tests/test_p2p_store.py
ruff format checkpoint_engine/device_utils.py checkpoint_engine/api.py checkpoint_engine/ps.py checkpoint_engine/p2p_store.py tests/test_rdma_parser.py tests/test_api.py tests/test_p2p_store.py --check

Both passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant