Skip to content

[codex] Add robust exec-chain profiling support#46

Open
wangyinz wants to merge 9 commits into
mainfrom
codex/robust-exec-chain
Open

[codex] Add robust exec-chain profiling support#46
wangyinz wants to merge 9 commits into
mainfrom
codex/robust-exec-chain

Conversation

@wangyinz

Copy link
Copy Markdown
Member

Summary

  • Add robust PEAK exec-chain support for exec*, fexecve, execveat, posix_spawn, and posix_spawnp paths.
  • Add non-destructive rank-local exec checkpoint output and CSV tracing controls without calling peak_fini() before exec.
  • Preserve profiling across custom envp by injecting/de-duplicating LD_PRELOAD and propagating missing PEAK_* variables when enabled.
  • Interpose dynamically linked syscall(SYS_execve/SYS_execveat) so raw syscall callers do not silently bypass the exec-chain path.
  • Add tests and README documentation for PEAK_EXEC_CHAIN, PEAK_EXEC_CHECKPOINT, PEAK_EXEC_PROPAGATE_PEAK_ENV, and PEAK_EXEC_TRACE_PATH.

Root Causes Addressed

  • Successful exec replaces the old image before PEAK finalization, losing in-memory profile data unless a non-destructive checkpoint is emitted first.
  • Custom child environments can omit LD_PRELOAD and PEAK_*, preventing the replacement image from being profiled.
  • Vista/NVHPC AArch64 exposed an inline raw-syscall miscompile in exec/checkpoint paths; both paths now share a safer raw syscall bridge.
  • Frontera exposed a timer ABI issue from resolving the compat timer_create symbol; the signal policy now resolves the GLIBC_2.3.3 interface.

Validation

  • Local full CTest: 346/346 passed.
  • Local static check: test_cuda_interceptor_consistency passed; git diff --check clean.
  • Vista smoke via hpc-test: job 791735, COMPLETED 0:0, exec-chain smoke passed.
  • Frontera smoke via hpc-test: job 7832267, COMPLETED 0:0.
  • Frontera MPI slice via hpc-test: exec checkpoint tests passed; one unrelated strict-detach MPI test failed, consistent with pre-existing broad-suite noise.
  • Frontera main baseline via hpc-test: job 7832231 failed 47/216, used as evidence that remaining broad-suite failures are not introduced by this branch.

Review

Three independent final review rounds completed with no blockers found.

@wangyinz wangyinz marked this pull request as ready for review June 29, 2026 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant