Exploratory plot fix by asmacdo · Pull Request #424 · con/duct

asmacdo · 2026-04-29T01:48:16Z

Summary

Fixes #399. The current con-duct plot reads totals.pcpu and totals.rss per record and draws a single line for each.
However, both totals double-count (multi-core for pcpu, shared pages for rss), producing the 5363% spike that started this issue.

This PR:

replaces the summed-totals plot with per-pid traces
introduces pdcpu, a delta-corrected pcpu computed at plot time from consecutive (etime, pcpu) pairs in usage.jsonl.
Identity (same physical pid vs kernel pid reuse) is decided by comparing Δetime against Δwall_clock.
Aggregation-timing artifacts that would otherwise produce negative pdcpu on a continuous pid are clamped to None.
The default chart filters and caps the legend so the result stays readable on real workloads.

A handful of "REVERT ME" commits with temporary images/data for easier review.

Demo

See tmp-plot-demo.md (BEFORE ps-pcpu drops upper bound)
See tmp-plot-demo.md (CURRENT)

TODO

Get feedback on approach (Yarik / con-duct maintainers).
Pre-merge alignment on shape, not just correctness.
Linux-only math. pdcpu inverts the procps identity
pcpu = cputime/etime, which assumes ps reports a cumulative
lifetime ratio. macOS ps reports a 5/8-decayed EWMA, so
pcpu × etime ≠ cputime and the delta math is meaningless.
Today's info.json does not record the host platform, so
we can't tell at plot time whether a log came from Linux. We
need either:
- a platform field in info.json (sampler change, separate
PR), with plot.py refusing or warning on non-Linux logs; or
- some heuristic (e.g., detect that pcpu monotonically grows
between two flat-rate samples) and warn.
Currently silently produces wrong numbers on macOS logs.
Drop REVERT ME: TEMPORARY DEMO commit (the demo script, demo doc, log files, rendered output)
Break the memory-humanization binary->decimal in graph into a separate PR

Fixes the 5363% peak in con#399. - Previous plot summed totals.pcpu/totals.rss across pids per interval. Both sums double-count: multi-core for pcpu, shared pages for rss. - Replace with per-pid lines. For pcpu, compute pdcpu (delta-corrected %CPU) at plot time from consecutive (etime, pcpu) pairs in usage.jsonl. - Detect kernel pid reuse via Δetime ≈ Δwall (2s tolerance). Strict "etime regressed" misses cases where the new pid's etime crept above the old's (con#399 pid 3323259: 49s → 54s in 60s of wall = pid reuse). - Clamp pdcpu < 0 to None. duct aggregates pcpu as max-across-samples but etime as the last sample, so a spike-then-idle pattern in the prior interval can push the cputime delta negative even on a continuous pid. - Filter pids by "notable on either axis": peak pdcpu >= 0.5% OR peak rss >= 10MB. Cap legend at hybrid top-10 (top 5 by peak pdcpu unioned with top 5 by peak rss). - vsz commented out by default. Caveats: - pcpu × etime is an upper bound on cputime under max-across- samples aggregation; pdcpu inherits the approximation. - ~87% of pids in con#399's tox horde appear in only one record and don't get a pdcpu measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-04-29T01:50:14Z

Codecov Report

❌ Patch coverage is 85.71429% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.96%. Comparing base (2cd8fb9) to head (b748aae).
⚠️ Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
src/con_duct/plot.py	81.73%	10 Missing and 11 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #424      +/-   ##
==========================================
- Coverage   91.87%   90.96%   -0.91%     
==========================================
  Files          15       15              
  Lines        1120     1240     +120     
  Branches      139      168      +29     
==========================================
+ Hits         1029     1128      +99     
- Misses         69       78       +9     
- Partials       22       34      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

asmacdo · 2026-04-29T02:02:00Z

Main branch renders for reference:

399

fmriprep 1 subject

asmacdo · 2026-04-29T17:47:49Z

Updated requirements from @yarikoptic review:

No per-pid color cycling, no cmd labels.
Per metric (pcpu orange, pmem blue): faint dotted per-pid traces + bold solid "envelope" at max-of-pids per timestamp ("minimum of total use") + dashed "envelope" at sum-of-pids ("maximum of total use" — may be unbounded for now).
Drop rss from default chart, drop top-N cap. Followup PRs can re-add rss/vsz as opt-in flags.
3-entry linestyle legend: solid (max envelope) / dashed (sum envelope) / dotted (per-pid).

Per Yarik's review on PR con#424, replace per-pid colored lines + top-N legend with per-metric color + max/sum envelope layout. - All pcpu pid lines share one color (orange); all pmem lines share another (blue). Reviewers don't need to identify which pid was busy, just that something was. - For each metric, plot two envelopes across the kept pids: - max-across-pids: lower bound on total resource use, solid. "If some pid was at 50%, the total was at least 50%." - sum-across-pids: upper bound, dashed. Can blow past 100% on multi-core (per-pid pdcpu doesn't know about cores) and overstate memory (shared pages get counted multiple times in pmem); both caveats are accepted. - Drop rss from the default chart. peak_rss is still used as a relevance signal so memory-only pids contribute to the pmem cloud and envelope. - Drop the top-N hybrid cap. With faint same-color dotted per-pid lines, a cloud of dozens-to-hundreds of traces reads as background texture through which the envelopes are clearly visible. The peak_pdcpu >= 0.5% OR peak_rss >= 10MB filter still suppresses noise. - Two legends: metric color (top-left, orange/blue) and color- agnostic linestyle key (top-right): upper bound / lower bound / per-pid. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

asmacdo · 2026-04-29T19:54:40Z

I think we should consider adding a separate scale 0-100 for pmem on the right side, since pcpu is also percent but can go many times higher for multicore, which effectively squashes pmem. The issue with giving the second axis to pmem is that it would replace rss as the second y-axis.

We could

add an option to switch between pmem/rss
add a second plot for rss

wdyt?

yarikoptic · 2026-04-30T12:45:24Z

previews look much better! what about an alternative I believe I alluded to in the chat -- by default plot rss (not %mem) and state total physical ram in legend (if we know)? Then it would the right "Memory" with abs sizes up and we would be all set -- and I feel that for memory absolute value is good.

Per Yarik's second-pass review on PR con#424, replace pmem on the shared y-axis with rss (absolute bytes) on a secondary y-axis. Reasoning: under SLURM, pmem = rss / host_total, where host_total is the whole node's physical memory rather than the cgroup's allocated memory. A job using 100% of a 4GB request on a 256GB host therefore shows ~1.5% pmem, which the shared y-axis squashes to invisibility. Absolute rss avoids this entirely. - pcpu (orange): primary y-axis (%), unchanged. - rss (blue): secondary y-axis (bytes), formatted via the existing HumanizedAxisFormatter + _MEMORY_UNITS. - Legend label is best-effort: if info.json is available -- passed directly, or as a sibling to the usage file via {prefix}info.json -- read system.memory_total and render "rss (host: X.XTB)". Otherwise fall back to plain "rss". Plot CLI still accepts a bare usage file; info.json remains optional. - Filter unchanged: peak_pdcpu >= 0.5% OR peak_rss >= 10MB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plot's _MEMORY_UNITS used 1024-base divisors with "KB/MB/GB/TB" suffixes -- correct as KiB/MiB/GiB/TiB per IEC 80000-13, but the suffixes claimed decimal. SummaryFormatter.naturalsize used a parallel FILESIZE_SUFFIXES tuple at base 1000 with "kB/MB/GB" prefixes. The data table was duplicated, the conventions disagreed, and the plot axis lied about its base. - Add module-level FILESIZE_UNITS in _formatter.py: single source of truth for byte-suffix data. Base 1000, suffixes B/kB/MB/GB/TB/PB/ EB/ZB/YB. - Refactor naturalsize to walk FILESIZE_UNITS instead of the local FILESIZE_SUFFIXES tuple. Drop FILESIZE_SUFFIXES. Behavior preserved (covered by test_summary_formatter_S_sizes). - Fix the broken "%.3f" docstring example: actual output is "3.000 kB", not the "2.930 kB" left over from a 1024-base era. - plot.py imports FILESIZE_UNITS directly (no alias), drops local _MEMORY_UNITS. - Drop _format_bytes_compact (added in the previous commit) and route the rss legend label through SummaryFormatter().naturalsize for the same reason: avoid keeping a third byte-format helper. - test_plot._MEMORY_UNITS parametrize cases switch to 1000-multiples with kB/MB/GB suffixes; drops the test_format_bytes_compact case. Plot axis tick numbers shift slightly (e.g. "1.5MB" was 1.5 * 1024**2 bytes; same physical byte count now displays as "1.6 MB" since the divisor is smaller). The displayed *meaning* is now correct. Note: this commit is separable -- it can be cherry-removed and shipped as its own PR for a distinct changelog entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…apacity

asmacdo · 2026-04-30T15:40:02Z

Swap rss for pmem, done.
Adding host memory to the legend, done. (Note: if the info.json is available, we use it, otherwise we dont include. The demo includes one of each.)

Note: I fixed a bug caused by duplication. The Summary formatter used decimal humanization, the graph used binary but still used MB GB instead of MiB GiB. (Fixed by removing duplication, we now report decimal everywhere). I dont think this bugfix belongs in the PR, but is worth including during draft mode so we can review the plot changes together. I've noted removing this as a TODO in the description.

Previously dropped pids whose peak pdcpu was <0.5% AND peak rss <10MB, which rendered an empty plot for tiny/idle workloads (e.g. the gallery sleep-loop sample). The per-pid lines are now dotted/faint and the envelopes carry the signal, so the noise floor doesn't need trimming. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The rss "upper bound" envelope was summing each pid's per-interval peak rss. When a workload spawns or thrashes processes within a report interval, the per-pid peaks fall in different samples and never coexist, so summing them invents memory pressure that wasn't there. On bursty workloads (e.g. mriqc) this padded the line by ~2-3 GB on top of the true measured concurrent peak. duct already records the measured peak concurrent rss per report at sample-time as totals.rss (max-of-sum-per-sample, see _models.py). Read that directly for the upper-bound line. Within "observed samples only" framing it's a true upper bound on observed concurrent rss with no phantom coexistence. pcpu's upper bound (sum-of-pdcpu) is unchanged. totals.pcpu is unusable because it carries forward the lifetime-ratio masking that motivated con#399, and there's no unified etime to delta-invert it against. The pcpu/rss upper-bound semantics now genuinely diverge, so the shared envelope loop is split. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

yarikoptic · 2026-05-05T15:17:17Z

I slept on it: I think we should make it an explicit option which would allow to switch between plotting "original" pcpu which is a cumulative (and differently across OS as on OSX), vs our "time-point-estimate". So something like --cpu=ps-pcpu,ps-cpu-timepoint so if ps-cpu-timepoint -- we do our conversion estimate and otherwise just plotting original one (no transformation). When plotting, adjust label to become pcpu (time-point) to accent that it is our estimate. The rest stays exactly the same. WDYT?

edit: and in the gallery we will do both plots side -by- side to visualize the divergences.

asmacdo · 2026-05-05T16:07:46Z

@yarikoptic I agree. That interface seems reasonable. Im trying to imagine how this would play in the future when we may have multiple samplers, (ps regular, ps pdcpu calculated, /proc, psutil, whatever). The info.json could capture the sample type(s), so I think this would play nice with that, and default to the 1 on that was collected or force selection if more than sampling method one was selected.

Pure rename, no behavior change. Prepares for an upcoming --cpu mode flag where the same series can hold either raw ps pcpu or the delta-corrected pdcpu.

Splits the per-pid cpu series into two paths in _build_pid_series: - ps-pcpu (default): raw lifetime ratio from ps -o pcpu, no transformation. All records contribute; no entry is dropped. - ps-cpu-timepoint: existing delta-corrected pdcpu pipeline, unchanged. Y-axis label and color-legend swatch reflect the chosen mode. Per Yarik's review on PR con#424: lossless raw cpu by default, explicit opt-in to our derived time-point estimate. Future samplers (psutil, /proc, ...) can extend the choices list.

…imepoint) layout Adds --cpu ps-pcpu and --cpu ps-cpu-timepoint columns alongside the original main rendering for both con#399 and the fmriprep dataset. Drops the now-superseded _after.png files. tmp-plot-demo.md updated with the new 3-column tables and matching reproduce commands. The ps-pcpu column intentionally shows the loose phantom-coexistence upper bound (~11k% on con#399 record 28 vs main's 5363%); we'll use this as the visual case for dropping the sum envelope in ps-pcpu mode in a follow-up commit.

In ps-pcpu mode the per-pid value is ps's cumulative lifetime ratio, which procps is documented to compute inaccurately for short-lived multi-threaded processes (mutually inconsistent reads of utime, stime, starttime, and uptime; no atomic snapshot). Single-pid readings can exceed physical maxima (e.g. pcpu=5347 on a 20-core host in con#399). Summing those across pids -- already inflated by phantom coexistence on top -- produces an upper-bound line that's neither a faithful reading of ps nor a meaningful bound. The per-pid trace cloud and max-across-pids lower bound carry the visual signal; rss's totals.rss-based dashed upper bound is unaffected. In ps-cpu-timepoint mode the sum envelope is unchanged (per-pid pdcpu is delta-corrected and the negative-pdcpu clamp filters out the worst aggregation-timing artifacts).

Both 00_399_pspcpu.png and 00_fmriprep_pspcpu.png re-rendered against the new plot.py; the dashed cpu upper bound is gone in the ps-pcpu column. Lower bound, per-pid cloud, and rss envelopes unchanged. tmp-plot-demo.md image paths stable so no doc edits.

asmacdo and others added 2 commits April 28, 2026 20:33

REVERT ME: TEMPORARY DEMO

c455181

REVERT ME: refresh demo with before/after side-by-side

fbf0905

asmacdo and others added 2 commits April 29, 2026 13:17

REVERT ME: re-render after PNGs against envelope chart shape

52b1f48

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

yarikoptic added the semver-patch Increment the patch version when merged label Apr 30, 2026

asmacdo and others added 5 commits April 30, 2026 09:55

REVERT ME: re-render after PNGs against rss-on-twinx layout

b1e7a3c

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

REVERT ME: re-render after PNGs against base-1000 memory axis

9dfd65c

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

REVERT ME: add 399 info.json to show optional rendering of host mem c…

8c46bda

…apacity

asmacdo mentioned this pull request May 4, 2026

[DATALAD RUNCMD] Regenerate gallery with duct PR #424 (plot-pcpu-fix … con/duct-gallery#4

Open

asmacdo and others added 2 commits May 4, 2026 12:31

asmacdo added 5 commits May 6, 2026 09:15

plot: rename per-pid series key pdcpu -> cpu

323d597

Pure rename, no behavior change. Prepares for an upcoming --cpu mode flag where the same series can hold either raw ps pcpu or the delta-corrected pdcpu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploratory plot fix#424

Exploratory plot fix#424
asmacdo wants to merge 17 commits into
con:mainfrom
asmacdo:plot-pcpu-fix

asmacdo commented Apr 29, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

asmacdo commented Apr 29, 2026

Uh oh!

asmacdo commented Apr 29, 2026

Uh oh!

asmacdo commented Apr 29, 2026

Uh oh!

yarikoptic commented Apr 30, 2026

Uh oh!

asmacdo commented Apr 30, 2026 •

edited

Loading

Uh oh!

yarikoptic commented May 5, 2026 •

edited

Loading

Uh oh!

asmacdo commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

asmacdo commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Demo

TODO

Uh oh!

codecov Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

asmacdo commented Apr 29, 2026

Uh oh!

asmacdo commented Apr 29, 2026

Uh oh!

asmacdo commented Apr 29, 2026

Uh oh!

yarikoptic commented Apr 30, 2026

Uh oh!

asmacdo commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yarikoptic commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asmacdo commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

asmacdo commented Apr 29, 2026 •

edited

Loading

codecov Bot commented Apr 29, 2026 •

edited

Loading

asmacdo commented Apr 30, 2026 •

edited

Loading

yarikoptic commented May 5, 2026 •

edited

Loading