Skip to content

feat: surface PHP fatals, disk pressure, and Action Scheduler bloat in wake briefing#2800

Merged
chubes4 merged 1 commit into
mainfrom
wake-infra-signals
Jun 26, 2026
Merged

feat: surface PHP fatals, disk pressure, and Action Scheduler bloat in wake briefing#2800
chubes4 merged 1 commit into
mainfrom
wake-infra-signals

Conversation

@chubes4

@chubes4 chubes4 commented Jun 26, 2026

Copy link
Copy Markdown
Member

Summary

WakeBriefingTask only scanned three DM-database sources — failed datamachine_jobs, processing datamachine_jobs, and datamachine_logs at level = ERROR. That made it structurally blind to the three infrastructure-layer signals that mattered most in the recent events.extrachill.com incident: a claim_actions PHP fatal storm (fatals never land in datamachine_logs), a 94%-full disk, and a 28M-row / 23GB actionscheduler_actions runaway. The operator caught it via an external fatal alarm, not WAKE.

This wires three new threshold-gated gatherers into gatherSiteSignals() alongside the existing three. Each keeps the ruthless-terseness contract: ONE grouped markdown line, only when it crosses a bar, '' otherwise — exactly like getStuckJobs() / getGroupedErrors().

Signals added

  1. PHP fatals from debug.log (getPhpFatals()) — the highest-value add. Reads a bounded tail (5 MiB) of the live debug.log, folds multi-line stack traces into one entry, filters to the rolling window by parsed [DD-Mon-YYYY HH:MM:SS UTC] timestamp, normalizes each fatal/parse-error (warnings/notices excluded) to a stable signature (file basename + numeric-stripped message), groups, and emits e.g. ⚠ 3 PHP fatal(s) in debug.log, top: "foo.php: Uncaught Error: Call to a member function claim_actions() on null" ×2. This mirrors the rotation-safe reading approach behind wp extrachill analytics errors --severity=fatal (extrachill-analytics/inc/core/php-error-log.php) — reimplemented self-contained inside the task rather than reaching across a plugin boundary, so data-machine keeps no dependency on extrachill-analytics.

  2. Disk pressure (getDiskPressure()) — disk_free_space() / disk_total_space() on ABSPATH (falls back to WP_CONTENT_DIR, then /). Surfaces when < 15% free OR < 20GB free (whichever triggers). Example: ⚠ Disk 94% full (7.9GB free).

  3. Action Scheduler bloat (getActionSchedulerBloat()) — one information_schema.TABLES read of table_rows + (data_length + index_length) for {$wpdb->prefix}actionscheduler_actions and …_logs. Surfaces when > 1,000,000 rows OR > 2GB on either table. Example: ⚠ Action Scheduler bloat: wp_actionscheduler_actions 28.1M rows / 23.9GB.

debug.log path resolution

resolveDebugLogPath() is robust to WP_DEBUG_LOG being a bool or a path string: if it's a non-empty string, use it; else honor a real-file error_log ini target (skipping syslog/stream wrappers); else fall back to WP_CONTENT_DIR/debug.log. If the resolved log is unreadable/absent, getPhpFatals() returns '' — never fatals the task.

Thresholds & filters

Signal Default Filters
Disk < 15% free OR < 20GB free datamachine_wake_briefing_disk_min_free_pct, datamachine_wake_briefing_disk_min_free_bytes
AS bloat > 1,000,000 rows OR > 2GB (per table) datamachine_wake_briefing_as_max_rows, datamachine_wake_briefing_as_max_bytes
Fatal tail scan 5 MiB tail (constant FATAL_TAIL_BYTES)

All gatherers are fail-soft: any unreadable log, unprobable filesystem, or empty information_schema read returns '' and never throws.

Per-site vs network disk handling

gatherSiteSignals() runs once per blog under the switch_to_blog loop in gatherNetworkSignals(). AS bloat and fatals are genuinely per-site (per-prefix tables, per-site debug.log semantics) and repeat correctly. Disk is host-global, so a naive per-blog emit would repeat the same warning on every site. I added a $disk_emitted instance guard so the disk line is emitted once per run (on the first site that crosses the bar) rather than deduping it into a separate network-only line — this keeps gatherSiteSignals() self-contained and correct in both site and network scope without special-casing the network path.

Tests

New dependency-free smoke test tests/wake-briefing-infra-signals-smoke.php (same convention as retention-action-scheduler-batching-smoke.php) drives all three private gatherers via reflection:

  • Fatals: real temp debug.log with in-window grouped fatals, an out-of-window fatal, a multi-line trace, and a warning — asserts grouped count (3), top signature (×2 claim_actions), warning excluded, empty when window excludes all, and fail-soft empty when the log is absent.
  • Disk: filter-driven thresholds against the real filesystem — asserts a line when forced over the bar and empty when forced healthy.
  • AS bloat: fake $wpdb information_schema read — asserts the 28.1M/23.9GB actions table is named, the within-bounds logs table is not, humanized formatting, empty when both within bounds, the row ceiling trips independently of bytes via filter, and fail-soft empty on a missing table.

Verification

  • php -l on both changed files — no syntax errors.
  • php -d error_reporting=0 vendor/bin/phpcs inc/Engine/AI/System/Tasks/WakeBriefingTask.php tests/wake-briefing-infra-signals-smoke.phpexit 0.
  • php tests/wake-briefing-infra-signals-smoke.php13/13 assertions pass.
  • php tests/retention-action-scheduler-batching-smoke.php — still passes (no regression).

Closes #2799

…n wake briefing

WakeBriefingTask only scanned three DM-database sources (failed jobs,
processing jobs, ERROR-level datamachine_logs), leaving it structurally
blind to the infrastructure-layer signals that mattered most in a recent
incident: a claim_actions PHP fatal storm (fatals never land in
datamachine_logs), a 94%-full disk, and a 28M-row/23GB
actionscheduler_actions runaway.

Adds three threshold-gated, fail-soft gatherers wired into
gatherSiteSignals() alongside the existing three, each emitting one
grouped terse line only when over its bar:

- getPhpFatals(): rotation-safe bounded tail read of debug.log, grouped
  by normalized signature, fatals/parse-errors only.
- getDiskPressure(): disk_free/total_space on ABSPATH, < 15% or < 20GB
  free, emitted once per run (host-global).
- getActionSchedulerBloat(): information_schema row/byte read for the
  actions + logs tables, > 1M rows or > 2GB.

All thresholds filterable; all gatherers return '' on any error.

Closes #2799
@chubes4 chubes4 merged commit eb8b02b into main Jun 26, 2026
2 checks passed
@chubes4 chubes4 deleted the wake-infra-signals branch June 26, 2026 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wake briefing is blind to the signals that mattered most: PHP fatals (debug.log), disk pressure, and Action Scheduler table bloat

1 participant