feat: surface PHP fatals, disk pressure, and Action Scheduler bloat in wake briefing#2800
Merged
Conversation
…n wake briefing WakeBriefingTask only scanned three DM-database sources (failed jobs, processing jobs, ERROR-level datamachine_logs), leaving it structurally blind to the infrastructure-layer signals that mattered most in a recent incident: a claim_actions PHP fatal storm (fatals never land in datamachine_logs), a 94%-full disk, and a 28M-row/23GB actionscheduler_actions runaway. Adds three threshold-gated, fail-soft gatherers wired into gatherSiteSignals() alongside the existing three, each emitting one grouped terse line only when over its bar: - getPhpFatals(): rotation-safe bounded tail read of debug.log, grouped by normalized signature, fatals/parse-errors only. - getDiskPressure(): disk_free/total_space on ABSPATH, < 15% or < 20GB free, emitted once per run (host-global). - getActionSchedulerBloat(): information_schema row/byte read for the actions + logs tables, > 1M rows or > 2GB. All thresholds filterable; all gatherers return '' on any error. Closes #2799
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
WakeBriefingTaskonly scanned three DM-database sources — faileddatamachine_jobs, processingdatamachine_jobs, anddatamachine_logsatlevel = ERROR. That made it structurally blind to the three infrastructure-layer signals that mattered most in the recent events.extrachill.com incident: aclaim_actionsPHP fatal storm (fatals never land indatamachine_logs), a 94%-full disk, and a 28M-row / 23GBactionscheduler_actionsrunaway. The operator caught it via an external fatal alarm, not WAKE.This wires three new threshold-gated gatherers into
gatherSiteSignals()alongside the existing three. Each keeps the ruthless-terseness contract: ONE grouped markdown line, only when it crosses a bar,''otherwise — exactly likegetStuckJobs()/getGroupedErrors().Signals added
PHP fatals from
debug.log(getPhpFatals()) — the highest-value add. Reads a bounded tail (5 MiB) of the livedebug.log, folds multi-line stack traces into one entry, filters to the rolling window by parsed[DD-Mon-YYYY HH:MM:SS UTC]timestamp, normalizes each fatal/parse-error (warnings/notices excluded) to a stable signature (file basename + numeric-stripped message), groups, and emits e.g.⚠ 3 PHP fatal(s) in debug.log, top: "foo.php: Uncaught Error: Call to a member function claim_actions() on null" ×2.This mirrors the rotation-safe reading approach behindwp extrachill analytics errors --severity=fatal(extrachill-analytics/inc/core/php-error-log.php) — reimplemented self-contained inside the task rather than reaching across a plugin boundary, so data-machine keeps no dependency on extrachill-analytics.Disk pressure (
getDiskPressure()) —disk_free_space()/disk_total_space()onABSPATH(falls back toWP_CONTENT_DIR, then/). Surfaces when < 15% free OR < 20GB free (whichever triggers). Example:⚠ Disk 94% full (7.9GB free).Action Scheduler bloat (
getActionSchedulerBloat()) — oneinformation_schema.TABLESread oftable_rows+(data_length + index_length)for{$wpdb->prefix}actionscheduler_actionsand…_logs. Surfaces when > 1,000,000 rows OR > 2GB on either table. Example:⚠ Action Scheduler bloat: wp_actionscheduler_actions 28.1M rows / 23.9GB.debug.log path resolution
resolveDebugLogPath()is robust toWP_DEBUG_LOGbeing a bool or a path string: if it's a non-empty string, use it; else honor a real-fileerror_logini target (skippingsyslog/stream wrappers); else fall back toWP_CONTENT_DIR/debug.log. If the resolved log is unreadable/absent,getPhpFatals()returns''— never fatals the task.Thresholds & filters
datamachine_wake_briefing_disk_min_free_pct,datamachine_wake_briefing_disk_min_free_bytesdatamachine_wake_briefing_as_max_rows,datamachine_wake_briefing_as_max_bytesFATAL_TAIL_BYTES)All gatherers are fail-soft: any unreadable log, unprobable filesystem, or empty
information_schemaread returns''and never throws.Per-site vs network disk handling
gatherSiteSignals()runs once per blog under theswitch_to_blogloop ingatherNetworkSignals(). AS bloat and fatals are genuinely per-site (per-prefix tables, per-sitedebug.logsemantics) and repeat correctly. Disk is host-global, so a naive per-blog emit would repeat the same warning on every site. I added a$disk_emittedinstance guard so the disk line is emitted once per run (on the first site that crosses the bar) rather than deduping it into a separate network-only line — this keepsgatherSiteSignals()self-contained and correct in bothsiteandnetworkscope without special-casing the network path.Tests
New dependency-free smoke test
tests/wake-briefing-infra-signals-smoke.php(same convention asretention-action-scheduler-batching-smoke.php) drives all three private gatherers via reflection:debug.logwith in-window grouped fatals, an out-of-window fatal, a multi-line trace, and a warning — asserts grouped count (3), top signature (×2claim_actions), warning excluded, empty when window excludes all, and fail-soft empty when the log is absent.$wpdbinformation_schemaread — asserts the 28.1M/23.9GB actions table is named, the within-bounds logs table is not, humanized formatting, empty when both within bounds, the row ceiling trips independently of bytes via filter, and fail-soft empty on a missing table.Verification
php -lon both changed files — no syntax errors.php -d error_reporting=0 vendor/bin/phpcs inc/Engine/AI/System/Tasks/WakeBriefingTask.php tests/wake-briefing-infra-signals-smoke.php— exit 0.php tests/wake-briefing-infra-signals-smoke.php— 13/13 assertions pass.php tests/retention-action-scheduler-batching-smoke.php— still passes (no regression).Closes #2799