perf(maintenance): duration-margin hysteresis + branch cooldown + 250ms lease default (#169)#318
perf(maintenance): duration-margin hysteresis + branch cooldown + 250ms lease default (#169)#318hardbyte wants to merge 3 commits into
Conversation
…ms lease default (#169) Follow-up to #316 bench post-mortem. The K=3 hysteresis alone wasn't enough — bench evidence showed rotate_lease still flapping at the 50ms boundary because "delayed" samples hovered at 51ms and "recovered" at 45-49ms. K=3 accumulated quickly because every other tick crossed. The branch also kept doing real (slow) work between the flip and the recovery, contributing to the dead-tuple growth even with backoff on prune. Three coordinated changes in awa-worker: 1. Duration-margin + K-consecutive hysteresis. Samples must clearly exceed `tick_interval * 3/2` to count as overrun, or be clearly below `tick_interval * 7/10` to count as on-time. Anything in the deadband leaves both counters alone. Kills the boundary flap directly: 51ms-vs-49ms can no longer accumulate K observations. Integer ratios — no f64 in the hot path. 2. Per-branch cooldown. On flip-to-delayed, arm `cooldown_ticks_remaining = BRANCH_COOLDOWN_TICKS` (120 ticks = ~30s at the new 250ms lease_rotate cadence). While non-zero, `try_begin` returns None and the branch body is skipped — no ACCESS-EXCLUSIVE attempts, no count(*) scans, no contribution to the dead-tuple growth. Cooldown re-arms on every overrun observation while still delayed, so a chronically-failing branch stays quiet. Counters are frozen during cooldown (no body = no duration sample), so recovery requires K real on-time samples AFTER cooldown expires. 3. Default `lease_rotate_interval` 50ms → 250ms. The 50ms was over-spec'd given that prune backoff already throttles the heavy work — faster rotation just produced loop noise. 250ms still cycles the 8-slot lease ring in ~2s, which is plenty fast for reclamation. All other intervals unchanged. API change: `MaintenanceBranchTracker::begin` is now `try_begin` and returns `Option<BranchTimer>`. None signals "skip this tick's body"; all 13 select! arms updated to use `if let Some(timer) = ...`. The existing `BranchTimer::finish()` semantics are preserved. Tests: - Updated 6 existing branch-tracker tests for the new API + duration thresholds. - New: deadband_sample_does_not_advance_counters, k_consecutive_overruns_flips_and_arms_cooldown, cooldown_skips_body, cooldown_expires_then_body_runs, cooldown_rearms_on_continued_overrun, recovers_only_after_k_ontime_ticks_post_cooldown. - All 22 maintenance unit tests pass; 39 worker lib tests pass.
|
Warning Review limit reached
More reviews will be available in 51 minutes and 7 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR raises the default lease rotation interval from 50ms to 250ms and refactors the maintenance branch tracker into a cooldown-plus-hysteresis state machine (try_begin) that can suppress maintenance branch execution; maintenance loop wiring and tests are updated accordingly. ChangesLease Rotation Interval Configuration
Maintenance Branch Overrun Tracker Refactor
Sequence Diagram(s)sequenceDiagram
participant Scheduler as MaintenanceScheduler
participant Tracker as BranchTracker
participant Worker as MaintenanceTask
Scheduler->>Tracker: try_begin(branch_id, last_duration)
alt Suppressed
Tracker-->>Scheduler: None
else Allowed
Tracker-->>Scheduler: Some(BranchTimer)
Scheduler->>Worker: run maintenance branch
Worker-->>Scheduler: completed
Scheduler->>Tracker: timer.finish(branch_id, duration)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f07cf3af44
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } else if cross_threshold && state.is_delayed { | ||
| // Already delayed and another overrun arrived — | ||
| // re-arm cooldown but stay quiet about it. | ||
| state.cooldown_ticks_remaining = BRANCH_COOLDOWN_TICKS; | ||
| return None; |
There was a problem hiding this comment.
Don't re-arm cooldown from the stale overrun sample
When a branch first crosses the overrun threshold, try_begin returns None before any body runs, so record_finish is not called and last_duration remains the same overrun value throughout the 120 skipped ticks. Once the cooldown reaches zero, the next try_begin re-enters this already-delayed branch, sees that same stale last_duration with consecutive_overrun >= K, and this block re-arms cooldown and returns None again. In the real maintenance loop this means any branch that trips delayed once is skipped forever until the worker restarts; the tests mask this by manually overwriting last_duration during skipped ticks.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@awa-worker/src/maintenance.rs`:
- Around line 191-194: The branch can get stuck reusing a stale sampled duration
because try_begin returns None without consuming last_duration; to fix, ensure
each sampled duration is consumed exactly once by clearing or taking
last_duration before any early return. Update the logic around try_begin and the
cooldown handling: when you check state.cooldown_ticks_remaining and return
None, take() or set last_duration = None (or otherwise mark the sample consumed)
so that after cooldown the same overrun isn't re-evaluated; likewise, when
try_begin returns None elsewhere, consume last_duration before returning.
Reference: try_begin, last_duration, and state.cooldown_ticks_remaining.
- Around line 2528-2549: The helper replay_ticks is incorrectly mutating
tracker.branches[].last_duration before calling tracker.try_begin, which
simulates a duration even for ticks that were skipped by cooldown; change the
logic so you only set last_duration when try_begin returns Some(timer) (i.e.,
the body would run). Concretely, call tracker.try_begin(branch, tick_interval,
&metrics) first, check timer_opt.is_some(), and only then update
tracker.branches.lock().unwrap().entry(branch).or_default().last_duration =
Some(last_duration) (and simulate finishing the timer if necessary); keep
references to replay_ticks, tracker.try_begin, and the last_duration field to
locate the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c11d8e52-664e-44a7-98f2-163b65db6945
📒 Files selected for processing (2)
awa-worker/src/client.rsawa-worker/src/maintenance.rs
Codex and CodeRabbit both independently flagged a P1 bug on PR #318: without consuming the sample, the branch enters perpetual cooldown once it flips delayed. Trace: - Body N runs slow, record_finish writes last_duration=Some(slow). - Tick N+1: try_begin reads but doesn't consume; consecutive_overrun saturates past K; flip + cooldown=120; return None. - Ticks N+2..N+121: cooldown gate decrements, return None. No body runs, no record_finish call, last_duration STAYS Some(slow). - Tick N+122 (cooldown drained): try_begin sees the same stale Some(slow), saturates overrun again, re-arms cooldown forever. Worker has to restart to escape. The test helper hid this by writing last_duration before every iteration — synthetic samples even on ticks where the body was skipped. Fix: `state.last_duration.take()` in try_begin's hysteresis check. Each sample is evaluated exactly once. After flip, the consumed sample leaves last_duration = None; cooldown drains without writing anything; first post-cooldown try_begin sees None and skips the eval, body runs, record_finish writes a fresh sample, next try_begin evaluates it. Test fixes: - `replay_ticks` no longer seeds last_duration on each iteration. It calls record_finish *only* when the body would have run, matching production: skipped ticks don't produce samples. - New `seed_last_duration` helper for tests that need an initial sample to drive the very first try_begin observation. - Tests updated for the new K+1 recovery cadence: the very first post-cooldown body produces a sample evaluated on the *next* try_begin, so K evaluable on-time samples require K+1 ticks post-cooldown. All 22 maintenance tests pass with the corrected semantics. The production-side fix is the take() — the test fixes are bookkeeping that exposes (rather than hides) the new contract.
|
Corrected bench verdict after the Run: Result:
Interpretation: #318 is still worth merging as a mitigation for maintenance noise / boundary flapping, and the corrected sample-consuming semantics are required. But the release-gate evidence still says sustained pinned MVCC pressure eventually hits the queue-metadata per-row update wall. This is not lease heartbeat churn anymore; it is the queue metadata model under a long pinned snapshot. Recommendation: merge #318 as the mitigation, but do not frame it as closing #169. For v0.6 either document a hard MVCC discipline/ceiling for long pinned snapshots, or pull the #295 rotation/TRUNCATE metadata redesign into the release scope if that ceiling is unacceptable. |
Codex and CodeRabbit both independently flagged a P1 bug on PR #318: without consuming the sample, the branch enters perpetual cooldown once it flips delayed. Trace: - Body N runs slow, record_finish writes last_duration=Some(slow). - Tick N+1: try_begin reads but doesn't consume; consecutive_overrun saturates past K; flip + cooldown=120; return None. - Ticks N+2..N+121: cooldown gate decrements, return None. No body runs, no record_finish call, last_duration STAYS Some(slow). - Tick N+122 (cooldown drained): try_begin sees the same stale Some(slow), saturates overrun again, re-arms cooldown forever. Worker has to restart to escape. The test helper hid this by writing last_duration before every iteration — synthetic samples even on ticks where the body was skipped. Fix: `state.last_duration.take()` in try_begin's hysteresis check. Each sample is evaluated exactly once. After flip, the consumed sample leaves last_duration = None; cooldown drains without writing anything; first post-cooldown try_begin sees None and skips the eval, body runs, record_finish writes a fresh sample, next try_begin evaluates it. Test fixes: - `replay_ticks` no longer seeds last_duration on each iteration. It calls record_finish *only* when the body would have run, matching production: skipped ticks don't produce samples. - New `seed_last_duration` helper for tests that need an initial sample to drive the very first try_begin observation. - Tests updated for the new K+1 recovery cadence: the very first post-cooldown body produces a sample evaluated on the *next* try_begin, so K evaluable on-time samples require K+1 ticks post-cooldown. All 22 maintenance tests pass with the corrected semantics. The production-side fix is the take() — the test fixes are bookkeeping that exposes (rather than hides) the new contract.
Summary
Follow-up to #316 from the bench post-mortem you ran. K=3 hysteresis alone wasn't enough — bench evidence showed
rotate_leasestill flapping at the 50ms boundary because "delayed" samples hovered at 51ms and "recovered" at 45-49ms. The branch also kept doing real (slow) work between flip and recovery, contributing to dead-tuple growth even with prune backoff.Three coordinated changes
1. Duration-margin + K-consecutive hysteresis
last_duration > tick_interval * 3/2(1.5×)last_duration < tick_interval * 7/10(0.7×)2. Per-branch cooldown
cooldown_ticks_remaining = 120(~30s at the new 250ms lease_rotate cadence)try_beginreturnsNone→ branch body is skipped: noACCESS EXCLUSIVEattempts, nocount(*)scans, no contribution to dead-tuple growth3. Default
lease_rotate_interval50ms → 250msqueue_rotate_intervalstays 1000ms;claim_rotate_intervaldefaults to queue)API change
MaintenanceBranchTracker::begin→try_beginreturningOption<BranchTimer>. None signals "skip this tick's body"; all 13select!arms updated toif let Some(timer) = .... ExistingBranchTimer::finish()semantics preserved.Test plan
cargo fmt --allcargo clippy --workspace --all-targets --all-features -- -D warningscargo build --workspacecargo test -p awa-worker --lib— 39 passingdeadband_sample_does_not_advance_countersk_consecutive_overruns_flips_and_arms_cooldowncooldown_skips_bodycooldown_expires_then_body_runscooldown_rearms_on_continued_overrunrecovers_only_after_k_ontime_ticks_post_cooldownsingle_overrun_does_not_flip(updated for new thresholds)intermittent_overrun_does_not_flip(updated)finish_records_last_duration(updated for Option return)per_branch_state_is_independent(still passes)What this is and isn't
This addresses the noise + per-tick work issues the bench surfaced on the maintenance side. It does not attempt to close the queue-metadata bloat cliff (
queue_claim_heads/queue_enqueue_heads/queue_ring_state/queue_terminal_live_countsunder sustained pin) — that's the per-row-state-machine model's fundamental limit and is the territory of #295 (rotation+TRUNCATE engine).The next
long_horizonre-run will tell us how much of the cliff this closes. If meaningful improvement: ship 0.6.0-beta.2 with the operational-guidance doc shape from the #169 release decision. If not: time to discuss #295 scope formally.Summary by CodeRabbit