Skip to content

perf(maintenance): duration-margin hysteresis + branch cooldown + 250ms lease default (#169)#318

Open
hardbyte wants to merge 3 commits into
mainfrom
brian/issue-169-maintenance-cooldown-margin
Open

perf(maintenance): duration-margin hysteresis + branch cooldown + 250ms lease default (#169)#318
hardbyte wants to merge 3 commits into
mainfrom
brian/issue-169-maintenance-cooldown-margin

Conversation

@hardbyte
Copy link
Copy Markdown
Owner

@hardbyte hardbyte commented Jun 4, 2026

Summary

Follow-up to #316 from the bench post-mortem you ran. K=3 hysteresis alone wasn't enough — bench evidence showed rotate_lease still flapping at the 50ms boundary because "delayed" samples hovered at 51ms and "recovered" at 45-49ms. The branch also kept doing real (slow) work between flip and recovery, contributing to dead-tuple growth even with prune backoff.

Three coordinated changes

1. Duration-margin + K-consecutive hysteresis

  • Overrun sample: last_duration > tick_interval * 3/2 (1.5×)
  • On-time sample: last_duration < tick_interval * 7/10 (0.7×)
  • Anything in between is a deadband — neither counter advances
  • Still requires K=3 consecutive on either side before flipping
  • Kills the boundary flap directly: 51ms-vs-49ms can't accumulate K observations because they don't even count
  • Integer ratios — no f64 in the hot path

2. Per-branch cooldown

  • On flip-to-delayed, arm cooldown_ticks_remaining = 120 (~30s at the new 250ms lease_rotate cadence)
  • While non-zero, try_begin returns None → branch body is skipped: no ACCESS EXCLUSIVE attempts, no count(*) scans, no contribution to dead-tuple growth
  • Cooldown re-arms on every overrun observation while still delayed → chronically-failing branch stays quiet
  • Counters are frozen during cooldown (no body = no sample), so recovery requires K real on-time samples after cooldown expires

3. Default lease_rotate_interval 50ms → 250ms

  • The 50ms default was over-spec'd given prune backoff already throttles the heavy work
  • 250ms still cycles the 8-slot lease ring in ~2s — plenty fast for reclamation
  • All other intervals unchanged (queue_rotate_interval stays 1000ms; claim_rotate_interval defaults to queue)

API change

MaintenanceBranchTracker::begintry_begin returning Option<BranchTimer>. None signals "skip this tick's body"; all 13 select! arms updated to if let Some(timer) = .... Existing BranchTimer::finish() semantics preserved.

Test plan

  • cargo fmt --all
  • cargo clippy --workspace --all-targets --all-features -- -D warnings
  • cargo build --workspace
  • cargo test -p awa-worker --lib — 39 passing
  • 22 maintenance unit tests passing; 10 new for this PR:
    • deadband_sample_does_not_advance_counters
    • k_consecutive_overruns_flips_and_arms_cooldown
    • cooldown_skips_body
    • cooldown_expires_then_body_runs
    • cooldown_rearms_on_continued_overrun
    • recovers_only_after_k_ontime_ticks_post_cooldown
    • single_overrun_does_not_flip (updated for new thresholds)
    • intermittent_overrun_does_not_flip (updated)
    • finish_records_last_duration (updated for Option return)
    • per_branch_state_is_independent (still passes)

What this is and isn't

This addresses the noise + per-tick work issues the bench surfaced on the maintenance side. It does not attempt to close the queue-metadata bloat cliff (queue_claim_heads / queue_enqueue_heads / queue_ring_state / queue_terminal_live_counts under sustained pin) — that's the per-row-state-machine model's fundamental limit and is the territory of #295 (rotation+TRUNCATE engine).

The next long_horizon re-run will tell us how much of the cliff this closes. If meaningful improvement: ship 0.6.0-beta.2 with the operational-guidance doc shape from the #169 release decision. If not: time to discuss #295 scope formally.

Summary by CodeRabbit

  • Chores
    • Tuned lease rotation timing to reduce churn and improve resource stability.
    • Strengthened maintenance task scheduling with overrun detection, cooldown suppression, and per-branch isolation to avoid repeated work during unhealthy periods.
  • Tests
    • Updated unit tests to validate the new scheduling, cooldown and recovery behaviors.

…ms lease default (#169)

Follow-up to #316 bench post-mortem. The K=3 hysteresis alone wasn't
enough — bench evidence showed rotate_lease still flapping at the 50ms
boundary because "delayed" samples hovered at 51ms and "recovered" at
45-49ms. K=3 accumulated quickly because every other tick crossed.
The branch also kept doing real (slow) work between the flip and the
recovery, contributing to the dead-tuple growth even with backoff
on prune.

Three coordinated changes in awa-worker:

1. Duration-margin + K-consecutive hysteresis. Samples must clearly
   exceed `tick_interval * 3/2` to count as overrun, or be clearly
   below `tick_interval * 7/10` to count as on-time. Anything in the
   deadband leaves both counters alone. Kills the boundary flap
   directly: 51ms-vs-49ms can no longer accumulate K observations.
   Integer ratios — no f64 in the hot path.

2. Per-branch cooldown. On flip-to-delayed, arm
   `cooldown_ticks_remaining = BRANCH_COOLDOWN_TICKS` (120 ticks =
   ~30s at the new 250ms lease_rotate cadence). While non-zero,
   `try_begin` returns None and the branch body is skipped — no
   ACCESS-EXCLUSIVE attempts, no count(*) scans, no contribution to
   the dead-tuple growth. Cooldown re-arms on every overrun
   observation while still delayed, so a chronically-failing branch
   stays quiet. Counters are frozen during cooldown (no body =
   no duration sample), so recovery requires K real on-time samples
   AFTER cooldown expires.

3. Default `lease_rotate_interval` 50ms → 250ms. The 50ms was
   over-spec'd given that prune backoff already throttles the heavy
   work — faster rotation just produced loop noise. 250ms still
   cycles the 8-slot lease ring in ~2s, which is plenty fast for
   reclamation. All other intervals unchanged.

API change: `MaintenanceBranchTracker::begin` is now `try_begin` and
returns `Option<BranchTimer>`. None signals "skip this tick's body";
all 13 select! arms updated to use `if let Some(timer) = ...`. The
existing `BranchTimer::finish()` semantics are preserved.

Tests:
- Updated 6 existing branch-tracker tests for the new API + duration
  thresholds.
- New: deadband_sample_does_not_advance_counters,
  k_consecutive_overruns_flips_and_arms_cooldown,
  cooldown_skips_body, cooldown_expires_then_body_runs,
  cooldown_rearms_on_continued_overrun,
  recovers_only_after_k_ontime_ticks_post_cooldown.
- All 22 maintenance unit tests pass; 39 worker lib tests pass.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

Warning

Review limit reached

@hardbyte, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 51 minutes and 7 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 214c5bc8-09ae-4aa9-a0f5-a7dfbe4331cd

📥 Commits

Reviewing files that changed from the base of the PR and between 263ec46 and 7ddacce.

📒 Files selected for processing (3)
  • awa-worker/src/maintenance.rs
  • docs/architecture.md
  • docs/configuration.md
📝 Walkthrough

Walkthrough

This PR raises the default lease rotation interval from 50ms to 250ms and refactors the maintenance branch tracker into a cooldown-plus-hysteresis state machine (try_begin) that can suppress maintenance branch execution; maintenance loop wiring and tests are updated accordingly.

Changes

Lease Rotation Interval Configuration

Layer / File(s) Summary
Increase lease rotation interval with documentation
awa-worker/src/client.rs
ClientBuilder::new passes 250ms instead of 50ms as the default lease_rotate_interval, with expanded inline documentation explaining the change.

Maintenance Branch Overrun Tracker Refactor

Layer / File(s) Summary
Branch state fields
awa-worker/src/maintenance.rs
Adds consecutive_overrun, consecutive_ontime, and cooldown_ticks_remaining to MaintenanceBranchState.
try_begin gating and hysteresis logic
awa-worker/src/maintenance.rs
Replaces begin with try_begin which implements cooldown-first skipping and duration-margin hysteresis, returns Option<BranchTimer>, emits overrun observability only when entering delayed, and suppresses branch execution by returning None during delays/cooldowns.
Test hooks and cooldown snapshot
awa-worker/src/maintenance.rs
Adds cooldown_snapshot test helper exposing internal counters and cooldown state for assertions.
Maintenance loop conditional dispatch
awa-worker/src/maintenance.rs
Leader maintenance tokio::select! loop and queue/lease/claim rotate branches now call branch_tracker.try_begin(...) and run bodies / call timer.finish() only when Some(timer) is returned.
Tracker tests refactor and tick replay
awa-worker/src/maintenance.rs
Refactors tests to replay ticks via try_begin, distinguishing run vs skipped, updates assertions for deadband/hysteresis behavior, cooldown re-arming and expiry, and adjusts per-branch initial seeding for independence tests.

Sequence Diagram(s)

sequenceDiagram
  participant Scheduler as MaintenanceScheduler
  participant Tracker as BranchTracker
  participant Worker as MaintenanceTask
  Scheduler->>Tracker: try_begin(branch_id, last_duration)
  alt Suppressed
    Tracker-->>Scheduler: None
  else Allowed
    Tracker-->>Scheduler: Some(BranchTimer)
    Scheduler->>Worker: run maintenance branch
    Worker-->>Scheduler: completed
    Scheduler->>Tracker: timer.finish(branch_id, duration)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • hardbyte/awa#316: Adds hysteresis and consecutive-sample logic to MaintenanceBranchTracker that this PR builds upon with suppression gating and cooldown state.

Poem

🐰 I counted ticks beneath the moonlit sky,
Overruns whispered, "rest a little, try."
Cooldown set, hysteresis in my paw,
Leases breathe at two-fifty—peaceful law.
Thump-thump, the watcher-rabbit keeps the draw.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main changes: duration-margin hysteresis, branch cooldown, and 250ms lease default.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch brian/issue-169-maintenance-cooldown-margin

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f07cf3af44

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +220 to +224
} else if cross_threshold && state.is_delayed {
// Already delayed and another overrun arrived —
// re-arm cooldown but stay quiet about it.
state.cooldown_ticks_remaining = BRANCH_COOLDOWN_TICKS;
return None;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don't re-arm cooldown from the stale overrun sample

When a branch first crosses the overrun threshold, try_begin returns None before any body runs, so record_finish is not called and last_duration remains the same overrun value throughout the 120 skipped ticks. Once the cooldown reaches zero, the next try_begin re-enters this already-delayed branch, sees that same stale last_duration with consecutive_overrun >= K, and this block re-arms cooldown and returns None again. In the real maintenance loop this means any branch that trips delayed once is skipped forever until the worker restarts; the tests mask this by manually overwriting last_duration during skipped ticks.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@awa-worker/src/maintenance.rs`:
- Around line 191-194: The branch can get stuck reusing a stale sampled duration
because try_begin returns None without consuming last_duration; to fix, ensure
each sampled duration is consumed exactly once by clearing or taking
last_duration before any early return. Update the logic around try_begin and the
cooldown handling: when you check state.cooldown_ticks_remaining and return
None, take() or set last_duration = None (or otherwise mark the sample consumed)
so that after cooldown the same overrun isn't re-evaluated; likewise, when
try_begin returns None elsewhere, consume last_duration before returning.
Reference: try_begin, last_duration, and state.cooldown_ticks_remaining.
- Around line 2528-2549: The helper replay_ticks is incorrectly mutating
tracker.branches[].last_duration before calling tracker.try_begin, which
simulates a duration even for ticks that were skipped by cooldown; change the
logic so you only set last_duration when try_begin returns Some(timer) (i.e.,
the body would run). Concretely, call tracker.try_begin(branch, tick_interval,
&metrics) first, check timer_opt.is_some(), and only then update
tracker.branches.lock().unwrap().entry(branch).or_default().last_duration =
Some(last_duration) (and simulate finishing the timer if necessary); keep
references to replay_ticks, tracker.try_begin, and the last_duration field to
locate the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c11d8e52-664e-44a7-98f2-163b65db6945

📥 Commits

Reviewing files that changed from the base of the PR and between cc9ba62 and f07cf3a.

📒 Files selected for processing (2)
  • awa-worker/src/client.rs
  • awa-worker/src/maintenance.rs

Comment thread awa-worker/src/maintenance.rs
Comment thread awa-worker/src/maintenance.rs
hardbyte and others added 2 commits June 4, 2026 15:21
Codex and CodeRabbit both independently flagged a P1 bug on PR #318:
without consuming the sample, the branch enters perpetual cooldown
once it flips delayed. Trace:

  - Body N runs slow, record_finish writes last_duration=Some(slow).
  - Tick N+1: try_begin reads but doesn't consume; consecutive_overrun
    saturates past K; flip + cooldown=120; return None.
  - Ticks N+2..N+121: cooldown gate decrements, return None. No body
    runs, no record_finish call, last_duration STAYS Some(slow).
  - Tick N+122 (cooldown drained): try_begin sees the same stale
    Some(slow), saturates overrun again, re-arms cooldown forever.

Worker has to restart to escape. The test helper hid this by writing
last_duration before every iteration — synthetic samples even on
ticks where the body was skipped.

Fix: `state.last_duration.take()` in try_begin's hysteresis check.
Each sample is evaluated exactly once. After flip, the consumed
sample leaves last_duration = None; cooldown drains without writing
anything; first post-cooldown try_begin sees None and skips the
eval, body runs, record_finish writes a fresh sample, next try_begin
evaluates it.

Test fixes:
  - `replay_ticks` no longer seeds last_duration on each iteration.
    It calls record_finish *only* when the body would have run,
    matching production: skipped ticks don't produce samples.
  - New `seed_last_duration` helper for tests that need an initial
    sample to drive the very first try_begin observation.
  - Tests updated for the new K+1 recovery cadence: the very first
    post-cooldown body produces a sample evaluated on the *next*
    try_begin, so K evaluable on-time samples require K+1 ticks
    post-cooldown.

All 22 maintenance tests pass with the corrected semantics. The
production-side fix is the take() — the test fixes are bookkeeping
that exposes (rather than hides) the new contract.
@hardbyte
Copy link
Copy Markdown
Owner Author

hardbyte commented Jun 4, 2026

Corrected bench verdict after the last_duration.take() rescue fix + docs cleanup (7ddacce).

Run: custom-20260604T033058Z-4216b7 in postgresql-job-queue-benchmarking, Awa 7ddacce on pr-318, harness 02e3a4f, clean worktrees. Shape: warmup=30s, clean_1=10m, idle_1=idle-in-tx:70m, recovery_1=10m, 800/s, 32 workers, QUEUE_STRIPE_COUNT=4, LEASE_ROTATE_MS=250, completion batch 1024, flush 20ms.

Result:

  • Clean: steady. completion avg 800.0/s, enqueue avg 800.0/s, depth avg 11, p99 avg 63ms.
  • Idle pinned snapshot: perf(maintenance): duration-margin hysteresis + branch cooldown + 250ms lease default (#169) #318 delays the old cliff but does not remove the underlying limit. Over the full 70m pin, completion avg 796.96/s vs enqueue avg 800.0/s; final sample was completion 776.0/s, enqueue 800.4/s, p99 35.8s, depth 12,733.
  • Shape matters: around ~43m pinned it showed bounded stall/catch-up pulses, but from ~60m pinned depth began climbing monotonically. Last 20 idle samples: depth 1,153 -> 1,769 -> 3,458 -> 7,282 -> 12,733; p99 3.9s -> 6.2s -> 10.4s -> 20.6s -> 35.8s.
  • Recovery after releasing the pin was immediate. First recovery sample completed 1225/s and depth returned to 0; subsequent recovery windows were back to ~800/s, depth 0-21, p99 mostly <110ms.

Interpretation: #318 is still worth merging as a mitigation for maintenance noise / boundary flapping, and the corrected sample-consuming semantics are required. But the release-gate evidence still says sustained pinned MVCC pressure eventually hits the queue-metadata per-row update wall. This is not lease heartbeat churn anymore; it is the queue metadata model under a long pinned snapshot.

Recommendation: merge #318 as the mitigation, but do not frame it as closing #169. For v0.6 either document a hard MVCC discipline/ceiling for long pinned snapshots, or pull the #295 rotation/TRUNCATE metadata redesign into the release scope if that ceiling is unacceptable.

hardbyte added a commit that referenced this pull request Jun 5, 2026
Codex and CodeRabbit both independently flagged a P1 bug on PR #318:
without consuming the sample, the branch enters perpetual cooldown
once it flips delayed. Trace:

  - Body N runs slow, record_finish writes last_duration=Some(slow).
  - Tick N+1: try_begin reads but doesn't consume; consecutive_overrun
    saturates past K; flip + cooldown=120; return None.
  - Ticks N+2..N+121: cooldown gate decrements, return None. No body
    runs, no record_finish call, last_duration STAYS Some(slow).
  - Tick N+122 (cooldown drained): try_begin sees the same stale
    Some(slow), saturates overrun again, re-arms cooldown forever.

Worker has to restart to escape. The test helper hid this by writing
last_duration before every iteration — synthetic samples even on
ticks where the body was skipped.

Fix: `state.last_duration.take()` in try_begin's hysteresis check.
Each sample is evaluated exactly once. After flip, the consumed
sample leaves last_duration = None; cooldown drains without writing
anything; first post-cooldown try_begin sees None and skips the
eval, body runs, record_finish writes a fresh sample, next try_begin
evaluates it.

Test fixes:
  - `replay_ticks` no longer seeds last_duration on each iteration.
    It calls record_finish *only* when the body would have run,
    matching production: skipped ticks don't produce samples.
  - New `seed_last_duration` helper for tests that need an initial
    sample to drive the very first try_begin observation.
  - Tests updated for the new K+1 recovery cadence: the very first
    post-cooldown body produces a sample evaluated on the *next*
    try_begin, so K evaluable on-time samples require K+1 ticks
    post-cooldown.

All 22 maintenance tests pass with the corrected semantics. The
production-side fix is the take() — the test fixes are bookkeeping
that exposes (rather than hides) the new contract.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant