Move queue storage cursors off hot rows#321
Conversation
|
Need an answer fast? Review this PR in Change Stack to ask focused questions about the PR or a changed range. No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughImplements sequence-backed per-lane cursor substrate and sequence_next_value usage, deterministic post-commit claim-cursor advancement, bucketed terminal live counters with rebucket migration, maintenance branch hysteresis/cooldown gating, and coordinated tests/docs updates. ChangesQueue Storage Cursor and Counter Redesign
Estimated code review effort 🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0adaf01532
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| -- receipts=FALSE writes directly into the partitioned leases table. | ||
| -------------------------------------------------------------------- | ||
|
|
||
| v_advance_claim_condition := 'FALSE'; |
There was a problem hiding this comment.
Advance claim cursors before releasing the lane lock
With more than one claimer on the same queue, setting this condition to FALSE means claim_ready_runtime commits the lease/receipt while the sequence cursor is still at the old lane_seq; the FOR UPDATE lock on queue_claim_heads is then released before Rust's post-commit advance_claim_cursors can run. A second claimer that wins that race reads the same sequence_next_value, selects the same ready_entries rows, and either hits duplicate-key errors in the lease/receipt tables or can duplicate the claim after ring rotation. The cursor needs to move while the lane lock is still held, or the post-commit path needs another mechanism that prevents another claim from observing the stale cursor.
Useful? React with 👍 / 👎.
| WHERE NOT EXISTS ( | ||
| SELECT 1 FROM %1$I.queue_terminal_live_counts LIMIT 1 | ||
| ) |
There was a problem hiding this comment.
Rebucket existing terminal counters during migration
On upgraded databases where queue_terminal_live_counts already has rows, the new counter_bucket column defaults every preexisting aggregate to bucket 0 and this backfill is skipped entirely. Later decrements use the actual job_id bucket in decrement_live_terminal_counters_tx, so deleting/retrying terminal jobs whose bucket is not 0 will not reduce the old aggregate row, leaving queue_counts_exact terminal totals inflated. The migration should rebuild or redistribute existing counter rows instead of skipping whenever the table is non-empty.
Useful? React with 👍 / 👎.
…ms lease default (#169) Follow-up to #316 bench post-mortem. The K=3 hysteresis alone wasn't enough — bench evidence showed rotate_lease still flapping at the 50ms boundary because "delayed" samples hovered at 51ms and "recovered" at 45-49ms. K=3 accumulated quickly because every other tick crossed. The branch also kept doing real (slow) work between the flip and the recovery, contributing to the dead-tuple growth even with backoff on prune. Three coordinated changes in awa-worker: 1. Duration-margin + K-consecutive hysteresis. Samples must clearly exceed `tick_interval * 3/2` to count as overrun, or be clearly below `tick_interval * 7/10` to count as on-time. Anything in the deadband leaves both counters alone. Kills the boundary flap directly: 51ms-vs-49ms can no longer accumulate K observations. Integer ratios — no f64 in the hot path. 2. Per-branch cooldown. On flip-to-delayed, arm `cooldown_ticks_remaining = BRANCH_COOLDOWN_TICKS` (120 ticks = ~30s at the new 250ms lease_rotate cadence). While non-zero, `try_begin` returns None and the branch body is skipped — no ACCESS-EXCLUSIVE attempts, no count(*) scans, no contribution to the dead-tuple growth. Cooldown re-arms on every overrun observation while still delayed, so a chronically-failing branch stays quiet. Counters are frozen during cooldown (no body = no duration sample), so recovery requires K real on-time samples AFTER cooldown expires. 3. Default `lease_rotate_interval` 50ms → 250ms. The 50ms was over-spec'd given that prune backoff already throttles the heavy work — faster rotation just produced loop noise. 250ms still cycles the 8-slot lease ring in ~2s, which is plenty fast for reclamation. All other intervals unchanged. API change: `MaintenanceBranchTracker::begin` is now `try_begin` and returns `Option<BranchTimer>`. None signals "skip this tick's body"; all 13 select! arms updated to use `if let Some(timer) = ...`. The existing `BranchTimer::finish()` semantics are preserved. Tests: - Updated 6 existing branch-tracker tests for the new API + duration thresholds. - New: deadband_sample_does_not_advance_counters, k_consecutive_overruns_flips_and_arms_cooldown, cooldown_skips_body, cooldown_expires_then_body_runs, cooldown_rearms_on_continued_overrun, recovers_only_after_k_ontime_ticks_post_cooldown. - All 22 maintenance unit tests pass; 39 worker lib tests pass.
Codex and CodeRabbit both independently flagged a P1 bug on PR #318: without consuming the sample, the branch enters perpetual cooldown once it flips delayed. Trace: - Body N runs slow, record_finish writes last_duration=Some(slow). - Tick N+1: try_begin reads but doesn't consume; consecutive_overrun saturates past K; flip + cooldown=120; return None. - Ticks N+2..N+121: cooldown gate decrements, return None. No body runs, no record_finish call, last_duration STAYS Some(slow). - Tick N+122 (cooldown drained): try_begin sees the same stale Some(slow), saturates overrun again, re-arms cooldown forever. Worker has to restart to escape. The test helper hid this by writing last_duration before every iteration — synthetic samples even on ticks where the body was skipped. Fix: `state.last_duration.take()` in try_begin's hysteresis check. Each sample is evaluated exactly once. After flip, the consumed sample leaves last_duration = None; cooldown drains without writing anything; first post-cooldown try_begin sees None and skips the eval, body runs, record_finish writes a fresh sample, next try_begin evaluates it. Test fixes: - `replay_ticks` no longer seeds last_duration on each iteration. It calls record_finish *only* when the body would have run, matching production: skipped ticks don't produce samples. - New `seed_last_duration` helper for tests that need an initial sample to drive the very first try_begin observation. - Tests updated for the new K+1 recovery cadence: the very first post-cooldown body produces a sample evaluated on the *next* try_begin, so K evaluable on-time samples require K+1 ticks post-cooldown. All 22 maintenance tests pass with the corrected semantics. The production-side fix is the take() — the test fixes are bookkeeping that exposes (rather than hides) the new contract.
0adaf01 to
872149a
Compare
|
Rebased after #320 merged and resolved the migration collision:
Validation after rebase:
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
docs/upgrade-0.5-to-0.6.md (1)
289-292: ⚡ Quick winCompute cursor values once to keep
lagself-consistent.On Line 289–Line 292, the snippet re-invokes
sequence_next_value(...)when computinglag. If those reads are not guaranteed stable per row, displayedenqueue_cursor/claim_cursorcan diverge fromlag. Compute once, then subtract aliases.Proposed docs snippet update
- SELECT priority, enqueue_shard, - <queue_storage_schema>.sequence_next_value(enqueues.seq_name) AS enqueue_cursor, - <queue_storage_schema>.sequence_next_value(claims.seq_name) AS claim_cursor, - <queue_storage_schema>.sequence_next_value(enqueues.seq_name) - - <queue_storage_schema>.sequence_next_value(claims.seq_name) AS lag + SELECT priority, enqueue_shard, enqueue_cursor, claim_cursor, + enqueue_cursor - claim_cursor AS lag FROM <queue_storage_schema>.queue_claim_heads AS claims JOIN <queue_storage_schema>.queue_enqueue_heads AS enqueues USING (queue, priority, enqueue_shard) + CROSS JOIN LATERAL ( + SELECT <queue_storage_schema>.sequence_next_value(enqueues.seq_name) AS enqueue_cursor, + <queue_storage_schema>.sequence_next_value(claims.seq_name) AS claim_cursor + ) cur WHERE queue = 'my_hot_queue' ORDER BY priority, enqueue_shard;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/upgrade-0.5-to-0.6.md` around lines 289 - 292, The SQL currently calls <queue_storage_schema>.sequence_next_value(enqueues.seq_name) and <queue_storage_schema>.sequence_next_value(claims.seq_name) twice (for enqueue_cursor/claim_cursor and again when computing lag), which can produce inconsistent per-row values; update the query to compute each sequence_next_value once (assign to the aliases enqueue_cursor and claim_cursor using the existing calls) and compute lag by subtracting those aliases (enqueue_cursor - claim_cursor) so lag is self-consistent; look for uses of sequence_next_value, enqueue_cursor, claim_cursor, and lag in the snippet and change the lag expression to subtract the previously aliased values rather than re-invoking sequence_next_value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@awa-model/migrations/v023_install_queue_storage_substrate.sql`:
- Around line 1262-1303: The migration must rebucket or invalidate existing
aggregates for queue_terminal_live_counts: after adding counter_bucket and
before (or as part of) replacing queue_terminal_live_counts_pkey, update
existing rows so they won't silently drift — either set counter_bucket = (job_id
% 256) if a job_id column exists, or else set terminal_counter_trusted_at = NULL
for all rows and rebuild/aggregate counts under the new PK; ensure you perform
any necessary aggregation (group by ready_slot, queue, priority, enqueue_shard,
counter_bucket) to collapse old single-bucket rows into the new bucketed shape
and then add the new PRIMARY KEY (queue_terminal_live_counts_pkey) so runtime
key changes won’t break consistency.
In `@awa-model/src/queue_storage.rs`:
- Around line 6646-6657: The code collects claim_cursor_advances and calls
self.advance_claim_cursors(pool, &claim_cursor_advances) replaying them in
caller order which can break preconditions when multiple advances target the
same lane out-of-order; modify the path that handles claim_cursor_advances (the
vector built in the loop around cancel_job_tx and consumed by
advance_claim_cursors) to coalesce and sort advances by the lane-identifying
tuple (queue, priority, enqueue_shard) and then apply per-lane advances in
ascending only_if_current before calling advance_claim_cursors so updates for
the same lane are applied in increasing only_if_current order; update the logic
that builds/consumes claim_cursor_advances (the result.claim_cursor_advance from
cancel_job_tx and the advance_claim_cursors call) to perform this
group-by-and-sort step.
- Around line 3048-3059: The current advance_claim_cursors method swallows
failures from advance_claim_cursors_strict, which can leave per-lane claim
sequences permanently stale; update advance_claim_cursors to retry the strict
advance with a bounded retry/backoff (e.g., N attempts with small delays) or
enqueue a durable compensating repair task so transient pool/connection errors
won't be dropped; specifically modify advance_claim_cursors to call
advance_claim_cursors_strict in a retry loop (or schedule a follow-up repair
job) and ensure failures are surfaced/logged after final attempt so
queue_claimer_signal and queue_counts_exact don't observe permanent stale state.
---
Nitpick comments:
In `@docs/upgrade-0.5-to-0.6.md`:
- Around line 289-292: The SQL currently calls
<queue_storage_schema>.sequence_next_value(enqueues.seq_name) and
<queue_storage_schema>.sequence_next_value(claims.seq_name) twice (for
enqueue_cursor/claim_cursor and again when computing lag), which can produce
inconsistent per-row values; update the query to compute each
sequence_next_value once (assign to the aliases enqueue_cursor and claim_cursor
using the existing calls) and compute lag by subtracting those aliases
(enqueue_cursor - claim_cursor) so lag is self-consistent; look for uses of
sequence_next_value, enqueue_cursor, claim_cursor, and lag in the snippet and
change the lag expression to subtract the previously aliased values rather than
re-invoking sequence_next_value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 768e6d67-e941-4795-a108-4c5dfd035231
📒 Files selected for processing (18)
awa-model/migrations/v022_delete_compat_terminal_counter.sqlawa-model/migrations/v023_install_queue_storage_substrate.sqlawa-model/migrations/v027_sequence_lane_cursors.sqlawa-model/src/admin.rsawa-model/src/migrations.rsawa-model/src/queue_storage.rsawa-worker/src/client.rsawa-worker/src/maintenance.rsawa/tests/chaos_suite_test.rsawa/tests/postgres_failover_smoke_test.rsawa/tests/queue_storage_benchmark_test.rsawa/tests/queue_storage_runtime_test.rsawa/tests/telemetry_test.rsdocs/adr/008-copy-batch-ingestion.mddocs/adr/019-queue-storage-redesign.mddocs/architecture.mddocs/configuration.mddocs/upgrade-0.5-to-0.6.md
Summary
Pulls the #295 storage-shape work into v0.6:
queue_terminal_live_countsby deterministicjob_id % 256buckets so completion-heavy lanes do not hammer one counter row under pinned MVCCBenchmark evidence
Release-gate run:
custom-20260605T013419Z-0c45d0inhardbyte/postgresql-job-queue-benchmarking, summary committed atresults/2026-06-05-v06-gate/SUMMARY.md.Shape: 1 replica, 32 workers, fixed 800 jobs/s producer; warmup 5m, clean 20m, idle-in-tx 60m, recovery 10m; compared Awa branch vs pgque submodule
55ddc1d.Headline: both Awa and pgque completed the 60-minute pinned-MVCC phase without a sustained throughput cliff. Awa idle median completion was
798.57/sat798.59/smedian enqueue, with p95 depth20and median e2e p9530.02 ms. Pgque idle median completion was800.05/sat800.13/smedian enqueue, with p95 depth81and median e2e p95131.55 ms.Validation
cargo fmt --checkgit diff --checkCARGO_TARGET_DIR=target/codex-check cargo check -p awa-modelTEST_DATABASE_URL=postgres://postgres:test@localhost:15432/awa_test CARGO_TARGET_DIR=target/codex-check cargo test -p awa --test migration_test -- --nocapture(37 passed)TEST_DATABASE_URL=postgres://postgres:test@localhost:15432/awa_test CARGO_TARGET_DIR=target/codex-check cargo test -p awa --test queue_storage_runtime_test -- --nocapture(79 passed)Summary by CodeRabbit
New Features
Bug Fixes
Performance Improvements
Documentation & Tests