Skip to content

Move queue storage cursors off hot rows#321

Merged
hardbyte merged 10 commits into
mainfrom
issue-295-v06-sequence-cursors
Jun 6, 2026
Merged

Move queue storage cursors off hot rows#321
hardbyte merged 10 commits into
mainfrom
issue-295-v06-sequence-cursors

Conversation

@hardbyte
Copy link
Copy Markdown
Owner

@hardbyte hardbyte commented Jun 5, 2026

Summary

Pulls the #295 storage-shape work into v0.6:

  • moves queue-storage enqueue/claim lane cursors from hot MVCC-updated rows to PostgreSQL sequences while retaining the head tables as lane registries and lock targets
  • advances claim cursors post-commit so nontransactional sequence state cannot skip work on rollback
  • stripes queue_terminal_live_counts by deterministic job_id % 256 buckets so completion-heavy lanes do not hammer one counter row under pinned MVCC
  • refreshes the default/custom queue-storage substrates via migration v26 and updates docs/tests for the new cursor semantics

Benchmark evidence

Release-gate run: custom-20260605T013419Z-0c45d0 in hardbyte/postgresql-job-queue-benchmarking, summary committed at results/2026-06-05-v06-gate/SUMMARY.md.

Shape: 1 replica, 32 workers, fixed 800 jobs/s producer; warmup 5m, clean 20m, idle-in-tx 60m, recovery 10m; compared Awa branch vs pgque submodule 55ddc1d.

Headline: both Awa and pgque completed the 60-minute pinned-MVCC phase without a sustained throughput cliff. Awa idle median completion was 798.57/s at 798.59/s median enqueue, with p95 depth 20 and median e2e p95 30.02 ms. Pgque idle median completion was 800.05/s at 800.13/s median enqueue, with p95 depth 81 and median e2e p95 131.55 ms.

Validation

  • cargo fmt --check
  • git diff --check
  • CARGO_TARGET_DIR=target/codex-check cargo check -p awa-model
  • TEST_DATABASE_URL=postgres://postgres:test@localhost:15432/awa_test CARGO_TARGET_DIR=target/codex-check cargo test -p awa --test migration_test -- --nocapture (37 passed)
  • TEST_DATABASE_URL=postgres://postgres:test@localhost:15432/awa_test CARGO_TARGET_DIR=target/codex-check cargo test -p awa --test queue_storage_runtime_test -- --nocapture (79 passed)

Summary by CodeRabbit

  • New Features

    • Sequence-driven cursor behavior for more accurate queue selection/counting; per-lane claim-cursor advances and rebucketed terminal counters.
    • Schema migration to install the new queue-storage substrate and rebucket existing counters.
  • Bug Fixes

    • Fixed terminal-counter decrement to update the correct counter bucket.
  • Performance Improvements

    • Increased lease rotation interval from 50ms to 250ms.
  • Documentation & Tests

    • Updated ADRs/docs and added/updated tests for migration, counters, cursor behavior, and maintenance gating.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

Need an answer fast? Review this PR in Change Stack to ask focused questions about the PR or a changed range.

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f4b6b4ee-ee19-4a9c-bb88-de0a9398f5f7

📥 Commits

Reviewing files that changed from the base of the PR and between 1e39cf9 and 8820b31.

📒 Files selected for processing (3)
  • awa-python/python/awa/_awa.pyi
  • awa-python/python/awa/client.py
  • awa-python/src/client.rs

📝 Walkthrough

Walkthrough

Implements sequence-backed per-lane cursor substrate and sequence_next_value usage, deterministic post-commit claim-cursor advancement, bucketed terminal live counters with rebucket migration, maintenance branch hysteresis/cooldown gating, and coordinated tests/docs updates.

Changes

Queue Storage Cursor and Counter Redesign

Layer / File(s) Summary
Sequence-backed lane cursor substrate & rebucket migration
awa-model/migrations/v023_install_queue_storage_substrate.sql, awa-model/migrations/v027_sequence_lane_cursors.sql
Adds seq_name columns and helpers, ensures per-lane sequences exist and are synced, adds counter_bucket to queue_terminal_live_counts, backfills/rebuilds bucketed counters, and provides v027 multi-schema rollout.
claim_ready_runtime sequence logic
awa-model/migrations/v023_install_queue_storage_substrate.sql
claim_ready_runtime() now derives lane claim_seq/next_seq via sequence_next_value(...), uses selected_with_spent and conditional advancement (set_sequence_next) only when safe, and avoids advancing when nothing was claimed.
Runtime: queue_storage core changes
awa-model/src/queue_storage.rs
Adds TERMINAL_COUNTER_BUCKETS and terminal_counter_bucket(job_id); reserves enqueue sequence via reserve_enqueue_seq; batches lane lane_seq allocation; computes, normalizes, and advances claim-cursor advances after commit; updates cancel/age paths to emit advances rather than in-transaction sequence moves; bucket-aware increment/decrement/rebuild of terminal counters.
Admin and tests: cursor contract alignment
awa-model/src/admin.rs, awa/tests/*, awa-python/tests/*
Replaces direct claim_seq column usage with sequence_next_value(claims.seq_name) in admin reads and many tests/benchmarks/smoke checks so selection/count predicates align with sequence-backed cursors.
Maintenance hysteresis and lease-rotate tuning
awa-worker/src/maintenance.rs, awa-worker/src/client.rs, docs/*
Implements try_begin(...) -> Option<BranchTimer> with cooldown and K-consecutive hysteresis gating for leader branches; increases default lease rotation from 50ms to 250ms and updates docs/ADR examples.
Tests and migration test
awa/tests/queue_storage_runtime_test.rs, awa/tests/migration_test.rs, awa-model/tests/queue_storage_copy_test.rs
Adds test ensuring claim runtime does not skip uncommitted enqueue reservations; adds test_v027_rebuckets_existing_terminal_live_counts; centralizes lane availability helper and updates tests for bucketed counters and sequence-backed cursor behavior.
Correctness traces and harness
correctness/storage/AwaSegmentedStorageTrace.tla, correctness/storage/AwaSegmentedStorageTraceLostClaimAdvance.cfg
Adds a TLA+ lost-claim-advance trace and TLC config to validate receipt-only claim commit without post-commit cursor advance; updates harness README and run scripts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

"🧑‍🌾
Sequences now track our cursors true,
Per lane and bucket, old counts made new.
Hysteresis hushes noisy ticks with care,
Post-commit advances happen where they dare—
Thump-thump, the queues hop forward, carrots share!"

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Move queue storage cursors off hot rows' clearly and specifically summarizes the main architectural change—moving enqueue/claim cursors from MVCC-updated hot rows to PostgreSQL sequences—which is the primary objective of this substantial PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue-295-v06-sequence-cursors

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0adaf01532

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

-- receipts=FALSE writes directly into the partitioned leases table.
--------------------------------------------------------------------

v_advance_claim_condition := 'FALSE';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Advance claim cursors before releasing the lane lock

With more than one claimer on the same queue, setting this condition to FALSE means claim_ready_runtime commits the lease/receipt while the sequence cursor is still at the old lane_seq; the FOR UPDATE lock on queue_claim_heads is then released before Rust's post-commit advance_claim_cursors can run. A second claimer that wins that race reads the same sequence_next_value, selects the same ready_entries rows, and either hits duplicate-key errors in the lease/receipt tables or can duplicate the claim after ring rotation. The cursor needs to move while the lane lock is still held, or the post-commit path needs another mechanism that prevents another claim from observing the stale cursor.

Useful? React with 👍 / 👎.

Comment on lines +1337 to +1339
WHERE NOT EXISTS (
SELECT 1 FROM %1$I.queue_terminal_live_counts LIMIT 1
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Rebucket existing terminal counters during migration

On upgraded databases where queue_terminal_live_counts already has rows, the new counter_bucket column defaults every preexisting aggregate to bucket 0 and this backfill is skipped entirely. Later decrements use the actual job_id bucket in decrement_live_terminal_counters_tx, so deleting/retrying terminal jobs whose bucket is not 0 will not reduce the old aggregate row, leaving queue_counts_exact terminal totals inflated. The migration should rebuild or redistribute existing counter rows instead of skipping whenever the table is non-empty.

Useful? React with 👍 / 👎.

hardbyte and others added 4 commits June 5, 2026 17:02
…ms lease default (#169)

Follow-up to #316 bench post-mortem. The K=3 hysteresis alone wasn't
enough — bench evidence showed rotate_lease still flapping at the 50ms
boundary because "delayed" samples hovered at 51ms and "recovered" at
45-49ms. K=3 accumulated quickly because every other tick crossed.
The branch also kept doing real (slow) work between the flip and the
recovery, contributing to the dead-tuple growth even with backoff
on prune.

Three coordinated changes in awa-worker:

1. Duration-margin + K-consecutive hysteresis. Samples must clearly
   exceed `tick_interval * 3/2` to count as overrun, or be clearly
   below `tick_interval * 7/10` to count as on-time. Anything in the
   deadband leaves both counters alone. Kills the boundary flap
   directly: 51ms-vs-49ms can no longer accumulate K observations.
   Integer ratios — no f64 in the hot path.

2. Per-branch cooldown. On flip-to-delayed, arm
   `cooldown_ticks_remaining = BRANCH_COOLDOWN_TICKS` (120 ticks =
   ~30s at the new 250ms lease_rotate cadence). While non-zero,
   `try_begin` returns None and the branch body is skipped — no
   ACCESS-EXCLUSIVE attempts, no count(*) scans, no contribution to
   the dead-tuple growth. Cooldown re-arms on every overrun
   observation while still delayed, so a chronically-failing branch
   stays quiet. Counters are frozen during cooldown (no body =
   no duration sample), so recovery requires K real on-time samples
   AFTER cooldown expires.

3. Default `lease_rotate_interval` 50ms → 250ms. The 50ms was
   over-spec'd given that prune backoff already throttles the heavy
   work — faster rotation just produced loop noise. 250ms still
   cycles the 8-slot lease ring in ~2s, which is plenty fast for
   reclamation. All other intervals unchanged.

API change: `MaintenanceBranchTracker::begin` is now `try_begin` and
returns `Option<BranchTimer>`. None signals "skip this tick's body";
all 13 select! arms updated to use `if let Some(timer) = ...`. The
existing `BranchTimer::finish()` semantics are preserved.

Tests:
- Updated 6 existing branch-tracker tests for the new API + duration
  thresholds.
- New: deadband_sample_does_not_advance_counters,
  k_consecutive_overruns_flips_and_arms_cooldown,
  cooldown_skips_body, cooldown_expires_then_body_runs,
  cooldown_rearms_on_continued_overrun,
  recovers_only_after_k_ontime_ticks_post_cooldown.
- All 22 maintenance unit tests pass; 39 worker lib tests pass.
Codex and CodeRabbit both independently flagged a P1 bug on PR #318:
without consuming the sample, the branch enters perpetual cooldown
once it flips delayed. Trace:

  - Body N runs slow, record_finish writes last_duration=Some(slow).
  - Tick N+1: try_begin reads but doesn't consume; consecutive_overrun
    saturates past K; flip + cooldown=120; return None.
  - Ticks N+2..N+121: cooldown gate decrements, return None. No body
    runs, no record_finish call, last_duration STAYS Some(slow).
  - Tick N+122 (cooldown drained): try_begin sees the same stale
    Some(slow), saturates overrun again, re-arms cooldown forever.

Worker has to restart to escape. The test helper hid this by writing
last_duration before every iteration — synthetic samples even on
ticks where the body was skipped.

Fix: `state.last_duration.take()` in try_begin's hysteresis check.
Each sample is evaluated exactly once. After flip, the consumed
sample leaves last_duration = None; cooldown drains without writing
anything; first post-cooldown try_begin sees None and skips the
eval, body runs, record_finish writes a fresh sample, next try_begin
evaluates it.

Test fixes:
  - `replay_ticks` no longer seeds last_duration on each iteration.
    It calls record_finish *only* when the body would have run,
    matching production: skipped ticks don't produce samples.
  - New `seed_last_duration` helper for tests that need an initial
    sample to drive the very first try_begin observation.
  - Tests updated for the new K+1 recovery cadence: the very first
    post-cooldown body produces a sample evaluated on the *next*
    try_begin, so K evaluable on-time samples require K+1 ticks
    post-cooldown.

All 22 maintenance tests pass with the corrected semantics. The
production-side fix is the take() — the test fixes are bookkeeping
that exposes (rather than hides) the new contract.
@hardbyte hardbyte force-pushed the issue-295-v06-sequence-cursors branch from 0adaf01 to 872149a Compare June 5, 2026 05:10
@hardbyte
Copy link
Copy Markdown
Owner Author

hardbyte commented Jun 5, 2026

Rebased after #320 merged and resolved the migration collision:

Validation after rebase:

  • cargo fmt --check
  • git diff --check origin/main...HEAD
  • CARGO_TARGET_DIR=target/codex-check cargo check -p awa-model
  • TEST_DATABASE_URL=postgres://postgres:test@localhost:15432/awa_test CARGO_TARGET_DIR=target/codex-check cargo test -p awa --test migration_test -- --nocapture - 37 passed
  • TEST_DATABASE_URL=postgres://postgres:test@localhost:15432/awa_test CARGO_TARGET_DIR=target/codex-check cargo test -p awa --test queue_storage_runtime_test -- --nocapture - 79 passed

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
docs/upgrade-0.5-to-0.6.md (1)

289-292: ⚡ Quick win

Compute cursor values once to keep lag self-consistent.

On Line 289–Line 292, the snippet re-invokes sequence_next_value(...) when computing lag. If those reads are not guaranteed stable per row, displayed enqueue_cursor/claim_cursor can diverge from lag. Compute once, then subtract aliases.

Proposed docs snippet update
-  SELECT priority, enqueue_shard,
-         <queue_storage_schema>.sequence_next_value(enqueues.seq_name) AS enqueue_cursor,
-         <queue_storage_schema>.sequence_next_value(claims.seq_name) AS claim_cursor,
-         <queue_storage_schema>.sequence_next_value(enqueues.seq_name)
-           - <queue_storage_schema>.sequence_next_value(claims.seq_name) AS lag
+  SELECT priority, enqueue_shard, enqueue_cursor, claim_cursor,
+         enqueue_cursor - claim_cursor AS lag
   FROM <queue_storage_schema>.queue_claim_heads AS claims
   JOIN <queue_storage_schema>.queue_enqueue_heads AS enqueues
     USING (queue, priority, enqueue_shard)
+  CROSS JOIN LATERAL (
+    SELECT <queue_storage_schema>.sequence_next_value(enqueues.seq_name) AS enqueue_cursor,
+           <queue_storage_schema>.sequence_next_value(claims.seq_name) AS claim_cursor
+  ) cur
   WHERE queue = 'my_hot_queue'
   ORDER BY priority, enqueue_shard;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/upgrade-0.5-to-0.6.md` around lines 289 - 292, The SQL currently calls
<queue_storage_schema>.sequence_next_value(enqueues.seq_name) and
<queue_storage_schema>.sequence_next_value(claims.seq_name) twice (for
enqueue_cursor/claim_cursor and again when computing lag), which can produce
inconsistent per-row values; update the query to compute each
sequence_next_value once (assign to the aliases enqueue_cursor and claim_cursor
using the existing calls) and compute lag by subtracting those aliases
(enqueue_cursor - claim_cursor) so lag is self-consistent; look for uses of
sequence_next_value, enqueue_cursor, claim_cursor, and lag in the snippet and
change the lag expression to subtract the previously aliased values rather than
re-invoking sequence_next_value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@awa-model/migrations/v023_install_queue_storage_substrate.sql`:
- Around line 1262-1303: The migration must rebucket or invalidate existing
aggregates for queue_terminal_live_counts: after adding counter_bucket and
before (or as part of) replacing queue_terminal_live_counts_pkey, update
existing rows so they won't silently drift — either set counter_bucket = (job_id
% 256) if a job_id column exists, or else set terminal_counter_trusted_at = NULL
for all rows and rebuild/aggregate counts under the new PK; ensure you perform
any necessary aggregation (group by ready_slot, queue, priority, enqueue_shard,
counter_bucket) to collapse old single-bucket rows into the new bucketed shape
and then add the new PRIMARY KEY (queue_terminal_live_counts_pkey) so runtime
key changes won’t break consistency.

In `@awa-model/src/queue_storage.rs`:
- Around line 6646-6657: The code collects claim_cursor_advances and calls
self.advance_claim_cursors(pool, &claim_cursor_advances) replaying them in
caller order which can break preconditions when multiple advances target the
same lane out-of-order; modify the path that handles claim_cursor_advances (the
vector built in the loop around cancel_job_tx and consumed by
advance_claim_cursors) to coalesce and sort advances by the lane-identifying
tuple (queue, priority, enqueue_shard) and then apply per-lane advances in
ascending only_if_current before calling advance_claim_cursors so updates for
the same lane are applied in increasing only_if_current order; update the logic
that builds/consumes claim_cursor_advances (the result.claim_cursor_advance from
cancel_job_tx and the advance_claim_cursors call) to perform this
group-by-and-sort step.
- Around line 3048-3059: The current advance_claim_cursors method swallows
failures from advance_claim_cursors_strict, which can leave per-lane claim
sequences permanently stale; update advance_claim_cursors to retry the strict
advance with a bounded retry/backoff (e.g., N attempts with small delays) or
enqueue a durable compensating repair task so transient pool/connection errors
won't be dropped; specifically modify advance_claim_cursors to call
advance_claim_cursors_strict in a retry loop (or schedule a follow-up repair
job) and ensure failures are surfaced/logged after final attempt so
queue_claimer_signal and queue_counts_exact don't observe permanent stale state.

---

Nitpick comments:
In `@docs/upgrade-0.5-to-0.6.md`:
- Around line 289-292: The SQL currently calls
<queue_storage_schema>.sequence_next_value(enqueues.seq_name) and
<queue_storage_schema>.sequence_next_value(claims.seq_name) twice (for
enqueue_cursor/claim_cursor and again when computing lag), which can produce
inconsistent per-row values; update the query to compute each
sequence_next_value once (assign to the aliases enqueue_cursor and claim_cursor
using the existing calls) and compute lag by subtracting those aliases
(enqueue_cursor - claim_cursor) so lag is self-consistent; look for uses of
sequence_next_value, enqueue_cursor, claim_cursor, and lag in the snippet and
change the lag expression to subtract the previously aliased values rather than
re-invoking sequence_next_value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 768e6d67-e941-4795-a108-4c5dfd035231

📥 Commits

Reviewing files that changed from the base of the PR and between 879dead and 872149a.

📒 Files selected for processing (18)
  • awa-model/migrations/v022_delete_compat_terminal_counter.sql
  • awa-model/migrations/v023_install_queue_storage_substrate.sql
  • awa-model/migrations/v027_sequence_lane_cursors.sql
  • awa-model/src/admin.rs
  • awa-model/src/migrations.rs
  • awa-model/src/queue_storage.rs
  • awa-worker/src/client.rs
  • awa-worker/src/maintenance.rs
  • awa/tests/chaos_suite_test.rs
  • awa/tests/postgres_failover_smoke_test.rs
  • awa/tests/queue_storage_benchmark_test.rs
  • awa/tests/queue_storage_runtime_test.rs
  • awa/tests/telemetry_test.rs
  • docs/adr/008-copy-batch-ingestion.md
  • docs/adr/019-queue-storage-redesign.md
  • docs/architecture.md
  • docs/configuration.md
  • docs/upgrade-0.5-to-0.6.md

Comment thread awa-model/migrations/v023_install_queue_storage_substrate.sql
Comment thread awa-model/src/queue_storage.rs
Comment thread awa-model/src/queue_storage.rs
@hardbyte hardbyte merged commit cb6c391 into main Jun 6, 2026
13 checks passed
@hardbyte hardbyte deleted the issue-295-v06-sequence-cursors branch June 6, 2026 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants