Skip to content

Claim hot-path slows ~33% with deadline rescue enabled at high concurrency #246

@hardbyte

Description

@hardbyte

Context

At 1×256 worker concurrency, depth-target driving awa.queue_lanes near saturation, enabling per-claim deadline rescue costs ~33% steady-state throughput and ~50% p99 latency on the claim hot path. The cost is distributed across the whole claim path rather than concentrated in a single rescue-specific query, which suggests a structural rather than additive overhead.

Reproduction shape: awa-bench long_horizon scenario, single replica, 256 workers, depth-target = 4000, JOB_WORK_MS=1, producer-mode=depth-target, 30 s warmup + 120 s clean. Two cells, identical except LEASE_DEADLINE_MS:

rescue OFF (LEASE_DEADLINE_MS=0) rescue ON (library default) Δ
completion_rate 5,419 jobs/s 3,649 jobs/s −33 %
offered (depth-target burst) 3,905 jobs/s 2,752 jobs/s −30 %
end_to_end_p99_ms 100 ms 151 ms +51 %

What pg_stat_statements says

pg_stat_statements_reset() immediately before each cell, snapshot at +90 s into the clean phase. Top hot-path queries (mean_exec_time, ms):

query OFF mean ON mean Δ
UPDATE awa.queue_enqueue_heads SET next_seq = next_seq + $3 ... 14.8 18.0 +22 %
INSERT INTO awa.queue_enqueue_heads ON CONFLICT DO NOTHING 5.6 7.3 +30 %
SELECT ... FROM awa.claim_ready_runtime(...) 1.28 1.60 +25 %
INSERT INTO awa.queue_lanes ON CONFLICT DO NOTHING 2.9 3.3 +14 %
INSERT INTO awa.queue_claim_heads ON CONFLICT DO NOTHING 2.8 3.3 +18 %
UPDATE awa.queue_lanes ... deltas ... 2.2 2.5 +13 %
UPDATE awa.queue_lanes SET available_count = ... 1.7 2.0 +18 %

Two things this doesn't show:

  1. No new query in the top-50 with rescue ON. If the rescue scanner were a periodic standalone scan firing visibly, it should appear. It doesn't — either it's inside the regular claim path (e.g. force-close happens inline in claim_ready_runtime), or it's batched at a low enough cadence that even with 33 % throughput cost it stays below the noise floor.
  2. The cost is uniform, not localized. Every hot query is 13–30 % slower with rescue ON. That's the fingerprint of contention on a shared resource (lock contention, buffer-cache pressure, index walk cost), not "the rescue path itself adds N ms per claim."

The body of claim_ready_runtime documents the mechanism in a comment:

deadline_at is the per-claim deadline when the queue has a non-zero deadline_duration; the deadline-rescue path scans expired rows (anti-joined with closures and leases — same disambiguation as the heartbeat-rescue path) and force-closes them.

Hypothesis

The most consistent explanation for uniform per-query slowdown without a visible new query is:

awa.lease_claims working set inflates when rescue is on, and every other query that touches lease_claims (every claim, every completion) pays the cost in index lookup time / page fetches / lock contention.

That'd happen if claims awaiting force-close linger in lease_claims longer than they would when deadline_at IS NULL (where the table only ever sees a row for the duration of an in-flight job).

Equally consistent secondary hypothesis: the rescue scanner is doing a sequential scan or partial-index miss across lease_claims, holding short locks that serialize against the regular claim INSERT/UPDATE path.

Proposed experiments

In rough order of cheap → less cheap:

  1. pg_stat_user_tables.n_live_tup snapshot of awa.lease_claims at end-of-clean in both cells. If rescue ON shows materially more rows than rescue OFF at steady state, the working-set inflation hypothesis is confirmed. ~5 minutes of work.
  2. Add awa.lease_claims (deadline_at) WHERE deadline_at IS NOT NULL partial index and re-run rescue ON. If the per-claim latency tax drops, the rescue scanner is doing an avoidable seqscan / full-index walk.
  3. Tune the rescue scanner's batch size / wake interval (or, if it's currently inline in claim_ready_runtime, hoist it out to a background task with its own cadence). Look for the knee where the scanner reaps fast enough to keep lease_claims small but doesn't dominate the claim path.
  4. Flame graph of awa-bench at 1×256 with rescue on vs off. Would directly confirm whether the time goes to PG roundtrips, lock waits, or in-process work in the awa-worker claim loop.

The first two are non-invasive and should be enough to localize the cost. (3) and (4) are needed if the first two come back ambiguous.

Safety considerations

  • Don't disable deadline_at writes by default to fix this. Per-claim deadline rescue is the documented fallback path for stuck workers under correctness invariants the chaos suite relies on; the wrong fix would be making the bench look better at the cost of a real reliability mechanism.
  • The fix should not change the visible behavior of force-close. If a job's deadline expires, it must still be force-closed and re-eligible for claim. Both inline (current) and background (proposed) implementations need to preserve this.
  • The partial-index proposal needs a migration plan. awa.lease_claims is partitioned in current schema; the partial index needs to be defined per-partition and on the master.
  • If the working-set hypothesis is right, there's an upstream knock-on for chaos scenarios. A larger steady-state lease_claims table also means slower force-close on a true wedge, because the rescue scan walks more rows. Worth measuring how long it takes a wedged claim to be force-closed at high concurrency before merging any tuning that lets the table grow.

Reproducer

awa-bench adapter from the postgresql-job-queue-benchmarking repo at branch bench/2026-05-07-awa-alpha6-pgque-rc1:

docker compose up -d --wait
LEASE_DEADLINE_MS=0 \
  uv run bench run --systems awa --replicas 1 \
    --producer-rate 50000 --producer-mode depth-target --target-depth 4000 \
    --worker-count 256 \
    --phase warmup=warmup:30s --phase clean=clean:120s

# repeat without LEASE_DEADLINE_MS to use library default

Tested against awa 0.6.0-alpha.6 and 0.6.0-alpha.7; both reproduce the same shape.

pg_stat_statements snapshots are committed under results/2026-05-08-rescue-perf-probe/snapshots/ for direct inspection.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions