Skip to content

perf(queue_storage): live terminal counter for queue_counts_exact #290

@hardbyte

Description

@hardbyte

Context

PR #289 added QueueStorage::queue_counts_fast — an index-only depth probe for high-cadence pollers (admin dashboards, depth-target producers, soak observability). It is not a replacement for queue_counts_exact: it under-counts completed while the live done_entries segment hasn't rolled up, and the exact path is still what admin tooling reaches for when the live terminal segment matters.

queue_counts_exact itself stays slow at deep backlog. Two CTEs dominate:

  • live_terminal: count(*) FROM done_entries WHERE queue = $1. The per-slot done_entries child tables have a composite index leading with (queue, priority, enqueue_shard, lane_seq), so a queue-only filter can in principle index-only-scan — but that still has to count every matching terminal row, and append-heavy recent rows aren't reliably index-only until vacuum marks pages all-visible.
  • live_running's lease_claims × lease_claim_closures anti-join — separate problem, not in scope here.

A naive done_entries(queue) index would mostly duplicate the leading columns of the existing composite index. It might cut buffer reads, but it's still O(live terminal rows) per call.

The better shape is an exact rollup that makes queue_counts_exact O(queue slots × shards), not O(done rows).

Proposal

Add a live terminal counter alongside queue_terminal_rollups.pruned_completed_count, keyed naturally on the same dimensions the hot path already partitions on:

Counter Key shape
queue_terminal_live_counts.completed (ready_slot, queue, priority, enqueue_shard)

(Or (ready_slot, queue, priority) if enqueue_shard granularity isn't worth the extra rows.)

Write paths to update:

  • insert_done_rows_tx — group by (ready_slot, queue, priority, enqueue_shard) and INSERT … ON CONFLICT … DO UPDATE SET completed = completed + EXCLUDED.completed once per group per batch.
  • Terminal row removal: retry_job_tx (terminal → retryable), move_failed_to_dlq (terminal → DLQ), any other terminal-deleting path. Decrement by the moved count.

Prune path: when a ready_slot is pruned, fold its live counters into queue_terminal_rollups.pruned_completed_count for that (queue, priority) (sum or max, matching the existing aggregator), then DELETE FROM queue_terminal_live_counts WHERE ready_slot = $slot.

Exact query becomes:

exact_completed =
    sum(queue_terminal_rollups.pruned_completed_count) +
    sum(queue_terminal_live_counts.completed)
WHERE queue = ANY($1)

Both are bounded by num_ready_slots × num_priority × num_enqueue_shards. No done_entries scan.

Risks

  • Write-path contention: the new counter is hit by every insert_done_rows_tx batch. Keep it naturally sharded by (ready_slot, queue, priority, enqueue_shard) rather than a single hot (queue) row to avoid serialising completions.
  • WAL / throughput regression: each batch grows by N upserts (one per group). Benchmark before merging. Worst-case skew is single-queue single-priority single-shard workloads where every batch hits the same row — measure that explicitly.
  • Correctness on counted-removal paths: retry_job_tx and DLQ move have to decrement the correct (ready_slot, queue, priority, enqueue_shard) group, which means they need to read the row being moved before decrementing. Verify by hand against each terminal-removal call site.

Out of scope here

  • The lease_claims × lease_claim_closures anti-join cost in live_running. Still slow at high open-claim counts in receipt-plane modes. Worth a separate ticket.
  • Bench harness polling cadence (separate small change to switch the long-horizon depth poller off the 200 ms tight loop).
  • The "could we just add done_entries(queue) and move on?" branch — worth measuring before discarding, but the bet is that it cuts I/O but not the row count and so doesn't pay back the index maintenance cost. PR perf(queue_storage): add queue_counts_fast for high-cadence pollers #289 ships the depth-probe API regardless.

Empirical first step (optional)

Before the denormalisation lands, a narrow regression bench against the current composite index — count(*) FROM done_entries_$slot WHERE queue = $1 per slot, summed in the harness — would calibrate how much of the gap a queue-only index could close. If that's already within target latency for admin UI, the denormalisation may be overkill. The expectation is that it isn't.

Refs: #289, ADR-019, ADR-026 (narrow terminal history).

Acceptance criteria

  • queue_counts_exact.completed no longer scans done_entries — the
    query reads only queue_terminal_rollups.pruned_completed_count and
    the new queue_terminal_live_counts table, bounded by
    num_ready_slots × num_priority × num_enqueue_shards.
  • Every code path that adds or removes a terminal row keeps the
    counters correct: completion (insert_done_rows_tx), retry-from-
    terminal (retry_job_tx), DLQ move (move_failed_to_dlq and the
    DLQ-on-fail completion branches), discard, and prune (folds slot
    live counts into pruned_completed_count then clears).
  • Tests cover increment (completion), decrement (retry-from-terminal,
    DLQ move, discard), prune-fold (live + rollup converge), and
    recovery/rebuild from done_entries if such a path is added.
  • Benchmark includes the skewed worst case — one queue, one priority,
    one enqueue shard, high completion rate — so the write-path
    contention on the resulting hot counter row is measured before
    merging. Compare to the current queue_counts_exact cost at the
    same workload to confirm the trade is net positive at the relevant
    backlog sizes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew functionalityoperationalOperational tooling and configuration

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions