diff --git a/AGENTS.md b/AGENTS.md index ef9907e..783c248 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -110,6 +110,11 @@ style. Match the surrounding file. `SlotHeader`, `SchemaInfo`) is guarded by `MAGIC` + `VERSION`. Any layout change requires a `VERSION` bump and updating the `static_assert`s in `types.h`. +- **ARCHITECTURE.md ships with the change.** Any commit touching the + `types.h` encoding, a `Region.h` contract, or the publish / receive / + repair paths must include its ARCHITECTURE.md delta in the same commit. + The doc is the only map of the lock-free invariants -- a stale claim + there is a footgun armed for the next reader. ### 5.4 Comments diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 815c5dd..9a36d1a 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -127,9 +127,9 @@ The region header is self-describing and forward-compatible: Header (at offset 0) ┌───────────────────────────────────────────────────────────┐ │ magic (atomic) 0x4B49434B4D534721 ("KICKMSG!") │ -│ version 4 │ +│ version 7 │ │ channel_type PubSub | Broadcast │ -│ total_size total mmap size in bytes │ +│ total_size total mmap size in bytes │ │ sub_rings_offset byte offset to first subscriber ring │ │ pool_offset byte offset to slot pool │ │ max_subs max subscriber slots │ @@ -148,6 +148,8 @@ Header (at offset 0) │ schema_state Unset | Claiming | Set (atomic u32) │ │ schema_data SchemaInfo — 512 B, 8 cache lines │ │ free_top Treiber stack head (atomic u64) │ +│ steal_count stale entries stolen by repair (u64) │ +│ identity_hash logical-identity stamp (u64, 0=unset) │ └───────────────────────────────────────────────────────────┘ ``` @@ -204,10 +206,10 @@ The descriptor is published via a three-state atomic (`schema_state`): ``` try_claim_schema() memcpy schema_data - ┌───── CAS ─────────►┌─────────────────────────►┌────────┐ - │ Unset │ Claiming │ Set │ + ┌───── CAS ─────────►┌─────────────────────────►┌─────────┐ + │ Unset │ Claiming │ Set │ │ (state = 0) │ (state = 1) │(state=2)│ - └────────────────────┘ payload bytes written └────────┘ + └────────────────────┘ payload bytes written └─────────┘ between Claiming and Set ``` @@ -258,14 +260,16 @@ region. first, here's what they wrote, you decide." The library never throws on schema grounds. -Version bumped `4 → 5` for the cross-process runtime counters -added to `SubRingHeader` (`dropped_count`, `lost_count` — see the -Subscriber Ring section for their layout and semantics). The -fields fit inside the existing `write_pos` cache-line padding, so -`sizeof(SubRingHeader)` is unchanged, but the in-memory meaning of -the bytes following `has_waiter` changed — v4 and v5 cannot share -a region. Pre-v5 binaries are rejected at `open()` by the existing -version check. +Layout version history since the schema descriptor (v4): v5 added +the cross-process runtime counters to `SubRingHeader` +(`dropped_count`, `lost_count` -- see the Subscriber Ring section); +v6 added the ring owner identity (`owner_pid`, `owner_starttime`) +that `reclaim_dead_rings()` keys on; v7 changed the entry commit +protocol (position-tagged locks, skip markers -- see the +sequence-word encoding below) and added `steal_count` and +`identity_hash` to the header. Each bump changes the in-memory +meaning of shared bytes, so mixed-version binaries cannot share a +region; mismatches are rejected at `open()` by the version check. ## Treiber Free Stack @@ -310,7 +314,7 @@ contains a sequence number, slot index, and payload length -- all atomic. ``` Ring[0] ┌──────────────────────────────────────────────────────────┐ -│ state: Live in_flight: 0 write_pos: AtomicU64 = 42 │ +│ state: Live in_flight: 0 write_pos: AtomicU64 = 42 │ │ │ │ entries[0..7]: │ │ ┌─────┬───────────┬──────────┬─────────────┐ │ @@ -330,7 +334,7 @@ Ring[0] - **Capacity** must be a power of 2 (index masking: `pos & (cap - 1)`). - **state_flight** (atomic uint32): packed `[in_flight:30 | state:2]`. - State bits: `Free(0)`, `Live(1)`, `Draining(2)`. + State bits: `Free(0)`, `Live(1)`, `Draining(2)`, `Reclaiming(3)`. In_flight bits: number of publishers currently admitted to this ring. Packing into a single variable eliminates cross-variable ordering concerns: the publisher's CAS atomically checks state and increments @@ -353,7 +357,7 @@ Ring[0] per instance. - **Sequence number** is monotonically increasing (`pos + 1`), used as a seqlock for data consistency validation and as a commit barrier between - publishers (see Publish Flow below). + publishers. The word is tag-encoded (see the encoding table below). - Stale entries (sequence < subscriber's expected) are detected and reported as lost messages. @@ -363,6 +367,34 @@ cache line: the hot path already owns this line when incrementing coherency traffic. Different rings target different lines (128 B stride), so publishers/subscribers on distinct rings never contend. +### Entry sequence-word encoding + +Each entry's 64-bit `sequence` packs a 2-bit tag (bit 63 = lock bit, +bit 62 = repair bit) and a 62-bit value: + +``` +[tag:2 | value:62] + 00 committed word is pos + 1; slot_idx / payload_len valid + 01 skip marker word carries pos + 1; committed but EMPTY -- + metadata untrustworthy by design + 10 locked a publisher is mid-commit at pos + (position-tagged lock, unique per position) + 11 repair-locked a repairer owns the entry mid-steal +``` + +Lock values are unique -- a position is claimed by exactly one +publisher, which locks it at most once -- so an unchanged lock value +across a full timeout window proves a single holder spanned it: the +staleness proof every steal relies on. + +Stolen entries are committed as **skip markers**, not plain +sequences. The stolen-from publisher's late metadata stores can land +at any time after the steal, so nothing may ever trust `slot_idx` / +`payload_len` under tag 01: subscribers count a skip-marked position +as one lost message without reading its metadata, and the eviction +path never releases a slot through a skip-tagged entry. The victim's +stores are harmless by construction, not by timing. + ### Subscriber join and visibility window A subscriber joins by CAS-ing a Free ring to Live. The CAS expects @@ -380,23 +412,33 @@ of death. A slow-but-alive publisher could still execute its pending `fetch_sub`, causing the same underflow. **Publisher self-repair**: when a publisher times out on a stuck -entry, it heals the entry in place (3 stores, ~10 ns) so the next -publisher at that position succeeds without timeout. This handles -both Case A (LOCKED_SEQUENCE) and Case B (stale entry >1 wrap behind). +entry, it heals the entry in place (a CAS steal, then a skip-marker +commit) so the next publisher at that position succeeds without +timeout. This handles both Case A (stale position-tagged lock, only +when one lock value provably spanned the whole timeout) and Case B +(stale entry >1 wrap behind). The operator primitives `repair_locked_entries()` and `reset_retired_rings()` remain available for defense-in-depth and health monitoring, but most crash residue is now self-healed by publishers on the hot path. -**Ordering invariant**: the subscriber captures `write_pos` BEFORE -the CAS to Live, not after (`Subscriber.cc`: `write_pos.load(acquire)` -then `state_flight.compare_exchange_strong(Free, Live, acq_rel)`). -Once the ring is Live, publishers can immediately -`fetch_add(write_pos)`, racing with the subscriber's read. Capturing -first guarantees `start_pos_ <= any position a publisher can claim -after seeing Live`. Without this ordering, the subscriber's -`drain_unconsumed` window `[start_pos_, wp)` can miss entries -committed between the CAS and the read — a refcount leak. +**Ordering invariant**: the subscriber reads `write_pos` BEFORE the +CAS to Live (`Subscriber.cc`: `write_pos.load(acquire)` then +`state_flight.compare_exchange_strong(Free, Live, acq_rel)`), and +reads it AGAIN after winning the CAS. The two reads serve different +purposes: + +- The pre-CAS value becomes `start_pos_`, the drain floor. Once the + ring is Live, publishers can immediately `fetch_add(write_pos)`, + racing with the read; capturing first guarantees `start_pos_ <= any + position a publisher can claim after seeing Live`. Without that, + `drain_unconsumed`'s window `[start_pos_, wp)` could miss entries + committed between the CAS and the read -- a refcount leak. +- The post-CAS value seeds `read_pos_` (the larger of the two reads + wins). The pre-CAS value can be stale -- there is no happens-before + edge to the previous tenant's final position -- and consuming from + it would replay messages left over from the previous tenancy. + **Anyone editing the Subscriber constructor must preserve this order.** A newly joined subscriber may miss a small number of in-flight @@ -447,75 +489,108 @@ Publisher | a single LDADDAL on AArch64 (LSE). | |-- If ring full (wrap): - | wait_and_capture_slot() Spin-wait (check clock every 1024 + | wait_for_commit() Spin-wait (check clock every 1024 | | iterations) up to commit_timeout - | | (default 100ms). - | |-- Committed: Capture slot_idx (old_slot). - | | Release is DEFERRED until after lock - | | CAS succeeds (see below). - | '-- Timeout (crash): Previous writer crashed. old_slot = - | INVALID_SLOT. The pool slot referenced - | by the abandoned entry is leaked - | (recoverable by GC). The ring position - | is poisoned until repair_locked_entries(). + | | (default 10 ms). + | |-- Committed: Proceed to the lock CAS. The previous + | | occupant's slot is released AFTER the + | | lock succeeds, from a post-lock read + | | of slot_idx (see below). + | '-- Timeout (crash or Records whether ONE position-tagged + | stall): lock value spanned the whole wait — + | the proof self-repair needs before + | stealing the lock. | |-- Two-phase commit: | | | | Phase 1 - CAS lock: - | | CAS entry.sequence Atomically swap from prev_seq to - | | prev_seq -> LOCKED LOCKED_SEQUENCE (UINT64_MAX). - | | This exclusively owns the entry: - | | no other publisher can CAS from - | | LOCKED_SEQUENCE since they expect - | | prev_seq. - | | Retry up to 64 times If another publisher holds the lock - | | if expected == (entry is LOCKED_SEQUENCE), retry. - | | LOCKED_SEQUENCE The holder will release quickly - | | (just two relaxed stores + one - | | release store). - | | If expected is neither Entry was committed by another - | | prev_seq nor LOCKED publisher. excess++, give up - | | on this ring. + | | CAS entry.sequence CAS from the OBSERVED previous value + | | observed -> -- a plain commit at prev_seq OR a + | | seq_lock(pos) skip marker carrying prev_seq -- to + | | seq_lock(pos), the position-tagged + | | lock [tag:2|pos:62]. This exclusively + | | owns the entry: no other publisher + | | can CAS from a lock value since they + | | expect an unlocked word at prev_seq, + | | and every locker's value is unique + | | (one publisher per position, locks + | | at most once). Whether the + | | predecessor was a skip marker is + | | remembered for the slot release + | | below. + | | Retry up to 64 times If another publisher holds the lock, + | | if locked reload and retry. The holder will + | | release quickly (two relaxed stores + | | + one CAS commit). + | | If unlocked but not Entry was committed by another + | | at prev_seq writer. Lock failure path below. | | | | Lock failure: - | | state_flight.fetch_sub Release admission — this ring's - | | (IN_FLIGHT_ONE, rel.) in_flight was incremented during the - | | CAS admission step above. Must be - | | decremented on every exit path, - | | otherwise the subscriber destructor - | | spin-waits on in_flight forever. - | | Self-repair: If the entry is stuck (LOCKED or - | | if LOCKED or stale: >1 wrap stale), advance it so the - | | store INVALID_SLOT NEXT publisher at this position - | | store seq = expected succeeds without timeout. - | | excess++, continue Do NOT release old_slot — between - | | capture and now, the entry may have - | | been overwritten. old_slot could - | | belong to a newer generation. The + | | Self-repair: Steal a provably-stale entry (lock + | | CAS to seq_repair(pos) value stable across the whole wait, + | | store INVALID_SLOT or committed >1 wrap behind) and + | | commit seq_skip(pos) commit it as an EMPTY SKIP MARKER so + | | the NEXT publisher at this position + | | succeeds without timeout. The CAS + | | backs off if a live writer wins. + | | Do NOT release any slot -- the entry + | | was never ours. The stale occupant's | | unreleased ref is a bounded leak - | | (1 per drop), recoverable by GC. + | | (1 per steal), recoverable by GC. + | | abandon_delivery(): Count the drop, then release + | | dropped_count++ admission -- this ring's in_flight + | | state_flight.fetch_sub was incremented at CAS admission and + | | (IN_FLIGHT_ONE, rel.) must be decremented on every exit + | | Dekker wake path, otherwise the subscriber + | | (fence + has_waiter destructor spin-waits forever. + | | + futex_wake_all) Then the SAME wake as the success + | | path: write_pos already advanced + | | (the position may now carry a skip + | | marker), so without the wake a + | | parked subscriber sleeps out its + | | full timeout. + | | excess++, continue + | | + | | Lock success -- release previous occupant: + | | If NOT prev_was_skip After locking, we own the entry; the + | | and pos >= capacity: lock-CAS acquire pairs with the + | | read e.slot_idx previous committer's release, so even + | | if in bounds: a commit that landed after our wait + | | release_slot(it) timed out is seen and released here. + | | INVALID (drain marker) fails the + | | bound check: drain_unconsumed already + | | released this ring's reference -- + | | releasing again would double- + | | decrement. A SKIP predecessor is + | | never released: its metadata is + | | untrustworthy by design; the stolen + | | entry's ref is left for GC. Below + | | one wrap there is no predecessor + | | (a zero-initialized slot_idx would + | | read as valid slot 0). | | - | | Lock success — deferred release: - | | Re-read e.slot_idx After locking, we own the entry. - | | If slot_idx != INVALID: Release old_slot (this ring's - | | release_slot(old_slot) reference to the previous occupant). - | | If slot_idx == INVALID: drain_unconsumed already released - | | skip release this ring's reference. Releasing - | | again would double-decrement. - | | Why deferred? TOCTOU: between wait_and_capture_slot reading - | | slot_idx and the lock CAS, another publisher (or drain) can - | | modify the entry. Releasing before lock risks corrupting a - | | live slot's refcount. After lock, no concurrent modification. + | | Theft guard: If sequence != seq_lock(pos), a + | | reload entry.sequence repairer proved our lock stale (we + | | if not ours: drop stalled past commit_timeout) and owns + | | the entry. Storing data now would + | | tear the repaired entry. | | | | Write entry fields (relaxed, safe because we hold the lock): | | entry.slot_idx = 3 | | entry.payload_len = 128 | | - | | Phase 2 - commit: - | ' entry.sequence = 43 Release-store commits the entry. - | Subscribers and future publishers - | at this position will see all - | preceding stores. + | | Phase 2 - CAS commit: + | ' CAS entry.sequence CAS, not a blind store: fails only if + | seq_lock(pos) -> 43 a repairer stole the lock after the + | theft guard -- the entry is then a + | committed skip marker, and our late + | slot_idx/payload_len stores landed + | under tag 01, which nothing ever + | trusts: permanently harmless. We + | record a drop (abandon_delivery). + | Release on success: subscribers and + | future publishers at this position + | see all preceding stores. | |-- state_flight.fetch_sub Release admission — subscriber | (IN_FLIGHT_ONE, release) destructor can now observe @@ -546,12 +621,12 @@ Publisher Without the lock, two publishers that CAS `write_pos` to adjacent positions could interleave their `slot_idx` and `sequence` stores on -overlapping entries (after a ring wrap). The `LOCKED_SEQUENCE` sentinel +overlapping entries (after a ring wrap). The position-tagged lock prevents this: only one publisher at a time can write an entry's data -fields, and the final release-store of the real sequence makes the -entry visible atomically. +fields, and the final CAS commit of the real sequence makes the entry +visible atomically — or fails, detectably, if the lock was stolen. -Subscribers treat `LOCKED_SEQUENCE` the same as "not yet committed" +Subscribers treat any locked value the same as "not yet committed" and return `nullopt`, so the lock is invisible to them except as a brief delay. @@ -599,17 +674,25 @@ Subscriber X (read_pos_ = 41, local) 2. entry = entries[41 & mask] Read the ring entry. seq1 = entry.sequence (acquire) | - | Four outcomes: + | Five outcomes: | |-- seq1 == expected (42) Data ready -> proceed to read. | + |-- seq1 is a skip marker A repair stole this position from a + | carrying expected (42) stalled publisher. Committed but + | lost_++ EMPTY: the metadata is untrustworthy + | read_pos_++ by design and never read. Count one + | continue lost, retry next entry. + | |-- seq1 > expected (42) Subscriber fell behind. The entry - | (e.g. seq1 = 47) was overwritten while we weren't - | lost_ += (47 - 42) looking. Skip ahead, count as lost. - | read_pos_++ Continue loop -> retry next entry. - | continue + | lost_++ was overwritten while we weren't + | read_pos_++ looking. Count one lost, retry next + | continue entry. (Falling more than a full + | ring behind is detected earlier, + | against write_pos, and skipped in + | bulk.) | - |-- seq1 == LOCKED_SEQUENCE A publisher is mid-commit on this + |-- seq1 is a lock value A publisher is mid-commit on this | return nullopt entry. Come back later. | '-- seq1 < expected (42) Entry not yet committed. A publisher @@ -618,10 +701,10 @@ Subscriber X (read_pos_ = 41, local) sequence yet. Come back later. Not a deadlock: if the publisher crashed, the next publisher at this - position will eventually overwrite - the entry (after commit_timeout), - and the subscriber will then see - seq > expected (skip path above). + position eventually steals the entry + (after commit_timeout) and commits a + skip marker, which the skip path + above consumes. | v 3. Read slot_idx and payload_len from the entry. @@ -632,16 +715,16 @@ Subscriber X (read_pos_ = 41, local) | via CAS before reading data. This prevents the publisher | | from freeing the slot while the subscriber reads it. | | | - | CAS Slot.refcount: rc -> rc+1 Pin the slot (only if | - | (retry while rc > 0) rc > 0, i.e. slot alive) | - | (if rc == 0: slot freed between seq1 read and | - | between seq1 and now, now. Count as lost.) | - | skip as lost message) | + | CAS Slot.refcount: rc -> rc+1 Pin the slot (only if | + | (retry while rc > 0) rc > 0, i.e. slot alive) | + | (if rc == 0: slot freed between seq1 read and | + | between seq1 and now, now. Count as lost.) | + | skip as lost message) | | | - | seq2 = entry.sequence (acquire) Seqlock validation: if | + | seq2 = entry.sequence (acquire) Seqlock validation: if | | seq2 == seq1? the entry was overwritten | | -> yes: pin valid after we pinned, the | - | -> no: undo pin, count lost slot_idx may be stale. | + | -> no: undo pin, count lost slot_idx may be stale. | | | |---- Copy mode: try_receive() --------------------------------| | | @@ -651,7 +734,7 @@ Subscriber X (read_pos_ = 41, local) | read_pos_++ | | return SampleRef { recv_buf_, payload_len } | | | - | Note: SampleRef points into recv_buf_ (subscriber-local | + | Note: SampleRef points into recv_buf_ (subscriber-local | | buffer). Calling try_receive() again overwrites it. | | Copy data from SampleRef before the next call. | | | @@ -724,9 +807,13 @@ resource leaks. ``` Crash point Consequence ───────────────────────────────────────────────────────────────────── -After treiber_pop, before Pool slot leaked (popped but - refcount pre-set never published, refcount never - set). Bounded: 1 slot per crash. +After treiber_pop, before Pool slot orphaned (popped but + refcount pre-set never published: refcount 0, off + the free stack, in no ring). + Bounded: 1 slot per crash. + Recovered by reclaim_orphaned_ + slots() via its free-stack + membership walk. After refcount pre-set, during Refcount was set to max_subs but the ring-push loop (delivered only k out of N rings were visited @@ -735,34 +822,45 @@ After refcount pre-set, during Refcount was set to max_subs but reference (inline for non-Live, or via eviction/consumption for Live). Remaining (max_subs - k) references - are never released. The slot is - permanently leaked. + are never released by normal paths; + the slot is leaked until reclaim_ + orphaned_slots() reclaims it (once + the k ring entries have been + consumed or evicted). After CAS on write_pos, before Two sub-cases depending on whether sequence store (the dangerous the publisher reached the CAS lock: window) - Case A — crash after CAS lock - (entry stuck at LOCKED_SEQUENCE): - Next publisher at this position - waits commit_timeout and drops. - repair_locked_entries() advances - the entry to the expected sequence. - - Case B — crash before CAS lock + Case A -- crash after CAS lock + (entry stuck at a position-tagged + lock): next publisher at this + position waits commit_timeout, + observes one stable lock value across + the whole wait, steals the entry + (CAS) and commits it as a skip + marker. The external + repair_locked_entries() does the + same with a grace re-check. + + Case B -- crash before CAS lock (entry still at the previous cycle's committed sequence): Next publisher at this position also waits commit_timeout and drops. repair_locked_entries() detects the entry is more than one full wrap - behind and advances it. - - In both cases, the pool slot - referenced by the crashed entry - may be garbage — it is marked - INVALID_SLOT by the repair, and - recovered by reclaim_orphaned_slots. - Subscriber sees a gap (lost msg). + behind and skip-commits it. + + In both cases the position ends as + a skip marker: the subscriber + counts it as one lost message + without reading the metadata, and + any slot ref the stale entry held + is recovered by + reclaim_orphaned_slots. (The + repair's INVALID_SLOT store is a + diagnostic only -- nothing trusts + metadata under a skip tag.) After sequence store No issue. Entry is committed. Subscribers can read it normally. @@ -771,35 +869,52 @@ After sequence store No issue. Entry is committed. ### Timeout mechanism When a publisher wraps around to a ring entry that was previously -claimed but never committed, it calls `wait_and_capture_slot()`: +claimed but never committed, it calls `wait_for_commit()`: ``` -wait_and_capture_slot(entry, expected_seq, timeout): +wait_for_commit(entry, expected_seq, timeout): + first = entry.sequence (acquire) deadline = now() + timeout loop (check clock every 1024 iterations): seq = entry.sequence (acquire) - if seq >= expected_seq and seq != LOCKED_SEQUENCE: - return entry.slot_idx (committed, capture the old slot) + if seq is not locked and seq_pos(seq) >= expected_seq: + return {seq, stable_lock: false} (committed -- a plain + commit or skip marker) if now() >= deadline: - return INVALID_SLOT (timeout) + return {seq, stable_lock: first is locked and seq == first} ``` -The function skips entries in `LOCKED_SEQUENCE` state because another -publisher is mid-commit on that entry and will release shortly. - -On timeout (returns `INVALID_SLOT`), the publisher: -1. Skips `release_slot()` (the old `slot_idx` may be garbage) -2. Attempts the CAS lock — if it fails, the publisher **self-repairs** - the stuck entry in place (stores `INVALID_SLOT` + the expected - sequence) so the next publisher at this position succeeds without - paying the timeout. Self-repair handles both Case A (LOCKED) and - Case B (stale), costs ~10 ns on top of the already-spent timeout, - and is safe under live traffic (idempotent stores). -3. Drops delivery for this ring and moves to the next subscriber ring. - The ring resumes normal operation on the next wrap. +The function waits out locked entries because another publisher is +mid-commit and will release shortly. `stable_lock` records that ONE +position-tagged lock value was observed at both ends of the full +window: a position is claimed by exactly one publisher and locked at +most once, so this proves a single holder exceeded the commit budget. + +On timeout, the publisher: +1. Attempts the CAS lock -- if it fails, the publisher **self-repairs**: + it steals a provably-stale entry (stable lock, or committed >1 wrap + behind) with a CAS to `seq_repair(pos)` and commits it as a skip + marker (`seq_skip(pos)`: tag 01, word carrying pos + 1) so the next + publisher at this position succeeds without paying the timeout. The + CAS backs off if a live writer commits first. A stolen-from + publisher that was merely slow detects the theft at its own theft + guard or CAS commit and records a drop instead of corrupting the + entry. Its late metadata stores -- landing at any point after the + steal -- fall under the skip tag, which nothing ever trusts: + subscribers count the position lost without reading + slot_idx/payload_len, and the eviction path never releases a slot + through a skip-tagged entry. The victim's stores are permanently + harmless by construction, not by timing. +2. Drops delivery for this ring and moves to the next subscriber ring. + Every drop path ends in `abandon_delivery()`, which mirrors the + success path's Dekker wake (seq_cst fence, `has_waiter` check, + `futex_wake_all`): `write_pos` already advanced, so without the + wake a parked subscriber would sleep out its full timeout on a + position that will never plainly commit. The ring resumes normal + operation on the next wrap. The timeout is configurable per channel via `channel::Config::commit_timeout` -(default: 100 ms). The tradeoff: +(default: 10 ms). The tradeoff: - **Shorter timeout** → faster recovery after a crash, but higher risk of falsely evicting a slow-but-alive publisher under heavy scheduling @@ -812,13 +927,21 @@ The timeout is configurable per channel via `channel::Config::commit_timeout` The subscriber never deadlocks either. If a publisher crashes mid-commit: -1. The subscriber sees `seq < expected` or `seq == LOCKED_SEQUENCE` +1. The subscriber sees `seq < expected` or a locked value and returns `nullopt` (data not ready yet) -2. Eventually, another publisher wraps to the same position, - times out on `wait_and_capture_slot`, and overwrites the entry - via the two-phase commit with a higher sequence number -3. The subscriber then sees `seq > expected` (skip path), - counts the gap as lost messages, and resumes +2. Eventually another publisher wraps to the same position, times + out on `wait_for_commit`, and either commits its own payload via + the two-phase commit or steals the entry as a skip marker +3. The subscriber then sees `seq > expected` (skip-ahead path) or a + skip marker at exactly `expected` (one message lost, metadata + never read), counts the loss, and resumes + +While the head position is claimed-but-uncommitted, a blocking +`receive()` cannot park on the futex: it watches `write_pos`, and +the commit itself produces no futex edge. It polls instead -- a +bounded yield burst (64 spins), then 100 us naps until the entry +commits or the timeout expires -- so a crashed publisher costs naps, +not a hot spin for the whole timeout. ### Leak classes @@ -842,7 +965,8 @@ and full-window drain: ``` ~Subscriber(): - 1. state = Draining (seq_cst) — publishers see non-Live, skip this ring + 1. state = Draining — CAS, acq_rel, preserves in_flight; + publishers see non-Live, skip this ring 2. spin until in_flight == 0 — bounded by commit_timeout a) success: quiescence achieved b) timeout: publisher likely crashed — skip drain (see below) @@ -856,7 +980,9 @@ and full-window drain: entry.slot_idx = INVALID_SLOT (seq_cst) else: skip (evicted, uncommitted, or locked — falls into Class B) - 4. state = Free (seq_cst) — ring available for a new subscriber + 4. state = Free (release) — ring available for a new subscriber + (timeout path: CAS that preserves + the crashed publisher's in_flight) ``` The key invariant: `in_flight` is incremented by publishers BEFORE @@ -895,9 +1021,13 @@ Only Class B can leak slots. Each publisher crash leaks at most **2 pool slots**: - The slot the crashed publisher allocated (refcount stuck > 0 because - the remaining rings were never visited for inline release) -- The slot referenced by the abandoned ring entry (if one existed - at the wrapped position and its `slot_idx` could not be trusted) + the remaining rings were never visited for inline release; or + refcount 0 and off the free stack, for a crash before the pre-set) +- The slot referenced by the stolen ring entry, if one existed at the + wrapped position: the steal deliberately never releases it (the + stalled holder may already have batch-released this ring's ref, and + metadata under a skip tag is untrustworthy), so its ring ref leaks + until GC -- at most one per steal With a typical pool of 256+ slots, the system can tolerate dozens of crashes before running low. Class B leaks can be recovered by the @@ -906,10 +1036,11 @@ garbage collector (see below). ## Garbage Collection -Publisher crash leaks (Class B) leave pool slots with permanently -inflated refcounts. Since the crashed process is gone, no normal -code path will ever decrement them to zero. An explicit garbage -collection pass is needed for long-running systems. +Publisher crash leaks (Class B) leave pool slots with inflated +refcounts -- or, for a crash in the treiber_pop window, refcount 0 +while off the free stack. Since the crashed process is gone, no +normal code path will ever return them to the pool. An explicit +garbage collection pass is needed for long-running systems. ### Design principles @@ -928,31 +1059,69 @@ Recovery is split into two methods with different safety profiles: **`repair_locked_entries()`** — safe under live traffic. -Scans all ring entries. If `sequence == LOCKED_SEQUENCE` (publisher crashed -mid-commit), commits the entry with `slot_idx = INVALID_SLOT` and the -correct final sequence (`pos + 1`). This unblocks future publishers -wrapping to this position: they CAS `(pos + 1) → LOCKED`, which now -succeeds. Subscribers and evictions skip `INVALID_SLOT` entries. The -worst case under live traffic is a benign double-store if a slow (but -alive) publisher commits at the same time. +Scans all ring entries. Stale committed entries (>1 wrap behind, +including stale skip markers) are stolen immediately; locked entries +are collected and re-checked after a full `commit_timeout` grace +sleep -- the same position-tagged lock value at both instants proves +its unique holder exceeded the commit budget. Every steal takes +ownership with a CAS to `seq_repair(pos)` before touching the entry, +then commits the entry as a skip marker (`seq_skip(pos)`: tag 01, +word carrying pos + 1). The `INVALID_SLOT` / zero-length stores made +under the repair lock are diagnostics only -- nothing ever trusts +metadata under a skip tag. A live publisher that commits first wins +the CAS race and the repairer backs off; a stolen-from publisher that +was merely slow detects the theft (theft guard / CAS commit) and +records a drop. Subscribers count a skip-marked position as one lost +message without reading its metadata; evictions never release a slot +through one. + +Residuals, both bounded: +- The victim's late metadata stores -- landing at any point after the + steal -- fall under the skip tag and are ignored by construction, + so they are harmless; there is no torn-read window. +- The stolen entry's previous slot reference is never released by the + repair (the stalled holder may already have batch-released this + ring's ref; releasing again could double-free). At most one slot + ref leaks per steal, recovered by `reclaim_orphaned_slots()`. ``` repair_locked_entries(region): + candidates = [] for each ring i in [0, max_subs): for pos in [oldest_live, write_pos): - if entries[pos].sequence == LOCKED_SEQUENCE: - entries[pos].slot_idx = INVALID_SLOT - entries[pos].payload_len = 0 - entries[pos].sequence = pos + 1 // committed sequence + seq = entries[pos].sequence + if seq is locked: + candidates += {entry, pos, seq} // grace below + else if seq_pos(seq) + capacity < pos + 1: + steal(entry, pos, seq) // Case B, CAS-guarded + if candidates: sleep(commit_timeout) // grace + for {entry, pos, seq} in candidates: + if entry.sequence == seq: // same holder spanned it + steal(entry, pos, seq) + +steal(entry, pos, observed): // entry_steal_and_clear + CAS entry.sequence: observed -> seq_repair(pos) // live writer wins: back off + entry.slot_idx = INVALID_SLOT // diagnostics only + entry.payload_len = 0 + entry.sequence = seq_skip(pos) // release: committed, empty ``` -**`reclaim_orphaned_slots()`** — requires full quiescence. - -Scans all ring entries to build a set of referenced slot indices, then -reclaims any slot with refcount > 0 that is not in the referenced set. +**`reclaim_orphaned_slots()`** -- requires full quiescence. + +Builds the set of slot indices referenced by plain-committed ring +entries (locked and skip-marked entries are excluded -- skip metadata +is untrustworthy by design), walks the free stack to record +membership (exact under the quiescence contract, bounded by +`pool_size` against corrupt `next_free` cycles), then reclaims every +slot that is neither referenced nor on the stack -- regardless of +refcount. The membership walk is what recovers rc == 0 orphans (a +publisher crash between `treiber_pop` and the refcount pre-set, or a +reclaimer killed between its refcount store and its push) that a +refcount-only scan never could. NOT safe under live traffic. Requires: - All publishers quiesced (a publisher between refcount pre-set and - ring push has rc > 0 but no ring entry yet). + ring push has rc > 0 but no ring entry yet; one between treiber_pop + and the pre-set holds an off-stack rc == 0 slot). - No outstanding `SampleView` objects (a view holds a refcount pin without a ring entry reference; reclaiming it would free memory still being read). @@ -962,11 +1131,16 @@ reclaim_orphaned_slots(region): referenced = {} for each ring i in [0, max_subs): for pos in [oldest_live, write_pos): - if entries[pos].sequence >= pos + 1: + seq = entries[pos].sequence + if seq is plain-committed (tag 00) and seq >= pos + 1: referenced.insert(entries[pos].slot_idx) + on_stack = {} // exact under quiescence + for idx in chain from free_top, bounded by pool_size: + on_stack.insert(idx) + for idx in [0, pool_size): - if slot[idx].refcount > 0 and idx not in referenced: + if idx not in referenced and idx not in on_stack: slot[idx].refcount = 0 treiber_push(free_top, slot[idx], idx) ``` @@ -977,11 +1151,10 @@ reclaim_orphaned_slots(region): Crash scenario GC effect ────────────────────────────────────────────────────────────────────── After treiber_pop, before publish Slot has refcount 0, not in any - ring, not in free stack. GC cannot - distinguish it from a legitimately - free slot → NOT reclaimed (Class B - unrecoverable leak, bounded to 1 - slot per crash, see below). + ring, not in the free stack. The + free-stack membership walk proves + it off-stack, and it appears in no + committed entry -> reclaimed. After refcount pre-set, delivered Slot is in k rings but refcount to k of N rings is max_subs. The k ring references @@ -1009,14 +1182,17 @@ After write_pos CAS, before Entry is overwritten after // Draining ring with publishers finishing). Call twice with a gap // > commit_timeout; persistent counts indicate a real crash. auto report = region.diagnose(); -// report.locked_entries: entries stuck at LOCKED_SEQUENCE +// report.locked_entries: entries holding a position-tagged lock, or +// committed >1 wrap stale (incl. stale skip markers) // report.retired_rings: Free rings with stale in_flight > 0 // report.draining_rings: Draining rings with in_flight > 0 (usually transient) -// report.dead_rings: Live/Draining rings whose owner process is gone +// report.dead_rings: Live/Draining/Reclaiming rings whose owner is gone // report.live_rings: active subscriber rings +// report.schema_stuck: schema slot at Claiming (advisory; may be a live claim) -// Safe under live traffic — repairs poisoned ring entries. -// Can be called freely on a health-check timer. +// Safe under live traffic — commits poisoned ring entries as skip +// markers (CAS steal after a grace re-check; blocks one commit_timeout +// when locked entries exist). Can be called on a health-check timer. std::size_t repaired = region.repair_locked_entries(); // Reclaims rings whose owning subscriber process has died. Safe under @@ -1036,15 +1212,25 @@ std::size_t reclaimed = region.reclaim_orphaned_slots(); counts of locked entries and stuck rings. The supervisor calls this periodically; persistent nonzero counts signal recovery is needed. -**`repair_locked_entries()`** — commits locked entries with -`INVALID_SLOT`. Safe under live traffic (benign double-store if a -slow publisher commits at the same time). Can run on a timer. - -**`reclaim_dead_rings()`** — reclaims `Live`/`Draining` rings whose -owning subscriber process has died, identified by the `owner_pid` + -`owner_starttime` recorded on the ring at claim time and checked -against the OS (same liveness test as `Registry::sweep_stale`). Safe -under live traffic: a slow-but-alive subscriber is never touched. +**`repair_locked_entries()`** -- commits provably-stale entries as +skip markers. Safe under live traffic: steals are CAS-guarded and +gated on a commit_timeout grace re-check, so a live publisher always +wins and a slow one detects the theft. Can run on a timer. + +**`reclaim_dead_rings()`** -- reclaims `Live`/`Draining`/`Reclaiming` +rings whose owning subscriber process has died, identified by the +`owner_pid` + `owner_starttime` recorded on the ring at claim time +and checked against the OS (same liveness test as +`Registry::sweep_stale`). Safe under live traffic and against +concurrent reclaimers: a single-shot CAS moves the ring to +`Reclaiming`, owner death is re-verified under that exclusivity, then +the ring is freed -- or restored, if the check turned out to have +raced a fresh claim (the restore loop only acts while the state is +still `Reclaiming`, so it never stomps the owner's own teardown). A +slow-but-alive subscriber is never touched. +A reclaimer that crashes mid-pass leaves the ring at `Reclaiming`; +that residue is recovered by a later call (owner dead) or by the live +owner's own teardown. `in_flight` is preserved (a mid-commit publisher must still `fetch_sub`), so a reclaimed ring with leftover `in_flight` lands in the retired state for `reset_retired_rings()` to finish. @@ -1056,9 +1242,10 @@ deliberate post-crash action. For a dead *subscriber* prefer `reclaim_dead_rings()`, which is liveness-checked and cannot underflow `in_flight` against a slow publisher. -**`reclaim_orphaned_slots()`** — walks all rings to build a -referenced-slot set, then frees any unreferenced slot with -refcount > 0. NOT safe under live traffic — requires all publishers +**`reclaim_orphaned_slots()`** -- walks all rings to build a +referenced-slot set and the free stack for membership, then frees any +slot that is neither referenced nor on the stack, regardless of +refcount. NOT safe under live traffic -- requires all publishers quiesced and no outstanding `SampleView` objects. ### Recommended recovery sequence @@ -1081,13 +1268,12 @@ In most cases, steps 2–4 restore the channel to full operation. The recovery primitives close the common crash classes but have a few documented edges, all bounded: -- **`reclaim_orphaned_slots()` is not crash-safe mid-pass.** It resets a - slot's refcount to 0 and pushes it to the free stack as two separate - stores. If the *recovering* process is killed between them, the slot is - neither referenced nor free-listed (refcount 0), and a re-run skips it - (the `refcount > 0` guard no longer matches) — a permanent leak of that - one slot. Run recovery from a supervisor that is not itself under the - OOM/kill pressure that triggered recovery. +- **`reclaim_orphaned_slots()` mid-pass crash is self-healing.** It + resets a slot's refcount to 0 and pushes it to the free stack as two + separate stores. If the *recovering* process is killed between them, + the slot is left rc == 0, unreferenced, and off the stack -- exactly + the shape the next run's free-stack membership walk reclaims. No + permanent leak; just re-run the pass. - **`reset_retired_rings()` trusts the operator.** It zeroes `in_flight`; if called while a misclassified slow-but-alive publisher still owes a @@ -1128,7 +1314,11 @@ free_top (Treiber) 64-bit tagged pointer: 32-bit generation counter write_pos (rings) Monotonically increasing 64-bit counter. Only goes up, never revisits a value. -state (subscriber) One-way state machine: Free → Live → Draining → Free. +state (subscriber) One-way claim cycle: Free -> Live -> Draining -> + Free, plus a Reclaiming detour (Live/Draining -> + Reclaiming -> Free, or back) that + reclaim_dead_rings holds exclusively while + re-verifying owner death. Publishers only deliver to Live rings. The packed state_flight design (state + in_flight in a single uint32) eliminates cross-variable @@ -1279,6 +1469,17 @@ open_mailbox("peer", "reply") /{prefix}_peer_mbx_reply Mailbox paths include the owner's node name because they are personal reply channels -- the sender must know who to reply to. +Name components are sanitized into POSIX-shm-safe fragments, and that +mapping is many-to-one ("a:b" and "a b" both become "a_b"); the +"broadcast_" / "_mbx_" infixes are not escaped either, and macOS shm +names are truncated hashes. Two distinct logical channels can +therefore collide onto one shm name. To detect this, the Node stamps +`identity_hash` into the header at create time — an FNV-1a chain over +a kind tag ("topic" / "broadcast" / "mailbox"), the namespace, and the +RAW (pre-sanitization) logical coordinates. Opens verify it when both +sides are nonzero and fail with "Identity mismatch on existing region" +instead of silently sharing the region. Not part of `config_hash`. + ## Registry & Discovery @@ -1479,23 +1680,26 @@ competing for the same ring entry). In practice, the lock is held for two relaxed stores + one release store (~nanoseconds), so the 64-retry budget is generous. -### Unrecoverable slot leak (Class B) +### Slot leak window on publisher crash (Class B) If a publisher crashes between `treiber_pop` (slot allocated, refcount=0) and `refcount.store(max_subs)`, the slot has refcount=0 and is neither -in the free stack nor referenced by any ring entry. The GC cannot -distinguish it from a legitimately free slot and will not reclaim it. +in the free stack nor referenced by any ring entry. No normal code +path will ever touch it again -- but it is not lost: +`reclaim_orphaned_slots()` walks the free stack for membership and +reclaims any slot that is neither referenced nor on the stack, +regardless of refcount. A fresh `SharedRegion::create` (full +reinitialization) also recovers it. -This is a bounded leak: at most one slot per publisher crash in that -specific window (a few instructions wide). The slot is recovered on -the next `SharedRegion::create` (full reinitialization). +This is a bounded leak between GC passes: at most one slot per +publisher crash in that specific window (a few instructions wide). **Operational guidance:** if your deployment involves frequent publisher crashes (e.g. during development, or in a watchdog-restart architecture), size the pool with enough headroom to absorb the expected number of -orphans between region recreations. For a pool of 256 slots and a -crash rate of one per hour, the leak is negligible. If crashes are -frequent enough to matter, the region should be recreated. +orphans between `reclaim_orphaned_slots()` passes or region +recreations. For a pool of 256 slots and a crash rate of one per +hour, the leak is negligible. ### Pool and Ring Sizing diff --git a/CMakeLists.txt b/CMakeLists.txt index 80dcfab..f793c7e 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -11,12 +11,37 @@ option(BUILD_UNIT_TESTS "Build unit tests" OFF) option(BUILD_EXAMPLES "Build examples" OFF) option(BUILD_BENCHMARKS "Build benchmarks (requires Google Benchmark)" OFF) option(ENABLE_TSAN "Enable ThreadSanitizer" OFF) +option(ENABLE_ASAN "Enable AddressSanitizer" OFF) +option(ENABLE_UBSAN "Enable UndefinedBehaviorSanitizer" OFF) +option(ENABLE_WERROR "Treat warnings as errors (CI)" OFF) + +if(NOT MSVC) + add_compile_options(-Wall -Wextra) + if(ENABLE_WERROR) + add_compile_options(-Werror) + endif() +endif() + +# TSAN cannot share a process with ASAN/UBSAN runtimes; ASAN+UBSAN together is fine. +if(ENABLE_TSAN AND (ENABLE_ASAN OR ENABLE_UBSAN)) + message(FATAL_ERROR "ENABLE_TSAN cannot be combined with ENABLE_ASAN or ENABLE_UBSAN") +endif() if(ENABLE_TSAN) add_compile_options(-fsanitize=thread -g) add_link_options(-fsanitize=thread) endif() +if(ENABLE_ASAN) + add_compile_options(-fsanitize=address -g) + add_link_options(-fsanitize=address) +endif() + +if(ENABLE_UBSAN) + add_compile_options(-fsanitize=undefined -g) + add_link_options(-fsanitize=undefined) +endif() + # --- Platform-specific OS sources --- if(WIN32) set(OS_SOURCES diff --git a/README.md b/README.md index e05fdc9..6027d06 100644 --- a/README.md +++ b/README.md @@ -238,6 +238,16 @@ target_link_libraries(my_app PRIVATE kickmsg) | `BUILD_BENCHMARKS` | `OFF` | Build benchmarks (requires Google Benchmark) | | `ENABLE_TSAN` | `OFF` | Enable ThreadSanitizer | +## Security + +Shared-memory objects are created with mode `0600` (owner-only) on +Linux and macOS, so channel payloads are not readable by other users +on a multi-user host. To share channels across users, set the +`KICKMSG_SHM_MODE` environment variable to an octal mode (e.g. +`KICKMSG_SHM_MODE=0666`) in every process that *creates* regions — +openers are unaffected. The value is parsed once per process; an +invalid value falls back to `0600` with a warning on stderr. + ## Platform Support | Platform | SharedMemory | Futex | diff --git a/examples/hello_diagnose.cc b/examples/hello_diagnose.cc index 31c4889..cd29d35 100644 --- a/examples/hello_diagnose.cc +++ b/examples/hello_diagnose.cc @@ -66,8 +66,8 @@ int main() uint64_t wp = ring->write_pos.load(std::memory_order_acquire); ring->write_pos.store(wp + 1, std::memory_order_release); entries[wp & hdr->sub_ring_mask].sequence.store( - kickmsg::LOCKED_SEQUENCE, std::memory_order_release); - std::cout << " Injected: LOCKED_SEQUENCE at ring 0, pos " << wp << "\n"; + kickmsg::seq_lock(wp), std::memory_order_release); + std::cout << " Injected: stale lock at ring 0, pos " << wp << "\n"; } // Fault 2: Stuck ring (simulates subscriber teardown timeout after publisher crash) diff --git a/include/kickmsg/Hash.h b/include/kickmsg/Hash.h index ed8edfc..8a8a099 100644 --- a/include/kickmsg/Hash.h +++ b/include/kickmsg/Hash.h @@ -9,7 +9,7 @@ namespace kickmsg { - /// Optional hash helpers. Not used on any hot path — intended for + /// Optional hash helpers. Not used on any hot path -- intended for /// users filling SchemaInfo::identity without bringing their own /// hash implementation, and for building up descriptor fingerprints /// in user code more generally. @@ -32,13 +32,10 @@ namespace kickmsg /// 64-bit FNV-1a of a raw byte range. `seed` defaults to the /// standard offset basis for one-shot hashing; pass a previous /// hash value to chain additional bytes into it. - uint64_t fnv1a_64(void const* data, std::size_t len, - uint64_t seed = FNV1A_64_OFFSET_BASIS) noexcept; + uint64_t fnv1a_64(void const* data, std::size_t len, uint64_t seed = FNV1A_64_OFFSET_BASIS) noexcept; - /// 64-bit FNV-1a of a string. Thin wrapper around the raw-range - /// overload; preserved as a separate entry point because - /// descriptor-string hashing is by far the most common use. - uint64_t fnv1a_64(std::string_view s) noexcept; + /// 64-bit FNV-1a of a string. Thin wrapper around the raw-range version. + uint64_t fnv1a_64(std::string_view s, uint64_t seed = FNV1A_64_OFFSET_BASIS) noexcept; /// 64-bit FNV-1a of a trivially-copyable scalar or POD. Lets /// callers chain fields without spelling out `&v, sizeof(v)`: diff --git a/include/kickmsg/Naming.h b/include/kickmsg/Naming.h index 36596f5..05732a8 100644 --- a/include/kickmsg/Naming.h +++ b/include/kickmsg/Naming.h @@ -17,20 +17,20 @@ namespace kickmsg /// POSIX requires shm names to start with a single '/' and contain no /// further '/' characters; Linux additionally constrains the remainder /// to a single path component under /dev/shm. This helper produces the - /// "after the leading slash" fragment for one component — callers + /// "after the leading slash" fragment for one component -- callers /// assemble the final "/_" path themselves. /// /// Rules (human-readable, no hashing): - /// - strip leading '/' — lets callers pass ROS-style absolute names - /// like "/robot/arm" without producing "//…" or embedded slashes - /// - interior '/' becomes '.' — preserves hierarchy visually + /// - strip leading '/' -- lets callers pass ROS-style absolute names + /// like "/robot/arm" without producing "//..." or embedded slashes + /// - interior '/' becomes '.' -- preserves hierarchy visually /// ("robot/arm/joint1" -> "robot.arm.joint1") /// - POSIX "portable filename" chars [A-Za-z0-9._-] pass through - /// - everything else becomes '_' — deterministic, no collisions - /// between benign inputs, still eyeballable in `ls /dev/shm` + /// - everything else becomes '_' -- deterministic and still + /// eyeballable in `ls /dev/shm` /// /// Throws std::invalid_argument if the result would be empty (e.g. input - /// is "", "/", "///") — a blank component would produce ambiguous names + /// is "", "/", "///") -- a blank component would produce ambiguous names /// like "/prefix_" that silently collide across callers. \p what is /// interpolated into the exception message ("namespace", "topic", etc.). std::string sanitize_shm_component(std::string_view s, char const* what); @@ -44,7 +44,7 @@ namespace kickmsg /// macOS caveat: PSHMNAMLEN (31) leaves no room for a readable name, so /// the result is a hash and the suffix hash is truncated to fit. Two /// distinct (namespace, suffix) pairs can therefore collide onto the - /// same shm object — distinct topics would then share one region. + /// same shm object -- distinct topics would then share one region. /// Linux names are exact and never collide. Collisions are astronomically /// unlikely but not impossible; if it matters, keep names short enough to /// stay readable (Linux) or verify topology out of band. diff --git a/include/kickmsg/Node.h b/include/kickmsg/Node.h index 1c4820e..f91193f 100644 --- a/include/kickmsg/Node.h +++ b/include/kickmsg/Node.h @@ -24,7 +24,7 @@ namespace kickmsg /// Lifetime: the Node owns the underlying shared-memory mappings. All /// Publisher, Subscriber, and BroadcastHandle objects returned by this /// Node hold raw pointers into the mapped memory. They MUST NOT outlive - /// the Node that created them — destroying the Node unmaps the memory + /// the Node that created them -- destroying the Node unmaps the memory /// and silently invalidates all outstanding handles. class Node { @@ -33,7 +33,7 @@ namespace kickmsg // owner, tag) are sanitized into a POSIX-shm-compatible form: // leading '/' is stripped, interior '/' becomes '.', and any char // outside [A-Za-z0-9._-] becomes '_'. This lets callers pass - // ROS-style paths like "/robot/arm/joint1" directly — the region + // ROS-style paths like "/robot/arm/joint1" directly -- the region // ends up at "/_robot.arm.joint1" in /dev/shm, still // human-readable (no hashing). A component that sanitizes to the // empty string throws std::invalid_argument. @@ -61,7 +61,7 @@ namespace kickmsg // // The *_or_* variants relax that: either side may be the first to // materialize the region, mirroring join_broadcast()'s behavior. - // Useful when startup order is unknown — e.g. a listener service + // Useful when startup order is unknown -- e.g. a listener service // starting before its data source. // // These variants take `cfg` by reference with NO default: either @@ -77,12 +77,12 @@ namespace kickmsg // The freshly-opened SharedRegion from the second call is // discarded. Two Publisher handles on the same topic from the // same Node are NOT designed for concurrent use from separate - // threads — use one handle per thread instead. + // threads -- use one handle per thread instead. // // NOTE on cfg.schema: only the *creator* of the region bakes // cfg.schema into the header. On the open branch (region already // exists) cfg.schema is IGNORED and the existing region's schema - // is preserved — schema is orthogonal to channel geometry and + // is preserved -- schema is orthogonal to channel geometry and // never part of the create_or_open config-mismatch check. Use // try_claim_topic_schema() afterwards to publish a descriptor // regardless of which side ended up creating the region. @@ -114,7 +114,7 @@ namespace kickmsg // Thin wrappers that call SharedMemory::unlink() with the same name // formatting used by advertise / subscribe / join_broadcast / // create_mailbox. Safe to call whether or not this node currently - // holds a region for that name — unlink is a filesystem-level + // holds a region for that name -- unlink is a filesystem-level // operation on the SHM entry, independent of in-process handles. void unlink_topic(char const* topic) const; @@ -124,7 +124,7 @@ namespace kickmsg // --- Optional payload schema descriptor --- // - // The library never interprets schema bytes — these accessors just + // The library never interprets schema bytes -- these accessors just // forward to the SharedRegion backing the topic. Mismatch policy // is entirely the caller's. @@ -146,6 +146,12 @@ namespace kickmsg std::string make_broadcast_name(char const* channel) const; std::string make_mailbox_name(char const* owner, char const* tag) const; + // Identity hashes over the raw (pre-sanitization) coordinates plus + // a kind tag; shm-name collisions are then detected at open. + uint64_t make_topic_identity(char const* topic) const; + uint64_t make_broadcast_identity(char const* channel) const; + uint64_t make_mailbox_identity(char const* owner, char const* tag) const; + // unordered_map guarantees reference stability for elements // (only iterators are invalidated on rehash), so pointers // returned here remain valid across subsequent insertions. @@ -180,7 +186,7 @@ namespace kickmsg // humanoid robot can easily hold 100-300 topics (joints × (meas, // target) + cameras + IMUs + force sensors + hands), so O(N) // linear search over a vector/deque starts to matter. The - // duplication with SharedRegion::name() costs ~30 B per entry — + // duplication with SharedRegion::name() costs ~30 B per entry -- // negligible at any scale we care about. unordered_map also // guarantees reference stability for elements (the mmap addresses // used by Publisher/Subscriber don't move on rehash). diff --git a/include/kickmsg/Publisher.h b/include/kickmsg/Publisher.h index 9556e9d..89e6637 100644 --- a/include/kickmsg/Publisher.h +++ b/include/kickmsg/Publisher.h @@ -61,20 +61,37 @@ namespace kickmsg Allocation allocate(); /// Commit the currently reserved slot, recording `len` as the - /// payload size. No bounds check: caller guarantees - /// `len <= max_size` from the preceding allocate(). + /// payload size. + /// + /// Returns the number of rings delivered to. 0 means no pending + /// allocation, oversized `len` (the pending slot is recycled), or + /// zero live subscribers -- indistinguishable by design. std::size_t publish(std::size_t len); /// Allocate, copy, and publish in one call. - /// Returns bytes written on success, -EMSGSIZE if too large, -EAGAIN if pool exhausted. + /// Returns bytes written on success (NOT a delivery count: a + /// successful send may have reached zero subscribers), -EMSGSIZE + /// if too large, -EAGAIN if pool exhausted. int32_t send(void const* data, std::size_t len); /// Number of per-ring delivery drops (CAS lock contention or pool exhaustion). uint64_t dropped() const { return dropped_; } private: - static uint32_t wait_and_capture_slot(Entry& e, uint64_t expected_seq, - microseconds timeout); + /// Result of waiting for the previous wrap's occupant to commit. + /// stable_lock: one lock value spanned the whole timeout window, + /// proving its (unique) holder stale -- the steal precondition. + struct CommitWait + { + uint64_t last_seq; + bool stable_lock; + }; + + static CommitWait wait_for_commit(Entry& e, uint64_t expected_seq, + microseconds timeout); + void self_repair(Entry& e, uint64_t pos, uint64_t capacity, + CommitWait const& wait); + void abandon_delivery(SubRingHeader* ring); void release_slot(uint32_t idx); void release_pending(); diff --git a/include/kickmsg/Region.h b/include/kickmsg/Region.h index 06abd2b..6643681 100644 --- a/include/kickmsg/Region.h +++ b/include/kickmsg/Region.h @@ -10,11 +10,11 @@ namespace kickmsg { /// Runtime snapshot of a single subscriber ring. /// Values are relaxed/acquire-loaded, so the snapshot is internally - /// consistent per-ring but may race mildly across rings — fine for a + /// consistent per-ring but may race mildly across rings -- fine for a /// diagnostic view; not intended as a strongly-consistent read. struct RingStats { - uint32_t state; ///< ring::State as a raw int (0=Free, 1=Live, 2=Draining) + uint32_t state; ///< ring::State as a raw int (0=Free, 1=Live, 2=Draining, 3=Reclaiming) uint32_t in_flight; ///< Publishers currently admitted to this ring uint64_t write_pos; ///< Monotonic claim counter (rough throughput proxy) uint64_t dropped_count; ///< Cumulative publisher drops on this ring @@ -30,8 +30,9 @@ namespace kickmsg uint64_t total_writes; ///< Max of write_pos across all rings: publish events observed by the channel, monotonic across subscriber churn uint64_t total_drops; ///< Sum of dropped_count across all rings uint64_t total_losses; ///< Sum of lost_count across all rings + uint64_t total_steals; ///< Stale entries stolen (self-repair + repair_locked_entries); each may orphan one slot ref until reclaim_orphaned_slots() uint64_t live_rings; ///< Number of rings currently Live - uint64_t pool_free; ///< Approximate free-slot count (walks Treiber stack — racy under churn) + uint64_t pool_free; ///< Approximate free-slot count (walks Treiber stack -- racy under churn) uint64_t pool_size; ///< Total pool capacity (static) }; @@ -66,7 +67,7 @@ namespace kickmsg // Hand-written move ops so the moved-from object's base_/size_ // are reset to a default-constructed state. A defaulted move - // would leave them aliasing the destination's live memory — + // would leave them aliasing the destination's live memory -- // base() on the moved-from object would silently return a // dangling-looking-live pointer instead of nullptr. SharedRegion(SharedRegion&& other) noexcept @@ -95,14 +96,22 @@ namespace kickmsg ~SharedRegion() = default; + /// Create a fresh region under `name`, replacing any existing + /// object: peers keep their old (orphaned) mapping, never a + /// truncated one. Single-creator only; concurrent creators must + /// use create_or_open(). static SharedRegion create(char const* name, channel::Type type, channel::Config const& cfg, char const* creator_name = ""); - static SharedRegion open(char const* name); + /// Open an existing region. When `expected_identity` and the + /// stamped identity_hash are both nonzero they must match: a + /// mismatch (two logical channels colliding on one shm name) + /// throws instead of silently sharing the region. + static SharedRegion open(char const* name, uint64_t expected_identity = 0); /// Create the region if it doesn't exist, otherwise open the - /// existing one. On the open branch, cfg.schema is IGNORED — + /// existing one. On the open branch, cfg.schema is IGNORED -- /// schema is orthogonal to channel geometry and doesn't /// participate in the config-hash mismatch check. Use /// try_claim_schema() afterwards to publish a descriptor @@ -167,8 +176,8 @@ namespace kickmsg /// Atomically publish a schema descriptor to the region. /// - /// Returns true if this call claimed the slot (Unset → Claiming → - /// Set), false if some other claimant got there first — in which + /// Returns true if this call claimed the slot (Unset -> Claiming -> + /// Set), false if some other claimant got there first -- in which /// case the caller should read back with schema() and apply its /// own mismatch policy. When another claim is mid-write, this /// call briefly yields until the state settles or a small bounded @@ -182,8 +191,8 @@ namespace kickmsg bool try_claim_schema(SchemaInfo const& info); /// Recover a schema slot wedged in the Claiming state by a - /// crashed claimant (CAS'd Unset → Claiming then died before the - /// release-store of Set). Atomically CASes Claiming → Unset so a + /// crashed claimant (CAS'd Unset -> Claiming then died before the + /// release-store of Set). Atomically CASes Claiming -> Unset so a /// new claim can proceed; returns true if the reset actually /// happened, false if the state was not Claiming. /// @@ -191,7 +200,7 @@ namespace kickmsg /// crashed claimant is gone: a slow-but-alive writer could still /// be mid-memcpy into schema_data and would then release-store /// Set, racing a new claim into torn bytes. Mirrors the safety - /// contract of reset_retired_rings() — a deliberate post-crash + /// contract of reset_retired_rings() -- a deliberate post-crash /// action, not a routine maintenance call. bool reset_schema_claim(); @@ -217,22 +226,21 @@ namespace kickmsg /// confirmed gone. struct HealthReport { - uint32_t locked_entries; ///< Entries stuck at LOCKED_SEQUENCE + uint32_t locked_entries; ///< Entries holding a position-tagged lock, or committed >1 wrap stale uint32_t retired_rings; ///< Free rings with stale in_flight > 0 uint32_t draining_rings; ///< Draining rings with in_flight > 0 - uint32_t dead_rings; ///< Live/Draining rings whose owner process is gone + uint32_t dead_rings; ///< Live/Draining/Reclaiming rings whose owner process is gone uint32_t live_rings; ///< Active subscriber rings bool schema_stuck; ///< schema_state at Claiming (advisory; may be a live claim) }; HealthReport diagnose(); - /// Repair ring entries stuck at LOCKED_SEQUENCE (publisher crashed - /// mid-commit). Commits the entry with INVALID_SLOT so future - /// publishers can wrap past it. - /// - /// Safe to call under live traffic: the worst outcome is a benign - /// double-store if a slow (but alive) publisher commits at the same - /// time. Can be called freely on a health-check timer. + /// Repair entries left mid-commit by a crashed publisher, committing + /// them as skip markers so future publishers wrap past. Safe under + /// live traffic: locks are stolen only after a grace re-check proves + /// the holder stale (one commit_timeout, same value), every steal is + /// CAS-owned, and a merely-slow publisher detects the theft as a + /// drop. Blocks one commit_timeout when locked entries exist. /// Returns the number of entries repaired. std::size_t repair_locked_entries(); @@ -246,22 +254,26 @@ namespace kickmsg /// Returns the number of rings reset. std::size_t reset_retired_rings(); - /// Reclaim rings whose owning subscriber process has died (a Live or - /// Draining ring left behind by a crash). The owner pid + start time - /// recorded at claim time are checked against the OS; only rings with - /// a provably-dead owner are reclaimed, so this is safe under live - /// traffic -- a slow-but-alive subscriber is never touched. in_flight - /// is preserved (a mid-commit publisher must still fetch_sub), so a - /// reclaimed ring may land in the retired state for reset_retired_rings() - /// to finish; committed slot refs are recovered by - /// reclaim_orphaned_slots(). Returns the number of rings reclaimed. + /// Reclaim rings whose owner process is provably dead (pid + start + /// time checked against the OS); a slow-but-alive subscriber is + /// never touched. Two-phase like Registry::sweep_stale: single-shot + /// CAS to Reclaiming, death re-verified under that exclusivity, then + /// freed or restored -- safe for concurrent reclaimers; in_flight + /// churn just defers a ring to the next call. in_flight itself is + /// preserved (a mid-commit publisher must still fetch_sub), so a + /// reclaimed ring may land retired for reset_retired_rings(); + /// slot refs are recovered by reclaim_orphaned_slots(). + /// Returns the number of rings reclaimed. /// - /// Residual: a subscriber that crashes in the few instructions between - /// winning the claim CAS and recording its pid leaves owner_pid == 0, - /// which this cannot attribute and so will not reclaim. + /// Residuals: a subscriber that crashes in the few instructions + /// between winning the claim CAS and recording its pid leaves + /// owner_pid == 0, which this cannot attribute and so will not + /// reclaim. A reclaimer that crashes mid-pass leaves the ring + /// Reclaiming; a later call recovers it (dead owner) or the owner's + /// own teardown does (live owner). std::size_t reclaim_dead_rings(); - /// Runtime counter snapshot — safe under live traffic. + /// Runtime counter snapshot -- safe under live traffic. /// /// Reads the cross-process per-ring counters (`write_pos`, /// `dropped_count`, `lost_count`) plus ring state and an approximate @@ -276,7 +288,7 @@ namespace kickmsg /// forever under racing pushes/pops. RegionStats stats() const; - /// Static header snapshot — geometry + creator metadata. All + /// Static header snapshot -- geometry + creator metadata. All /// fields are written once at creation, so this is a plain copy. RegionInfo info() const; diff --git a/include/kickmsg/Subscriber.h b/include/kickmsg/Subscriber.h index 5db3cf0..0243891 100644 --- a/include/kickmsg/Subscriber.h +++ b/include/kickmsg/Subscriber.h @@ -168,6 +168,7 @@ namespace kickmsg uint64_t lost() const { return lost_; } uint64_t drain_timeouts() const { return drain_timeouts_; } + uint32_t ring_index() const { return ring_idx_; } private: void release_ring(); diff --git a/include/kickmsg/os/SharedMemory.h b/include/kickmsg/os/SharedMemory.h index 33caa47..554dae9 100644 --- a/include/kickmsg/os/SharedMemory.h +++ b/include/kickmsg/os/SharedMemory.h @@ -7,6 +7,8 @@ #ifdef _WIN32 #define WIN32_LEAN_AND_MEAN #include +#else + #include #endif namespace kickmsg @@ -17,6 +19,10 @@ namespace kickmsg #else using os_shm_handle = int; constexpr os_shm_handle INVALID_SHM_HANDLE = -1; + + /// Creation mode for shm objects: 0600 unless overridden via + /// KICKMSG_SHM_MODE (octal). + mode_t kickmsg_shm_mode(); #endif /// RAII wrapper around a named shared-memory region. @@ -31,7 +37,10 @@ namespace kickmsg SharedMemory& operator=(SharedMemory&& other) noexcept; ~SharedMemory(); - /// Create a new shared-memory region. Truncates to \p size bytes. + /// Create a new region of \p size bytes. An existing object under + /// this name is unlinked and replaced: peers keep their old mapping, + /// never a truncated one. Single-creator only; concurrent creators + /// must go through try_create. void create(std::string const& name, std::size_t size); /// Attempt to create a new shared-memory region. diff --git a/include/kickmsg/types.h b/include/kickmsg/types.h index 19c39f8..a13148f 100644 --- a/include/kickmsg/types.h +++ b/include/kickmsg/types.h @@ -4,6 +4,7 @@ #include #include #include +#include #include #include #include @@ -20,14 +21,38 @@ namespace kickmsg "Kickmsg requires lock-free 32-bit atomics."); constexpr uint64_t MAGIC = 0x4B49434B4D534721ULL; // "KICKMSG!" - constexpr uint32_t VERSION = 6; + constexpr uint32_t VERSION = 7; constexpr uint32_t INVALID_SLOT = UINT32_MAX; - constexpr uint64_t LOCKED_SEQUENCE = UINT64_MAX; constexpr std::size_t CACHE_LINE = 64; + // ---- Entry sequence-word encoding ---- + // + // [tag:2 | pos:62] 00 -> committed (word is pos + 1) + // 01 -> skip marker (word carries pos + 1; metadata untrustworthy) + // 10 -> locked by the publisher at `pos` + // 11 -> stolen by a repairer at `pos` + // + // Lock values are unique (one publisher per position, locks once), so an + // unchanged lock across an interval proves one holder spanned it -- the + // staleness proof repair relies on. Stolen entries commit the skip tag: + // the stolen-from publisher's plain metadata stores can land at any later + // time, so nothing may ever trust slot_idx/payload_len under it. + constexpr uint64_t SEQ_LOCK_BIT = 1ULL << 63; + constexpr uint64_t SEQ_REPAIR_BIT = 1ULL << 62; + + constexpr bool seq_is_locked(uint64_t seq) { return (seq & SEQ_LOCK_BIT) != 0; } + constexpr uint64_t seq_lock(uint64_t pos) { return SEQ_LOCK_BIT | pos; } + constexpr uint64_t seq_repair(uint64_t pos) { return SEQ_LOCK_BIT | SEQ_REPAIR_BIT | pos; } + constexpr bool seq_is_skip(uint64_t seq) + { + return (seq & (SEQ_LOCK_BIT | SEQ_REPAIR_BIT)) == SEQ_REPAIR_BIT; + } + constexpr uint64_t seq_skip(uint64_t pos) { return SEQ_REPAIR_BIT | (pos + 1); } + constexpr uint64_t seq_pos(uint64_t seq) { return seq & (SEQ_REPAIR_BIT - 1); } + // A healthy commit (memcpy + atomic release-store) finishes in a few // microseconds; even under moderate CAS contention it stays well under - // a millisecond. 10 ms is therefore ~1000× a normal commit — enough + // a millisecond. 10 ms is therefore ~1000× a normal commit -- enough // to absorb routine preemption without falsely evicting a live // publisher, while still recovering from a real crash fast enough to // avoid stalling subscribers. Applications running under severe @@ -40,7 +65,7 @@ namespace kickmsg /// /// The library never interprets any byte of this structure: it stores it /// in the shared-memory header so that multiple processes (possibly built - /// at different times, from different sources) can agree — or disagree — + /// at different times, from different sources) can agree -- or disagree -- /// on the payload format carried by the channel. /// /// Policy (which fields to fill, how to compute the hashes, what counts as @@ -67,7 +92,7 @@ namespace kickmsg uint32_t identity_algo; ///< User tag: 0 = unspecified uint32_t layout_algo; ///< User tag: 0 = unspecified uint32_t flags; ///< Reserved bit flags (0 for now) - uint8_t reserved[240]; ///< Future fields — zero on write + uint8_t reserved[240]; ///< Future fields -- zero on write }; static_assert(sizeof(SchemaInfo) == 512, "SchemaInfo layout is part of the shared-memory ABI"); @@ -89,7 +114,7 @@ namespace kickmsg /// Bitmask describing how two SchemaInfo values differ. /// /// Returned by diff(). Zero (Equal) means all checked fields match. - /// The library only compares fields with current semantic meaning — + /// The library only compares fields with current semantic meaning -- /// `flags` and `reserved[]` are deliberately excluded so that /// forward-compatible additions (a new flag bit, a new field carved /// from reserved) do NOT retroactively break existing comparisons. @@ -142,6 +167,11 @@ namespace kickmsg /// enforced by the library. Users read it back via /// SharedRegion::schema() and apply their own mismatch policy. std::optional schema; + + /// Optional logical-identity fingerprint, verified at open when + /// both sides are nonzero (shm-name collision detection). NOT + /// part of config_hash. + uint64_t identity = 0; }; } @@ -157,8 +187,8 @@ namespace kickmsg /// Layout version changes require a VERSION bump. struct Header { - std::atomic magic; ///< MAGIC sentinel — written last (release) during init, polled (acquire) by create_or_open - uint32_t version; ///< Layout version — rejects mismatched builds + std::atomic magic; ///< MAGIC sentinel -- written last (release) during init, polled (acquire) by create_or_open + uint32_t version; ///< Layout version -- rejects mismatched builds channel::Type channel_type; ///< PubSub or Broadcast uint64_t total_size; ///< Total shared-memory region size in bytes @@ -175,7 +205,7 @@ namespace kickmsg uint64_t sub_ring_stride; ///< Bytes between consecutive subscriber rings (aligned) uint64_t commit_timeout_us; ///< Max wait for a previous writer to commit (crash detection) - uint64_t config_hash; ///< FNV-1a of config fields — detects parameter mismatches on open + uint64_t config_hash; ///< FNV-1a of config fields -- detects parameter mismatches on open uint64_t creator_pid; ///< PID of the process that created the region (debug) uint64_t created_at_ns; ///< Creation timestamp in nanoseconds since epoch (debug) @@ -183,8 +213,8 @@ namespace kickmsg uint16_t creator_name_len; ///< Length of creator name string // creator_name bytes follow immediately after sizeof(Header) - /// Payload schema descriptor — opt-in, off the hot path. - /// Published via a tiny state machine (Unset → Claiming → Set): + /// Payload schema descriptor -- opt-in, off the hot path. + /// Published via a tiny state machine (Unset -> Claiming -> Set): /// writers update schema_data while schema_state == Claiming, then /// release-store Set. Readers acquire-load schema_state and only /// read schema_data if the state is Set. @@ -192,6 +222,8 @@ namespace kickmsg alignas(CACHE_LINE) SchemaInfo schema_data; alignas(CACHE_LINE) std::atomic free_top; ///< Treiber free-stack head (tagged: gen|idx) + std::atomic steal_count; ///< Entries stolen from a stale holder (each may orphan one slot ref until GC) + uint64_t identity_hash; ///< Logical-identity fingerprint, written once pre-MAGIC (0 = unstamped); detects shm-name collisions at open }; // The creator_name tail bytes are written at offset sizeof(Header) in the @@ -205,6 +237,18 @@ namespace kickmsg "Header size must be cache-line multiple to isolate atomic fields " "from the creator_name tail written at offset sizeof(Header)"); + // The magic/version prefix is the cross-build handshake: a build that + // opens a region stamped by a different VERSION must still be able to + // read these two fields at their fixed offsets to reject it. They are + // therefore frozen for ALL future versions -- any edit that moves them + // silently defeats the version-mismatch guard. + static_assert(std::is_standard_layout
::value, + "Header is placed in shared memory via reinterpret_cast"); + static_assert(offsetof(Header, magic) == 0, + "magic offset is a permanent ABI contract across all versions"); + static_assert(offsetof(Header, version) == 8, + "version offset is a permanent ABI contract across all versions"); + /// Ring entry: one per position in a subscriber ring. /// Packed to guarantee binary layout across compilers. struct Entry @@ -213,16 +257,19 @@ namespace kickmsg std::atomic slot_idx; ///< Index into the slot pool (INVALID_SLOT if released by drain) std::atomic payload_len; ///< Actual payload bytes written to the slot }; + static_assert(sizeof(Entry) == 16 and std::is_standard_layout::value, + "Entry layout drives cross-process ring-stride math"); /// Ring state machine for subscriber lifecycle. - /// Free → Live (subscriber joins) → Draining (subscriber leaving) → Free + /// Free -> Live (subscriber joins) -> Draining (subscriber leaving) -> Free namespace ring { enum State : uint32_t { - Free = 0, ///< No subscriber — available for claim - Live = 1, ///< Subscriber owns ring, publishers may deliver - Draining = 2, ///< Subscriber tearing down — no new delivery, drain in progress + Free = 0, ///< No subscriber -- available for claim + Live = 1, ///< Subscriber owns ring, publishers may deliver + Draining = 2, ///< Subscriber tearing down -- no new delivery, drain in progress + Reclaiming = 3, ///< reclaim_dead_rings() holds the ring exclusively while re-verifying owner death }; /// Packed [in_flight:30 | state:2] in a single uint32_t. @@ -260,7 +307,7 @@ namespace kickmsg // cold (written once on claim, read only by reclaim_dead_rings), so sharing // the line with the hot state_flight costs nothing in steady state. static_assert(sizeof(SubRingHeader) == 2 * CACHE_LINE, - "SubRingHeader must stay 2 cache lines — expanding it past the " + "SubRingHeader must stay 2 cache lines -- expanding it past the " "write_pos line padding requires reconsidering ring-stride math in Region.cc"); /// Slot header: prepended to each payload buffer in the pool. @@ -270,6 +317,10 @@ namespace kickmsg std::atomic refcount; ///< Number of ring references + SampleView pins std::atomic next_free; ///< Next slot index in the Treiber free-stack chain }; + static_assert(sizeof(SlotHeader) == 8 and std::is_standard_layout::value, + "SlotHeader layout drives cross-process slot-stride math"); + static_assert(std::is_standard_layout::value, + "SubRingHeader is placed in shared memory via reinterpret_cast"); // ---- constexpr helpers (stay in header) ---- @@ -303,6 +354,12 @@ namespace kickmsg uint64_t compute_config_hash(channel::Type type, channel::Config const& cfg); + /// Take ownership of a ring entry observed at `observed` (a stale lock + /// or a >1-wrap-stale committed value) and commit it as an empty skip + /// marker at position `pos`. Returns false without touching the entry + /// if it changed first -- a live writer beat us; never steal then. + bool entry_steal_and_clear(Entry& e, uint64_t pos, uint64_t observed); + void treiber_push(std::atomic& top, SlotHeader* slot, uint32_t slot_idx); void treiber_push(std::atomic& top, void* pool_base, std::size_t slot_stride, uint32_t slot_idx); uint32_t treiber_pop(std::atomic& top, void* base, Header const* h); diff --git a/py_bindings/src/kickmsg_py.cc b/py_bindings/src/kickmsg_py.cc index 1754438..b3c270a 100644 --- a/py_bindings/src/kickmsg_py.cc +++ b/py_bindings/src/kickmsg_py.cc @@ -245,6 +245,10 @@ namespace kickmsg { c.commit_timeout = us; }, "Commit timeout as a timedelta (microsecond resolution).") .def_rw("schema", &channel::Config::schema) + .def_rw("identity", &channel::Config::identity, + "Optional logical-identity fingerprint stamped into the " + "region header at create time and verified at open when " + "both sides are nonzero. Not part of the config hash.") .def("__repr__", [](channel::Config const& c) { return std::string{"Config(max_subscribers="} + @@ -366,9 +370,10 @@ namespace kickmsg char const* state_name = "?"; switch (r.state) { - case ring::Free: state_name = "Free"; break; - case ring::Live: state_name = "Live"; break; - case ring::Draining: state_name = "Draining"; break; + case ring::Free: { state_name = "Free"; break; } + case ring::Live: { state_name = "Live"; break; } + case ring::Draining: { state_name = "Draining"; break; } + case ring::Reclaiming: { state_name = "Reclaiming"; break; } } return std::string{"RingStats(state="} + state_name + ", in_flight=" + std::to_string(r.in_flight) + @@ -382,6 +387,7 @@ namespace kickmsg .def_ro("total_writes", &RegionStats::total_writes) .def_ro("total_drops", &RegionStats::total_drops) .def_ro("total_losses", &RegionStats::total_losses) + .def_ro("total_steals", &RegionStats::total_steals) .def_ro("live_rings", &RegionStats::live_rings) .def_ro("pool_free", &RegionStats::pool_free) .def_ro("pool_size", &RegionStats::pool_size) @@ -392,6 +398,7 @@ namespace kickmsg ", total_writes=" + std::to_string(s.total_writes) + ", total_drops=" + std::to_string(s.total_drops) + ", total_losses=" + std::to_string(s.total_losses) + + ", total_steals=" + std::to_string(s.total_steals) + ", pool_free=" + std::to_string(s.pool_free) + "/" + std::to_string(s.pool_size) + ")"; }); @@ -430,6 +437,7 @@ namespace kickmsg "name"_a, "type"_a, "cfg"_a, "creator"_a = std::string{""}, nb::rv_policy::move) .def_static("open", &SharedRegion::open, "name"_a, + "expected_identity"_a = uint64_t{0}, nb::rv_policy::move) .def_static("create_or_open", [](char const* name, channel::Type type, @@ -492,9 +500,9 @@ namespace kickmsg char const* role_name = "?"; switch (p.role) { - case registry::Publisher: role_name = "Publisher"; break; - case registry::Subscriber: role_name = "Subscriber"; break; - case registry::Both: role_name = "Both"; break; + case registry::Publisher: { role_name = "Publisher"; break; } + case registry::Subscriber: { role_name = "Subscriber"; break; } + case registry::Both: { role_name = "Both"; break; } } return std::string{"Participant(topic='"} + p.topic_name + "', node='" + p.node_name + diff --git a/scripts/lib/build_options.sh b/scripts/lib/build_options.sh index 7c386e8..cf7911f 100644 --- a/scripts/lib/build_options.sh +++ b/scripts/lib/build_options.sh @@ -6,14 +6,14 @@ # bash 4+ feature, and macOS ships 3.2 for GPL-licensing reasons. Each # option occupies the same index across all OPT_* arrays. -OPT_NAMES=( unit_tests benchmarks examples tsan ) -OPT_DEFAULTS=( OFF OFF OFF OFF ) -OPT_DESCRIPTIONS=( "Build unit tests" "Build benchmarks (requires Google Benchmark)" "Build examples" "Enable ThreadSanitizer" ) -OPT_CMAKE_FLAGS=( BUILD_UNIT_TESTS BUILD_BENCHMARKS BUILD_EXAMPLES ENABLE_TSAN ) +OPT_NAMES=( unit_tests benchmarks examples tsan asan ubsan ) +OPT_DEFAULTS=( OFF OFF OFF OFF OFF OFF ) +OPT_DESCRIPTIONS=( "Build unit tests" "Build benchmarks (requires Google Benchmark)" "Build examples" "Enable ThreadSanitizer" "Enable AddressSanitizer" "Enable UndefinedBehaviorSanitizer" ) +OPT_CMAKE_FLAGS=( BUILD_UNIT_TESTS BUILD_BENCHMARKS BUILD_EXAMPLES ENABLE_TSAN ENABLE_ASAN ENABLE_UBSAN ) # Conan flags: only options that actually gate a Conan-provided dep. # Empty slot = option has no Conan side effect. -OPT_CONAN_FLAGS=( unit_tests benchmarks "" "" ) +OPT_CONAN_FLAGS=( unit_tests benchmarks "" "" "" "" ) # CONFIG_VALUES is a parallel array keyed by the same index as OPT_NAMES. # Callers never index it directly — use config_get / config_set. diff --git a/src/Hash.cc b/src/Hash.cc index 0ebba02..f3cfd78 100644 --- a/src/Hash.cc +++ b/src/Hash.cc @@ -21,9 +21,9 @@ namespace kickmsg::hash return h; } - uint64_t fnv1a_64(std::string_view s) noexcept + uint64_t fnv1a_64(std::string_view s, uint64_t seed) noexcept { - return fnv1a_64(s.data(), s.size()); + return fnv1a_64(s.data(), s.size(), seed); } std::array identity_from_fnv1a(std::string_view descriptor) noexcept diff --git a/src/Node.cc b/src/Node.cc index 6b9da80..824060f 100644 --- a/src/Node.cc +++ b/src/Node.cc @@ -1,7 +1,9 @@ #include "kickmsg/Node.h" #include +#include +#include "kickmsg/Hash.h" #include "kickmsg/Naming.h" namespace kickmsg @@ -55,6 +57,14 @@ namespace kickmsg out += tag; return out; } + + /// Chain one identity component: bytes, then length as a separator + /// so ("ab","c") and ("a","bc") never hash alike. + uint64_t identity_chain(std::string_view s, uint64_t h) + { + h = hash::fnv1a_64(s, h); + return hash::fnv1a_64(s.size(), h); + } } void Node::touch_registry(std::string const& shm_name, @@ -141,20 +151,20 @@ namespace kickmsg { auto shm_name = make_topic_name(topic); auto topic_path = with_leading_slash(topic); - // Guard the create: without it, a second advertise() of the same - // topic re-evaluates SharedRegion::create() (shm_open O_TRUNC + - // memset) on the already-mapped live segment before emplace - // discards the duplicate — wiping the header/rings/pool under any - // existing Publisher and remote peers. + // Guard the create: a second advertise() would re-run create() + // (unlink + fresh object), orphaning the live segment for existing + // Publishers and remote peers. if (auto* r = find_region(shm_name)) { touch_registry(shm_name, topic_path, channel::PubSub, registry::Pubsub, registry::Publisher); return Publisher(*r); } + channel::Config stamped_cfg = cfg; + stamped_cfg.identity = make_topic_identity(topic); auto [it, _] = regions_.emplace( shm_name, - SharedRegion::create(shm_name.c_str(), channel::PubSub, cfg, name_.c_str())); + SharedRegion::create(shm_name.c_str(), channel::PubSub, stamped_cfg, name_.c_str())); touch_registry(shm_name, topic_path, channel::PubSub, registry::Pubsub, registry::Publisher); return Publisher(it->second); @@ -171,7 +181,8 @@ namespace kickmsg return Subscriber(*r); } auto [it, _] = regions_.emplace( - shm_name, SharedRegion::open(shm_name.c_str())); + shm_name, + SharedRegion::open(shm_name.c_str(), make_topic_identity(topic))); touch_registry(shm_name, topic_path, channel::PubSub, registry::Pubsub, registry::Subscriber); return Subscriber(it->second); @@ -200,16 +211,20 @@ namespace kickmsg Publisher Node::advertise_or_join(char const* topic, channel::Config const& cfg) { + channel::Config stamped_cfg = cfg; + stamped_cfg.identity = make_topic_identity(topic); return create_or_open_handle( make_topic_name(topic), with_leading_slash(topic), - channel::PubSub, registry::Pubsub, registry::Publisher, cfg); + channel::PubSub, registry::Pubsub, registry::Publisher, stamped_cfg); } Subscriber Node::subscribe_or_create(char const* topic, channel::Config const& cfg) { + channel::Config stamped_cfg = cfg; + stamped_cfg.identity = make_topic_identity(topic); return create_or_open_handle( make_topic_name(topic), with_leading_slash(topic), - channel::PubSub, registry::Pubsub, registry::Subscriber, cfg); + channel::PubSub, registry::Pubsub, registry::Subscriber, stamped_cfg); } BroadcastHandle Node::join_broadcast(char const* channel, channel::Config const& cfg) @@ -222,10 +237,12 @@ namespace kickmsg registry::Broadcast, registry::Both); return BroadcastHandle{Publisher{*r}, Subscriber{*r}}; } + channel::Config stamped_cfg = cfg; + stamped_cfg.identity = make_broadcast_identity(channel); auto [it, _] = regions_.emplace( shm_name, SharedRegion::create_or_open( - shm_name.c_str(), channel::Broadcast, cfg, name_.c_str())); + shm_name.c_str(), channel::Broadcast, stamped_cfg, name_.c_str())); touch_registry(shm_name, topic_path, channel::Broadcast, registry::Broadcast, registry::Both); return BroadcastHandle{Publisher{it->second}, Subscriber{it->second}}; @@ -235,13 +252,11 @@ namespace kickmsg { channel::Config mbx_cfg = cfg; mbx_cfg.max_subscribers = 1; + mbx_cfg.identity = make_mailbox_identity(name_.c_str(), tag); auto shm_name = make_mailbox_name(name_.c_str(), tag); auto topic_path = mailbox_topic(name_.c_str(), tag); - // Guard the create (see advertise): a second create_mailbox() of - // the same tag must not O_TRUNC+memset the live segment. Returning - // a handle to the existing region makes the duplicate claim fail - // loudly in the Subscriber ctor (the single ring is already Live) - // instead of silently wiping the mailbox. + // Guard the create (see advertise); the duplicate claim then fails + // loudly in the Subscriber ctor instead of splitting the mailbox. if (auto* r = find_region(shm_name)) { touch_registry(shm_name, topic_path, channel::PubSub, @@ -251,7 +266,7 @@ namespace kickmsg auto [it, _] = regions_.emplace( shm_name, SharedRegion::create(shm_name.c_str(), channel::PubSub, mbx_cfg, name_.c_str())); - // Mailbox owner is the one who receives — Subscriber role. + // Mailbox owner is the one who receives -- Subscriber role. touch_registry(shm_name, topic_path, channel::PubSub, registry::Mailbox, registry::Subscriber); return Subscriber(it->second); @@ -268,7 +283,9 @@ namespace kickmsg return Publisher(*r); } auto [it, _] = regions_.emplace( - shm_name, SharedRegion::open(shm_name.c_str())); + shm_name, + SharedRegion::open(shm_name.c_str(), + make_mailbox_identity(owner_node, tag))); // Mailbox sender is the Publisher side. touch_registry(shm_name, topic_path, channel::PubSub, registry::Mailbox, registry::Publisher); @@ -280,6 +297,7 @@ namespace kickmsg { channel::Config mbx_cfg = cfg; mbx_cfg.max_subscribers = 1; + mbx_cfg.identity = make_mailbox_identity(name_.c_str(), tag); return create_or_open_handle( make_mailbox_name(name_.c_str(), tag), mailbox_topic(name_.c_str(), tag), @@ -291,6 +309,7 @@ namespace kickmsg { channel::Config mbx_cfg = cfg; mbx_cfg.max_subscribers = 1; + mbx_cfg.identity = make_mailbox_identity(owner_node, tag); return create_or_open_handle( make_mailbox_name(owner_node, tag), mailbox_topic(owner_node, tag), @@ -360,6 +379,31 @@ namespace kickmsg + sanitize_shm_component(tag, "mailbox tag")); } + // Raw per-call components disambiguate colliding sanitized names; the + // leading kind tag keeps the three channel kinds in disjoint domains. + + uint64_t Node::make_topic_identity(char const* topic) const + { + uint64_t h = identity_chain("topic", hash::FNV1A_64_OFFSET_BASIS); + h = identity_chain(namespace_, h); + return identity_chain(topic, h); + } + + uint64_t Node::make_broadcast_identity(char const* channel) const + { + uint64_t h = identity_chain("broadcast", hash::FNV1A_64_OFFSET_BASIS); + h = identity_chain(namespace_, h); + return identity_chain(channel, h); + } + + uint64_t Node::make_mailbox_identity(char const* owner, char const* tag) const + { + uint64_t h = identity_chain("mailbox", hash::FNV1A_64_OFFSET_BASIS); + h = identity_chain(namespace_, h); + h = identity_chain(owner, h); + return identity_chain(tag, h); + } + SharedRegion* Node::find_region(std::string const& shm_name) { auto it = regions_.find(shm_name); diff --git a/src/Publisher.cc b/src/Publisher.cc index 068f873..c51c87e 100644 --- a/src/Publisher.cc +++ b/src/Publisher.cc @@ -39,6 +39,15 @@ namespace kickmsg std::size_t Publisher::publish(std::size_t len) { + // Oversized len would otherwise be truncated by the uint32_t store + // into payload_len -- possibly to a small VALID length, bypassing + // the subscriber's bound check and delivering a silently wrong + // length. Recycle the pending slot and report zero deliveries. + if (len > header_->slot_data_size) + { + release_pending(); + return 0; + } if (pending_slot_ == INVALID_SLOT) { return 0; @@ -94,7 +103,7 @@ namespace kickmsg admitted = true; break; } - // CAS failed — old was updated. Re-check state. + // CAS failed -- old was updated. Re-check state. } if (not admitted) @@ -113,96 +122,100 @@ namespace kickmsg auto* entries = ring_entries(ring); auto& e = entries[idx]; - // If the ring has wrapped, the entry we are about to overwrite - // may still reference a live slot. Capture its slot_idx for - // release AFTER we successfully lock the entry. - // - // We defer the release because of a TOCTOU race: between - // wait_and_capture_slot reading the slot_idx and our lock CAS, - // another publisher can overwrite the entry (evict + commit). - // If our CAS then fails, releasing the captured slot_idx would - // corrupt a live entry's refcount. Deferring until after lock - // success guarantees we own the entry. - uint32_t old_slot = INVALID_SLOT; + uint64_t prev_seq = 0; if (pos >= capacity) { - uint64_t expected_seq = pos - capacity + 1; - old_slot = wait_and_capture_slot(e, expected_seq, commit_timeout_); + prev_seq = pos - capacity + 1; } - // Two-phase commit: CAS sequence to LOCKED_SEQUENCE to exclusively - // own the entry, write data, then release-store the final sequence. - // This prevents concurrent publishers from interleaving slot_idx writes. - uint64_t prev_seq = 0; + // Wait for the previous wrap's occupant; also records whether one + // lock value spanned the whole timeout (self_repair's steal proof). + CommitWait wait{0, false}; if (pos >= capacity) { - prev_seq = pos - capacity + 1; + wait = wait_for_commit(e, prev_seq, commit_timeout_); + } + + // Two-phase commit: CAS to our lock, write data, CAS-commit. + // A repairer's theft makes both CASes fail instead of being + // blind-stored over. + uint64_t const lock_val = seq_lock(pos); + uint64_t observed = 0; + if (pos >= capacity) + { + observed = wait.last_seq; } + bool prev_was_skip = false; bool locked = false; for (int attempt = 0; attempt < 64; ++attempt) { - uint64_t expected = prev_seq; - // Acquire on success: we need to see the previous writer's stores. - if (e.sequence.compare_exchange_weak(expected, LOCKED_SEQUENCE, - std::memory_order_acquire, std::memory_order_relaxed)) + if (not seq_is_locked(observed) and seq_pos(observed) == prev_seq) { - locked = true; - break; + // CAS from the exact observed value (plain or skip-tagged). + uint64_t expected = observed; + // Acquire on success: we need to see the previous writer's stores. + if (e.sequence.compare_exchange_weak(expected, lock_val, + std::memory_order_acquire, std::memory_order_relaxed)) + { + prev_was_skip = seq_is_skip(observed); + locked = true; + break; + } + observed = expected; + continue; } - if (expected != LOCKED_SEQUENCE) + if (not seq_is_locked(observed)) { - // Entry was committed by another publisher, not just locked. - break; + break; // committed elsewhere: stale residue, can't lock } - // Another publisher holds the lock; it will release quickly. + observed = e.sequence.load(std::memory_order_relaxed); } if (not locked) { - // Self-repair: if the entry is stuck (LOCKED_SEQUENCE from - // a crashed publisher, or stale from a publisher that - // crashed before the CAS lock), advance it so the NEXT - // publisher at this position succeeds without timeout. - // Cost: three stores (~10 ns) after an already-expensive - // timeout (~10 ms). Always the right thing — leaving the - // entry stuck just punishes the next publisher for the - // same crash. - uint64_t seq = e.sequence.load(std::memory_order_acquire); - uint64_t expected = pos + 1; - if (seq == LOCKED_SEQUENCE or seq + capacity < expected) - { - e.slot_idx.store(INVALID_SLOT, std::memory_order_relaxed); - e.payload_len.store(0, std::memory_order_relaxed); - e.sequence.store(expected, std::memory_order_release); - } - - ++dropped_; - ring->dropped_count.fetch_add(1, std::memory_order_relaxed); + // Heal a provably-stale entry so the next publisher here + // does not pay the timeout again. + self_repair(e, pos, capacity, wait); + abandon_delivery(ring); ++excess; - ring->state_flight.fetch_sub(ring::IN_FLIGHT_ONE, - std::memory_order_release); continue; } - // Lock succeeded: we exclusively own this entry. Release the - // previous occupant's slot reference for this ring, unless - // drain_unconsumed already released it (marked by INVALID_SLOT). - if (old_slot != INVALID_SLOT) + // Release the previous occupant's slot from the post-lock read + // (sees even a commit that landed after our wait timed out; a + // drain's INVALID marker fails the bound check). Never for a + // skip predecessor (untrustworthy metadata), never below one + // wrap (zero-init slot_idx would read as valid slot 0). + if (pos >= capacity and not prev_was_skip) { - uint32_t current_slot = e.slot_idx.load(std::memory_order_acquire); - if (current_slot != INVALID_SLOT) + uint32_t prev_slot = e.slot_idx.load(std::memory_order_acquire); + if (prev_slot < header_->pool_size) { - release_slot(old_slot); + release_slot(prev_slot); } } - // We exclusively own this entry. No other publisher can CAS from - // LOCKED_SEQUENCE since they expect prev_seq. + // Theft guard: a repairer may have stolen our lock during a + // stall; storing data now would tear the repaired entry. + if (e.sequence.load(std::memory_order_acquire) != lock_val) + { + abandon_delivery(ring); + ++excess; + continue; + } + e.slot_idx.store(slot_idx, std::memory_order_relaxed); e.payload_len.store(static_cast(len), std::memory_order_relaxed); - // Release-store commits the entry: subscribers and future publishers - // at this position will see all preceding stores. - e.sequence.store(pos + 1, std::memory_order_release); + // CAS-commit: fails only on theft after the guard above. + // Release on success publishes the data stores. + uint64_t expected_lock = lock_val; + if (not e.sequence.compare_exchange_strong(expected_lock, pos + 1, + std::memory_order_release, std::memory_order_relaxed)) + { + abandon_delivery(ring); + ++excess; + continue; + } // Release admission. ring->state_flight.fetch_sub(ring::IN_FLIGHT_ONE, @@ -257,28 +270,73 @@ namespace kickmsg return static_cast(len); } - uint32_t Publisher::wait_and_capture_slot(Entry& e, uint64_t expected_seq, - microseconds timeout) + Publisher::CommitWait Publisher::wait_for_commit(Entry& e, uint64_t expected_seq, + microseconds timeout) { constexpr int CHECK_INTERVAL = 1024; nanoseconds start = kickmsg::monotonic_ns(); + uint64_t first = e.sequence.load(std::memory_order_acquire); + uint64_t seq = first; int i = 0; while (true) { - uint64_t seq = e.sequence.load(std::memory_order_acquire); - if (seq >= expected_seq and seq != LOCKED_SEQUENCE) + if (not seq_is_locked(seq) and seq_pos(seq) >= expected_seq) { - return e.slot_idx.load(std::memory_order_acquire); + return CommitWait{seq, false}; } ++i; if ((i & (CHECK_INTERVAL - 1)) == 0) { if (kickmsg::elapsed_time(start) >= timeout) { - return INVALID_SLOT; + // Same lock value at both ends proves one holder spanned + // the window (steal precondition). + bool stable = seq_is_locked(first) and seq == first; + return CommitWait{seq, stable}; } } + seq = e.sequence.load(std::memory_order_acquire); + } + } + + void Publisher::abandon_delivery(SubRingHeader* ring) + { + ++dropped_; + ring->dropped_count.fetch_add(1, std::memory_order_relaxed); + ring->state_flight.fetch_sub(ring::IN_FLIGHT_ONE, + std::memory_order_release); + // write_pos already advanced and the position may now carry a skip + // marker: without the wake a parked subscriber sleeps its timeout. + std::atomic_thread_fence(std::memory_order_seq_cst); + if (ring->has_waiter.load(std::memory_order_relaxed)) + { + futex_wake_all(ring->write_pos); + } + } + + void Publisher::self_repair(Entry& e, uint64_t pos, uint64_t capacity, + CommitWait const& wait) + { + uint64_t seq = e.sequence.load(std::memory_order_acquire); + uint64_t done = pos + 1; + + if (seq_is_locked(seq)) + { + // A lock that appeared mid-wait may be a healthy commit -- only + // steal one proven stable across the full wait. + if (not wait.stable_lock or seq != wait.last_seq) + { + return; + } + } + else if (seq_pos(seq) + capacity >= done) + { + return; // at most one wrap behind: normal contention residue + } + if (entry_steal_and_clear(e, pos, seq)) + { + header_->steal_count.fetch_add(1, std::memory_order_relaxed); } } diff --git a/src/Region.cc b/src/Region.cc index dd367a5..5b65b7d 100644 --- a/src/Region.cc +++ b/src/Region.cc @@ -113,19 +113,21 @@ namespace kickmsg // Optional payload schema: publish directly before the magic store. // No claim state machine needed at creation because (a) we are the - // only writer — no concurrent claimant can race — and (b) the + // only writer -- no concurrent claimant can race -- and (b) the // release-store of MAGIC below carries all preceding writes, // including the memcpy into schema_data and this relaxed store of // schema_state, across to any reader that acquire-loads MAGIC. // The relaxed is therefore correct; do NOT "fix" it to release in - // isolation — MAGIC is the sole publication fence for this region. + // isolation -- MAGIC is the sole publication fence for this region. if (cfg.schema.has_value()) { std::memcpy(&h->schema_data, &*cfg.schema, sizeof(SchemaInfo)); h->schema_state.store(schema::Set, std::memory_order_relaxed); } - h->free_top = tagged_pack(0, INVALID_SLOT); + h->free_top = tagged_pack(0, INVALID_SLOT); + h->steal_count = 0; + h->identity_hash = cfg.identity; for (uint32_t i = 0; i < cfg.pool_size; ++i) { @@ -170,7 +172,7 @@ namespace kickmsg "Header geometry: total_size smaller than Header"); } - // No zero counts or strides — divide-by-zero protection for + // No zero counts or strides -- divide-by-zero protection for // the bound checks below depends on these, and stamp_new_region // never produces a zero here. if (h->max_subs == 0 or h->pool_size == 0 @@ -374,7 +376,7 @@ namespace kickmsg return region; } - SharedRegion SharedRegion::open(char const* name) + SharedRegion SharedRegion::open(char const* name, uint64_t expected_identity) { SharedRegion region; region.name_ = name; @@ -382,6 +384,13 @@ namespace kickmsg region.base_ = region.shm_.address(); region.size_ = region.shm_.size(); validate_opened(region.base_, region.size_); + uint64_t stamped = region.header()->identity_hash; + if (expected_identity != 0 and stamped != 0 and stamped != expected_identity) + { + throw std::runtime_error( + std::string{"Identity mismatch on existing region (shm name collision): "} + + name); + } return region; } @@ -393,7 +402,7 @@ namespace kickmsg RegionLayout layout = compute_layout(cfg, creator_name); // Try to be the creator. On success, try_create leaves the - // SharedMemory fully mapped — we stamp the header directly rather + // SharedMemory fully mapped -- we stamp the header directly rather // than closing and re-entering SharedMemory::create, which would // require either O_TRUNC (rejected on Darwin) or shm_unlink + // recreate (introduces a tiny race window where a concurrent @@ -429,22 +438,29 @@ namespace kickmsg throw std::runtime_error( std::string{"Config mismatch on existing region: "} + name); } + if (cfg.identity != 0 and h->identity_hash != 0 + and h->identity_hash != cfg.identity) + { + throw std::runtime_error( + std::string{"Identity mismatch on existing region " + "(shm name collision): "} + name); + } SharedRegion region; region.name_ = name; region.shm_ = std::move(shm); region.base_ = region.shm_.address(); region.size_ = region.shm_.size(); // config_hash covers the cfg fields but NOT total_size, - // offsets, or strides — validate the geometry like + // offsets, or strides -- validate the geometry like // open()/attach_open() so a corrupt or partially-stamped // creator can't hand us junk that later pointer math trusts. validate_opened(region.base_, region.size_); return region; } - // SHM exists but magic/version not ready yet — creator + // SHM exists but magic/version not ready yet -- creator // is still mid-init. Close and retry. } - // try_open returned false (ENOENT) or magic not ready → retry. + // try_open returned false (ENOENT) or magic not ready -> retry. kickmsg::sleep(10ms); } @@ -455,8 +471,8 @@ namespace kickmsg void SharedRegion::unlink() { // Release the OS-level name backing this region. Existing - // mappings — this process and every peer that already opened - // the region — keep working until their last reference drops; + // mappings -- this process and every peer that already opened + // the region -- keep working until their last reference drops; // only the region's discoverability by name is affected. Any // holder, creator or opener, may call this. Future open-by- // name behaviour is OS-dependent and intentionally left to the @@ -478,7 +494,7 @@ namespace kickmsg // Schema slot wedged at Claiming: crashed claimant that CAS'd but // never reached Set. Mirrors the operator-surface pattern of - // retired_rings/locked_entries — reset_schema_claim() recovers it. + // retired_rings/locked_entries -- reset_schema_claim() recovers it. report.schema_stuck = (h->schema_state.load(std::memory_order_acquire) == schema::Claiming); @@ -500,9 +516,9 @@ namespace kickmsg uint64_t seq = e.sequence.load(std::memory_order_acquire); // Case A: explicitly locked, never committed. - // Case B: more than one full wrap behind — stale from a + // Case B: more than one full wrap behind -- stale from a // publisher that crashed before the CAS lock. - if (seq == LOCKED_SEQUENCE or seq + cap < pos + 1) + if (seq_is_locked(seq) or seq_pos(seq) + cap < pos + 1) { ++report.locked_entries; } @@ -525,10 +541,13 @@ namespace kickmsg ++report.draining_rings; } - // A Live or Draining ring whose owner process is gone is an - // orphan no other count surfaces (a dead Live ring otherwise - // reads as healthy). reclaim_dead_rings() recovers it. - if ((state == ring::Live or state == ring::Draining) + // A Live/Draining/Reclaiming ring whose owner process is gone + // is an orphan no other count surfaces (a dead Live ring + // otherwise reads as healthy; Reclaiming is the residue of a + // reclaimer that crashed mid-pass). reclaim_dead_rings() + // recovers all three. + if ((state == ring::Live or state == ring::Draining + or state == ring::Reclaiming) and ring_owner_dead(ring)) { ++report.dead_rings; @@ -544,6 +563,14 @@ namespace kickmsg auto* h = header(); std::size_t repaired = 0; + struct LockedCandidate + { + Entry* entry; + uint64_t pos; + uint64_t seq; + }; + std::vector candidates; + for (uint64_t i = 0; i < h->max_subs; ++i) { auto* ring = sub_ring_at(b, h, static_cast(i)); @@ -562,41 +589,47 @@ namespace kickmsg uint64_t seq = e.sequence.load(std::memory_order_acquire); uint64_t expected = pos + 1; - if (seq == LOCKED_SEQUENCE) + if (seq_is_locked(seq)) { - // Case A: publisher CAS'd Unset → LOCKED_SEQUENCE then - // crashed before the release-store of (pos + 1). The - // crashed publisher may have written garbage into - // slot_idx/payload_len. Mark the entry as having no - // valid slot so subscribers skip it and future evictions - // don't release a stale index. - e.slot_idx.store(INVALID_SLOT, std::memory_order_relaxed); - e.payload_len.store(0, std::memory_order_relaxed); - e.sequence.store(expected, std::memory_order_release); - ++repaired; + // Case A: may be a healthy in-flight commit -- never + // steal on first sight, defer to the grace pass. + candidates.push_back({&e, pos, seq}); } - else if (seq + cap < expected) + else if (seq_pos(seq) + cap < expected) { - // Case B: publisher claimed write_pos (fetch_add) then - // crashed before the CAS lock. The entry still carries - // its committed sequence from a previous wrap — more - // than one full ring revolution behind. No live - // publisher can be mid-commit for longer than one wrap - // (the commit path is a few instructions between - // fetch_add and the CAS), so > 1 wrap behind is - // definitively stale. - // - // slot_idx may reference a still-live slot from the - // previous cycle; mark INVALID_SLOT to avoid a double - // release. - e.slot_idx.store(INVALID_SLOT, std::memory_order_relaxed); - e.payload_len.store(0, std::memory_order_relaxed); - e.sequence.store(expected, std::memory_order_release); - ++repaired; + // Case B: committed >1 wrap behind (claimant crashed + // before its lock CAS). + if (entry_steal_and_clear(e, pos, seq)) + { + h->steal_count.fetch_add(1, std::memory_order_relaxed); + ++repaired; + } } } } + if (candidates.empty()) + { + return repaired; + } + + // Grace pass: an unchanged lock value across a full commit_timeout + // proves its (unique) holder exceeded the commit budget. + kickmsg::sleep(microseconds{h->commit_timeout_us}); + + for (auto const& c : candidates) + { + if (c.entry->sequence.load(std::memory_order_acquire) != c.seq) + { + continue; + } + if (entry_steal_and_clear(*c.entry, c.pos, c.seq)) + { + h->steal_count.fetch_add(1, std::memory_order_relaxed); + ++repaired; + } + } + return repaired; } @@ -637,7 +670,10 @@ namespace kickmsg uint32_t packed = ring->state_flight.load(std::memory_order_acquire); ring::State state = ring::get_state(packed); - if (state != ring::Live and state != ring::Draining) + // Reclaiming residue (reclaimer crashed mid-pass) is only + // recoverable here. + if (state != ring::Live and state != ring::Draining + and state != ring::Reclaiming) { continue; } @@ -646,27 +682,58 @@ namespace kickmsg continue; } - // Owner is gone. Flip the state bits to Free but PRESERVE - // in_flight: a publisher mid-commit to this ring will still - // fetch_sub, and zeroing in_flight here would underflow it into - // the state bits. A leftover Free | in_flight>0 is then a job - // for reset_retired_rings() once the publisher is confirmed gone; - // committed entries' slot refs are recovered by - // reclaim_orphaned_slots(). Clear owner last (Free + stale owner - // is harmless; the next claimer overwrites it). - uint32_t old = packed; - while (true) - { - uint32_t desired = (old & ~ring::STATE_MASK) | ring::Free; - if (ring->state_flight.compare_exchange_weak(old, desired, - std::memory_order_release, std::memory_order_acquire)) + // Two-phase, mirroring Registry::sweep_stale: a naive CAS retry + // is value-ABA-prone (ring freed and re-claimed between checks + // would be stomped). Single-shot CAS to Reclaiming, re-verify + // death under that exclusivity; in_flight churn just defers the + // ring to the next pass. + uint32_t fresh = ring->state_flight.load(std::memory_order_acquire); + if (ring::get_state(fresh) != state) + { + continue; + } + uint32_t claim = (fresh & ~ring::STATE_MASK) | ring::Reclaiming; + if (not ring->state_flight.compare_exchange_strong(fresh, claim, + std::memory_order_acq_rel, std::memory_order_relaxed)) + { + continue; + } + + if (ring_owner_dead(ring)) + { + // Free but PRESERVE in_flight (zeroing would underflow into + // the state bits on a late fetch_sub); owner cleared only + // after Free, or a crash here leaves an unattributable ring. + uint32_t old = claim; + while (ring::get_state(old) == ring::Reclaiming) { - break; + uint32_t desired = (old & ~ring::STATE_MASK) | ring::Free; + if (ring->state_flight.compare_exchange_weak(old, desired, + std::memory_order_release, std::memory_order_acquire)) + { + ring->owner_pid.store(0, std::memory_order_relaxed); + ring->owner_starttime.store(0, std::memory_order_relaxed); + ++reclaimed; + break; + } + } + } + else + { + // Displaced a live owner (freed + re-claimed between checks): + // restore it, but only while still Reclaiming -- the owner's + // own teardown may have moved the state. + uint32_t old = claim; + while (ring::get_state(old) == ring::Reclaiming) + { + uint32_t desired = (old & ~ring::STATE_MASK) | state; + if (ring->state_flight.compare_exchange_weak(old, desired, + std::memory_order_release, std::memory_order_acquire)) + { + break; + } } } - ring->owner_pid.store(0, std::memory_order_relaxed); - ring->owner_starttime.store(0, std::memory_order_relaxed); - ++reclaimed; } return reclaimed; @@ -678,7 +745,7 @@ namespace kickmsg uint32_t state = h->schema_state.load(std::memory_order_acquire); if (state != schema::Set) { - // Unset or a claim is mid-write — no stable payload to return. + // Unset or a claim is mid-write -- no stable payload to return. return std::nullopt; } SchemaInfo out; @@ -714,8 +781,8 @@ namespace kickmsg // Someone else won the claim. If they're mid-write, wait briefly // for the state to settle at Set so a follow-up schema() read is - // meaningful — but bound the wait: a claimant that crashed between - // CAS→Claiming and store→Set leaves the slot wedged. Operators + // meaningful -- but bound the wait: a claimant that crashed between + // CAS->Claiming and store->Set leaves the slot wedged. Operators // recover such a wedge with reset_schema_claim(), and diagnose() // surfaces it via HealthReport::schema_stuck. // @@ -741,7 +808,7 @@ namespace kickmsg // confirming the original claimant is gone; otherwise a slow-but- // alive writer could finish its memcpy into schema_data and then // release-store Set, while a new claimant is concurrently using - // the slot — producing torn bytes. + // the slot -- producing torn bytes. uint32_t expected = schema::Claiming; return header()->schema_state.compare_exchange_strong( expected, schema::Unset, @@ -755,13 +822,14 @@ namespace kickmsg auto const* h = header(); RegionStats out{}; - out.pool_size = h->pool_size; + out.pool_size = h->pool_size; + out.total_steals = h->steal_count.load(std::memory_order_relaxed); out.rings.reserve(h->max_subs); for (uint64_t i = 0; i < h->max_subs; ++i) { // sub_ring_at needs a non-const base*/header*, but the operation - // is read-only — const_cast is safe here. + // is read-only -- const_cast is safe here. auto* ring = sub_ring_at(const_cast(b), h, static_cast(i)); uint32_t packed = ring->state_flight.load(std::memory_order_acquire); @@ -796,7 +864,7 @@ namespace kickmsg // bounded by pool_size so a concurrent push/pop storm can't fool us // into an unbounded loop. Under churn we can undercount (a slot // being popped mid-walk) or overcount (a slot's next_free pointing - // to a just-pushed node we've already counted) — acceptable for a + // to a just-pushed node we've already counted) -- acceptable for a // diagnostic view. uint64_t top = h->free_top.load(std::memory_order_acquire); uint32_t idx = tagged_idx(top); @@ -862,10 +930,12 @@ namespace kickmsg auto& e = entries[pos & h->sub_ring_mask]; uint64_t seq = e.sequence.load(std::memory_order_acquire); - // Skip uncommitted and locked entries. - if (seq >= pos + 1 and seq != LOCKED_SEQUENCE) + // Skip uncommitted, locked, and skip-marker entries (the + // latter carry untrustworthy metadata by design). + if (not seq_is_locked(seq) and not seq_is_skip(seq) + and seq >= pos + 1) { - uint32_t idx = e.slot_idx; + uint32_t idx = e.slot_idx.load(std::memory_order_acquire); if (idx < h->pool_size) { referenced[idx] = true; @@ -874,23 +944,34 @@ namespace kickmsg } } - // Reclaim unreferenced slots with refcount > 0. + // Free-stack membership (exact under the quiescence contract) + // recovers rc == 0 orphans a refcount-only scan never could; + // bounded against corrupt next_free cycles. + std::vector on_stack(h->pool_size, false); + uint64_t walked = 0; + uint32_t idx32 = tagged_idx(h->free_top.load(std::memory_order_acquire)); + while (idx32 != INVALID_SLOT and idx32 < h->pool_size + and walked < h->pool_size) + { + on_stack[idx32] = true; + idx32 = slot_at(b, h, idx32)->next_free.load(std::memory_order_relaxed); + ++walked; + } + + // Reclaim slots that are neither ring-referenced nor on the free + // stack, regardless of refcount. std::size_t reclaimed = 0; for (uint64_t idx = 0; idx < h->pool_size; ++idx) { - if (referenced[idx]) + if (referenced[idx] or on_stack[idx]) { continue; } - auto* slot = slot_at(b, h, static_cast(idx)); - uint32_t rc = slot->refcount.load(std::memory_order_acquire); - if (rc > 0) - { - slot->refcount.store(0, std::memory_order_release); - treiber_push(h->free_top, slot, static_cast(idx)); - ++reclaimed; - } + auto* slot = slot_at(b, h, static_cast(idx)); + slot->refcount.store(0, std::memory_order_release); + treiber_push(h->free_top, slot, static_cast(idx)); + ++reclaimed; } return reclaimed; diff --git a/src/Registry.cc b/src/Registry.cc index 4cfcb79..20e065f 100644 --- a/src/Registry.cc +++ b/src/Registry.cc @@ -58,7 +58,7 @@ namespace kickmsg h->version = registry::VERSION; h->capacity = capacity; - // MAGIC published last — readers spin on it with acquire. + // MAGIC published last -- readers spin on it with acquire. h->magic.store(registry::MAGIC, std::memory_order_release); } @@ -212,7 +212,7 @@ namespace kickmsg { return slot; } - // Registry full — sweep dead-pid residue and retry. Bounded to + // Registry full -- sweep dead-pid residue and retry. Bounded to // avoid livelock when many registrants race on a full registry. for (int attempt = 0; attempt < 3; ++attempt) { @@ -283,7 +283,11 @@ namespace kickmsg es[i].node_name, ::strnlen(es[i].node_name, sizeof(es[i].node_name))); - // Seqlock recheck: reject if state changed or generation bumped. + // Seqlock recheck. The fence is load-bearing: an acquire load + // only orders later accesses, so without it the relaxed field + // reads could be satisfied after g2/s2 (cf. read_seqretry's + // smp_rmb). + std::atomic_thread_fence(std::memory_order_acquire); uint32_t g2 = es[i].generation.load(std::memory_order_acquire); uint32_t s2 = es[i].state.load(std::memory_order_acquire); if (s2 != registry::Active or g1 != g2) @@ -392,7 +396,7 @@ namespace kickmsg uint64_t pid = es[i].pid.load(std::memory_order_acquire); if (pid == 0) { - // Registrant hasn't stored its pid yet — reclaiming here + // Registrant hasn't stored its pid yet -- reclaiming here // would race with its pending field writes. continue; } @@ -404,7 +408,7 @@ namespace kickmsg } // Phase 1: CAS to Reclaiming to block concurrent registrants - // and close the ABA window on a direct state→Free CAS. + // and close the ABA window on a direct state->Free CAS. uint32_t expected = s; if (not es[i].state.compare_exchange_strong( expected, registry::Reclaiming, @@ -434,7 +438,7 @@ namespace kickmsg continue; } - // Phase 2: finalize. Fields are left untouched — same + // Phase 2: finalize. Fields are left untouched -- same // reasoning as deregister(). es[i].generation.fetch_add(1, std::memory_order_relaxed); es[i].state.store(registry::Free, std::memory_order_release); diff --git a/src/Subscriber.cc b/src/Subscriber.cc index 493564b..5c92c97 100644 --- a/src/Subscriber.cc +++ b/src/Subscriber.cc @@ -46,8 +46,16 @@ namespace kickmsg std::memory_order_relaxed); ring->owner_pid.store(pid, std::memory_order_release); ring_idx_ = i; + // Pre-CAS wp can be stale (no HB edge to the previous + // tenant): keep it as the drain floor, but consume from the + // freshest value or we'd replay the previous tenancy. + uint64_t wp2 = ring->write_pos.load(std::memory_order_acquire); start_pos_ = wp; - read_pos_ = start_pos_; + read_pos_ = wp; + if (wp2 > wp) + { + read_pos_ = wp2; + } break; } } @@ -67,7 +75,7 @@ namespace kickmsg auto* ring = sub_ring_at(base_, header_, ring_idx_); - // Transition Live → Draining, preserving in_flight count. + // Transition Live -> Draining, preserving in_flight count. uint32_t old = ring->state_flight.load(std::memory_order_acquire); while (true) { @@ -102,7 +110,7 @@ namespace kickmsg if (quiesced) { drain_unconsumed(ring); - // in_flight == 0 — safe to store directly. + // in_flight == 0 -- safe to store directly. ring->state_flight.store(ring::make_packed(ring::Free), std::memory_order_release); } @@ -207,10 +215,18 @@ namespace kickmsg uint64_t seq1 = e.sequence.load(std::memory_order_acquire); if (seq1 != read_pos_ + 1) { - if (seq1 == LOCKED_SEQUENCE or seq1 < read_pos_ + 1) + if (seq_is_skip(seq1) and seq_pos(seq1) == read_pos_ + 1) + { + // Skip marker: metadata untrustworthy by design. + ++lost_; + ring->lost_count.fetch_add(1, std::memory_order_relaxed); + ++read_pos_; + continue; + } + if (seq_is_locked(seq1) or seq_pos(seq1) < read_pos_ + 1) { - // Publisher is mid-commit (LOCKED_SEQUENCE) or has not - // committed yet. Come back later. + // Publisher is mid-commit (position-tagged lock) or has + // not committed yet. Come back later. return std::nullopt; } // Entry was overwritten (seq > expected): advance and retry. @@ -303,6 +319,7 @@ namespace kickmsg auto* ring = sub_ring_at(base_, header_, ring_idx_); nanoseconds start = kickmsg::monotonic_ns(); + int idle_spins = 0; while (true) { auto sample = try_receive(); @@ -316,15 +333,12 @@ namespace kickmsg { return std::nullopt; } + nanoseconds remaining = timeout - elapsed; uint64_t cur = ring->write_pos.load(std::memory_order_relaxed); if (cur <= read_pos_) { - nanoseconds remaining = timeout - elapsed; - if (remaining <= 0ns) - { - return std::nullopt; - } + idle_spins = 0; ring->has_waiter.store(1, std::memory_order_relaxed); // Pairs with the publisher's seq_cst fence: orders this store // before futex_wait's kernel read of write_pos so a concurrent @@ -333,6 +347,26 @@ namespace kickmsg futex_wait(ring->write_pos, cur, remaining); ring->has_waiter.store(0, std::memory_order_relaxed); } + else + { + // Head claimed but uncommitted: no futex edge fires for the + // commit itself, so poll -- bounded, or a crashed publisher + // turns this into a hot spin for the whole timeout. + ++idle_spins; + if (idle_spins <= 64) + { + kickmsg::yield(); + } + else + { + nanoseconds nap = 100us; + if (remaining < nap) + { + nap = remaining; + } + kickmsg::sleep(nap); + } + } } } @@ -370,7 +404,14 @@ namespace kickmsg uint64_t seq1 = e.sequence.load(std::memory_order_acquire); if (seq1 != read_pos_ + 1) { - if (seq1 == LOCKED_SEQUENCE or seq1 < read_pos_ + 1) + if (seq_is_skip(seq1) and seq_pos(seq1) == read_pos_ + 1) + { + ++lost_; + ring->lost_count.fetch_add(1, std::memory_order_relaxed); + ++read_pos_; + continue; + } + if (seq_is_locked(seq1) or seq_pos(seq1) < read_pos_ + 1) { return std::nullopt; } @@ -447,6 +488,7 @@ namespace kickmsg auto* ring = sub_ring_at(base_, header_, ring_idx_); nanoseconds start = kickmsg::monotonic_ns(); + int idle_spins = 0; while (true) { auto sample = try_receive_view(); @@ -460,15 +502,12 @@ namespace kickmsg { return std::nullopt; } + nanoseconds remaining = timeout - elapsed; uint64_t cur = ring->write_pos.load(std::memory_order_relaxed); if (cur <= read_pos_) { - nanoseconds remaining = timeout - elapsed; - if (remaining <= 0ns) - { - return std::nullopt; - } + idle_spins = 0; ring->has_waiter.store(1, std::memory_order_relaxed); // Pairs with the publisher's seq_cst fence: orders this store // before futex_wait's kernel read of write_pos so a concurrent @@ -477,6 +516,26 @@ namespace kickmsg futex_wait(ring->write_pos, cur, remaining); ring->has_waiter.store(0, std::memory_order_relaxed); } + else + { + // Head claimed but uncommitted: no futex edge fires for the + // commit itself, so poll -- bounded, or a crashed publisher + // turns this into a hot spin for the whole timeout. + ++idle_spins; + if (idle_spins <= 64) + { + kickmsg::yield(); + } + else + { + nanoseconds nap = 100us; + if (remaining < nap) + { + nap = remaining; + } + kickmsg::sleep(nap); + } + } } } diff --git a/src/os/darwin/SharedMemory.cc b/src/os/darwin/SharedMemory.cc index f9dddc7..7edd047 100644 --- a/src/os/darwin/SharedMemory.cc +++ b/src/os/darwin/SharedMemory.cc @@ -1,5 +1,3 @@ -// macOS-specific SharedMemory::create(). Other methods live in -// src/os/posix/SharedMemory.cc. #include "kickmsg/os/SharedMemory.h" #include @@ -11,20 +9,13 @@ namespace kickmsg { void SharedMemory::create(std::string const& name, std::size_t size) { - // Darwin's shm_open(O_CREAT|O_TRUNC) returns EINVAL on an existing - // object, and ftruncate can only be called once per object. Unlink - // first, then exclusive-create, to sidestep both. - // - // NOTE: the unlink + exclusive-create sequence is NOT safe under - // concurrent callers (two processes could both unlink, then both - // exclusive-create different objects with the same name). This - // function is called only from SharedRegion::create(), whose - // documented contract is "caller intends exclusive ownership" - // (publisher-arrives-first ordering). The race-prone multi- - // creator path uses try_create (in posix/SharedMemory.cc), which - // is a pure O_EXCL probe with no unlink and is safe. + // Unlink + exclusive-create: Darwin returns EINVAL on + // O_CREAT|O_TRUNC of an existing object and allows only one + // ftruncate per object. Single-creator only (two concurrent + // callers could both unlink) -- concurrent creators go through + // try_create. ::shm_unlink(name.c_str()); - fd_ = ::shm_open(name.c_str(), O_RDWR | O_CREAT | O_EXCL, 0666); + fd_ = ::shm_open(name.c_str(), O_RDWR | O_CREAT | O_EXCL, kickmsg_shm_mode()); if (fd_ < 0) { throw std::system_error(errno, std::system_category(), diff --git a/src/os/darwin/Time.cc b/src/os/darwin/Time.cc index 324c1ab..780f471 100644 --- a/src/os/darwin/Time.cc +++ b/src/os/darwin/Time.cc @@ -1,5 +1,3 @@ -// macOS-specific sleep(). clock_nanosleep is unavailable; nanosleep is the -// POSIX fallback. Other Time entry points live in src/os/posix/Time.cc. #include "kickmsg/os/Time.h" #include diff --git a/src/os/linux/Futex.cc b/src/os/linux/Futex.cc index f03195f..de99cd4 100644 --- a/src/os/linux/Futex.cc +++ b/src/os/linux/Futex.cc @@ -9,6 +9,12 @@ namespace kickmsg { + // The futex word is the LOW 32 bits of the 64-bit counter; on a + // big-endian target &word addresses the HIGH half and the kernel-side + // value check silently breaks (lost wakeups until timeout). + static_assert(__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__, + "futex word aliasing assumes the low half of write_pos at offset 0"); + bool futex_wait(std::atomic& word, uint64_t expected, nanoseconds timeout) { auto* addr = reinterpret_cast(&word); diff --git a/src/os/linux/SharedMemory.cc b/src/os/linux/SharedMemory.cc index 4832bec..b82754e 100644 --- a/src/os/linux/SharedMemory.cc +++ b/src/os/linux/SharedMemory.cc @@ -1,5 +1,3 @@ -// Linux-specific SharedMemory::create(). Other methods live in -// src/os/posix/SharedMemory.cc. #include "kickmsg/os/SharedMemory.h" #include @@ -12,7 +10,12 @@ namespace kickmsg { void SharedMemory::create(std::string const& name, std::size_t size) { - fd_ = ::shm_open(name.c_str(), O_RDWR | O_CREAT | O_TRUNC, 0666); + // Unlink + exclusive-create, never O_TRUNC: truncation zeroes an + // object live peers still have mapped; orphaning leaves them intact. + // The sequence is single-creator only (two concurrent callers could + // both unlink) -- concurrent creators go through try_create. + ::shm_unlink(name.c_str()); + fd_ = ::shm_open(name.c_str(), O_RDWR | O_CREAT | O_EXCL, kickmsg_shm_mode()); if (fd_ < 0) { throw std::system_error(errno, std::system_category(), diff --git a/src/os/linux/Time.cc b/src/os/linux/Time.cc index 4e2ed4c..e80f501 100644 --- a/src/os/linux/Time.cc +++ b/src/os/linux/Time.cc @@ -1,5 +1,3 @@ -// Linux-specific sleep(). Other Time entry points live in -// src/os/posix/Time.cc. #include "kickmsg/os/Time.h" #include diff --git a/src/os/posix/SharedMemory.cc b/src/os/posix/SharedMemory.cc index d442380..6129003 100644 --- a/src/os/posix/SharedMemory.cc +++ b/src/os/posix/SharedMemory.cc @@ -3,6 +3,8 @@ #include "kickmsg/os/SharedMemory.h" #include +#include +#include #include #include #include @@ -12,6 +14,32 @@ namespace kickmsg { + mode_t kickmsg_shm_mode() + { + static mode_t const mode = []() -> mode_t + { + constexpr mode_t DEFAULT_MODE = 0600; + char const* env = std::getenv("KICKMSG_SHM_MODE"); + if (env == nullptr or *env == '\0') + { + return DEFAULT_MODE; + } + errno = 0; + char* end = nullptr; + unsigned long v = std::strtoul(env, &end, 8); + if (errno != 0 or end == env or *end != '\0' or v > 07777) + { + std::fprintf(stderr, + "kickmsg: invalid KICKMSG_SHM_MODE '%s' " + "(expected octal permission bits, e.g. \"0666\"); using 0600\n", + env); + return DEFAULT_MODE; + } + return static_cast(v); + }(); + return mode; + } + namespace { [[noreturn]] void throw_system_error(char const* context, @@ -83,9 +111,9 @@ namespace kickmsg { // Keep the fd and do the full setup (ftruncate + mmap) inline. // SharedRegion::create_or_open consumes the resulting mapping - // directly — there's no reason to close here and re-enter create(), + // directly -- there's no reason to close here and re-enter create(), // and the old round-trip pattern caused a subtle race on Darwin. - fd_ = ::shm_open(name.c_str(), O_RDWR | O_CREAT | O_EXCL, 0666); + fd_ = ::shm_open(name.c_str(), O_RDWR | O_CREAT | O_EXCL, kickmsg_shm_mode()); if (fd_ < 0) { if (errno == EEXIST) @@ -95,7 +123,17 @@ namespace kickmsg } throw_system_error("SharedMemory: shm_open(try_create)", name); } - map(size); + try + { + map(size); + } + catch (...) + { + // O_EXCL made the name ours; leaving a zero-size object would + // wedge it (EEXIST on create, not-ready on open, forever). + ::shm_unlink(name.c_str()); + throw; + } return true; } diff --git a/src/os/windows/Futex.cc b/src/os/windows/Futex.cc index 8f2e35e..66c050f 100644 --- a/src/os/windows/Futex.cc +++ b/src/os/windows/Futex.cc @@ -5,6 +5,16 @@ namespace kickmsg { + // The watched word is the LOW 32 bits of the 64-bit counter; on a + // big-endian target &word addresses the HIGH half and the value check + // silently breaks (lost wakeups until timeout). MSVC has no + // __BYTE_ORDER__, but every Windows target (x86, x64, ARM64) is + // little-endian. +#if defined(__BYTE_ORDER__) + static_assert(__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__, + "WaitOnAddress word aliasing assumes the low half of write_pos at offset 0"); +#endif + bool futex_wait(std::atomic& word, uint64_t expected, nanoseconds timeout) { auto* addr = reinterpret_cast(&word); diff --git a/src/types.cc b/src/types.cc index b932c64..f6dbed8 100644 --- a/src/types.cc +++ b/src/types.cc @@ -38,7 +38,7 @@ namespace kickmsg } // FNV-1a over the config fields, detecting parameter mismatches at - // open time. Field order is part of the on-disk hash — do NOT + // open time. Field order is part of the on-disk hash -- do NOT // reorder without bumping VERSION. uint64_t compute_config_hash(channel::Type type, channel::Config const& cfg) { @@ -66,12 +66,34 @@ namespace kickmsg if (std::strncmp(a.name, b.name, sizeof(a.name)) != 0) d |= Name; if (a.identity_algo != b.identity_algo) d |= IdentityAlgo; if (a.layout_algo != b.layout_algo) d |= LayoutAlgo; - // Intentionally NOT compared: flags, reserved[] — see Diff + // Intentionally NOT compared: flags, reserved[] -- see Diff // doc in types.h for rationale (forward compatibility). return d; } } + bool entry_steal_and_clear(Entry& e, uint64_t pos, uint64_t observed) + { + // CAS-own before touching metadata; a live writer that moves first wins. + uint64_t expected = observed; + if (not e.sequence.compare_exchange_strong(expected, seq_repair(pos), + std::memory_order_acquire, std::memory_order_relaxed)) + { + return false; + } + + // No slot release here: the holder may have batch-released this + // ring's ref already (excess path) -- releasing again could + // double-free. The leak is bounded and GC-recoverable. + e.slot_idx.store(INVALID_SLOT, std::memory_order_relaxed); + e.payload_len.store(0, std::memory_order_relaxed); + // Skip tag, not a plain sequence: the holder's late metadata stores + // must never be trusted (see types.h). The INVALID stores are + // diagnostics only. + e.sequence.store(seq_skip(pos), std::memory_order_release); + return true; + } + void treiber_push(std::atomic& top, SlotHeader* slot, uint32_t slot_idx) { uint64_t old_top = top.load(std::memory_order_relaxed); diff --git a/tests/CMakeLists.txt b/tests/CMakeLists.txt index a01d654..71a4ad3 100644 --- a/tests/CMakeLists.txt +++ b/tests/CMakeLists.txt @@ -24,6 +24,7 @@ add_executable(kickmsg_stress_test stress/pool_exhaustion.cc stress/live_repair.cc stress/edge_cases.cc + stress/big_payload.cc ) target_link_libraries(kickmsg_stress_test PRIVATE kickmsg argparse::argparse) set_target_properties(kickmsg_stress_test @@ -38,7 +39,7 @@ set_target_properties(kickmsg_stress_test add_test(NAME stress COMMAND kickmsg_stress_test WORKING_DIRECTORY ${CMAKE_BINARY_DIR}) set_tests_properties(stress PROPERTIES TIMEOUT 300) -# Crash test uses fork/waitpid/SIGKILL — POSIX only +# Crash test uses fork/waitpid/SIGKILL -- POSIX only if(NOT WIN32) add_executable(kickmsg_crash_test crash_test.cc) target_link_libraries(kickmsg_crash_test PRIVATE kickmsg) @@ -48,6 +49,26 @@ if(NOT WIN32) add_test(NAME crash COMMAND kickmsg_crash_test WORKING_DIRECTORY ${CMAKE_BINARY_DIR}) set_tests_properties(crash PROPERTIES TIMEOUT 300) + # Stall/repair test SIGSTOPs a forked publisher and steals its stale + # entry locks from the parent -- POSIX only, like the crash test. + add_executable(kickmsg_stall_repair_test stall_repair_test.cc) + target_link_libraries(kickmsg_stall_repair_test PRIVATE kickmsg) + set_target_properties(kickmsg_stall_repair_test + PROPERTIES + RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}) + add_test(NAME stall_repair COMMAND kickmsg_stall_repair_test WORKING_DIRECTORY ${CMAKE_BINARY_DIR}) + set_tests_properties(stall_repair PROPERTIES TIMEOUT 300) + + # Multi-process steady-state MPMC: forked publishers/subscribers over a + # parent-created region, exact conservation oracles -- POSIX only. + add_executable(kickmsg_mp_stress_test mp_stress_test.cc) + target_link_libraries(kickmsg_mp_stress_test PRIVATE kickmsg) + set_target_properties(kickmsg_mp_stress_test + PROPERTIES + RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}) + add_test(NAME mp_stress COMMAND kickmsg_mp_stress_test WORKING_DIRECTORY ${CMAKE_BINARY_DIR}) + set_tests_properties(mp_stress PROPERTIES TIMEOUT 300) + add_executable(kickmsg_registry_stress_test registry_stress_test.cc) target_link_libraries(kickmsg_registry_stress_test PRIVATE kickmsg) set_target_properties(kickmsg_registry_stress_test diff --git a/tests/README.md b/tests/README.md index 49685ec..45c0f04 100644 --- a/tests/README.md +++ b/tests/README.md @@ -17,6 +17,8 @@ these tests exercise. | `kickmsg_unit` | Unit + ABI/layout asserts, schema protocol, geometry validation, recovery primitives. Deterministic. | | `kickmsg_stress_test` | Lock-free contention: MPMC publish/receive, Treiber free-stack, pool exhaustion, churn, fairness, zero-copy. | | `kickmsg_crash_test` | Crash recovery: fork a participant, SIGKILL mid-operation, verify the channel self-heals (POSIX only). | +| `kickmsg_stall_repair_test` | False-positive death: SIGSTOP a live publisher so its entry lock looks stale, steal it with `repair_locked_entries()`, SIGCONT -- every steal must become a clean drop (POSIX only). | +| `kickmsg_mp_stress_test` | Multi-process steady-state MPMC: 4 forked publisher processes x 2 forked subscriber processes over one region (POSIX only). | | `kickmsg_registry_stress_test` | Registry under concurrent register/deregister + dead-PID sweep (POSIX only). | ## Prerequisites @@ -74,8 +76,13 @@ A clean run ends with: Runs: 110 Scenarios passed: 1650 Scenarios failed: 0 Total reorders: 0 VERDICT: ALL CLEAN ``` -`fail` and `reorders` must both be 0. A nonzero exit code (and -`VERDICT: FAILURES DETECTED`) means a real problem -- capture the log. +`fail` and `reorders` must both be 0 -- the verdict enforces both, and +also counts a nonzero binary exit as a failure even when the Summary +line itself was clean (teardown crash, exit-time sanitizer report, +watchdog kill). Each run is wrapped in `timeout` (`RUN_TIMEOUT_SECS`, +default 900 s) so a deadlock becomes a recorded failure instead of an +eternal hang. `VERDICT: FAILURES DETECTED` means a real problem -- +evidence is preserved under the failure dir with the epoch in the name. ### 4. Race detection -- highest value per hour A clean Release soak only fails if a race *manifests* as corruption; @@ -84,9 +91,88 @@ endurance is worth more than a long Release soak for finding ordering bugs. ```bash scripts/configure.sh build_tsan --with=unit_tests --with=tsan scripts/setup_build.sh build_tsan && cmake --build build_tsan -j -TSAN_OPTIONS="suppressions=$PWD/tests/tsan.supp" \ +TSAN_OPTIONS="suppressions=$PWD/tests/tsan.supp:halt_on_error=1:exitcode=66" \ tests/endurance.sh build_tsan/kickmsg_stress_test 14400 ``` +`halt_on_error=1` makes TSAN stop at the first race with the report intact; +without it TSAN reports and *continues*, and the run still exits cleanly. The +endurance harness also greps each run for `ThreadSanitizer`/`runtime error:` +and counts it as a failure (`san=` column), so a race is caught either way. + +For memory/UB bugs the same recipe works with ASAN+UBSAN: build with +`scripts/configure.sh build_asan --with=unit_tests --with=asan --with=ubsan`, +then soak `build_asan/kickmsg_stress_test` -- the rig (rung 5) picks it up +automatically as its `asan-150` profile. + +Rungs 3-4 above are the **hands-on** soaks: pick one binary and a duration and +hammer it (e.g. crash recovery for an hour before a PR, or TSAN stress for an +afternoon). The rig below is the **unattended** counterpart -- it just cycles +all of them for you. It is built *on top of* `endurance.sh`, not a replacement; +keep using the direct form for targeted runs. + +### 5. Unattended long-horizon rig -- cycles all profiles +`tests/soak_all.sh` loops weighted profiles until a wall-clock deadline, each a +time-sliced `endurance.sh` pass: **TSAN takes half the slices** (rarest, +highest-value signal), crash fuzz a quarter, plain stress the rest as periodic +sanity at oversub 150/200. It records a per-slice verdict, persists failing +runs, and survives any single slice failing. It self-detaches (survives +logout), runs under `caffeinate` on macOS, and prints a watch/stop dashboard. +```bash +# build the plain + crash binaries (build/) and optionally build_tsan/ +tests/soak_all.sh 604800 1800 # 1 week, 30-min slices; returns at once +# absent build_tsan/ is skipped, not fatal. SOAK_FOREGROUND=1 to run inline. +``` +Each launch isolates its run under `soak_logs/run_/` (also +`soak_logs/latest`) with the master log, per-slice logs, and `fails/`. Check +progress at any time with: +```bash +tests/soak_status.sh # state, elapsed/remaining, per-profile tally +``` +The launch dashboard also prints the exact `tail`/stop commands. Interrupting a +run (Ctrl-C or `kill`) unlinks the active segment, so it leaves no stale +`/dev/shm` behind (`kill -9` is the exception, covered by unlink-before-create). + +Slice length (arg 2) is operational, not a coverage knob: a profile's total +iterations over the run depend only on its *share* of the cycle, not on how +that time is chunked. Shorter slices just give finer failure-attribution +checkpoints. To shift priority between profiles, edit the weighting in the +`PROFILES` list -- don't lengthen slices. + +The **crash test fuzzes its kill timing** (seeded random per round) so a long +soak explores new crash windows instead of re-hitting a fixed schedule: +```bash +build/kickmsg_crash_test # seed from clock -- logged at startup +KICKMSG_CRASH_SEED=12345 build/kickmsg_crash_test # replay an exact schedule +KICKMSG_CRASH_ROUNDS=200 build/kickmsg_crash_test # more kills per invocation +``` + +The **stall/repair test fuzzes a *false-positive* death**: the publisher is +not killed but SIGSTOPped at random instants, so it sometimes freezes while +holding an entry lock that a tight `commit_timeout` (2 ms) makes "provably +stale". The parent then runs `repair_locked_entries()` during the stall and +SIGCONTs the publisher, which resumes against a stolen lock. The theft-safe +protocol (position-tagged locks, CAS commit, theft guard) must turn every +steal into a clean drop -- the test fails on any torn payload, per-publisher +sequence rewind, or refcount/pool corruption, and bounds reclaimed orphan +slots by the number of steals. The run prints `Steals observed: N` (a 0 is a +WARN, not a failure -- the window is sampled probabilistically): +```bash +build/kickmsg_stall_repair_test # seed from clock -- logged at startup +KICKMSG_STALL_SEED=12345 build/kickmsg_stall_repair_test # replay an exact schedule +build/kickmsg_stall_repair_test --rounds 1000 # more stalls per invocation +``` + +The **multi-process MPMC test** is the cross-process counterpart of the +in-process stress suite: the parent creates the region, forks 4 publisher and +2 subscriber *processes*, and holds the publishers on a pipe gate until every +subscriber ring is Live -- so per-subscriber accounting is exact +(received + lost == total sent). It fails on any corruption, per-publisher +sequence reorder, conservation mismatch, signal-killed child, or pool/refcount +damage. The rig runs it as the `mp` profile: +```bash +build/kickmsg_mp_stress_test # ~2-5 s +KICKMSG_MP_SEED=12345 build/kickmsg_mp_stress_test # replay publisher start jitter +``` ## Contention scales to your machine @@ -107,6 +193,8 @@ bound it explicitly. ## Detached long soak (survives logout; keeps the machine awake) +For a *single-profile* `endurance.sh` run (rungs 3-4); the rig in rung 5 +self-detaches and handles `caffeinate` on its own. ```bash # macOS: caffeinate prevents idle sleep mid-soak nohup caffeinate -i tests/endurance.sh build/kickmsg_stress_test 43200 > soak.log 2>&1 & diff --git a/tests/crash_test.cc b/tests/crash_test.cc index b4fccd3..c4849ca 100644 --- a/tests/crash_test.cc +++ b/tests/crash_test.cc @@ -1,16 +1,16 @@ /// @file crash_test.cc /// @brief Multi-process crash recovery test for kickmsg. /// -/// Phase 1 — publisher crash rounds (the original scenario): fork a child +/// Phase 1 -- publisher crash rounds (the original scenario): fork a child /// publisher, kill it mid-commit with SIGKILL, verify the channel /// recovers via diagnose() + repair_locked_entries() + /// reset_retired_rings() + reclaim_orphaned_slots(). A subscriber /// running throughout validates that no corruption occurs. /// -/// Phase 2 — subscriber crash: a subscriber that SIGKILLs itself while +/// Phase 2 -- subscriber crash: a subscriber that SIGKILLs itself while /// holding live SampleView pins. Exercises reclaim_orphaned_slots(). /// -/// Phase 3 — multi-publisher simultaneous crash: four publishers all +/// Phase 3 -- multi-publisher simultaneous crash: four publishers all /// killed at once. Exercises the repair sequence when Case-A (locked /// sequences) and Case-B (retired rings) residue coexist at multiple /// ring positions. @@ -24,6 +24,7 @@ #include #include +#include "shm_cleanup.h" #include "kickmsg/os/Time.h" #include "kickmsg/Publisher.h" #include "kickmsg/Subscriber.h" @@ -45,6 +46,48 @@ static uint32_t compute_checksum(CrashPayload const& p) return p.magic ^ p.seq ^ 0xBAADF00D; } +// --- Seeded kill-timing fuzzer --------------------------------------------- +// Each kill fires at a random instant so a long soak explores new crash +// windows instead of re-hitting a fixed schedule. The seed is logged at +// startup; set KICKMSG_CRASH_SEED to replay a specific run. +namespace +{ + uint64_t g_rng_state = 0; + + uint64_t next_rand() // splitmix64 + { + g_rng_state += 0x9E3779B97F4A7C15ull; + uint64_t z = g_rng_state; + z = (z ^ (z >> 30)) * 0xBF58476D1CE4E5B9ull; + z = (z ^ (z >> 27)) * 0x94D049BB133111EBull; + return z ^ (z >> 31); + } + + // Sleep a random duration in [lo_us, hi_us] inclusive (microsecond grain). + void sleep_rand(uint64_t lo_us, uint64_t hi_us) + { + uint64_t span = hi_us - lo_us + 1; + kickmsg::sleep(microseconds{static_cast(lo_us + next_rand() % span)}); + } + + uint64_t seed_fuzzer() + { + uint64_t seed; + char const* env = std::getenv("KICKMSG_CRASH_SEED"); + if (env != nullptr) + { + seed = std::strtoull(env, nullptr, 0); + } + else + { + seed = static_cast(monotonic_ns().count()) + ^ (static_cast(::getpid()) << 32); + } + g_rng_state = seed; + return seed; + } +} + /// Child publisher: publishes as fast as possible using allocate() + publish() /// to maximize the window where a kill can orphan a slot. static void child_publisher_main(int /*round*/) @@ -114,8 +157,8 @@ static void child_subscriber_main(int result_fd, int signal_fd) ssize_t written = write(result_fd, &result, sizeof(result)); close(result_fd); close(signal_fd); - // Partial write → parent would read zero-initialized bytes and - // interpret that as "no corruption" — a silent false pass. Exit + // Partial write -> parent would read zero-initialized bytes and + // interpret that as "no corruption" -- a silent false pass. Exit // non-zero so waitpid surfaces the anomaly. if (written != sizeof(result)) { @@ -128,6 +171,7 @@ struct RoundResult bool recovered_entries; bool recovered_rings; bool recovered_slots; + bool repair_ok; bool subscriber_ok; }; @@ -156,8 +200,8 @@ static RoundResult run_one_round(int round) _exit(0); // never reached } - // Let publisher run for 20-50ms - kickmsg::sleep(milliseconds{20 + (round % 30)}); + // Kill at a fuzzed instant in the publisher's lifecycle (0.2-50ms). + sleep_rand(200, 50000); // Kill publisher mid-flight kill(pub_pid, SIGKILL); @@ -186,7 +230,8 @@ static RoundResult run_one_round(int round) // Verify clean after repair auto post = region.diagnose(); - if (post.locked_entries > 0 or post.retired_rings > 0) + result.repair_ok = (post.locked_entries == 0 and post.retired_rings == 0); + if (not result.repair_ok) { std::fprintf(stderr, " [FAIL] Round %d: repair incomplete " "(locked=%u, retired=%u)\n", @@ -220,7 +265,7 @@ static RoundResult run_one_round(int round) return result; } -/// Phase 2: subscriber SIGKILLed while holding SampleView pins — +/// Phase 2: subscriber SIGKILLed while holding SampleView pins -- /// `reclaim_orphaned_slots` must release them. static bool test_subscriber_crash() { @@ -244,7 +289,7 @@ static bool test_subscriber_crash() { auto r = kickmsg::SharedRegion::open(SUB_SHM); kickmsg::Subscriber sub(r); - // receive_view() returns SampleView which pins the slot — exactly + // receive_view() returns SampleView which pins the slot -- exactly // what we want to orphan on SIGKILL. std::vector pins; while (true) @@ -335,7 +380,7 @@ static bool test_subscriber_crash() return clean and writable; } -/// Phase 3: four publishers SIGKILLed simultaneously — repair sequence +/// Phase 3: four publishers SIGKILLed simultaneously -- repair sequence /// must handle Case-A + Case-B residue coexisting at multiple positions. static bool test_multi_publisher_crash() { @@ -382,7 +427,7 @@ static bool test_multi_publisher_crash() } } - kickmsg::sleep(30ms); + sleep_rand(2000, 50000); for (int i = 0; i < N_PUBS; ++i) { @@ -452,7 +497,17 @@ static bool test_multi_publisher_crash() int main() { - std::printf("=== Kickmsg Multi-Process Crash Test ===\n\n"); + std::printf("=== Kickmsg Multi-Process Crash Test ===\n"); + // Clean up segments if interrupted (Ctrl-C / kill) before the test's own + // unlink runs. Installed before forking so children inherit it too. + kickmsg_test::register_cleanup_shm("/kickmsg_crash_test"); + kickmsg_test::register_cleanup_shm("/kickmsg_crash_test_sub"); + kickmsg_test::register_cleanup_shm("/kickmsg_crash_test_multi"); + kickmsg_test::install_signal_cleanup(); + + uint64_t const seed = seed_fuzzer(); + std::printf("crash fuzz seed=%llu (set KICKMSG_CRASH_SEED to replay)\n\n", + static_cast(seed)); kickmsg::SharedMemory::unlink(SHM_NAME); @@ -490,7 +545,15 @@ int main() // Let subscriber attach kickmsg::sleep(50ms); - constexpr int NUM_ROUNDS = 10; + int NUM_ROUNDS = 30; + if (char const* r = std::getenv("KICKMSG_CRASH_ROUNDS")) + { + int v = std::atoi(r); + if (v > 0) + { + NUM_ROUNDS = v; + } + } int any_recovery = 0; bool all_ok = true; @@ -501,6 +564,10 @@ int main() { ++any_recovery; } + if (not result.repair_ok) + { + all_ok = false; + } } // Signal subscriber to exit @@ -508,7 +575,7 @@ int main() // Read subscriber results. A short read means the child either // crashed before writing its result struct or we lost bytes on the - // pipe — either way, we can't trust a zero-initialised sub_result + // pipe -- either way, we can't trust a zero-initialised sub_result // and should fail rather than silently pass the corruption check. struct { uint64_t recv; uint64_t corrupt; } sub_result{}; ssize_t got = read(result_pipe[0], &sub_result, sizeof(sub_result)); diff --git a/tests/endurance.sh b/tests/endurance.sh index 43af316..9b318bb 100755 --- a/tests/endurance.sh +++ b/tests/endurance.sh @@ -12,17 +12,36 @@ set -euo pipefail BINARY="$1" DURATION_SECS="${2:-3600}" +case "$DURATION_SECS" in + *[!0-9]*) echo "endurance.sh: duration must be numeric, got '$DURATION_SECS'" >&2; exit 2;; +esac shift # drop binary if [ "$#" -gt 0 ]; then shift # drop duration, if it was given fi EXTRA_ARGS=("$@") # anything left is forwarded to the binary +# Watchdog: a deadlock/livelock in the binary must become a recorded +# failure, not an eternal hang the soak never notices. RC 124/137 from +# timeout(1) goes through the normal nonzero-RC tally below. macOS has +# no stock timeout; fall back to gtimeout (coreutils) or, failing that, +# run unguarded and say so once. +RUN_TIMEOUT_SECS="${RUN_TIMEOUT_SECS:-900}" +TIMEOUT_CMD=() +if command -v timeout >/dev/null 2>&1; then + TIMEOUT_CMD=(timeout -k 30 "$RUN_TIMEOUT_SECS") +elif command -v gtimeout >/dev/null 2>&1; then + TIMEOUT_CMD=(gtimeout -k 30 "$RUN_TIMEOUT_SECS") +else + echo "WARN: no timeout(1)/gtimeout(1) -- a hung run will hang the soak" >&2 +fi + END_TIME=$(($(date +%s) + DURATION_SECS)) PASS=0 FAIL=0 RUNS=0 REORDERS=0 +SAN=0 echo "=== kickmsg endurance test ===" echo "Binary: $BINARY ${EXTRA_ARGS[*]+"${EXTRA_ARGS[*]}"}" echo "Duration: ${DURATION_SECS}s" @@ -30,7 +49,7 @@ echo "Start: $(date)" echo "" while [ "$(date +%s)" -lt "$END_TIME" ]; do RC=0 - OUTPUT=$("$BINARY" ${EXTRA_ARGS[@]+"${EXTRA_ARGS[@]}"} 2>&1) || RC=$? + OUTPUT=$(${TIMEOUT_CMD[@]+"${TIMEOUT_CMD[@]}"} "$BINARY" ${EXTRA_ARGS[@]+"${EXTRA_ARGS[@]}"} 2>&1) || RC=$? RUNS=$((RUNS + 1)) if [ "$RUNS" -eq 1 ]; then echo "$OUTPUT" | grep -iE "harness built|contention:" || true @@ -38,8 +57,10 @@ while [ "$(date +%s)" -lt "$END_TIME" ]; do fi SUMMARY=$(echo "$OUTPUT" | grep "Summary:" | tail -1 || true) if [ -n "$SUMMARY" ]; then - RUN_PASS=$(echo "$SUMMARY" | grep -oE '[0-9]+ passed' | grep -oE '[0-9]+') - RUN_FAIL=$(echo "$SUMMARY" | grep -oE '[0-9]+ failed' | grep -oE '[0-9]+') + # Guarded: a garbled/interleaved Summary line (possible under heavy + # sanitizer contention) must not let set -e kill the whole soak. + RUN_PASS=$(echo "$SUMMARY" | grep -oE '[0-9]+ passed' | grep -oE '[0-9]+' || true) + RUN_FAIL=$(echo "$SUMMARY" | grep -oE '[0-9]+ failed' | grep -oE '[0-9]+' || true) else # No summary line (e.g. crash test): tally by exit code. if [ "$RC" -eq 0 ]; then @@ -53,15 +74,34 @@ while [ "$(date +%s)" -lt "$END_TIME" ]; do RUN_PASS=${RUN_PASS:-0} RUN_FAIL=${RUN_FAIL:-0} RUN_REORDER=$(echo "$OUTPUT" | { grep -c "REORDER" || true; }) + # Sanitizer reports (TSAN/ASAN/UBSAN) go to stderr and do NOT bump the + # suite's "failed" count -- detect them explicitly or they get swallowed. + RUN_SANITIZER=$(echo "$OUTPUT" | { grep -c -E "ThreadSanitizer|AddressSanitizer|LeakSanitizer|MemorySanitizer|runtime error:" || true; }) + if [ "$RUN_SANITIZER" -gt 0 ] && [ "$RUN_FAIL" -eq 0 ]; then + RUN_FAIL=1 + fi + # A clean Summary line followed by a nonzero exit (teardown segfault, + # exit-time sanitizer report, timeout kill) is still a failure -- the + # Summary-based tally above must not outvote the exit code. + if [ "$RC" -ne 0 ] && [ "$RUN_FAIL" -eq 0 ]; then + RUN_FAIL=1 + fi PASS=$((PASS + RUN_PASS)) FAIL=$((FAIL + RUN_FAIL)) REORDERS=$((REORDERS + RUN_REORDER)) + SAN=$((SAN + RUN_SANITIZER)) ELAPSED=$(($(date +%s) - END_TIME + DURATION_SECS)) - printf "\r[%ds/%ds] runs=%d pass=%d fail=%d reorders=%d" \ - "$ELAPSED" "$DURATION_SECS" "$RUNS" "$PASS" "$FAIL" "$REORDERS" - if [ "$RUN_FAIL" -gt 0 ]; then + printf "\r[%ds/%ds] runs=%d pass=%d fail=%d reorders=%d san=%d" \ + "$ELAPSED" "$DURATION_SECS" "$RUNS" "$PASS" "$FAIL" "$REORDERS" "$SAN" + if [ "$RUN_FAIL" -gt 0 ] || [ "$RC" -ne 0 ]; then + # Persist the full run output -- the evidence is otherwise lost. + FAILDIR="${FAILDIR:-endurance_fails}" + mkdir -p "$FAILDIR" + # Epoch in the name: RUNS restarts every slice, and clobbering an + # earlier slice's evidence destroys exactly what a soak exists to keep. + printf '%s\n' "$OUTPUT" > "$FAILDIR/run_${RUNS}_rc${RC}_$(date +%s).log" echo "" - echo "$OUTPUT" | grep -E "REORDER|FAIL|WARN" || true + echo "$OUTPUT" | grep -E "REORDER|FAIL|WARN|ThreadSanitizer|AddressSanitizer|LeakSanitizer|runtime error:" || true fi done echo "" @@ -72,8 +112,9 @@ echo "Runs: $RUNS" echo "Scenarios passed: $PASS" echo "Scenarios failed: $FAIL" echo "Total reorders: $REORDERS" +echo "Sanitizer hits: $SAN" echo "End: $(date)" -if [ "$FAIL" -eq 0 ]; then +if [ "$FAIL" -eq 0 ] && [ "$REORDERS" -eq 0 ]; then echo "VERDICT: ALL CLEAN" else echo "VERDICT: FAILURES DETECTED" diff --git a/tests/mp_stress_test.cc b/tests/mp_stress_test.cc new file mode 100644 index 0000000..c3b5ff4 --- /dev/null +++ b/tests/mp_stress_test.cc @@ -0,0 +1,621 @@ +/// @file mp_stress_test.cc +/// @brief Multi-process steady-state MPMC soak. +/// +/// The parent creates the region, then forks 4 publisher processes and +/// 2 subscriber processes. Each subscriber signals readiness over a pipe +/// AFTER constructing its kickmsg::Subscriber; the parent releases the +/// publishers (go-pipe EOF) only once every ring is Live, so per-subscriber +/// accounting is exact: received + lost (+ corrupt/bad/reorder) == total +/// sent, no late-attach slack. +/// +/// Oracles per subscriber: no corruption (magic + checksum), per-publisher +/// strictly increasing seq, exact conservation, received > 0. Parent-side: +/// every child exits 0 (signal-death is a failure), then the same +/// structural checks as stall_repair_test (free-stack walk, refcounts, +/// rings Free) plus the no-crash GC oracle (repair fixes nothing, +/// reclaims bounded by observed publisher drops). + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "shm_cleanup.h" +#include "kickmsg/os/Time.h" +#include "kickmsg/Publisher.h" +#include "kickmsg/Subscriber.h" + +using namespace kickmsg; + +static constexpr char const* SHM_NAME = "/kickmsg_mp_stress"; + +static constexpr int NUM_PUBS = 4; +static constexpr int NUM_SUBS = 2; +static constexpr uint32_t MSGS_PER_PUB = 100000; +static constexpr uint64_t TOTAL_SENT = static_cast(NUM_PUBS) * MSGS_PER_PUB; + +struct MpPayload +{ + static constexpr uint32_t MAGIC = 0x4D505353; // 'MPSS' + uint32_t magic; + uint32_t pub_id; + uint32_t seq; + uint32_t checksum; +}; + +static uint32_t compute_checksum(MpPayload const& p) +{ + return p.magic ^ p.pub_id ^ p.seq ^ 0xDEADBEEF; +} + +/// Per-subscriber tallies shipped to the parent over a pipe. +struct SubReport +{ + uint64_t received; + uint64_t lost; + uint64_t corrupted; + uint64_t bad_pub_id; + uint64_t reorders; +}; + +// --- Seeded PRNG (publisher start jitter; KICKMSG_MP_SEED replays a run) ---- +namespace +{ + uint64_t g_rng_state = 0; + + uint64_t next_rand() // splitmix64 + { + g_rng_state += 0x9E3779B97F4A7C15ull; + uint64_t z = g_rng_state; + z = (z ^ (z >> 30)) * 0xBF58476D1CE4E5B9ull; + z = (z ^ (z >> 27)) * 0x94D049BB133111EBull; + return z ^ (z >> 31); + } + + uint64_t seed_fuzzer() + { + uint64_t seed; + char const* env = std::getenv("KICKMSG_MP_SEED"); + if (env != nullptr) + { + seed = std::strtoull(env, nullptr, 0); + } + else + { + seed = static_cast(monotonic_ns().count()) + ^ (static_cast(::getpid()) << 32); + } + g_rng_state = seed; + return seed; + } +} + +// --- Pipe helpers (EINTR + short read/write) --------------------------------- + +static bool read_all(int fd, void* buf, std::size_t len) +{ + auto* p = static_cast(buf); + while (len > 0) + { + ssize_t n = ::read(fd, p, len); + if (n < 0) + { + if (errno == EINTR) + { + continue; + } + return false; + } + if (n == 0) // peer closed before sending everything + { + return false; + } + p += n; + len -= static_cast(n); + } + return true; +} + +static bool write_all(int fd, void const* buf, std::size_t len) +{ + auto const* p = static_cast(buf); + while (len > 0) + { + ssize_t n = ::write(fd, p, len); + if (n < 0) + { + if (errno == EINTR) + { + continue; + } + return false; + } + p += n; + len -= static_cast(n); + } + return true; +} + +/// Aborts on fork failure: an unchecked -1 return would make kill(-1, ...) +/// wipe the entire process group. +static pid_t checked_fork(char const* site) +{ + pid_t p = fork(); + if (p < 0) + { + std::fprintf(stderr, "fork() failed at %s: errno=%d\n", site, errno); + std::_Exit(2); + } + return p; +} + +// --- Children ----------------------------------------------------------------- + +/// Publisher child: opens the region, blocks on the go-pipe (EOF = release), +/// then publishes MSGS_PER_PUB checksummed payloads. +static int child_publisher_main(int pub_id, int go_rfd, uint64_t jitter_us) +{ + auto region = kickmsg::SharedRegion::open(SHM_NAME); + kickmsg::Publisher pub(region); + + // Wait for the parent's release: read returns 0 once every write end of + // the go pipe is closed, so all publishers unblock together. + char b; + while (::read(go_rfd, &b, 1) < 0 and errno == EINTR) + { + } + ::close(go_rfd); + + kickmsg::sleep(microseconds{static_cast(jitter_us)}); + + for (uint32_t seq = 0; seq < MSGS_PER_PUB; ++seq) + { + MpPayload msg; + msg.magic = MpPayload::MAGIC; + msg.pub_id = static_cast(pub_id); + msg.seq = seq; + msg.checksum = compute_checksum(msg); + + int32_t rc; + while ((rc = pub.send(&msg, sizeof(msg))) < 0) + { + if (rc != -EAGAIN) + { + std::fprintf(stderr, " [FATAL] publisher %d: send() returned %d\n", pub_id, rc); + return 3; + } + kickmsg::yield(); + } + } + return 0; +} + +/// Subscriber child: constructs its Subscriber, signals readiness, then +/// consumes until every published position is accounted for (received or +/// lost or in an error bucket), and ships the tallies to the parent. +static int child_subscriber_main(int sub_id, int ready_wfd, int report_wfd) +{ + auto region = kickmsg::SharedRegion::open(SHM_NAME); + kickmsg::Subscriber sub(region); + + char const ready = 1; + if (not write_all(ready_wfd, &ready, 1)) + { + return 4; + } + ::close(ready_wfd); + + SubReport rep{}; + uint32_t last_seq[NUM_PUBS]; + bool have_last[NUM_PUBS] = {}; + + // Conservation doubles as the stop condition: with the ring Live before + // the first send, every position ends up in exactly one bucket, so the + // accounted total reaches TOTAL_SENT iff nothing vanished. The deadline + // turns a conservation bug into a parent-side oracle failure instead of + // a hang. + auto const deadline = monotonic_ns() + seconds{60}; + + while (true) + { + uint64_t accounted = rep.received + rep.corrupted + rep.bad_pub_id + + rep.reorders + sub.lost(); + if (accounted >= TOTAL_SENT or monotonic_ns() >= deadline) + { + break; + } + + auto sample = sub.receive(100ms); + if (not sample) + { + continue; + } + + if (sample->len() != sizeof(MpPayload)) + { + ++rep.corrupted; + continue; + } + + MpPayload msg; + std::memcpy(&msg, sample->data(), sizeof(msg)); + if (msg.magic != MpPayload::MAGIC or msg.checksum != compute_checksum(msg)) + { + ++rep.corrupted; + continue; + } + if (msg.pub_id >= static_cast(NUM_PUBS)) + { + ++rep.bad_pub_id; + continue; + } + + if (have_last[msg.pub_id] and msg.seq <= last_seq[msg.pub_id]) + { + if (rep.reorders < 10) + { + std::fprintf(stderr, " [REORDER] sub%d: pub %u seq %u after %u @pos %" PRIu64 "\n", + sub_id, msg.pub_id, msg.seq, last_seq[msg.pub_id], + sample->ring_pos()); + } + ++rep.reorders; + continue; + } + last_seq[msg.pub_id] = msg.seq; + have_last[msg.pub_id] = true; + ++rep.received; + } + + rep.lost = sub.lost(); + if (not write_all(report_wfd, &rep, sizeof(rep))) + { + return 5; + } + ::close(report_wfd); + return 0; +} + +// --- Final structural checks (mirrors stall_repair_test) ----------------------- + +static bool verify_rings_free(kickmsg::SharedRegion& region) +{ + auto* hdr = region.header(); + bool ok = true; + for (uint32_t i = 0; i < hdr->max_subs; ++i) + { + auto* ring = kickmsg::sub_ring_at(region.base(), hdr, i); + uint32_t packed = ring->state_flight.load(std::memory_order_acquire); + if (kickmsg::ring::get_state(packed) != kickmsg::ring::Free) + { + std::fprintf(stderr, " [FAIL] ring %u not Free after teardown (state=%u)\n", + i, kickmsg::ring::get_state(packed)); + ok = false; + } + if (kickmsg::ring::get_in_flight(packed) != 0) + { + std::fprintf(stderr, " [FAIL] ring %u has in_flight=%u after teardown\n", + i, kickmsg::ring::get_in_flight(packed)); + ok = false; + } + } + return ok; +} + +/// Every slot must either be a member of the free stack or have refcount 0 +/// (a free-stack walk also catches out-of-range and duplicate links, i.e. +/// refcount-driven double-pushes). +static bool verify_slots(kickmsg::SharedRegion& region) +{ + auto* base = region.base(); + auto* hdr = region.header(); + + std::vector in_free(hdr->pool_size, false); + uint32_t top = kickmsg::tagged_idx(hdr->free_top.load(std::memory_order_acquire)); + + while (top != kickmsg::INVALID_SLOT) + { + if (top >= hdr->pool_size) + { + std::fprintf(stderr, " [FAIL] free stack contains out-of-range index %u\n", top); + return false; + } + if (in_free[top]) + { + std::fprintf(stderr, " [FAIL] free stack contains duplicate slot %u " + "(refcount double-release)\n", top); + return false; + } + in_free[top] = true; + + auto* slot = kickmsg::slot_at(base, hdr, top); + top = slot->next_free; + } + + bool ok = true; + for (uint32_t i = 0; i < hdr->pool_size; ++i) + { + if (in_free[i]) + { + continue; + } + auto* slot = kickmsg::slot_at(base, hdr, i); + uint32_t rc = slot->refcount; + if (rc != 0) + { + std::fprintf(stderr, " [FAIL] slot %u not in free stack and refcount=%u\n", i, rc); + ok = false; + } + } + return ok; +} + +// --- Parent ------------------------------------------------------------------- + +/// Reaps one child; returns false (and reports) on nonzero exit or +/// signal-death. +static bool reap_child(pid_t pid, char const* role, int id) +{ + int st = 0; + pid_t w = waitpid(pid, &st, 0); + if (w != pid) + { + std::fprintf(stderr, " [FAIL] waitpid(%s %d) returned %d (errno=%d)\n", + role, id, static_cast(w), errno); + return false; + } + if (WIFSIGNALED(st)) + { + std::fprintf(stderr, " [FAIL] %s %d killed by signal %d\n", role, id, WTERMSIG(st)); + return false; + } + if (not WIFEXITED(st) or WEXITSTATUS(st) != 0) + { + std::fprintf(stderr, " [FAIL] %s %d exited with status 0x%x\n", role, id, st); + return false; + } + return true; +} + +int main() +{ + std::printf("=== Kickmsg Multi-Process MPMC Stress Test ===\n"); + kickmsg_test::register_cleanup_shm("/kickmsg_mp_stress"); + kickmsg_test::install_signal_cleanup(); + + uint64_t const seed = seed_fuzzer(); + std::printf("mp seed=%llu (set KICKMSG_MP_SEED to replay): %d pubs x %u msgs, %d subs\n", + static_cast(seed), NUM_PUBS, MSGS_PER_PUB, NUM_SUBS); + + kickmsg::SharedMemory::unlink(SHM_NAME); + + kickmsg::channel::Config cfg; + cfg.max_subscribers = 4; + cfg.sub_ring_capacity = 64; + cfg.pool_size = 256; + cfg.max_payload_size = 64; + + auto region = kickmsg::SharedRegion::create( + SHM_NAME, kickmsg::channel::PubSub, cfg, "mp_stress_test"); + + // Draw publisher start jitters from the parent's PRNG (before forking) + // so a seed replays the exact schedule. + uint64_t jitter_us[NUM_PUBS]; + for (int i = 0; i < NUM_PUBS; ++i) + { + jitter_us[i] = next_rand() % 2000; + } + + int ready_pipe[2]; + int go_pipe[2]; + int report_pipe[NUM_SUBS][2]; + bool pipes_ok = (pipe(ready_pipe) == 0) and (pipe(go_pipe) == 0); + for (int i = 0; i < NUM_SUBS; ++i) + { + pipes_ok = pipes_ok and (pipe(report_pipe[i]) == 0); + } + if (not pipes_ok) + { + std::fprintf(stderr, "pipe() failed: errno=%d\n", errno); + return 2; + } + + // Flush before forking: children inherit the stdio buffer and would + // replay the banner on their own exit. + std::fflush(nullptr); + + // Fork subscribers first: they must claim their rings (and say so) + // before any publisher is released. + pid_t sub_pid[NUM_SUBS]; + for (int i = 0; i < NUM_SUBS; ++i) + { + sub_pid[i] = checked_fork("subscriber"); + if (sub_pid[i] == 0) + { + ::close(ready_pipe[0]); + // Close both go-pipe ends: a surviving write-end copy here would + // defeat the EOF release below. + ::close(go_pipe[0]); + ::close(go_pipe[1]); + for (int j = 0; j < NUM_SUBS; ++j) + { + ::close(report_pipe[j][0]); + if (j != i) + { + ::close(report_pipe[j][1]); + } + } + int rc = child_subscriber_main(i, ready_pipe[1], report_pipe[i][1]); + std::fflush(nullptr); + _exit(rc); + } + } + + pid_t pub_pid[NUM_PUBS]; + for (int i = 0; i < NUM_PUBS; ++i) + { + pub_pid[i] = checked_fork("publisher"); + if (pub_pid[i] == 0) + { + ::close(ready_pipe[0]); + ::close(ready_pipe[1]); + ::close(go_pipe[1]); + for (int j = 0; j < NUM_SUBS; ++j) + { + ::close(report_pipe[j][0]); + ::close(report_pipe[j][1]); + } + int rc = child_publisher_main(i, go_pipe[0], jitter_us[i]); + std::fflush(nullptr); + _exit(rc); + } + } + + ::close(ready_pipe[1]); + ::close(go_pipe[0]); + for (int i = 0; i < NUM_SUBS; ++i) + { + ::close(report_pipe[i][1]); + } + + // Readiness gate: one byte per subscriber, then release the publishers + // by closing the last write end of the go pipe. + char ready_buf[NUM_SUBS]; + if (not read_all(ready_pipe[0], ready_buf, sizeof(ready_buf))) + { + std::fprintf(stderr, " [FAIL] subscribers died before signaling readiness\n"); + return 2; + } + ::close(ready_pipe[0]); + std::printf("subscribers ready, releasing publishers\n"); + std::fflush(stdout); + + nanoseconds const t0 = kickmsg::monotonic_ns(); + ::close(go_pipe[1]); + + bool all_ok = true; + + for (int i = 0; i < NUM_PUBS; ++i) + { + all_ok &= reap_child(pub_pid[i], "publisher", i); + } + std::printf("publishers done (%lld ms), draining subscribers\n", + static_cast( + duration_cast(kickmsg::elapsed_time(t0)).count())); + std::fflush(stdout); + + SubReport reports[NUM_SUBS] = {}; + for (int i = 0; i < NUM_SUBS; ++i) + { + if (not read_all(report_pipe[i][0], &reports[i], sizeof(reports[i]))) + { + std::fprintf(stderr, " [FAIL] subscriber %d report read failed (child died?)\n", i); + all_ok = false; + } + ::close(report_pipe[i][0]); + all_ok &= reap_child(sub_pid[i], "subscriber", i); + } + + std::printf(" %-6s %10s %10s %10s %10s %10s\n", + "sub", "received", "lost", "corrupt", "bad_pid", "reorder"); + for (int i = 0; i < NUM_SUBS; ++i) + { + SubReport const& r = reports[i]; + std::printf(" sub%-3d %10" PRIu64 " %10" PRIu64 " %10" PRIu64 + " %10" PRIu64 " %10" PRIu64 "\n", + i, r.received, r.lost, r.corrupted, r.bad_pub_id, r.reorders); + + if (r.corrupted > 0) + { + std::fprintf(stderr, " [FAIL] sub%d: %" PRIu64 " corrupted messages\n", i, r.corrupted); + all_ok = false; + } + if (r.bad_pub_id > 0) + { + std::fprintf(stderr, " [FAIL] sub%d: %" PRIu64 " bad publisher IDs\n", i, r.bad_pub_id); + all_ok = false; + } + if (r.reorders > 0) + { + std::fprintf(stderr, " [FAIL] sub%d: %" PRIu64 " reordered messages\n", i, r.reorders); + all_ok = false; + } + if (r.received == 0) + { + std::fprintf(stderr, " [FAIL] sub%d: received 0 messages\n", i); + all_ok = false; + } + uint64_t accounted = r.received + r.lost + r.corrupted + r.bad_pub_id + r.reorders; + if (accounted != TOTAL_SENT) + { + std::fprintf(stderr, " [FAIL] sub%d: accounted %" PRIu64 " != total sent %" PRIu64 + " (messages vanished or duplicated)\n", i, accounted, TOTAL_SENT); + all_ok = false; + } + } + + // No-crash GC oracle: every child exited cleanly, so repair must find + // nothing and reclaims are bounded by observed publisher drops (a + // descheduled publisher whose lock got stolen books a drop; the steal + // leaks one slot ref that only reclaim recovers). + uint64_t dropped = 0; + for (uint32_t i = 0; i < cfg.max_subscribers; ++i) + { + auto* ring = kickmsg::sub_ring_at(region.base(), region.header(), i); + dropped += ring->dropped_count.load(std::memory_order_acquire); + } + std::size_t repaired = region.repair_locked_entries(); + if (repaired != 0) + { + std::fprintf(stderr, " [FAIL] repair_locked_entries fixed %zu entries on a no-crash run\n", + repaired); + all_ok = false; + } + std::size_t reclaimed = region.reclaim_orphaned_slots(); + if (reclaimed > dropped) + { + std::fprintf(stderr, " [FAIL] reclaimed %zu slots but only %" PRIu64 + " publisher drops can account for steal residue\n", reclaimed, dropped); + all_ok = false; + } + else if (reclaimed != 0) + { + std::printf(" [NOTE] %zu slot(s) reclaimed, within the %" PRIu64 "-drop steal budget\n", + reclaimed, dropped); + } + + if (not verify_rings_free(region)) + { + all_ok = false; + } + if (not verify_slots(region)) + { + all_ok = false; + } + + kickmsg::SharedMemory::unlink(SHM_NAME); + + int pass = 0; + int fail = 0; + if (all_ok) + { + std::printf("\n [PASS]\n"); + pass = 1; + } + else + { + std::printf("\n [FAIL]\n"); + fail = 1; + } + std::printf("=== Summary: %d passed, %d failed ===\n", pass, fail); + + if (all_ok) + { + return 0; + } + return 1; +} diff --git a/tests/shm_cleanup.h b/tests/shm_cleanup.h new file mode 100644 index 0000000..fb472c7 --- /dev/null +++ b/tests/shm_cleanup.h @@ -0,0 +1,61 @@ +#ifndef KICKMSG_TESTS_SHM_CLEANUP_H +#define KICKMSG_TESTS_SHM_CLEANUP_H + +// Best-effort shm cleanup for the test binaries: on SIGINT/SIGTERM, unlink the +// registered segments so an interrupted run leaves no stale /dev/shm entry for +// the next run to trip over. Names must be string literals (static storage); +// the handler only calls shm_unlink + _exit, both async-signal-safe. +// SIGKILL (-9) cannot be caught -- those leftovers are handled by each +// scenario's unlink-before-create instead. + +#ifndef _WIN32 +#include +#include +#include +#include + +namespace kickmsg_test +{ + inline constexpr int MAX_CLEANUP = 32; + // volatile: the elements are read from a signal handler. + inline char const* volatile g_cleanup_names[MAX_CLEANUP] = {}; + inline volatile sig_atomic_t g_cleanup_count = 0; + + inline void shm_cleanup_handler(int sig) + { + for (sig_atomic_t i = 0; i < g_cleanup_count; ++i) + { + ::shm_unlink(g_cleanup_names[i]); + } + ::_exit(128 + sig); + } + + inline void register_cleanup_shm(char const* name) + { + if (g_cleanup_count < MAX_CLEANUP) + { + g_cleanup_names[g_cleanup_count] = name; + g_cleanup_count = g_cleanup_count + 1; + } + } + + inline void install_signal_cleanup() + { + struct sigaction sa; + std::memset(&sa, 0, sizeof(sa)); + sa.sa_handler = shm_cleanup_handler; + sigemptyset(&sa.sa_mask); + sa.sa_flags = 0; + ::sigaction(SIGINT, &sa, nullptr); + ::sigaction(SIGTERM, &sa, nullptr); + } +} +#else +namespace kickmsg_test +{ + inline void register_cleanup_shm(char const*) {} + inline void install_signal_cleanup() {} +} +#endif + +#endif diff --git a/tests/soak_all.sh b/tests/soak_all.sh new file mode 100755 index 0000000..8ae47a8 --- /dev/null +++ b/tests/soak_all.sh @@ -0,0 +1,240 @@ +#!/bin/bash +# Cycles weighted test profiles until a deadline, each a time-sliced endurance +# pass; aggregates results, persists failing runs, survives any slice failing. +# Self-detaches (survives logout), runs under caffeinate on macOS, and prints +# a watch/stop dashboard before returning. +# +# Usage: soak_all.sh [total_secs] [slice_secs] +# total_secs : wall-clock budget (default 604800 = 1 week) +# slice_secs : seconds per profile slice (default 1800 = 30 min) +# SOAK_FOREGROUND=1 run inline instead of detaching +# PLAIN_BIN / CRASH_BIN / TSAN_BIN override binaries (TSAN optional) +# +# NOT 'set -e': a failing slice records and continues. +set -uo pipefail + +TOTAL_SECS="${1:-604800}" +SLICE_SECS="${2:-1800}" +ROOT="$(cd "$(dirname "$0")/.." && pwd)" +SELF="$(cd "$(dirname "$0")" && pwd)/$(basename "$0")" +BASE_LOGDIR="${SOAK_LOGDIR:-$ROOT/soak_logs}" + +# --- Launcher: detach + caffeinate, print dashboard, return ----------------- +if [ -z "${KICKMSG_SOAK_RUNDIR:-}" ] && [ -z "${SOAK_FOREGROUND:-}" ]; then + RUNDIR="$BASE_LOGDIR/run_$(date +%Y%m%d_%H%M%S)" + mkdir -p "$RUNDIR" + ln -sfn "$RUNDIR" "$BASE_LOGDIR/latest" 2>/dev/null || true + export KICKMSG_SOAK_RUNDIR="$RUNDIR" + caf="" + caf_label="off (non-macOS)" + if [ "$(uname)" = "Darwin" ] && command -v caffeinate >/dev/null 2>&1; then + caf="caffeinate -i" + caf_label="on (idle-sleep blocked)" + fi + nohup $caf "$SELF" "$TOTAL_SECS" "$SLICE_SECS" "$RUNDIR/stdout.log" 2>&1 & + pid=$! + echo "$pid" > "$RUNDIR/soak.pid" + M="$RUNDIR/soak_all.log" + cat < "$LOGDIR/soak.pid" + +PLAIN="${PLAIN_BIN:-$ROOT/build/kickmsg_stress_test}" +CRASH="${CRASH_BIN:-$ROOT/build/kickmsg_crash_test}" +STALL="${STALL_BIN:-$ROOT/build/kickmsg_stall_repair_test}" +MP="${MP_BIN:-$ROOT/build/kickmsg_mp_stress_test}" +TSAN="${TSAN_BIN:-$ROOT/build_tsan/kickmsg_stress_test}" +ASAN="${ASAN_BIN:-$ROOT/build_asan/kickmsg_stress_test}" +TSAN_SUPP="$ROOT/tests/tsan.supp" +ENDURANCE="$ROOT/tests/endurance.sh" + +# Weighted cycle: repeats encode priority. TSAN is the rarest, highest-value +# signal, so it takes half the slices; crash fuzz a quarter; plain stress the +# rest as periodic sanity at two contention levels. Oversub is kept <=200 on +# purpose -- a single oversub-300 run can take tens of minutes and blow past a +# slice boundary (endurance.sh only checks the clock between runs). +# A profile whose binary is absent is skipped, not fatal. +PROFILES=( + "tsan-150|$TSAN|--oversub 150" + "crash|$CRASH|" + "tsan-150|$TSAN|--oversub 150" + "stress-150|$PLAIN|--oversub 150" + "stall|$STALL|" + "mp|$MP|" + "tsan-150|$TSAN|--oversub 150" + "crash|$CRASH|" + "tsan-150|$TSAN|--oversub 150" + "asan-150|$ASAN|--oversub 150" + "stress-200|$PLAIN|--oversub 200" +) + +TOTAL_SLICES=0 +TOTAL_PASS=0 +TOTAL_FAIL=0 +TOTAL_SAN=0 +HARNESS_FAIL=0 +START=$(date +%s) +DEADLINE=$((START + TOTAL_SECS)) + +log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$MASTER"; } + +summary() { + echo "" | tee -a "$MASTER" + log "=== SOAK SUMMARY ===" + log "slices=$TOTAL_SLICES pass_scenarios=$TOTAL_PASS fail_scenarios=$TOTAL_FAIL san_hits=$TOTAL_SAN harness_fails=$HARNESS_FAIL" + # Per-profile breakdown via awk over the recorded slice lines (no + # associative arrays -- this must run on macOS bash 3.2). + echo "--- per-profile ---" | tee -a "$MASTER" + grep -hE "^SLICE " "$MASTER" 2>/dev/null \ + | awk '{p=$0; sub(/.*profile=/,"",p); sub(/ .*/,"",p); + f=$0; sub(/.*fail=/,"",f); sub(/ .*/,"",f); + n[p]++; ff[p]+=f} + END{for(k in n) printf " %-12s slices=%d fail=%d\n", k, n[k], ff[k]}' \ + | tee -a "$MASTER" + if [ "$TOTAL_FAIL" -gt 0 ] || [ "$HARNESS_FAIL" -gt 0 ]; then + log "VERDICT: FAILURES DETECTED -- evidence under $LOGDIR/fails/" + elif [ "$TOTAL_SLICES" -eq 0 ] || [ "$TOTAL_PASS" -eq 0 ]; then + # Zero work done must never read as a clean week: a stale build + # dir or wrong cwd would otherwise soak nothing and report green. + log "VERDICT: NO WORK DONE -- check binaries and run dir" + else + log "VERDICT: ALL CLEAN" + fi +} + +on_signal() { + summary + rc=130 + if [ "$TOTAL_FAIL" -gt 0 ] || [ "$HARNESS_FAIL" -gt 0 ]; then + rc=1 + fi + exit "$rc" +} +trap on_signal INT TERM + +log "soak start: total=${TOTAL_SECS}s slice=${SLICE_SECS}s deadline_epoch=$DEADLINE" +log "binaries: plain=$([ -x "$PLAIN" ] && echo yes || echo NO) crash=$([ -x "$CRASH" ] && echo yes || echo NO) stall=$([ -x "$STALL" ] && echo yes || echo NO) mp=$([ -x "$MP" ] && echo yes || echo NO) tsan=$([ -x "$TSAN" ] && echo yes || echo NO) asan=$([ -x "$ASAN" ] && echo yes || echo NO)" + +if [ ! -x "$PLAIN" ] && [ ! -x "$CRASH" ] && [ ! -x "$TSAN" ]; then + # Without this guard the while loop below spins at full speed for the + # whole budget, appending a skip line per iteration and ending ALL CLEAN. + log "FATAL: no profile binary is executable -- nothing to soak" + summary + exit 2 +fi + +i=0 +while [ "$(date +%s)" -lt "$DEADLINE" ]; do + n=${#PROFILES[@]} + idx=$((i % n)) + i=$((i + 1)) + entry="${PROFILES[$idx]}" + label="${entry%%|*}" + rest="${entry#*|}" + bin="${rest%%|*}" + extra="${rest#*|}" + if [ "$extra" = "$bin" ]; then + extra="" + fi + + if [ ! -x "$bin" ]; then + log "skip profile=$label (binary absent: $bin)" + continue + fi + + now=$(date +%s) + remain=$((DEADLINE - now)) + if [ "$remain" -lt 5 ]; then + break + fi + this="$SLICE_SECS" + if [ "$remain" -lt "$this" ]; then + this="$remain" + fi + + TOTAL_SLICES=$((TOTAL_SLICES + 1)) + slog="$SLICEDIR/$(printf '%04d' "$TOTAL_SLICES")_${label}.log" + log "slice $TOTAL_SLICES profile=$label dur=${this}s -> $slog" + + # Per-slice evidence dir (slice number in the path: evidence from one + # slice must never overwrite another's); sanitizer options only for + # TSAN slices. + export FAILDIR="$LOGDIR/fails/${label}_s$(printf '%04d' "$TOTAL_SLICES")" + # abort_on_error=1 on top of halt_on_error: the report is followed by + # abort() instead of exit(), so a core dump is captured when ulimits allow. + case "$label" in + tsan*) + supp="" + if [ -f "$TSAN_SUPP" ]; then + supp="suppressions=$TSAN_SUPP:" + fi + export TSAN_OPTIONS="${supp}halt_on_error=1:abort_on_error=1:exitcode=66" + unset ASAN_OPTIONS UBSAN_OPTIONS 2>/dev/null || true + ;; + asan*) + export ASAN_OPTIONS="halt_on_error=1:abort_on_error=1:exitcode=66" + export UBSAN_OPTIONS="halt_on_error=1:abort_on_error=1" + unset TSAN_OPTIONS 2>/dev/null || true + ;; + *) + unset TSAN_OPTIONS ASAN_OPTIONS UBSAN_OPTIONS 2>/dev/null || true + ;; + esac + + rc=0 + "$ENDURANCE" "$bin" "$this" $extra > "$slog" 2>&1 || rc=$? + + p=$(grep -E "Scenarios passed:" "$slog" | grep -oE '[0-9]+' | tail -1 || true) + f=$(grep -E "Scenarios failed:" "$slog" | grep -oE '[0-9]+' | tail -1 || true) + sa=$(grep -E "Sanitizer hits:" "$slog" | grep -oE '[0-9]+' | tail -1 || true) + v=$(grep -E "VERDICT:" "$slog" | tail -1 || true) + p="${p:-0}" + f="${f:-0}" + sa="${sa:-0}" + # No VERDICT line means endurance.sh itself died (bash error, OOM kill, + # stray pkill) before finishing the slice -- a harness failure, distinct + # from a test failure, and never silently counted as a clean slice. + hfail=0 + if [ -z "$v" ]; then + hfail=1 + HARNESS_FAIL=$((HARNESS_FAIL + 1)) + mkdir -p "$LOGDIR/fails" + cp "$slog" "$LOGDIR/fails/harness_slice_$(printf '%04d' "$TOTAL_SLICES").log" 2>/dev/null || true + fi + TOTAL_PASS=$((TOTAL_PASS + p)) + TOTAL_FAIL=$((TOTAL_FAIL + f)) + TOTAL_SAN=$((TOTAL_SAN + sa)) + # Machine-greppable record for the per-profile breakdown above. + echo "SLICE $TOTAL_SLICES profile=$label pass=$p fail=$f san=$sa rc=$rc hfail=$hfail" >> "$MASTER" + log " done profile=$label pass=$p fail=$f san=$sa rc=$rc ${v:-(no verdict line -- HARNESS FAILURE)}" +done + +summary +if [ "$TOTAL_FAIL" -gt 0 ] || [ "$HARNESS_FAIL" -gt 0 ]; then + exit 1 +fi +if [ "$TOTAL_SLICES" -eq 0 ] || [ "$TOTAL_PASS" -eq 0 ]; then + exit 2 +fi +exit 0 diff --git a/tests/soak_status.sh b/tests/soak_status.sh new file mode 100755 index 0000000..97677e2 --- /dev/null +++ b/tests/soak_status.sh @@ -0,0 +1,81 @@ +#!/bin/bash +# Compact live summary of a soak_all.sh run. +# Usage: soak_status.sh [run_dir] (defaults to soak_logs/latest) +set -uo pipefail +ROOT="$(cd "$(dirname "$0")/.." && pwd)" +RUNDIR="${1:-${SOAK_LOGDIR:-$ROOT/soak_logs}/latest}" +M="$RUNDIR/soak_all.log" +if [ ! -f "$M" ]; then + echo "no soak log at $M -- run tests/soak_all.sh first" + exit 1 +fi + +fmt() { # seconds -> "Hh MMm SSs" + local s="$1" + printf '%dh %02dm %02ds' "$((s / 3600))" "$(((s % 3600) / 60))" "$((s % 60))" +} + +pid="" +[ -f "$RUNDIR/soak.pid" ] && pid="$(cat "$RUNDIR/soak.pid" 2>/dev/null)" +if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then + state="RUNNING (pid $pid)" +else + # A dead worker is only "finished" if it got far enough to write a + # VERDICT; anything else died mid-run and must not look like success. + verdict="$(grep -E 'VERDICT:' "$M" 2>/dev/null | tail -1 | sed -E 's/.*(VERDICT:.*)/\1/' || true)" + if [ -n "$verdict" ]; then + state="finished ($verdict)" + else + state="DIED EARLY -- no verdict; check $RUNDIR/stdout.log" + fi +fi + +start_line="$(grep -m1 'soak start' "$M" 2>/dev/null || true)" +total="$(echo "$start_line" | grep -oE 'total=[0-9]+' | grep -oE '[0-9]+' || true)" +deadline="$(echo "$start_line" | grep -oE 'deadline_epoch=[0-9]+' | grep -oE '[0-9]+' || true)" +total="${total:-0}" +deadline="${deadline:-0}" +now="$(date +%s)" +elapsed=0 +remain=0 +if [ "$deadline" -gt 0 ]; then + elapsed=$((now - (deadline - total))) + remain=$((deadline - now)) + [ "$remain" -lt 0 ] && remain=0 +fi + +# grep -c always prints a count but exits 1 when zero -- don't add `|| echo 0` +# or it double-prints. set -e is off, so a non-zero exit here is harmless. +slices="$(grep -c '^SLICE ' "$M" 2>/dev/null)" +slices="${slices:-0}" +pass="$(grep '^SLICE ' "$M" 2>/dev/null | grep -oE 'pass=[0-9]+' | grep -oE '[0-9]+' | awk '{s+=$1} END{print s+0}')" +fail="$(grep '^SLICE ' "$M" 2>/dev/null | grep -oE 'fail=[0-9]+' | grep -oE '[0-9]+' | awk '{s+=$1} END{print s+0}')" +san="$(grep '^SLICE ' "$M" 2>/dev/null | grep -oE ' san=[0-9]+' | grep -oE '[0-9]+' | awk '{s+=$1} END{print s+0}')" +hfail="$(grep '^SLICE ' "$M" 2>/dev/null | grep -oE 'hfail=[0-9]+' | grep -oE '[0-9]+' | awk '{s+=$1} END{print s+0}')" +nfail="$(find "$RUNDIR/fails" -type f 2>/dev/null | wc -l | tr -d ' ')" + +echo "=== kickmsg soak status ===" +echo " state : $state" +echo " run dir : $RUNDIR" +echo " elapsed : $(fmt "$elapsed") of $(fmt "$total") (remaining $(fmt "$remain"))" +echo " slices : $slices done pass_scenarios=$pass fail_scenarios=$fail san_hits=$san harness_fails=$hfail fail_logs=$nfail" +echo " per-profile:" +grep '^SLICE ' "$M" 2>/dev/null \ + | awk '{p=$0; sub(/.*profile=/,"",p); sub(/ .*/,"",p); + f=$0; sub(/.*fail=/,"",f); sub(/ .*/,"",f); + n[p]++; ff[p]+=f} + END{for(k in n) printf " %-12s slices=%d fail=%d\n", k, n[k], ff[k]}' + +# Current/last slice and its latest live tally (\r-separated -> last field). +cur="$(grep -E ' slice [0-9]+ profile=' "$M" 2>/dev/null | tail -1 || true)" +if [ -n "$cur" ]; then + label="$(echo "$cur" | sed -E 's/.* profile=([^ ]+).*/\1/')" + slog="$(echo "$cur" | sed -E 's/.*-> //')" + tally="" + [ -f "$slog" ] && tally="$(tr '\r' '\n' < "$slog" | grep -E '^\[[0-9]+s/' | tail -1 || true)" + echo " current : $label ${tally:-(starting)}" +fi + +if [ "$nfail" -gt 0 ] || [ "$fail" -gt 0 ] || [ "$hfail" -gt 0 ]; then + echo " !! failures recorded -- see $RUNDIR/fails/" +fi diff --git a/tests/stall_repair_test.cc b/tests/stall_repair_test.cc new file mode 100644 index 0000000..14f39f4 --- /dev/null +++ b/tests/stall_repair_test.cc @@ -0,0 +1,490 @@ +/// @file stall_repair_test.cc +/// @brief False-positive-death fuzz test for the theft-safe commit protocol. +/// +/// A child publisher is SIGSTOPped at random instants, so it sometimes +/// freezes while holding a position-tagged entry lock. With a tight +/// commit_timeout the stall makes the lock "provably stale": an external +/// repairer (the parent) runs repair_locked_entries() during the stall and +/// steals the entry. The publisher is then SIGCONTed and resumes. +/// +/// The theft guard + CAS commit in Publisher::publish() must turn every +/// such steal into a clean publisher drop: +/// - never a torn payload (magic/checksum validated on every sample), +/// - never a per-publisher sequence rewind (seq strictly increasing), +/// - never refcount corruption (structural pool checks at the end). + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "shm_cleanup.h" +#include "kickmsg/os/Time.h" +#include "kickmsg/Publisher.h" +#include "kickmsg/Subscriber.h" + +using namespace kickmsg; + +static constexpr char const* SHM_NAME = "/kickmsg_stall_repair"; +static constexpr microseconds COMMIT_TIMEOUT = 2ms; + +struct StallPayload +{ + static constexpr uint32_t MAGIC = 0x57A11CAF; + uint32_t magic; + uint32_t pub_id; + uint32_t seq; + uint32_t checksum; +}; + +static uint32_t compute_checksum(StallPayload const& p) +{ + return p.magic ^ p.pub_id ^ p.seq ^ 0xDEADBEEF; +} + +// --- Seeded stall-timing fuzzer --------------------------------------------- +// Each SIGSTOP fires at a random instant so a long soak explores new stall +// windows instead of re-hitting a fixed schedule. The seed is logged at +// startup; set KICKMSG_STALL_SEED to replay a specific run. +namespace +{ + uint64_t g_rng_state = 0; + + uint64_t next_rand() // splitmix64 + { + g_rng_state += 0x9E3779B97F4A7C15ull; + uint64_t z = g_rng_state; + z = (z ^ (z >> 30)) * 0xBF58476D1CE4E5B9ull; + z = (z ^ (z >> 27)) * 0x94D049BB133111EBull; + return z ^ (z >> 31); + } + + // Sleep a random duration in [lo_us, hi_us] inclusive (microsecond grain). + void sleep_rand(uint64_t lo_us, uint64_t hi_us) + { + uint64_t span = hi_us - lo_us + 1; + kickmsg::sleep(microseconds{static_cast(lo_us + next_rand() % span)}); + } + + uint64_t seed_fuzzer() + { + uint64_t seed; + char const* env = std::getenv("KICKMSG_STALL_SEED"); + if (env != nullptr) + { + seed = std::strtoull(env, nullptr, 0); + } + else + { + seed = static_cast(monotonic_ns().count()) + ^ (static_cast(::getpid()) << 32); + } + g_rng_state = seed; + return seed; + } + + volatile sig_atomic_t g_child_stop = 0; + + void child_stop_handler(int) + { + g_child_stop = 1; + } +} + +/// Aborts on fork failure: an unchecked -1 return would make kill(-1, ...) +/// wipe the entire process group. +static pid_t checked_fork(char const* site) +{ + pid_t p = fork(); + if (p < 0) + { + std::fprintf(stderr, "fork() failed at %s: errno=%d\n", site, errno); + std::_Exit(2); + } + return p; +} + +/// Child publisher: publishes checksummed payloads with a strictly +/// increasing seq in a tight loop until SIGTERM flips the stop flag. +static void child_publisher_main() +{ + // Replace the inherited shm-cleanup SIGTERM handler: the parent still + // uses the segment, so the child must convert SIGTERM into a clean loop + // exit instead of unlinking the region out from under it. + struct sigaction sa; + std::memset(&sa, 0, sizeof(sa)); + sa.sa_handler = child_stop_handler; + sigemptyset(&sa.sa_mask); + sa.sa_flags = 0; + ::sigaction(SIGTERM, &sa, nullptr); + + auto region = kickmsg::SharedRegion::open(SHM_NAME); + kickmsg::Publisher pub(region); + + uint32_t seq = 0; + while (g_child_stop == 0) + { + auto a = pub.allocate(); + if (a.data == nullptr) + { + kickmsg::yield(); + continue; + } + + StallPayload msg; + msg.magic = StallPayload::MAGIC; + msg.pub_id = 1; + msg.seq = seq; + msg.checksum = compute_checksum(msg); + std::memcpy(a.data, &msg, sizeof(msg)); + + pub.publish(sizeof(msg)); + // A dropped publish (theft detected) leaves a gap -- gaps are + // legitimate; the subscriber only rejects a seq going backward. + ++seq; + } +} + +// --- Subscriber thread ------------------------------------------------------- + +struct SubStats +{ + std::atomic received{0}; + std::atomic corrupted{0}; + std::atomic rewinds{0}; +}; + +static std::atomic g_sub_stop{false}; + +static void subscriber_main(kickmsg::SharedRegion& region, SubStats& stats) +{ + kickmsg::Subscriber sub(region); + + uint32_t last_seq = 0; + bool have_last = false; + + while (true) + { + auto sample = sub.receive(50ms); + if (not sample) + { + if (not g_sub_stop.load(std::memory_order_acquire)) + { + continue; + } + // try_receive can return null after exhausting its retry budget + // on a run of evicted/skip entries while messages remain; null + // with no lost() progress is the only true ring-empty signal. + uint64_t lost_before = sub.lost(); + sample = sub.try_receive(); + if (not sample) + { + if (sub.lost() == lost_before) + { + break; + } + continue; + } + } + + if (sample->len() != sizeof(StallPayload)) + { + ++stats.corrupted; + continue; + } + + StallPayload msg; + std::memcpy(&msg, sample->data(), sizeof(msg)); + if (msg.magic != StallPayload::MAGIC + or msg.pub_id != 1 + or msg.checksum != compute_checksum(msg)) + { + ++stats.corrupted; + continue; + } + + if (have_last and msg.seq <= last_seq) + { + // Cap the per-sample log: one corruption event can cascade into + // thousands of rewinds and flood the soak evidence dir. + if (stats.rewinds.load(std::memory_order_relaxed) < 10) + { + std::fprintf(stderr, " [REWIND] seq %u after %u @pos %" PRIu64 "\n", + msg.seq, last_seq, sample->ring_pos()); + } + else if (stats.rewinds.load(std::memory_order_relaxed) == 10) + { + std::fprintf(stderr, " [REWIND] ... further rewinds suppressed\n"); + } + ++stats.rewinds; + continue; + } + last_seq = msg.seq; + have_last = true; + ++stats.received; + } +} + +// --- Final structural checks -------------------------------------------------- + +static bool verify_rings_free(kickmsg::SharedRegion& region) +{ + auto* hdr = region.header(); + bool ok = true; + for (uint32_t i = 0; i < hdr->max_subs; ++i) + { + auto* ring = kickmsg::sub_ring_at(region.base(), hdr, i); + uint32_t packed = ring->state_flight.load(std::memory_order_acquire); + if (kickmsg::ring::get_state(packed) != kickmsg::ring::Free) + { + std::fprintf(stderr, " [FAIL] ring %u not Free after teardown (state=%u)\n", + i, kickmsg::ring::get_state(packed)); + ok = false; + } + if (kickmsg::ring::get_in_flight(packed) != 0) + { + std::fprintf(stderr, " [FAIL] ring %u has in_flight=%u after teardown\n", + i, kickmsg::ring::get_in_flight(packed)); + ok = false; + } + } + return ok; +} + +/// Every slot must either be a member of the free stack or have refcount 0 +/// (a free-stack walk also catches out-of-range and duplicate links, i.e. +/// refcount-driven double-pushes). +static bool verify_slots(kickmsg::SharedRegion& region) +{ + auto* base = region.base(); + auto* hdr = region.header(); + + std::vector in_free(hdr->pool_size, false); + uint32_t top = kickmsg::tagged_idx(hdr->free_top.load(std::memory_order_acquire)); + + while (top != kickmsg::INVALID_SLOT) + { + if (top >= hdr->pool_size) + { + std::fprintf(stderr, " [FAIL] free stack contains out-of-range index %u\n", top); + return false; + } + if (in_free[top]) + { + std::fprintf(stderr, " [FAIL] free stack contains duplicate slot %u " + "(refcount double-release)\n", top); + return false; + } + in_free[top] = true; + + auto* slot = kickmsg::slot_at(base, hdr, top); + top = slot->next_free; + } + + bool ok = true; + for (uint32_t i = 0; i < hdr->pool_size; ++i) + { + if (in_free[i]) + { + continue; + } + auto* slot = kickmsg::slot_at(base, hdr, i); + uint32_t rc = slot->refcount; + if (rc != 0) + { + std::fprintf(stderr, " [FAIL] slot %u not in free stack and refcount=%u\n", i, rc); + ok = false; + } + } + return ok; +} + +int main(int argc, char** argv) +{ + // ~8 ms per round (pre-stall jitter + staleness wait + grace pass) puts + // the default run at roughly 3 s of wall time. + int rounds = 400; + for (int i = 1; i < argc; ++i) + { + if (std::strcmp(argv[i], "--rounds") == 0 and i + 1 < argc) + { + int v = std::atoi(argv[i + 1]); + if (v > 0) + { + rounds = v; + } + ++i; + } + } + + std::printf("=== Kickmsg Stall/Repair False-Positive-Death Test ===\n"); + kickmsg_test::register_cleanup_shm("/kickmsg_stall_repair"); + kickmsg_test::install_signal_cleanup(); + + uint64_t const seed = seed_fuzzer(); + std::printf("stall fuzz seed=%llu (set KICKMSG_STALL_SEED to replay), rounds=%d\n", + static_cast(seed), rounds); + + kickmsg::SharedMemory::unlink(SHM_NAME); + + kickmsg::channel::Config cfg; + cfg.max_subscribers = 2; + cfg.sub_ring_capacity = 8; + cfg.pool_size = 32; + cfg.max_payload_size = 64; + cfg.commit_timeout = COMMIT_TIMEOUT; // tight: stalls are frequently declared stale + + auto region = kickmsg::SharedRegion::create( + SHM_NAME, kickmsg::channel::PubSub, cfg, "stall_repair_test"); + + // Fork the publisher BEFORE spawning any thread (fork in a multithreaded + // process would leave the child with a possibly-inconsistent heap). + pid_t pub_pid = checked_fork("publisher"); + if (pub_pid == 0) + { + child_publisher_main(); + _exit(0); + } + + SubStats stats; + std::thread sub_thread(subscriber_main, std::ref(region), std::ref(stats)); + + // Let the subscriber attach and the publisher spin up. + kickmsg::sleep(30ms); + + bool all_ok = true; + bool child_alive = true; + uint64_t steals = 0; + + for (int round = 0; round < rounds; ++round) + { + sleep_rand(0, 3000); + + kill(pub_pid, SIGSTOP); + int st = 0; + pid_t w = waitpid(pub_pid, &st, WUNTRACED); + if (w != pub_pid or not WIFSTOPPED(st)) + { + std::fprintf(stderr, " [FAIL] round %d: publisher gone before SIGSTOP " + "(w=%d, status=0x%x)\n", round, static_cast(w), st); + all_ok = false; + child_alive = false; + break; + } + + // 2x commit_timeout + jitter: any lock the stopped child holds is + // now provably stale to the repairer's grace pass. + sleep_rand(4000, 6000); + + steals += region.repair_locked_entries(); + + kill(pub_pid, SIGCONT); + + if ((round + 1) % 50 == 0) + { + std::printf(" round %d/%d: steals so far %" PRIu64 ", received %" PRIu64 "\n", + round + 1, rounds, steals, + stats.received.load(std::memory_order_relaxed)); + } + } + + // Stop the child gracefully and fail on signal-death. + if (child_alive) + { + kill(pub_pid, SIGTERM); + int st = 0; + waitpid(pub_pid, &st, 0); + if (not WIFEXITED(st) or WEXITSTATUS(st) != 0) + { + std::fprintf(stderr, " [FAIL] publisher did not exit cleanly (status=0x%x)\n", st); + all_ok = false; + } + } + + // Subscriber drains the residue, then exits. + g_sub_stop.store(true, std::memory_order_release); + sub_thread.join(); + + uint64_t const received = stats.received.load(std::memory_order_relaxed); + uint64_t const corrupted = stats.corrupted.load(std::memory_order_relaxed); + uint64_t const rewinds = stats.rewinds.load(std::memory_order_relaxed); + + std::printf(" Subscriber: received %" PRIu64 ", corrupted %" PRIu64 + ", rewinds %" PRIu64 "\n", received, corrupted, rewinds); + if (corrupted > 0) + { + std::fprintf(stderr, " [FAIL] %" PRIu64 " corrupted samples (torn payload)\n", corrupted); + all_ok = false; + } + if (rewinds > 0) + { + std::fprintf(stderr, " [FAIL] %" PRIu64 " sequence rewinds\n", rewinds); + all_ok = false; + } + if (received == 0) + { + std::fprintf(stderr, " [FAIL] subscriber received nothing\n"); + all_ok = false; + } + + // Final GC + structural checks (publisher dead, subscriber destroyed). + steals += region.repair_locked_entries(); + std::size_t const reclaimed = region.reclaim_orphaned_slots(); + + // Each steal can orphan at most one slot ref (entry_steal_and_clear + // deliberately leaks the displaced reference); more reclaims than + // steals means the normal path leaked. + std::printf(" Reclaimed slots: %zu (steal budget %" PRIu64 ")\n", reclaimed, steals); + if (reclaimed > steals) + { + std::fprintf(stderr, " [FAIL] reclaimed %zu slots but only %" PRIu64 + " steals can account for orphan residue\n", reclaimed, steals); + all_ok = false; + } + + if (not verify_rings_free(region)) + { + all_ok = false; + } + if (not verify_slots(region)) + { + all_ok = false; + } + + std::printf("Steals observed: %" PRIu64 "\n", steals); + if (steals == 0) + { + std::printf(" [WARN] no steal happened -- the dangerous window was never " + "sampled this run (probabilistic; not a failure)\n"); + } + + kickmsg::SharedMemory::unlink(SHM_NAME); + + int pass = 0; + int fail = 0; + if (all_ok) + { + std::printf("\n [PASS]\n"); + pass = 1; + } + else + { + std::printf("\n [FAIL]\n"); + fail = 1; + } + std::printf("=== Summary: %d passed, %d failed ===\n", pass, fail); + + if (all_ok) + { + return 0; + } + return 1; +} diff --git a/tests/stress/SCENARIOS.md b/tests/stress/SCENARIOS.md index e811a8a..48bb15a 100644 --- a/tests/stress/SCENARIOS.md +++ b/tests/stress/SCENARIOS.md @@ -4,6 +4,23 @@ All scenarios run as part of `kickmsg_stress_test`. Use `endurance.sh` for extended runs. TSAN builds scale message counts by 100x to keep runtime manageable. +Common oracles for the no-crash scenarios (everything except `gc_recovery` +and `live_repair`, which inject damage by design): + +- **Readiness barrier**: subscribers signal after constructing their + `Subscriber`; publishers wait for all of them before the first send. Every + ring is Live for the whole run, which makes per-subscriber accounting exact. +- **Exact conservation**: `received + lost (+ corrupt/bad/reorder) == + total_sent` per subscriber -- messages can neither vanish nor duplicate. +- **GC-zero**: `repair_locked_entries()` must fix 0 entries, and + `reclaim_orphaned_slots()` must reclaim no more slots than the publishers' + ring `dropped_count` total. Drops happen when a publisher descheduled past + `commit_timeout` has its entry lock stolen by a peer (`self_repair`); each + steal deliberately leaks one slot ref that only the GC recovers, so + reclaims within the drop budget are stall residue, not a leak. Any repair, + or any reclaim beyond that budget, is a normal-path bug that the GC would + otherwise silently mask before the structural verifies. + ## Treiber stack (`treiber.cc`) **What**: 8 threads × 100K pop/push cycles on the lock-free free-stack. @@ -21,7 +38,7 @@ linked structure, or slot duplication. **What**: 4 subscriber threads repeatedly join and leave (5 rounds each) while a publisher sends continuously. -**Why**: Exercises the full ring lifecycle: Free → Live → Draining → Free. +**Why**: Exercises the full ring lifecycle: Free -> Live -> Draining -> Free. Tests drain_unconsumed correctness, in_flight quiescence spin, and ring reuse. **Config**: max_subs=4, ring=32, pool=128, 10K messages. @@ -31,7 +48,7 @@ or refcount leak. ## GC recovery (`gc_recovery.cc`) -**What**: Manually poisons a ring entry with LOCKED_SEQUENCE and orphans a slot +**What**: Manually poisons a ring entry with a stale position-tagged lock and orphans a slot with refcount > 0. Calls `repair_locked_entries()` and `reclaim_orphaned_slots()` and verifies they fix both issues. @@ -46,7 +63,7 @@ orphaned slot, or corrupts the free-stack. **What**: 1 publisher × 16 subscribers, 100K messages. Measures the receive distribution spread (min vs max across subscribers). -**Why**: Verifies that per-subscriber rings provide equal service — a slow +**Why**: Verifies that per-subscriber rings provide equal service -- a slow subscriber should not starve fast ones. **Config**: ring=256, pool=512 (large enough for no eviction pressure). @@ -91,12 +108,12 @@ batch excess, or double-push corrupting the free-stack. ## Live repair (`live_repair.cc`) **What**: 4 publishers + 4 subscribers running for 2 seconds. A background -injector periodically poisons ring entries with LOCKED_SEQUENCE. A background +injector periodically poisons ring entries with stale position-tagged locks. A background healer calls `diagnose()` + `repair_locked_entries()`. **Why**: Validates the claim that `repair_locked_entries()` is safe under live traffic. The repair does a plain store to `sequence` while publishers may be -CAS-ing the same entry — this test verifies the "benign double-store" argument. +CAS-ing the same entry -- this test verifies the "benign double-store" argument. **Failure means**: Data corruption caused by repair racing with a live publisher, or repair failing to unblock a poisoned entry. @@ -107,12 +124,34 @@ or repair failing to unblock a poisoned entry. messages. **Why**: Every publish wraps and evicts the previous entry. Hammers -`wait_and_capture_slot` on every message. Tests the wrap + two-phase commit +`wait_for_commit` on every message. Tests the wrap + two-phase commit hot path with zero buffering. -**Failure means**: Eviction race, wait_and_capture_slot timeout under normal +**Failure means**: Eviction race, wait_for_commit timeout under normal load, or refcount corruption from immediate reuse. +## Big payload (`big_payload.cc`) + +**What**: 4 publishers × 50K messages of 8 KB through a small pool (16) and +tiny rings (8), so eviction constantly races readers. One subscriber consumes +by copy (`try_receive`), one zero-copy (`try_receive_view`). Each payload is a +header (magic, pub_id, seq, byte_count, FNV-1a checksum) followed by a +deterministic byte pattern over the full 8 KB; readers re-derive the pattern +and checksum for every byte of every sample. The zero-copy reader validates +through the `SampleView` twice -- a second pass failing after a clean first +pass proves the slot was overwritten while pinned. + +**Why**: The small `Payload` used elsewhere fits in one cache line, so a torn +read there is nearly impossible to observe. 8 KB spans many lines and takes +long enough to copy/validate that an eviction racing the pin/seqlock window +has a real chance of being caught. + +**Config**: pool=16, ring=8, max_subs=2, payload=8192 B, 4 pubs × 50K msgs. + +**Failure means**: Torn read (pin or seqlock violation), pin not held for the +view's lifetime, checksum/pattern corruption, conservation miss, or refcount +leak. + ## Subscriber saturation (`edge_cases.cc`) **What**: max_subs=4, attempt to create 6 subscribers. Verify 5th and 6th diff --git a/tests/stress/big_payload.cc b/tests/stress/big_payload.cc new file mode 100644 index 0000000..fa389ec --- /dev/null +++ b/tests/stress/big_payload.cc @@ -0,0 +1,354 @@ +#include "common.h" + +#include "kickmsg/Hash.h" + +// Large-payload torn-read hunt: 8 KB messages through a small pool (16) and +// tiny rings (8) so eviction constantly races readers. Every byte of every +// sample is validated against a deterministic pattern plus an FNV-1a checksum +// over the whole body; any mismatch is a torn read. The zero-copy reader +// validates the SampleView twice -- a second pass that fails after a clean +// first pass proves the slot was overwritten while pinned. + +namespace +{ + constexpr std::size_t BIG_PAYLOAD_SIZE = 8192; + + struct BigHeader + { + static constexpr uint32_t MAGIC = 0xB16FEED5; + uint32_t magic; + uint32_t pub_id; + uint32_t seq; + uint32_t byte_count; + uint64_t checksum; + }; + + constexpr std::size_t BIG_BODY_SIZE = BIG_PAYLOAD_SIZE - sizeof(BigHeader); + + constexpr int NUM_PUBS = 4; + constexpr int NUM_SUBS = 2; + + inline uint8_t pattern_byte(uint32_t pub_id, uint32_t seq, std::size_t i) + { + return static_cast(seq * 31u + pub_id * 131u + i); + } + + struct BigSubStats + { + uint64_t received = 0; + uint64_t lost = 0; + uint64_t corrupted = 0; + uint64_t torn_view = 0; + uint64_t reordered = 0; + }; + + // Full validation of one sample: header sanity, every body byte against + // the pattern, and the FNV-1a checksum over the whole body. + bool validate_big(uint8_t const* data, std::size_t len, int sub_id, char const* pass_label) + { + if (len != BIG_PAYLOAD_SIZE) + { + std::fprintf(stderr, " [FAIL] sub%d (%s): wrong-length sample (%zu bytes)\n", + sub_id, pass_label, len); + return false; + } + + BigHeader hdr; + std::memcpy(&hdr, data, sizeof(hdr)); + + if (hdr.magic != BigHeader::MAGIC + or hdr.pub_id >= static_cast(NUM_PUBS) + or hdr.byte_count != BIG_BODY_SIZE) + { + std::fprintf(stderr, " [FAIL] sub%d (%s): bad header (magic=%08x pub=%u bytes=%u)\n", + sub_id, pass_label, hdr.magic, hdr.pub_id, hdr.byte_count); + return false; + } + + uint8_t const* body = data + sizeof(BigHeader); + for (std::size_t i = 0; i < BIG_BODY_SIZE; ++i) + { + if (body[i] != pattern_byte(hdr.pub_id, hdr.seq, i)) + { + std::fprintf(stderr, " [FAIL] sub%d (%s): torn body at byte %zu " + "(pub %u seq %u: got %02x, want %02x)\n", + sub_id, pass_label, i, hdr.pub_id, hdr.seq, + body[i], pattern_byte(hdr.pub_id, hdr.seq, i)); + return false; + } + } + + uint64_t sum = kickmsg::hash::fnv1a_64(body, BIG_BODY_SIZE); + if (sum != hdr.checksum) + { + std::fprintf(stderr, " [FAIL] sub%d (%s): checksum mismatch " + "(pub %u seq %u: got %016" PRIx64 ", want %016" PRIx64 ")\n", + sub_id, pass_label, hdr.pub_id, hdr.seq, sum, hdr.checksum); + return false; + } + return true; + } + + void check_reorder(uint8_t const* data, std::vector& last_seq, BigSubStats& stats, + int sub_id) + { + BigHeader hdr; + std::memcpy(&hdr, data, sizeof(hdr)); + auto& prev = last_seq[hdr.pub_id]; + if (prev != UINT32_MAX and hdr.seq <= prev) + { + std::fprintf(stderr, " [FAIL] sub%d: pub %u seq %u after seq %u (reorder)\n", + sub_id, hdr.pub_id, hdr.seq, prev); + ++stats.reordered; + return; + } + prev = hdr.seq; + ++stats.received; + } +} + +bool run_big_payload() +{ + uint32_t const NUM_MSGS = 50000 / TSAN_SCALE; + + std::printf("--- Big payload: %d pubs x %u msgs of %zu B, pool=16, ring=8, " + "1 copy + 1 zerocopy sub ---\n", + NUM_PUBS, NUM_MSGS, BIG_PAYLOAD_SIZE); + + g_all_publishers_done = false; + g_subscribers_ready = 0; + g_subscribers_expected = NUM_SUBS; + + kickmsg::channel::Config cfg; + cfg.max_subscribers = NUM_SUBS; + cfg.sub_ring_capacity = 8; + cfg.pool_size = 16; + cfg.max_payload_size = BIG_PAYLOAD_SIZE; + + char const* shm_name = "/kickmsg_big_payload"; + kickmsg::SharedMemory::unlink(shm_name); + auto region = kickmsg::SharedRegion::create( + shm_name, kickmsg::channel::PubSub, cfg, "big_payload"); + + std::vector sub_stats(NUM_SUBS); + + auto pub_worker = [&](int pub_id) + { + kickmsg::Publisher pub{region}; + + wait_subscribers_ready(); + + std::vector buf(BIG_PAYLOAD_SIZE); + + for (uint32_t i = 0; i < NUM_MSGS; ++i) + { + uint8_t* body = buf.data() + sizeof(BigHeader); + for (std::size_t b = 0; b < BIG_BODY_SIZE; ++b) + { + body[b] = pattern_byte(static_cast(pub_id), i, b); + } + + BigHeader hdr; + hdr.magic = BigHeader::MAGIC; + hdr.pub_id = static_cast(pub_id); + hdr.seq = i; + hdr.byte_count = static_cast(BIG_BODY_SIZE); + hdr.checksum = kickmsg::hash::fnv1a_64(body, BIG_BODY_SIZE); + std::memcpy(buf.data(), &hdr, sizeof(hdr)); + + int32_t rc; + while ((rc = pub.send(buf.data(), buf.size())) < 0) + { + if (rc != -EAGAIN) + { + std::fprintf(stderr, " [FATAL] publisher %d: send() returned %d\n", + pub_id, rc); + std::abort(); + } + kickmsg::yield(); + } + } + }; + + // Copy reader: try_receive path (memcpy out of the slot under the pin). + auto copy_sub = [&](int sub_id) + { + kickmsg::Subscriber sub{region}; + g_subscribers_ready.fetch_add(1, std::memory_order_release); + + auto& stats = sub_stats[static_cast(sub_id)]; + std::vector last_seq(NUM_PUBS, UINT32_MAX); + auto const timeout = milliseconds{500}; + + while (true) + { + auto sample = sub.receive(timeout); + if (not sample) + { + if (g_all_publishers_done) + { + // try_receive can return null after exhausting its retry + // budget on a run of evicted entries; only "null with no + // lost() progress" proves the ring is empty. + uint64_t lost_before = sub.lost(); + sample = sub.try_receive(); + if (not sample) + { + if (sub.lost() == lost_before) + { + break; + } + continue; + } + } + else + { + continue; + } + } + + auto const* data = static_cast(sample->data()); + if (not validate_big(data, sample->len(), sub_id, "copy")) + { + ++stats.corrupted; + continue; + } + check_reorder(data, last_seq, stats, sub_id); + } + + stats.lost = sub.lost(); + }; + + // Zero-copy reader: validates THROUGH the SampleView twice. The pin must + // keep the slot immutable for the view's whole lifetime, so a second pass + // failing after a clean first pass is a pin violation, not a torn commit. + auto view_sub = [&](int sub_id) + { + kickmsg::Subscriber sub{region}; + g_subscribers_ready.fetch_add(1, std::memory_order_release); + + auto& stats = sub_stats[static_cast(sub_id)]; + std::vector last_seq(NUM_PUBS, UINT32_MAX); + auto const timeout = milliseconds{500}; + + while (true) + { + auto view = sub.receive_view(timeout); + if (not view) + { + if (g_all_publishers_done) + { + uint64_t lost_before = sub.lost(); + view = sub.try_receive_view(); + if (not view) + { + if (sub.lost() == lost_before) + { + break; + } + continue; + } + } + else + { + continue; + } + } + + auto const* data = static_cast(view->data()); + if (not validate_big(data, view->len(), sub_id, "view pass 1")) + { + ++stats.corrupted; + continue; + } + if (not validate_big(data, view->len(), sub_id, "view pass 2")) + { + std::fprintf(stderr, " [FAIL] sub%d: view changed between passes " + "(slot overwritten while pinned)\n", sub_id); + ++stats.torn_view; + continue; + } + check_reorder(data, last_seq, stats, sub_id); + } + + stats.lost = sub.lost(); + }; + + nanoseconds t0 = kickmsg::monotonic_ns(); + + std::thread copy_thread(copy_sub, 0); + std::thread view_thread(view_sub, 1); + + std::vector pub_threads; + for (int i = 0; i < NUM_PUBS; ++i) + { + pub_threads.emplace_back(pub_worker, i); + } + + for (auto& t : pub_threads) + { + t.join(); + } + g_all_publishers_done.store(true, std::memory_order_release); + + copy_thread.join(); + view_thread.join(); + + nanoseconds t1 = kickmsg::monotonic_ns(); + int64_t elapsed_ms = std::chrono::duration_cast(t1 - t0).count(); + + uint64_t const total_sent = static_cast(NUM_PUBS) * NUM_MSGS; + std::printf(" Elapsed: %" PRId64 " ms, total published: %" PRIu64 "\n", + elapsed_ms, total_sent); + + bool ok = true; + + for (int i = 0; i < NUM_SUBS; ++i) + { + auto const& s = sub_stats[static_cast(i)]; + std::printf(" sub%d: received=%" PRIu64 " lost=%" PRIu64 " corrupted=%" PRIu64 + " torn_view=%" PRIu64 " reordered=%" PRIu64 "\n", + i, s.received, s.lost, s.corrupted, s.torn_view, s.reordered); + + if (s.corrupted > 0 or s.torn_view > 0 or s.reordered > 0) + { + std::fprintf(stderr, " [FAIL] sub%d: corrupted=%" PRIu64 " torn_view=%" PRIu64 + " reordered=%" PRIu64 "\n", + i, s.corrupted, s.torn_view, s.reordered); + ok = false; + } + if (s.received == 0) + { + std::fprintf(stderr, " [FAIL] sub%d: received 0 messages!\n", i); + ok = false; + } + + // Exact conservation: the readiness barrier means every ring is Live + // before the first send, and both readers drain to completion. + uint64_t accounted = s.received + s.lost + s.corrupted + s.torn_view + s.reordered; + if (accounted != total_sent) + { + std::fprintf(stderr, " [FAIL] sub%d: received+lost+corrupted+torn+reorder (%" PRIu64 + ") != total_sent (%" PRIu64 ")!\n", + i, accounted, total_sent); + ok = false; + } + } + + ok &= verify_gc_zero(region, cfg); + ok &= verify_refcounts_zero(region, cfg); + ok &= verify_pool_free(region, cfg); + ok &= verify_rings_inactive(region, cfg); + + region.unlink(); + + if (ok) + { + std::printf(" [PASS]\n\n"); + } + else + { + std::printf(" [FAIL]\n\n"); + } + return ok; +} diff --git a/tests/stress/churn.cc b/tests/stress/churn.cc index 68ad3db..f19d493 100644 --- a/tests/stress/churn.cc +++ b/tests/stress/churn.cc @@ -86,17 +86,7 @@ bool run_subscriber_churn() // 1. start_pos_ is captured BEFORE CAS to Live (no missed entries) // 2. drain_unconsumed releases all entries in [start_pos_, wp) // A refcount leak here indicates a gap in the drain window. - std::size_t repaired = region.repair_locked_entries(); - std::size_t reclaimed = region.reclaim_orphaned_slots(); - if (repaired > 0) - { - std::printf(" GC repaired %zu locked entries\n", repaired); - } - if (reclaimed > 0) - { - std::printf(" GC reclaimed %zu orphaned slots\n", reclaimed); - } - + ok &= verify_gc_zero(region, cfg); ok &= verify_refcounts_zero(region, cfg); ok &= verify_pool_free(region, cfg); diff --git a/tests/stress/common.h b/tests/stress/common.h index ef8927e..8211ba9 100644 --- a/tests/stress/common.h +++ b/tests/stress/common.h @@ -89,10 +89,26 @@ static constexpr std::size_t TRACE_SIZE = 16; inline std::atomic g_all_publishers_done{false}; +// Readiness barrier: subscribers signal after construction, publishers wait. +// Every ring is then Live for the whole run, making per-subscriber +// accounting exact. Scenarios must reset both before spawning threads. +inline std::atomic g_subscribers_ready{0}; +inline int g_subscribers_expected{0}; + +inline void wait_subscribers_ready() +{ + while (g_subscribers_ready.load(std::memory_order_acquire) < g_subscribers_expected) + { + kickmsg::yield(); + } +} + inline void publisher_thread(kickmsg::SharedRegion& region, int pub_id, uint32_t count) { kickmsg::Publisher pub{region}; + wait_subscribers_ready(); + for (uint32_t i = 0; i < count; ++i) { Payload msg; @@ -186,6 +202,7 @@ inline SubResult subscriber_thread_copy(kickmsg::SharedRegion& region, int sub_i int num_pubs, uint32_t /*msgs_per_pub*/) { kickmsg::Subscriber sub{region}; + g_subscribers_ready.fetch_add(1, std::memory_order_release); SubResult result{}; result.sub_id = sub_id; @@ -204,10 +221,17 @@ inline SubResult subscriber_thread_copy(kickmsg::SharedRegion& region, int sub_i { if (g_all_publishers_done) { + // Null can mean retry budget burned on evicted entries; + // only null with no lost() progress proves ring-empty. + uint64_t lost_before = sub.lost(); sample = sub.try_receive(); if (not sample) { - break; + if (sub.lost() == lost_before) + { + break; + } + continue; } } else @@ -236,6 +260,7 @@ inline SubResult subscriber_thread_zerocopy(kickmsg::SharedRegion& region, int s int num_pubs, uint32_t /*msgs_per_pub*/) { kickmsg::Subscriber sub{region}; + g_subscribers_ready.fetch_add(1, std::memory_order_release); SubResult result{}; result.sub_id = sub_id; @@ -254,10 +279,16 @@ inline SubResult subscriber_thread_zerocopy(kickmsg::SharedRegion& region, int s { if (g_all_publishers_done) { + // Same caveat as the copy path. + uint64_t lost_before = sub.lost(); view = sub.try_receive_view(); if (not view) { - break; + if (sub.lost() == lost_before) + { + break; + } + continue; } } else @@ -282,6 +313,47 @@ inline SubResult subscriber_thread_zerocopy(kickmsg::SharedRegion& region, int s return result; } +// No-crash oracle: GC work not explained by observed publisher drops means +// the normal path leaked and GC masked it -- fail. Each stall-steal leaks +// at most one slot ref and books at least one drop, so reclaimed <= drops +// is the exact budget. +inline bool verify_gc_zero(kickmsg::SharedRegion& region, kickmsg::channel::Config const& cfg) +{ + bool ok = true; + + uint64_t dropped = 0; + for (uint32_t i = 0; i < cfg.max_subscribers; ++i) + { + auto* ring = kickmsg::sub_ring_at(region.base(), region.header(), i); + dropped += ring->dropped_count.load(std::memory_order_acquire); + } + + std::size_t repaired = region.repair_locked_entries(); + if (repaired != 0) + { + std::fprintf(stderr, " [FAIL] repair_locked_entries fixed %zu entries on a no-crash run\n", + repaired); + ok = false; + } + + std::size_t reclaimed = region.reclaim_orphaned_slots(); + if (reclaimed > dropped) + { + std::fprintf(stderr, " [FAIL] reclaim_orphaned_slots recovered %zu slots on a no-crash run " + "(only %" PRIu64 " publisher drops can account for steal residue)\n", + reclaimed, dropped); + ok = false; + } + else if (reclaimed != 0) + { + std::printf(" [NOTE] %zu slot(s) reclaimed, within the %" PRIu64 + "-drop steal budget (stalled-publisher steal residue)\n", + reclaimed, dropped); + } + + return ok; +} + inline bool verify_pool_free(kickmsg::SharedRegion& region, kickmsg::channel::Config const& cfg) { auto* base = region.base(); diff --git a/tests/stress/edge_cases.cc b/tests/stress/edge_cases.cc index 8d4c0b8..4b592e3 100644 --- a/tests/stress/edge_cases.cc +++ b/tests/stress/edge_cases.cc @@ -8,6 +8,9 @@ bool run_single_slot_ring() constexpr int NUM_PUBS = 4; constexpr int NUM_SUBS = 4; + + g_subscribers_ready = 0; + g_subscribers_expected = NUM_SUBS; uint32_t const NUM_MSGS = 10000 / TSAN_SCALE; kickmsg::channel::Config cfg; @@ -28,15 +31,13 @@ bool run_single_slot_ring() for (int i = 0; i < NUM_SUBS; ++i) { - sub_threads.emplace_back([®ion, i, &sub_results, NUM_MSGS]() + sub_threads.emplace_back([®ion, i, &sub_results]() { sub_results[static_cast(i)] = subscriber_thread_copy(region, i, NUM_PUBS, NUM_MSGS); }); } - kickmsg::sleep(10ms); - std::vector pub_threads; for (int i = 0; i < NUM_PUBS; ++i) { @@ -91,17 +92,7 @@ bool run_single_slot_ring() } } - std::size_t repaired = region.repair_locked_entries(); - if (repaired > 0) - { - std::printf(" GC repaired %zu locked entries\n", repaired); - } - std::size_t reclaimed = region.reclaim_orphaned_slots(); - if (reclaimed > 0) - { - std::printf(" GC reclaimed %zu orphaned slots\n", reclaimed); - } - + ok &= verify_gc_zero(region, cfg); ok &= verify_refcounts_zero(region, cfg); ok &= verify_pool_free(region, cfg); ok &= verify_rings_inactive(region, cfg); @@ -255,17 +246,7 @@ bool run_subscriber_saturation() // Destroy all subscribers before verification subs.clear(); - std::size_t repaired = region.repair_locked_entries(); - if (repaired > 0) - { - std::printf(" GC repaired %zu locked entries\n", repaired); - } - std::size_t reclaimed = region.reclaim_orphaned_slots(); - if (reclaimed > 0) - { - std::printf(" GC reclaimed %zu orphaned slots\n", reclaimed); - } - + ok &= verify_gc_zero(region, cfg); ok &= verify_refcounts_zero(region, cfg); ok &= verify_pool_free(region, cfg); ok &= verify_rings_inactive(region, cfg); diff --git a/tests/stress/fairness.cc b/tests/stress/fairness.cc index 3e2b98f..755b4dc 100644 --- a/tests/stress/fairness.cc +++ b/tests/stress/fairness.cc @@ -24,20 +24,21 @@ bool run_fairness_test() auto region = kickmsg::SharedRegion::create( shm_name, kickmsg::channel::PubSub, cfg, "fairness"); + g_subscribers_ready = 0; + g_subscribers_expected = NUM_SUBS; + std::vector results(NUM_SUBS); std::vector sub_threads; for (int i = 0; i < NUM_SUBS; ++i) { - sub_threads.emplace_back([®ion, i, &results, NUM_MSGS]() + sub_threads.emplace_back([®ion, i, &results]() { results[static_cast(i)] = subscriber_thread_copy(region, i, 1, NUM_MSGS); }); } - kickmsg::sleep(10ms); - std::thread pub_thread(publisher_thread, std::ref(region), 0, NUM_MSGS); pub_thread.join(); g_all_publishers_done.store(true, std::memory_order_release); @@ -63,6 +64,17 @@ bool run_fairness_test() r.sub_id, r.corrupted, r.bad_pub_id, r.reordered); ok = false; } + + // Exact conservation: the readiness barrier means every ring is Live + // before the first send, and subscribers drain to completion. + uint64_t accounted = r.received + r.lost + r.corrupted + r.bad_pub_id + r.reordered; + if (accounted != NUM_MSGS) + { + std::fprintf(stderr, " [FAIL] sub%d: received+lost+corrupt+bad_pid+reorder (%" PRIu64 + ") != total_sent (%u)!\n", + r.sub_id, accounted, NUM_MSGS); + ok = false; + } } std::printf(" Received range: [%" PRIu64 ", %" PRIu64 "] (spread: %" PRIu64 ")\n", @@ -74,17 +86,7 @@ bool run_fairness_test() ok = false; } - std::size_t repaired = region.repair_locked_entries(); - if (repaired > 0) - { - std::printf(" GC repaired %zu locked entries\n", repaired); - } - std::size_t reclaimed = region.reclaim_orphaned_slots(); - if (reclaimed > 0) - { - std::printf(" GC reclaimed %zu orphaned slots\n", reclaimed); - } - + ok &= verify_gc_zero(region, cfg); ok &= verify_refcounts_zero(region, cfg); ok &= verify_pool_free(region, cfg); ok &= verify_rings_inactive(region, cfg); diff --git a/tests/stress/gc_recovery.cc b/tests/stress/gc_recovery.cc index dba7b44..1059562 100644 --- a/tests/stress/gc_recovery.cc +++ b/tests/stress/gc_recovery.cc @@ -21,7 +21,7 @@ bool run_gc_recovery() bool ok = true; - // Simulate a publisher crash mid-commit: poison one entry with LOCKED_SEQUENCE. + // Simulate a publisher crash mid-commit: poison one entry with a stale lock. // To have a valid ring entry, we need an active subscriber and a committed entry first. { kickmsg::Subscriber sub(region); @@ -38,7 +38,7 @@ bool run_gc_recovery() uint64_t wp = ring->write_pos.load(std::memory_order_acquire); if (wp > 0) { - entries[(wp - 1) & h->sub_ring_mask].sequence = kickmsg::LOCKED_SEQUENCE; + entries[(wp - 1) & h->sub_ring_mask].sequence = kickmsg::seq_lock(wp - 1); } } diff --git a/tests/stress/live_repair.cc b/tests/stress/live_repair.cc index f393d2a..b62c7a9 100644 --- a/tests/stress/live_repair.cc +++ b/tests/stress/live_repair.cc @@ -57,9 +57,11 @@ bool run_live_repair() // Subscriber threads: receive continuously until done std::vector sub_results(NUM_SUBS); + std::vector sub_rings(NUM_SUBS, UINT32_MAX); auto sub_worker = [&](int sub_id) { kickmsg::Subscriber sub{region}; + sub_rings[static_cast(sub_id)] = sub.ring_index(); SubResult result{}; result.sub_id = sub_id; @@ -105,7 +107,8 @@ bool run_live_repair() sub_results[static_cast(sub_id)] = result; }; - // Injector thread: every 10ms, poison ring 0's next entry with LOCKED_SEQUENCE + // Injector thread: every 10ms, poison ring 0's next entry with a + // stale position-tagged lock (as a crashed publisher would leave it) auto injector = [&]() { while (not stop) @@ -114,14 +117,32 @@ bool run_live_repair() auto* ring = kickmsg::sub_ring_at(base, hdr, 0); auto* entries = kickmsg::ring_entries(ring); - uint64_t wp = ring->write_pos.load(std::memory_order_acquire); - // Advance write_pos by 1 and poison the new entry - ring->write_pos.store(wp + 1, std::memory_order_release); - entries[(wp) & hdr->sub_ring_mask].sequence.store( - kickmsg::LOCKED_SEQUENCE, std::memory_order_release); - - inject_count.fetch_add(1, std::memory_order_relaxed); + // fetch_add, NOT load+store: real publishers fetch_add this + // counter concurrently and a lost increment would hand two + // publishers the same position (harness-induced corruption). + uint64_t wp = ring->write_pos.fetch_add(1, std::memory_order_acq_rel); + auto& e = entries[(wp) & hdr->sub_ring_mask]; + + // CAS like a real publisher, never a blind store: a descheduled + // injector's late store would land over an already-repaired + // entry below every scan window -- an unstealable lock no real + // crash can produce. If the entry moved on first, skip the + // injection (the claimed position heals as Case-B residue). + uint64_t prev = 0; + if (wp >= hdr->sub_ring_capacity) + { + prev = wp - hdr->sub_ring_capacity + 1; + } + uint64_t observed = e.sequence.load(std::memory_order_acquire); + if (not kickmsg::seq_is_locked(observed) + and kickmsg::seq_pos(observed) == prev + and e.sequence.compare_exchange_strong(observed, + kickmsg::seq_lock(wp), + std::memory_order_acq_rel, std::memory_order_relaxed)) + { + inject_count.fetch_add(1, std::memory_order_relaxed); + } } }; @@ -187,6 +208,36 @@ bool run_live_repair() bool ok = true; + if (heal_count.load() == 0) + { + // Publishers' self-repair can win every race against the healer + // (it steals ~10 ms after the poison; the healer needs cadence + + // grace) -- effectiveness is asserted below via total_steals. + std::printf(" [NOTE] healer repaired nothing (publishers self-healed first)\n"); + } + + // The injector poisons ring 0: if repair never unblocked it, its + // subscriber starves. Receiving anything proves traffic flowed past + // the injected locks. + bool ring0_seen = false; + for (int i = 0; i < NUM_SUBS; ++i) + { + if (sub_rings[static_cast(i)] == 0) + { + ring0_seen = true; + if (sub_results[static_cast(i)].received == 0) + { + std::fprintf(stderr, " [FAIL] ring-0 subscriber (sub%d) received nothing\n", i); + ok = false; + } + } + } + if (not ring0_seen) + { + std::fprintf(stderr, " [FAIL] no subscriber claimed ring 0\n"); + ok = false; + } + // Check for corruption and reorders in subscriber results for (auto const& r : sub_results) { @@ -210,6 +261,24 @@ bool run_live_repair() { std::printf(" GC final repaired %zu locked entries\n", repaired); } + + // Effectiveness, timing-independent: every injected lock must have been + // stolen by SOMEONE (publisher self-repair, the healer, or the final + // pass above), and no poison may survive. + uint64_t steals = region.stats().total_steals; + if (steals < inject_count.load()) + { + std::fprintf(stderr, " [FAIL] %" PRIu64 " injections but only %" PRIu64 " steals\n", + inject_count.load(), steals); + ok = false; + } + auto post = region.diagnose(); + if (post.locked_entries > 0) + { + std::fprintf(stderr, " [FAIL] %u locked entries survived repair\n", + post.locked_entries); + ok = false; + } std::size_t reclaimed = region.reclaim_orphaned_slots(); if (reclaimed > 0) { diff --git a/tests/stress/main.cc b/tests/stress/main.cc index ed9679e..7ffe298 100644 --- a/tests/stress/main.cc +++ b/tests/stress/main.cc @@ -1,4 +1,5 @@ #include "common.h" +#include "../shm_cleanup.h" #include "kickmsg/version.h" #include @@ -13,6 +14,7 @@ bool run_pool_exhaustion(); bool run_live_repair(); bool run_single_slot_ring(); bool run_subscriber_saturation(); +bool run_big_payload(); int main(int argc, char** argv) { @@ -56,6 +58,18 @@ int main(int argc, char** argv) std::thread::hardware_concurrency(), static_cast(contention_count())); + // Unlink the scenario segments if interrupted (Ctrl-C / kill), so a killed + // run leaves a clean /dev/shm. Keep in sync with the scenarios' shm_name. + for (char const* name : {"/kickmsg_treiber_stress", "/kickmsg_churn_test", + "/kickmsg_gc_test", "/kickmsg_fairness_test", + "/kickmsg_stress_test", "/kickmsg_pool_exhaustion", + "/kickmsg_live_repair", "/kickmsg_single_slot_ring", + "/kickmsg_sub_saturation", "/kickmsg_big_payload"}) + { + kickmsg_test::register_cleanup_shm(name); + } + kickmsg_test::install_signal_cleanup(); + TestRunner runner; runner.run("treiber_stress", run_treiber_stress); @@ -67,6 +81,7 @@ int main(int argc, char** argv) runner.run("live_repair", run_live_repair); runner.run("single_slot_ring", run_single_slot_ring); runner.run("subscriber_saturation", run_subscriber_saturation); + runner.run("big_payload", run_big_payload); return runner.summary(); } diff --git a/tests/stress/mpmc.cc b/tests/stress/mpmc.cc index 5a3e465..912df4d 100644 --- a/tests/stress/mpmc.cc +++ b/tests/stress/mpmc.cc @@ -13,6 +13,8 @@ bool run_stress_test(TestConfig const& tc) tc.pool_size, tc.ring_capacity); g_all_publishers_done = false; + g_subscribers_ready = 0; + g_subscribers_expected = tc.num_subscribers; kickmsg::channel::Config cfg; cfg.max_subscribers = tc.max_subs; @@ -44,8 +46,6 @@ bool run_stress_test(TestConfig const& tc) }); } - kickmsg::sleep(10ms); - std::vector pub_threads; for (int i = 0; i < tc.num_publishers; ++i) { @@ -112,31 +112,22 @@ bool run_stress_test(TestConfig const& tc) std::fprintf(stderr, " [FAIL] sub%d: received 0 messages!\n", r.sub_id); all_ok = false; } - // Completeness check: received + lost should account for all messages - // this subscriber saw. The lost counter tracks ring-level losses; - // received + lost may be less than total_sent (subscriber starts from - // write_pos at construction, missing earlier messages), but should - // never exceed it. - if (r.received + r.lost > total_sent) + // Exact conservation: the readiness barrier guarantees every ring is + // Live before the first send, so each subscriber's ring sees every + // message. Every ring position is consumed (received / corrupted / + // bad_pub_id / reordered, exactly one bucket each) or counted in + // lost; any other total means messages vanished or were duplicated. + uint64_t accounted = r.received + r.lost + r.corrupted + r.bad_pub_id + r.reordered; + if (accounted != total_sent) { - std::fprintf(stderr, " [FAIL] sub%d: received+lost (%" PRIu64 - ") > total_sent (%" PRIu64 ")!\n", - r.sub_id, r.received + r.lost, total_sent); + std::fprintf(stderr, " [FAIL] sub%d: received+lost+corrupt+bad_pid+reorder (%" PRIu64 + ") != total_sent (%" PRIu64 ")!\n", + r.sub_id, accounted, total_sent); all_ok = false; } } - std::size_t repaired = region.repair_locked_entries(); - if (repaired > 0) - { - std::printf(" GC repaired %zu locked entries\n", repaired); - } - std::size_t reclaimed = region.reclaim_orphaned_slots(); - if (reclaimed > 0) - { - std::printf(" GC reclaimed %zu orphaned slots\n", reclaimed); - } - + all_ok &= verify_gc_zero(region, cfg); all_ok &= verify_refcounts_zero(region, cfg); all_ok &= verify_pool_free(region, cfg); all_ok &= verify_rings_inactive(region, cfg); diff --git a/tests/stress/pool_exhaustion.cc b/tests/stress/pool_exhaustion.cc index d22dd45..ec30f80 100644 --- a/tests/stress/pool_exhaustion.cc +++ b/tests/stress/pool_exhaustion.cc @@ -21,13 +21,27 @@ bool run_pool_exhaustion() constexpr int NUM_SUBS = 4; uint32_t const NUM_MSGS = 10000 / TSAN_SCALE; + g_subscribers_ready = 0; + g_subscribers_expected = NUM_SUBS; + std::atomic eagain_count{0}; + std::atomic corruption_count{0}; + + struct SlowSubStats + { + uint64_t received = 0; + uint64_t lost = 0; + uint64_t corrupted = 0; + }; + std::vector sub_stats(NUM_SUBS); // Publishers that track EAGAIN auto pub_worker = [&](int pub_id) { kickmsg::Publisher pub{region}; + wait_subscribers_ready(); + for (uint32_t i = 0; i < NUM_MSGS; ++i) { Payload msg; @@ -51,11 +65,14 @@ bool run_pool_exhaustion() } }; - // Slow subscribers: 1us sleep between receives + // Slow subscribers: 1us sleep between receives. Slow but complete: the + // loop still drains every remaining message after publishers finish. auto slow_sub = [&](int sub_id) { kickmsg::Subscriber sub{region}; + g_subscribers_ready.fetch_add(1, std::memory_order_release); + auto& stats = sub_stats[static_cast(sub_id)]; auto const timeout = milliseconds{500}; while (true) @@ -65,10 +82,18 @@ bool run_pool_exhaustion() { if (g_all_publishers_done) { + // try_receive can return null after exhausting its retry + // budget on a run of evicted entries; only "null with no + // lost() progress" proves the ring is empty. + uint64_t lost_before = sub.lost(); sample = sub.try_receive(); if (not sample) { - break; + if (sub.lost() == lost_before) + { + break; + } + continue; } } else @@ -80,16 +105,29 @@ bool run_pool_exhaustion() // Deliberately slow consumer kickmsg::sleep(1us); - if (sample->len() == sizeof(Payload)) + if (sample->len() != sizeof(Payload)) { - Payload msg; - std::memcpy(&msg, sample->data(), sizeof(msg)); - if (msg.magic != Payload::MAGIC or msg.checksum != compute_checksum(msg)) - { - std::fprintf(stderr, " [FAIL] sub%d: corruption detected\n", sub_id); - } + std::fprintf(stderr, " [FAIL] sub%d: wrong-length sample (%zu bytes)\n", + sub_id, sample->len()); + ++stats.corrupted; + corruption_count.fetch_add(1, std::memory_order_relaxed); + continue; + } + + Payload msg; + std::memcpy(&msg, sample->data(), sizeof(msg)); + if (msg.magic != Payload::MAGIC or msg.checksum != compute_checksum(msg)) + { + std::fprintf(stderr, " [FAIL] sub%d: corruption detected\n", sub_id); + ++stats.corrupted; + corruption_count.fetch_add(1, std::memory_order_relaxed); + continue; } + + ++stats.received; } + + stats.lost = sub.lost(); }; std::vector sub_threads; @@ -98,8 +136,6 @@ bool run_pool_exhaustion() sub_threads.emplace_back(slow_sub, i); } - kickmsg::sleep(10ms); - std::vector pub_threads; for (int i = 0; i < NUM_PUBS; ++i) { @@ -121,17 +157,40 @@ bool run_pool_exhaustion() bool ok = true; - std::size_t repaired = region.repair_locked_entries(); - if (repaired > 0) + if (corruption_count.load() != 0) { - std::printf(" GC repaired %zu locked entries\n", repaired); + std::fprintf(stderr, " [FAIL] %" PRIu64 " corrupted/wrong-length samples!\n", + corruption_count.load()); + ok = false; } - std::size_t reclaimed = region.reclaim_orphaned_slots(); - if (reclaimed > 0) + + uint64_t const total_sent = static_cast(NUM_PUBS) * NUM_MSGS; + for (int i = 0; i < NUM_SUBS; ++i) { - std::printf(" GC reclaimed %zu orphaned slots\n", reclaimed); + auto const& s = sub_stats[static_cast(i)]; + std::printf(" sub%d: received=%" PRIu64 " lost=%" PRIu64 " corrupted=%" PRIu64 "\n", + i, s.received, s.lost, s.corrupted); + + if (s.received == 0) + { + std::fprintf(stderr, " [FAIL] sub%d: received 0 messages!\n", i); + ok = false; + } + + // Exact conservation: the readiness barrier means every ring is Live + // before the first send, and the slow consumers still drain to + // completion, so every ring position lands in exactly one bucket. + uint64_t accounted = s.received + s.lost + s.corrupted; + if (accounted != total_sent) + { + std::fprintf(stderr, " [FAIL] sub%d: received+lost+corrupted (%" PRIu64 + ") != total_sent (%" PRIu64 ")!\n", + i, accounted, total_sent); + ok = false; + } } + ok &= verify_gc_zero(region, cfg); ok &= verify_pool_free(region, cfg); ok &= verify_refcounts_zero(region, cfg); ok &= verify_rings_inactive(region, cfg); diff --git a/tests/unit/node-t.cc b/tests/unit/node-t.cc index 848c9c8..374f4df 100644 --- a/tests/unit/node-t.cc +++ b/tests/unit/node-t.cc @@ -70,7 +70,7 @@ TEST_F(NodeTest, AdvertiseTwiceDoesNotWipeLiveRegion) kickmsg::Node sub_node("subnode", "test"); auto sub = sub_node.subscribe("dup"); // claims a ring; region Live - // Second advertise on the same node — must reuse, not wipe. + // Second advertise on the same node -- must reuse, not wipe. auto pub2 = node.advertise("dup", small_cfg()); uint32_t val = 7; @@ -237,7 +237,7 @@ TEST_F(NodeTest, UnlinkTopicRemovesShm) // unlink, then verify a fresh open() fails. On Linux this exercises // the unlink path (without it, the /dev/shm entry would persist); on // Windows the last-handle-close already removed the mapping and - // unlink is a harmless no-op — both produce the same post-condition. + // unlink is a harmless no-op -- both produce the same post-condition. track("/test_ephemeral"); { @@ -278,7 +278,7 @@ TEST_F(NodeTest, UnlinkBroadcastAndMailboxUseRightNames) (void)mbx; } { - // Unlink from a different node — unlink_mailbox takes an explicit + // Unlink from a different node -- unlink_mailbox takes an explicit // owner, which is also what a cleanup tool in a separate process // would have to supply. kickmsg::Node cleanup("cleanup", "test"); @@ -353,6 +353,43 @@ TEST_F(NodeTest, RosStyleTopicNamesAreSanitizedIntoShmPath) EXPECT_EQ(pub_node.name(), "drv"); } +TEST_F(NodeTest, ShmNameCollisionDetectedByIdentityStamp) +{ + // Sanitization is many-to-one: "a:b" and "a b" both become "a_b", so + // two distinct logical topics land on the same shm name "/test_a_b". + // The identity hash stamped at create (over the RAW topic string) must + // reject the open instead of silently sharing the region. + track("/test_a_b"); + + kickmsg::Node pub_node("pubnode", "test"); + auto pub = pub_node.advertise("a:b", small_cfg()); + + kickmsg::Node other("othernode", "test"); + try + { + other.subscribe("a b"); + FAIL() << "expected identity mismatch on colliding topic name"; + } + catch (std::runtime_error const& e) + { + EXPECT_NE(std::string(e.what()).find("Identity mismatch"), + std::string::npos) << e.what(); + } + + // create_or_open path: same config (config_hash matches), colliding + // logical name -- identity check must still fire. + EXPECT_THROW(other.subscribe_or_create("a b", small_cfg()), + std::runtime_error); + + // The genuine topic still opens fine. + kickmsg::Node reader("readernode", "test"); + auto sub = reader.subscribe("a:b"); + uint32_t val = 9; + ASSERT_GE(pub.send(&val, sizeof(val)), 0); + auto got = sub.try_receive(); + ASSERT_TRUE(got.has_value()); +} + TEST_F(NodeTest, EmptyTopicNameThrows) { kickmsg::Node node("n", "test"); diff --git a/tests/unit/publisher-t.cc b/tests/unit/publisher-t.cc index d9dd7b3..fd9320e 100644 --- a/tests/unit/publisher-t.cc +++ b/tests/unit/publisher-t.cc @@ -96,6 +96,29 @@ TEST_F(PublisherTest, SendReturnsEmsgsize) EXPECT_EQ(pub.send(buf, cfg.max_payload_size + 1), -EMSGSIZE); } +TEST_F(PublisherTest, PublishOversizedLenReturnsZeroAndRecyclesSlot) +{ + kickmsg::channel::Config cfg; + cfg.max_subscribers = 1; + cfg.sub_ring_capacity = 4; + cfg.pool_size = 1; + cfg.max_payload_size = 8; + + auto region = kickmsg::SharedRegion::create(SHM_NAME, kickmsg::channel::PubSub, cfg); + kickmsg::Subscriber sub(region); + kickmsg::Publisher pub(region); + + auto a = pub.allocate(); + ASSERT_NE(a.data, nullptr); + EXPECT_EQ(pub.publish(cfg.max_payload_size + 1), 0u); + + // The oversized publish must have recycled the pending slot: with + // pool_size == 1, a leak would make this allocate() fail. + auto b = pub.allocate(); + ASSERT_NE(b.data, nullptr); + EXPECT_EQ(pub.publish(sizeof(uint32_t)), 1u); +} + TEST_F(PublisherTest, SendReturnsEagainOnPoolExhaustion) { kickmsg::channel::Config cfg; @@ -191,12 +214,12 @@ TEST_F(PublisherTest, MultipleSubscribersEachReceive) TEST_F(PublisherTest, SelfRepairCaseA_LockedSequence) { - // Simulate Case A: a publisher CAS-locked an entry (LOCKED_SEQUENCE) - // then crashed before committing. The next publisher that wraps to - // this position should: - // 1. time out in wait_and_capture_slot - // 2. fail the CAS lock (entry is LOCKED_SEQUENCE, not prev_seq) - // 3. self-repair the entry (advance seq to expected) + // Simulate Case A: a publisher CAS-locked an entry (position-tagged + // lock) then crashed before committing. The next publisher that wraps + // to this position should: + // 1. time out in wait_for_commit (observing one stable lock value) + // 2. fail the CAS lock (entry is locked, not prev_seq) + // 3. self-repair the entry (steal + advance seq to expected) // 4. drop delivery for this wrap // After that, the NEXT publisher at this position should succeed // without any timeout. @@ -206,7 +229,7 @@ TEST_F(PublisherTest, SelfRepairCaseA_LockedSequence) cfg.sub_ring_capacity = 4; // capacity = 4 cfg.pool_size = 16; cfg.max_payload_size = 8; - cfg.commit_timeout = std::chrono::microseconds{1000}; // 1 ms — fast test + cfg.commit_timeout = std::chrono::microseconds{1000}; // 1 ms -- fast test auto region = kickmsg::SharedRegion::create(SHM_NAME, kickmsg::channel::PubSub, cfg); @@ -230,20 +253,20 @@ TEST_F(PublisherTest, SelfRepairCaseA_LockedSequence) // Advance write_pos past pos=4 so the ring has wrapped. ring->write_pos.store(5, std::memory_order_release); // Lock entry at idx=0 as if a publisher crashed mid-commit at pos=4. - entries[0].sequence.store(kickmsg::LOCKED_SEQUENCE, + entries[0].sequence.store(kickmsg::seq_lock(4), std::memory_order_release); - // Next publish: pos=5 → idx=1 (clean entry, succeeds). + // Next publish: pos=5 -> idx=1 (clean entry, succeeds). uint32_t val = 100; ASSERT_GE(pub.send(&val, sizeof(val)), 0); - // pos=6 → idx=2 (clean), pos=7 → idx=3 (clean). + // pos=6 -> idx=2 (clean), pos=7 -> idx=3 (clean). val = 101; ASSERT_GE(pub.send(&val, sizeof(val)), 0); val = 102; ASSERT_GE(pub.send(&val, sizeof(val)), 0); - // pos=8 → idx=0 → POISONED. Publisher times out + drops + self-repairs. + // pos=8 -> idx=0 -> POISONED. Publisher times out + drops + self-repairs. auto before = std::chrono::steady_clock::now(); val = 103; pub.send(&val, sizeof(val)); // may drop, but should NOT hang forever @@ -256,10 +279,10 @@ TEST_F(PublisherTest, SelfRepairCaseA_LockedSequence) << "Publisher should have waited ~1 ms for the stuck entry"; // After self-repair, entry at idx=0 should be advanced. - // The entry was stuck at LOCKED_SEQUENCE for pos=4. The publisher at - // pos=8 expects seq=5 and should have repaired it to 9 (pos=8 + 1). + // The entry was stuck locked at pos=4. The publisher at pos=8 expects + // seq=5 and should have repaired it to 9 (pos=8 + 1). uint64_t seq0 = entries[0].sequence.load(std::memory_order_acquire); - EXPECT_NE(seq0, kickmsg::LOCKED_SEQUENCE) + EXPECT_FALSE(kickmsg::seq_is_locked(seq0)) << "Self-repair should have advanced the stuck entry"; // Now the NEXT publish at idx=0 (pos=12) should succeed WITHOUT timeout. @@ -270,7 +293,7 @@ TEST_F(PublisherTest, SelfRepairCaseA_LockedSequence) ASSERT_GE(pub.send(&val, sizeof(val)), 0); } - // pos=12 → idx=0. If self-repair worked, this should be fast. + // pos=12 -> idx=0. If self-repair worked, this should be fast. before = std::chrono::steady_clock::now(); val = 203; ASSERT_GE(pub.send(&val, sizeof(val)), 0); @@ -280,7 +303,7 @@ TEST_F(PublisherTest, SelfRepairCaseA_LockedSequence) after - before).count(); EXPECT_LT(elapsed_us, 500) << "After self-repair, the next publisher at this position should " - "succeed instantly — not wait for another timeout"; + "succeed instantly -- not wait for another timeout"; } TEST_F(PublisherTest, SelfRepairCaseB_StaleEntry) @@ -322,9 +345,9 @@ TEST_F(PublisherTest, SelfRepairCaseB_StaleEntry) // Entry idx=0 still has seq=1. A publisher at pos=8 (idx=0) expects // seq=5. seq=1 + cap=4 = 5 which is NOT < 5, so it needs to be - // strictly more than one wrap. At pos=12, expected=9. 1+4=5 < 9 → stale. + // strictly more than one wrap. At pos=12, expected=9. 1+4=5 < 9 -> stale. - // Publish at pos=12 → idx=0: should timeout + self-repair + drop. + // Publish at pos=12 -> idx=0: should timeout + self-repair + drop. uint32_t val = 999; pub.send(&val, sizeof(val)); // drops at idx=0 but self-repairs @@ -344,7 +367,7 @@ TEST_F(PublisherTest, SelfRepairCaseB_StaleEntry) auto before = std::chrono::steady_clock::now(); val = 303; - pub.send(&val, sizeof(val)); // pos=16 → idx=0, should be fast + pub.send(&val, sizeof(val)); // pos=16 -> idx=0, should be fast auto after = std::chrono::steady_clock::now(); auto elapsed_us = std::chrono::duration_cast( diff --git a/tests/unit/region-t.cc b/tests/unit/region-t.cc index 718ea62..760a4a4 100644 --- a/tests/unit/region-t.cc +++ b/tests/unit/region-t.cc @@ -343,31 +343,31 @@ TEST_F(RegionTest, RepairLockedEntryUnblocksPublishing) // Advance write_pos to simulate that a publisher claimed pos=1 ring->write_pos.store(2, std::memory_order_release); - auto& e1 = entries[1]; // pos=1 → idx=1 - e1.sequence.store(kickmsg::LOCKED_SEQUENCE, std::memory_order_release); + auto& e1 = entries[1]; // pos=1 -> idx=1 + e1.sequence.store(kickmsg::seq_lock(1), std::memory_order_release); // Repair should fix the locked entry std::size_t repaired = region.repair_locked_entries(); EXPECT_EQ(repaired, 1u); - // The repaired entry should have seq = pos + 1 = 2 + // The repaired entry carries the skip marker for pos = 1 uint64_t seq = e1.sequence.load(std::memory_order_acquire); - EXPECT_EQ(seq, 2u); + EXPECT_EQ(seq, kickmsg::seq_skip(1)); // The repaired entry should have INVALID_SLOT uint32_t slot_idx = e1.slot_idx.load(std::memory_order_acquire); EXPECT_EQ(slot_idx, kickmsg::INVALID_SLOT); // Now publish enough to wrap around: pos 2, 3, 4, 5 - // pos=4 wraps to idx=0 and expects prev_seq=1 (pos 0's committed seq) — OK - // pos=5 wraps to idx=1 and expects prev_seq=2 (the repaired seq) — this + // pos=4 wraps to idx=0 and expects prev_seq=1 (pos 0's committed seq) -- OK + // pos=5 wraps to idx=1 and expects prev_seq=2 (the repaired seq) -- this // would fail with the old code that stored prev_seq instead of pos+1 for (int i = 0; i < 4; ++i) { val = static_cast(200 + i); ASSERT_GE(pub.send(&val, sizeof(val)), 0) << "Publishing failed at iteration " << i - << " — repaired entry likely blocked the ring"; + << " -- repaired entry likely blocked the ring"; } // Subscriber should receive the new messages (some may be lost due to wrapping) @@ -417,9 +417,9 @@ TEST_F(RegionTest, RepairStaleEntryFromCrashedPublisherBeforeCasLock) // becomes > 1 wrap stale. // write_pos after the 4 real publishes is 4. Set it to 4 + 2*cap = 12. ring->write_pos.store(12, std::memory_order_release); - // Don't touch entries — they keep their old sequences. Entry idx=0 + // Don't touch entries -- they keep their old sequences. Entry idx=0 // has seq=1, but expected seq at pos=8 (the slot in the scan window) - // is 9. (pos=8 maps to idx=0 because 8 & 3 = 0.) 1 + 4 < 9 → stale. + // is 9. (pos=8 maps to idx=0 because 8 & 3 = 0.) 1 + 4 < 9 -> stale. auto report = region.diagnose(); EXPECT_GT(report.locked_entries, 0u) @@ -429,11 +429,11 @@ TEST_F(RegionTest, RepairStaleEntryFromCrashedPublisherBeforeCasLock) EXPECT_GT(repaired, 0u) << "repair_locked_entries() should advance the stale entry"; - // After repair, the entry at idx=0 should have seq = expected. - // The expected pos for idx=0 in the window [12-4, 12) = [8, 12) is pos=8. + // After repair, the entry at idx=0 carries the skip marker for pos=8 + // (the slot in the scan window [12-4, 12) that maps to idx=0). uint64_t seq0 = entries[0].sequence.load(std::memory_order_acquire); - EXPECT_EQ(seq0, 9u) // pos=8 → expected = 8 + 1 = 9 - << "Stale entry should be advanced to pos + 1"; + EXPECT_EQ(seq0, kickmsg::seq_skip(8)) + << "Stale entry should be advanced to the pos=8 skip marker"; // Publishing should now succeed past the repaired slot. for (int i = 0; i < 8; ++i) @@ -441,7 +441,7 @@ TEST_F(RegionTest, RepairStaleEntryFromCrashedPublisherBeforeCasLock) uint32_t val = static_cast(100 + i); ASSERT_GE(pub.send(&val, sizeof(val)), 0) << "Publishing failed at iteration " << i - << " — repaired entry may still be stuck"; + << " -- repaired entry may still be stuck"; } } @@ -464,15 +464,16 @@ TEST_F(RegionTest, RepairLockedEntryAtPositionZero) auto* ring = kickmsg::sub_ring_at(region.base(), region.header(), 0); auto* entries = kickmsg::ring_entries(ring); ring->write_pos.store(1, std::memory_order_release); - entries[0].sequence.store(kickmsg::LOCKED_SEQUENCE, std::memory_order_release); + entries[0].sequence.store(kickmsg::seq_lock(0), std::memory_order_release); std::size_t repaired = region.repair_locked_entries(); EXPECT_EQ(repaired, 1u); - EXPECT_EQ(entries[0].sequence.load(std::memory_order_acquire), 1u); + EXPECT_EQ(entries[0].sequence.load(std::memory_order_acquire), + kickmsg::seq_skip(0)); EXPECT_EQ(entries[0].slot_idx.load(std::memory_order_acquire), kickmsg::INVALID_SLOT); // Publishing should work: pos=1,2,3 use fresh indices, pos=4 wraps to idx=0 - // and expects prev_seq=1 — matches the repaired value + // and expects prev_seq=1 -- matches the repaired value kickmsg::Publisher pub(region); for (int i = 0; i < 5; ++i) { @@ -522,7 +523,7 @@ TEST_F(RegionTest, DiagnoseDetectsLockedEntries) auto* ring = kickmsg::sub_ring_at(region.base(), region.header(), 0); auto* entries = kickmsg::ring_entries(ring); ring->write_pos.store(2, std::memory_order_release); - entries[1].sequence.store(kickmsg::LOCKED_SEQUENCE, std::memory_order_release); + entries[1].sequence.store(kickmsg::seq_lock(1), std::memory_order_release); auto report = region.diagnose(); EXPECT_EQ(report.locked_entries, 1u); @@ -584,13 +585,13 @@ TEST_F(RegionTest, ResetRetiredRingsLeavesDrainingUntouched) auto region = kickmsg::SharedRegion::create(SHM_NAME, kickmsg::channel::PubSub, cfg); - // Ring 0: retired (Free | in_flight=1) — should be reset + // Ring 0: retired (Free | in_flight=1) -- should be reset auto* ring0 = kickmsg::sub_ring_at(region.base(), region.header(), 0); ring0->state_flight.store( kickmsg::ring::make_packed(kickmsg::ring::Free, 1), std::memory_order_release); - // Ring 1: draining (Draining | in_flight=1) — must NOT be touched + // Ring 1: draining (Draining | in_flight=1) -- must NOT be touched auto* ring1 = kickmsg::sub_ring_at(region.base(), region.header(), 1); ring1->state_flight.store( kickmsg::ring::make_packed(kickmsg::ring::Draining, 1), @@ -794,7 +795,7 @@ TEST_F(RegionTest, SchemaClaimRejectsSecondClaimant) auto region = kickmsg::SharedRegion::create( SHM_NAME, kickmsg::channel::PubSub, cfg); - // Second process tries to claim a *different* schema — library just + // Second process tries to claim a *different* schema -- library just // reports "not the claimant", it never throws. User picks the policy. auto other = make_schema("other/Pose", 2, 0x00, 0x00); EXPECT_FALSE(region.try_claim_schema(other)); @@ -898,7 +899,7 @@ TEST_F(RegionTest, SchemaReaderDuringClaimingReturnsNullopt) TEST_F(RegionTest, SchemaResetRecoversWedgedClaimingState) { - // Crash scenario: a claimant CAS'd Unset → Claiming and died before + // Crash scenario: a claimant CAS'd Unset -> Claiming and died before // the release-store of Set. Every try_claim_schema() caller will // observe Claiming and return false after bounded yields. // reset_schema_claim() is the operator-driven recovery. @@ -918,7 +919,7 @@ TEST_F(RegionTest, SchemaResetRecoversWedgedClaimingState) // Operator confirms the original claimant is gone, resets the slot. EXPECT_TRUE(region.reset_schema_claim()); - // Second call is a no-op — state is already Unset. + // Second call is a no-op -- state is already Unset. EXPECT_FALSE(region.reset_schema_claim()); // Subsequent claim now succeeds. @@ -968,7 +969,7 @@ TEST_F(RegionTest, SchemaCreateOrOpenIgnoresOpenerSchemaWhenCreatorHadNone) auto opened = kickmsg::SharedRegion::create_or_open( SHM_NAME, kickmsg::channel::PubSub, opener_cfg, "opener"); - // Opener's cfg.schema was discarded — slot is still Unset. + // Opener's cfg.schema was discarded -- slot is still Unset. EXPECT_FALSE(opened.schema().has_value()); } @@ -989,7 +990,7 @@ TEST_F(RegionTest, SchemaCrossHandleObservesClaim) EXPECT_FALSE(r2.schema().has_value()); // r1 claims. r2 must observe Set without any further fence on its - // side — acquire-load in schema() synchronizes with r1's release-store. + // side -- acquire-load in schema() synchronizes with r1's release-store. ASSERT_TRUE(r1.try_claim_schema(make_schema("xhandle/Type", 5, 0x9A, 0xBC))); auto got_r2 = r2.schema(); @@ -1067,11 +1068,11 @@ TEST_F(RegionTest, SchemaDoesNotAffectConfigHash) auto other_cfg = default_cfg(); other_cfg.schema = make_schema("opener/Type", 2, 0xCC, 0xDD); - // Must succeed — geometry matches, schema differs but is ignored on open. + // Must succeed -- geometry matches, schema differs but is ignored on open. auto opened = kickmsg::SharedRegion::create_or_open( SHM_NAME, kickmsg::channel::PubSub, other_cfg, "opener"); - // Opener observes the creator's schema, not its own — library doesn't + // Opener observes the creator's schema, not its own -- library doesn't // overwrite or enforce anything. auto got = opened.schema(); ASSERT_TRUE(got.has_value()); @@ -1079,7 +1080,7 @@ TEST_F(RegionTest, SchemaDoesNotAffectConfigHash) } // ----------------------------------------------------------------------------- -// stats() — cross-process counter snapshot +// stats() -- cross-process counter snapshot // ----------------------------------------------------------------------------- TEST_F(RegionTest, StatsOnFreshRegionReportsZeros) @@ -1150,7 +1151,7 @@ TEST_F(RegionTest, StatsLostCountMatchesSubscriberLostOnOverflow) kickmsg::Subscriber sub(region); kickmsg::Publisher pub(region); - // Publish more than the ring can hold without draining — forces the + // Publish more than the ring can hold without draining -- forces the // subscriber's drain-ahead path to bump lost_count on its next read. uint32_t payload = 0; std::size_t const to_publish = cfg.sub_ring_capacity * 3; @@ -1167,7 +1168,7 @@ TEST_F(RegionTest, StatsLostCountMatchesSubscriberLostOnOverflow) EXPECT_GT(sub.lost(), 0u); auto s = region.stats(); - // Exactly one ring is Live — its lost_count equals the subscriber's. + // Exactly one ring is Live -- its lost_count equals the subscriber's. uint64_t ring_lost = 0; for (auto const& r : s.rings) { @@ -1196,7 +1197,7 @@ TEST_F(RegionTest, StatsPoolFreeTracksAllocations) } // ----------------------------------------------------------------------------- -// attach_create / attach_open — caller-provided memory +// attach_create / attach_open -- caller-provided memory // ----------------------------------------------------------------------------- class InjectedRegionTest : public ::testing::Test @@ -1299,7 +1300,7 @@ TEST_F(InjectedRegionTest, AttachCreateRejectsMisalignedAddress) { auto cfg = default_cfg(); auto buf = make_buffer(cfg, "x"); - auto* bad = static_cast(buf.get()) + 1; // off by one — not aligned + auto* bad = static_cast(buf.get()) + 1; // off by one -- not aligned EXPECT_THROW( kickmsg::SharedRegion::attach_create( @@ -1336,7 +1337,7 @@ TEST_F(InjectedRegionTest, UnlinkOnInjectedRegionIsNoOp) auto region = kickmsg::SharedRegion::attach_create( buf.get(), buf.size, kickmsg::channel::PubSub, cfg, "x", "should-not-be-unlinked"); - // Must not call shm_unlink on the label, which would fail if it tried — + // Must not call shm_unlink on the label, which would fail if it tried -- // the label is not a path. Just checks that the call returns cleanly. EXPECT_NO_THROW(region.unlink()); // And the region remains usable after a no-op unlink. @@ -1346,7 +1347,7 @@ TEST_F(InjectedRegionTest, UnlinkOnInjectedRegionIsNoOp) TEST_F(InjectedRegionTest, AttachOpenRejectsBufferSmallerThanHeader) { // A buffer smaller than sizeof(Header) must be rejected BEFORE any - // dereference of magic/version/total_size — otherwise the load is + // dereference of magic/version/total_size -- otherwise the load is // an out-of-bounds read on hostile or accidentally-small input. alignas(kickmsg::CACHE_LINE) std::byte tiny[kickmsg::CACHE_LINE]{}; static_assert(sizeof(tiny) < sizeof(kickmsg::Header)); @@ -1369,7 +1370,7 @@ TEST_F(InjectedRegionTest, MoveLeavesSourceWithNullBase) auto dst = std::move(src); EXPECT_EQ(dst.base(), live_base); // After move, the source must NOT still alias the destination's - // live memory — otherwise base()/header() on the moved-from object + // live memory -- otherwise base()/header() on the moved-from object // returns a dangling-looking-live pointer instead of nullptr. EXPECT_EQ(src.base(), nullptr); } @@ -1519,3 +1520,136 @@ TEST_F(CorruptedHeaderTest, RejectsCreatorNameLenPastTail) EXPECT_THROW(kickmsg::SharedRegion::attach_open(buf.get(), buf.size), std::runtime_error); } + +// --------------------------------------------------------------------------- +// Repair theft-safety: a slow-but-alive publisher whose lock is stolen must +// be detected at its commit CAS (never blind-stored over), a healthy commit +// must survive the grace pass, and a steal must back off if the entry +// changes first. +// --------------------------------------------------------------------------- + +#include "kickmsg/os/Time.h" + +TEST_F(RegionTest, RepairStealsProvenStaleLockAndResumedCommitFails) +{ + kickmsg::channel::Config cfg; + cfg.max_subscribers = 1; + cfg.sub_ring_capacity = 4; + cfg.pool_size = 8; + cfg.max_payload_size = 8; + cfg.commit_timeout = std::chrono::microseconds{1000}; + + auto region = kickmsg::SharedRegion::create(SHM_NAME, kickmsg::channel::PubSub, cfg); + kickmsg::Subscriber sub(region); + kickmsg::Publisher pub(region); + + for (int i = 0; i < 4; ++i) + { + uint32_t val = static_cast(i); + ASSERT_GE(pub.send(&val, sizeof(val)), 0); + ASSERT_TRUE(sub.try_receive().has_value()); + } + + auto* ring = kickmsg::sub_ring_at(region.base(), region.header(), 0); + auto* entries = kickmsg::ring_entries(ring); + + // Stalled holder at pos=4 (idx 0): claimed the position, locked the + // entry, then was descheduled past commit_timeout. + ring->write_pos.store(5, std::memory_order_release); + uint64_t expected = 1; + ASSERT_TRUE(entries[0].sequence.compare_exchange_strong( + expected, kickmsg::seq_lock(4), + std::memory_order_acquire, std::memory_order_relaxed)); + + // The lock value is stable across the grace re-check -> repair steals. + EXPECT_EQ(region.repair_locked_entries(), 1u); + EXPECT_EQ(entries[0].sequence.load(std::memory_order_acquire), + kickmsg::seq_skip(4)); + EXPECT_EQ(entries[0].slot_idx.load(std::memory_order_acquire), + kickmsg::INVALID_SLOT); + EXPECT_EQ(entries[0].payload_len.load(std::memory_order_acquire), 0u); + + // The holder resumes and commits: the CAS from its own lock value must + // fail and leave the repaired entry untouched. The old blind-store + // protocol re-stamped the same sequence here -- the torn-entry / + // sequence-rewind corruption this protocol exists to prevent. + uint64_t lock_val = kickmsg::seq_lock(4); + EXPECT_FALSE(entries[0].sequence.compare_exchange_strong( + lock_val, 5u, std::memory_order_release, std::memory_order_relaxed)); + EXPECT_EQ(entries[0].sequence.load(std::memory_order_acquire), + kickmsg::seq_skip(4)); + EXPECT_EQ(entries[0].slot_idx.load(std::memory_order_acquire), + kickmsg::INVALID_SLOT); +} + +TEST_F(RegionTest, RepairGraceSparesInFlightCommit) +{ + kickmsg::channel::Config cfg; + cfg.max_subscribers = 1; + cfg.sub_ring_capacity = 4; + cfg.pool_size = 8; + cfg.max_payload_size = 8; + cfg.commit_timeout = std::chrono::microseconds{50000}; + + auto region = kickmsg::SharedRegion::create(SHM_NAME, kickmsg::channel::PubSub, cfg); + kickmsg::Subscriber sub(region); + kickmsg::Publisher pub(region); + + for (int i = 0; i < 4; ++i) + { + uint32_t val = static_cast(i); + ASSERT_GE(pub.send(&val, sizeof(val)), 0); + ASSERT_TRUE(sub.try_receive().has_value()); + } + + auto* ring = kickmsg::sub_ring_at(region.base(), region.header(), 0); + auto* entries = kickmsg::ring_entries(ring); + + // Healthy holder mid-commit at pos=4. + ring->write_pos.store(5, std::memory_order_release); + uint64_t expected = 1; + ASSERT_TRUE(entries[0].sequence.compare_exchange_strong( + expected, kickmsg::seq_lock(4), + std::memory_order_acquire, std::memory_order_relaxed)); + + std::atomic repaired{SIZE_MAX}; + std::thread repairer([&] { repaired = region.repair_locked_entries(); }); + + // Commit while the repairer sits in its 50 ms grace sleep: the re-check + // sees the value changed and must NOT steal. + kickmsg::sleep(std::chrono::microseconds{10000}); + entries[0].slot_idx.store(kickmsg::INVALID_SLOT, std::memory_order_relaxed); + entries[0].payload_len.store(0, std::memory_order_relaxed); + uint64_t lock_val = kickmsg::seq_lock(4); + EXPECT_TRUE(entries[0].sequence.compare_exchange_strong( + lock_val, 5u, std::memory_order_release, std::memory_order_relaxed)); + + repairer.join(); + EXPECT_EQ(repaired.load(), 0u); + EXPECT_EQ(entries[0].sequence.load(std::memory_order_acquire), 5u); +} + +TEST_F(RegionTest, StealBacksOffWhenEntryChangesFirst) +{ + kickmsg::channel::Config cfg; + cfg.max_subscribers = 1; + cfg.sub_ring_capacity = 4; + cfg.pool_size = 8; + cfg.max_payload_size = 8; + + auto region = kickmsg::SharedRegion::create(SHM_NAME, kickmsg::channel::PubSub, cfg); + + auto* ring = kickmsg::sub_ring_at(region.base(), region.header(), 0); + auto* entries = kickmsg::ring_entries(ring); + + // A repairer observed the lock, but the holder committed first. + entries[0].slot_idx.store(3, std::memory_order_relaxed); + entries[0].payload_len.store(7, std::memory_order_relaxed); + entries[0].sequence.store(5, std::memory_order_release); + + EXPECT_FALSE(kickmsg::entry_steal_and_clear(entries[0], 4, + kickmsg::seq_lock(4))); + EXPECT_EQ(entries[0].sequence.load(std::memory_order_acquire), 5u); + EXPECT_EQ(entries[0].slot_idx.load(std::memory_order_acquire), 3u); + EXPECT_EQ(entries[0].payload_len.load(std::memory_order_acquire), 7u); +}