Skip to content

Fix cross-thread data races where event-loop coroutines complete off the loop thread#54

Open
battlesnake wants to merge 6 commits into
masterfrom
fix/event-loop-task-pool-thread-race
Open

Fix cross-thread data races where event-loop coroutines complete off the loop thread#54
battlesnake wants to merge 6 commits into
masterfrom
fix/event-loop-task-pool-thread-race

Conversation

@battlesnake

@battlesnake battlesnake commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

rpp::event_loop is meant to resume every coroutine on its own thread. Two related cases broke
that and tripped TSAN — both with the same root cause: completion was observed off the loop
instead of routed back to it.

The two races

  1. event_task awaiting a task — if the task's body finished on a pool thread (because it
    co_awaited an rpp::cfuture), task_final_awaiter did a symmetric transfer that dragged the
    event_task onto the pool thread, racing the loop's done() check.
  2. run_until_done(task<T>&) / (deferred<T>&) — these drove a top-level task by polling
    handle.done() in a loop. If the task finished on a pool thread, that poll raced the pool
    thread on the frame, and the result could be read with no happens-before.

Repro for (1), now a regression test:

auto pool_completing_task = [&]() -> rpp::task<int> {
    rpp::cfuture<int> fut = rpp::async_task([] { rpp::sleep_ms(1); return 42; });
    int r = co_await fut;   // cfuture resumes the task body on a POOL thread
    co_return r;            // co_return -> task_final_awaiter runs on the pool thread
};
rpp::event_task et = [&]() -> rpp::event_task {
    int r = co_await pool_completing_task();  // before fix: continuation drifts off-loop
}();
loop->run_until_done(et);

The fix — make completion loop-affine

  • task_final_awaiter posts the awaiter's continuation back to the owning loop (via a
    thread-local loop handle) instead of resuming inline on whatever thread finished the task. A
    small lock-free CAS handshake on promise.continuation coordinates the two sides — the loop
    thread registering the awaiter vs. the pool thread completing — with no lost wakeups or
    double-resumes. Standalone use (no loop) is unchanged. The captured loop handle is a plain
    member published through that CAS, so it stays correct on weak-memory hardware (ARM).
  • run_until_done(task/deferred) now awaits the task from a tiny internal event_task and
    reuses the (now loop-affine) event_task driver, so the result is produced and the frame
    destroyed on the loop thread. The old polling loop is deleted.

No public API change — the new pieces live in rpp::detail.

Verification (gcc + clang, TSAN)

  • test_event_loop 49/49, including 5 new regression tests (event_task await, plus
    run_until_done for task / deferred / void / exception propagation). The key ones stay clean
    30–50× under test_until_failure.
  • Full suite green except the pre-existing basic_duration_coro timing assertion (fails on
    master too — TSAN overhead).
  • clang-tidy clean. The 2 pre-existing test_event_loop warnings (lines 89/149) are a
    different mechanism (background_awaiter_fut) and are left untouched.

Notes

  • README.md line references refreshed mechanically (update_doc_linerefs.py); no API or
    display-text changes.
  • 6 commits, kept separate to show the progression (race fix → memory-ordering fix → simplify →
    demote-to-plain → docs → the second race). Squash-merge if you'd rather have one.

battlesnake and others added 6 commits June 20, 2026 14:40
… via cfuture

Root cause: task_final_awaiter did inline symmetric transfer to the event_task
continuation on whatever thread the inner task completed on (e.g. a pool thread
resuming a co_await cfuture). The event_task body then ran off the loop thread,
racing the loop thread's event_task.done() check in run_until_done().

Fix (Option A - loop-affine by default):

- task.h: add inline thread_local tl_loop_post_fn/tl_loop_ctx in rpp::detail.
  task_base::await_suspend captures them into the promise atomics (continuation,
  resume_via, resume_ctx) using a CAS handshake with task_final_awaiter so that
  both the "task finishes first" and "awaiter sets up first" races are safe.
  task_final_awaiter reads them with acquire semantics and posts the continuation
  back to the owner loop instead of doing symmetric transfer.

- event_loop.cpp: constructor registers loop_post_resume on the owner thread's
  TLS (covering eagerly-started event_tasks before the first process_event call).
  process_event saves/restores TLS around each resume so nested loops are correct.
  loop_post_resume is a static C-function trampoline that calls post_resume().

- tests/test_event_loop.cpp: add TestCase(tsan_task_await_resumes_on_loop_thread)
  that co_awaits a cfuture-completing task<int> from an event_task and asserts
  the continuation runs on the loop thread; passes TSAN clean 5/5 runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The CAS failure in task_final_awaiter used memory_order_relaxed, so it did
not synchronize with the release CAS in task_base::await_suspend. On
weak-memory architectures (ARM) this means resume_via/resume_ctx stores
(written before the await_suspend release CAS) were not guaranteed visible
to task_final_awaiter after the relaxed failure, breaking the loop-post-back
mechanism.

Fix: use memory_order_acquire on the CAS failure so it pairs with
await_suspend's release and sees all prior relaxed stores. The subsequent
resume_via.load can then be relaxed (HB already established).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Merge tl_loop_post_fn + tl_loop_ctx into a single loop_ctx struct
  (one TLS slot; cleaner save/restore in process_event)
- Hoist resume_via/ctx stores before the lazy branch in await_suspend
  (was duplicated in two places, now one store unconditionally)
- await_suspend CAS-failure path: use thread-locals directly instead of
  re-reading from atomics we just wrote on the same thread
- Trim WHAT comments per style guide

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adversarial review (5-lens propose + 3-skeptic verify) of the CAS handshake
confirmed resume_via/resume_ctx carried zero synchronization: they were two
relaxed atomics per promise specialization, written on the loop thread before
the release CAS on `continuation` and read on the completing thread only after
the acquire-on-failure CAS. That is the textbook "publish a plain payload via a
release flag" idiom, so they can be a single plain detail::loop_ctx member —
the `continuation` atomic provides all the happens-before. Removes 4 atomic
objects + 4 relaxed memory-order tokens, and makes the contract clearer (the
relaxed atomics implied an ordering they did not carry).

The four `continuation` memory orders and the done-sentinel are unchanged: the
demotion's safety rests entirely on the acquire-on-failure CAS staying acquire.

Also: move loop_post_resume above its first use, drop its forward declaration;
tighten the ordering WHY-comments to name both publication sites (the lazy
first-await release store is a second one the old wording omitted).

Verified on the worktree (not master) with gcc and clang TSAN:
test_event_loop 45/45 incl. tsan_task_await_resumes_on_loop_thread (30/30 under
test_until_failure), full suite green except the pre-existing basic_duration_coro
timing assertion, clang-tidy clean, only the 2 pre-existing showcase warnings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Mechanical line-number sync (update_doc_linerefs.py); no display-text or API
changes. Reconciles refs that drifted as this branch reworked task.h /
event_loop handshake (and some pre-existing master drift in sprint.h/tests.h).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… thread

Same root cause as this PR's main fix, in a second spot. run_until_done() drove a
top-level task by polling handle.done() in a loop; if the task body finished on a
pool thread (e.g. it awaited a cfuture directly), that poll raced the pool thread
and could read the result before it was ready.

Now run_until_done() awaits the task from a small internal event_task, so
completion is posted back to the loop and the result is read on the loop thread —
reusing the path the event_task driver already uses. The old polling loop is gone.

Tests cover the task, deferred, void and exception cases; all clean under TSAN on
gcc and clang. (README line refs refreshed mechanically.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@battlesnake battlesnake changed the title Fix cross-thread data race: event-loop coroutine drifts to a pool thread when awaiting a task that completes off-loop Fix cross-thread data races where event-loop coroutines complete off the loop thread Jun 20, 2026
battlesnake added a commit that referenced this pull request Jun 21, 2026
…drift)

#54 made task co_await loop-affine (task_final_awaiter posts the continuation
back to the owner loop). cfuture coroutines (initial/final_suspend = never, no
task_final_awaiter) were left out: cfuture::await_suspend spawns a pool worker
that waits on the future and called cont.resume() on that worker, so an
event-loop-bound coroutine that co_awaits a cfuture drifts onto a pool thread
and then races the loop thread on its own coroutine frame. A chain of cfuture
coroutines (each co_awaiting the next) drifts across pool threads with no
happens-before -> TSAN data race on the frame.

Fix (mirror #54's loop-affine-by-default):
- future_types.h: move the loop-context TLS (loop_post_fn / loop_ctx / tl_loop)
  here from task.h so both the task and the cfuture awaiters can capture it.
- future.h: cfuture<T>::await_suspend and cfuture<void>::await_suspend capture
  tl_loop and, on the pool worker, post the continuation back to the owner loop
  (falling back to cont.resume() for standalone, no-loop use).
- event_loop.cpp: ~event_loop clears tl_loop if it still points at this loop.
  The constructor advertises the loop on the owner thread's TLS; without the
  matching clear a later task/cfuture await on that thread would capture a
  pointer to the destroyed loop (use-after-free) - a pre-existing latent bug
  the cfuture change above would otherwise widen.

Test: tsan_cfuture_await_resumes_on_loop_thread - an event_task co_awaits a
pool-completing cfuture directly and asserts the continuation runs on the loop
thread (RED before the fix, GREEN after). The existing run_until_done task test
is updated: with the loop-affine cfuture await the body now stays on the loop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant