Skip to content

future: make cfuture co_await loop-affine too (stacked on #54)#55

Draft
battlesnake wants to merge 1 commit into
fix/event-loop-task-pool-thread-racefrom
fix/cfuture-coroutine-pool-drift
Draft

future: make cfuture co_await loop-affine too (stacked on #54)#55
battlesnake wants to merge 1 commit into
fix/event-loop-task-pool-thread-racefrom
fix/cfuture-coroutine-pool-drift

Conversation

@battlesnake

@battlesnake battlesnake commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator

Stacked on #54 (base branch: fix/event-loop-task-pool-thread-race).

Draft: needs rework. This regresses a downstream radio-config handshake (see comment below) and is not ready to land.

Problem

#54 made task co_await loop-affine: task_final_awaiter posts the continuation back to the owner loop instead of resuming on the completing pool thread. cfuture coroutines were left out - they use initial_suspend/final_suspend = suspend_never and have no task_final_awaiter, and cfuture::await_suspend spawns a pool worker that calls cont.resume() on that worker. So an event-loop-bound coroutine that co_awaits a cfuture drifts onto a pool thread, and a chain of cfuture coroutines hops across pool threads with no happens-before edge, racing each other's frames under TSAN.

Approach (mirror #54's loop-affine-by-default)

  • future_types.h: move the loop-context TLS (loop_post_fn / loop_ctx / tl_loop) here from task.h so both the task and cfuture awaiters share it.
  • future.h: cfuture<T>/cfuture<void>::await_suspend capture tl_loop and post the continuation back to the owner loop (falling back to cont.resume() with no loop context).
  • event_loop.cpp: ~event_loop clears tl_loop if it still points at this loop, so a later await on that thread cannot capture a pointer to the destroyed loop.

Test

tsan_cfuture_await_resumes_on_loop_thread: an event_task co_awaits a pool-completing cfuture and asserts the continuation runs on the loop thread.

Category LOC
prod ~28
tests ~35
docs README linerefs

…drift)

#54 made task co_await loop-affine (task_final_awaiter posts the continuation
back to the owner loop). cfuture coroutines (initial/final_suspend = never, no
task_final_awaiter) were left out: cfuture::await_suspend spawns a pool worker
that waits on the future and called cont.resume() on that worker, so an
event-loop-bound coroutine that co_awaits a cfuture drifts onto a pool thread
and then races the loop thread on its own coroutine frame. A chain of cfuture
coroutines (each co_awaiting the next) drifts across pool threads with no
happens-before -> TSAN data race on the frame.

Fix (mirror #54's loop-affine-by-default):
- future_types.h: move the loop-context TLS (loop_post_fn / loop_ctx / tl_loop)
  here from task.h so both the task and the cfuture awaiters can capture it.
- future.h: cfuture<T>::await_suspend and cfuture<void>::await_suspend capture
  tl_loop and, on the pool worker, post the continuation back to the owner loop
  (falling back to cont.resume() for standalone, no-loop use).
- event_loop.cpp: ~event_loop clears tl_loop if it still points at this loop.
  The constructor advertises the loop on the owner thread's TLS; without the
  matching clear a later task/cfuture await on that thread would capture a
  pointer to the destroyed loop (use-after-free) - a pre-existing latent bug
  the cfuture change above would otherwise widen.

Test: tsan_cfuture_await_resumes_on_loop_thread - an event_task co_awaits a
pool-completing cfuture directly and asserts the continuation runs on the loop
thread (RED before the fix, GREEN after). The existing run_until_done task test
is updated: with the loop-affine cfuture await the body now stays on the loop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@battlesnake battlesnake force-pushed the fix/cfuture-coroutine-pool-drift branch from 97ec896 to 0bfa8d2 Compare June 21, 2026 01:26
@battlesnake

Copy link
Copy Markdown
Collaborator Author

Heads up before this merges: I pinned a downstream consumer (krattlink) to this commit to verify it end to end, and it regressed a radio-config handshake test.

RadioConfigPersistenceF.sync_after_write_happy_path went from 102ms passing (with just #54) to a 13s "timeout waiting for remote APPLIED" with this PR, deterministically across the asan, clang-tidy and windows jobs. Making cfuture co_await loop-affine changes the apply-chain timing so the cross-component (AIR<->GND) handshake stalls, most likely because the downstream test pump model does not drive both event loops enough once the apply chain becomes loop-gated.

I've reverted the krattlink pin back to #54 for now (which is functionally correct, 102ms pass). This PR needs rework before it can land: keep cfuture awaits loop-affine without stalling a cross-loop handshake. Worth investigating whether the loop post needs a wakeup/notify, or whether the downstream test simply needs to pump both loops. Converting to draft until then.

@battlesnake battlesnake marked this pull request as draft June 21, 2026 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant