Skip to content

fix(python): avoid async callback GIL deadlock via unbounded work channel#2005

Open
chaliy wants to merge 7 commits into
mainfrom
2026-06-09-propose-fix-for-gil-deadlock-issue
Open

fix(python): avoid async callback GIL deadlock via unbounded work channel#2005
chaliy wants to merge 7 commits into
mainfrom
2026-06-09-propose-fix-for-gil-deadlock-issue

Conversation

@chaliy

@chaliy chaliy commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What

Switches the private-loop async callback work dispatch from a zero-capacity rendezvous channel (sync_channel(0)) to an unbounded channel(), so send() can never block while the caller holds the GIL.

Rebased on main (includes #2007/#2008 fixes). Updates TM-PY-030 threat model entry and corrects the inline code comment.

Why

The sync_channel(0) rendezvous was the root cause of the TM-PY-030 variant (1) GIL deadlock: on first call, the worker must attach to Python to create its asyncio loop before it reaches recv(), so dispatcher and worker waited on each other with the GIL held.

#2007 mitigated this by detaching the GIL around the send. This PR eliminates the root cause entirely: an unbounded channel makes send() non-blocking regardless of worker state. Only the result wait (result_rx.recv()) needs the GIL detached, which is preserved.

Changes

  • crates/bashkit-python/src/lib.rs: SyncSender/sync_channel(0)Sender/channel() for the work dispatch channel; update TM-PY-030 inline comment to reflect unbounded semantics.
  • crates/bashkit-python/tests/test_async_callbacks.py: add subprocess-based regression test test_async_callback_execute_sync_first_private_loop_call_does_not_deadlock that reliably detects the deadlock (runs in a fresh process with 3 s timeout).
  • specs/threat-model.md: update TM-PY-030 mitigation description for variant (1) and add regression test reference.

Security

The unbounded channel cannot cause unbounded queuing: result_rx.recv() blocks until the worker finishes, so the caller cannot issue a second send before the first result returns — queue depth ≤ 1 in practice.

Copilot AI review requested due to automatic review settings June 9, 2026 21:56
@chaliy chaliy added the aardvark label Jun 9, 2026
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 9, 2026

Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
bashkit 4199313 Commit Preview URL Jun 10 2026, 09:44 AM

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a deterministic Python GIL deadlock in the private-loop async callback path by ensuring the caller’s work dispatch cannot block while still holding the GIL, and adds a regression test that reliably detects the deadlock via a subprocess.

Changes:

  • Replace the private-loop worker work channel from a zero-capacity rendezvous channel to an unbounded std::sync::mpsc::Sender so send() cannot block during worker startup while the caller holds the GIL.
  • Keep the result handoff on a zero-capacity channel with the caller waiting while detached from the GIL.
  • Add a subprocess-based pytest regression test to ensure the first execute_sync() call with an async callback on the private loop completes (no deadlock).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
crates/bashkit-python/src/lib.rs Switch private-loop work dispatch to an unbounded channel and update comments explaining the deadlock avoidance.
crates/bashkit-python/tests/test_async_callbacks.py Add a subprocess regression test to catch the GIL deadlock deterministically.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/bashkit-python/src/lib.rs
Comment thread crates/bashkit-python/tests/test_async_callbacks.py Outdated
chaliy added 2 commits June 10, 2026 09:08
The previous comment described a capacity-0 rendezvous channel and
explained why GIL-detach was needed for the send. With the unbounded
channel the send is non-blocking; GIL release is only required for the
result rendezvous. Update the threat-model comment to match.
@chaliy chaliy force-pushed the 2026-06-09-propose-fix-for-gil-deadlock-issue branch from 3ff6055 to 13fc9d3 Compare June 10, 2026 09:11
Variant (1) is now mitigated by using an unbounded channel() so send()
cannot block, rather than solely relying on GIL detach around a rendezvous.
Add regression test reference for the first-call deadlock.
@chaliy chaliy changed the title fix(python): avoid async callback GIL deadlock fix(python): avoid async callback GIL deadlock via unbounded work channel Jun 10, 2026
chaliy added 4 commits June 10, 2026 09:18
- Propagate sys.path via PYTHONPATH to subprocess so it finds the
  compiled native extension in CI (maturin PYTHONPATH-only setup)
- Increase subprocess timeout from 3s to 10s to avoid flakiness on
  loaded CI runners during extension startup
- Add comment to unbounded channel explaining natural queue-depth bound
  (each caller blocks on result_rx.recv(), so queue depth ≤ concurrent calls)
… test

The previous PYTHONPATH propagation (sys.path join) injected source-tree
paths before site-packages, causing the subprocess to find bashkit/__init__.py
from the source directory but fail to find _bashkit.so (which is only in
site-packages after `pip install`).

Use importlib.util.find_spec('bashkit._bashkit') to locate the .so in the
same installation the test process uses, then prepend its parent directory
to PYTHONPATH. Works for both pip-installed (site-packages) and maturin-
develop (source tree) setups.
…nd_spec

find_spec("bashkit._bashkit") returns None in CI because pytest already
imports bashkit from the source tree; the submodule lookup resolves in that
context, finds no .so, and returns None — leaving PYTHONPATH unset and the
subprocess still hitting the source tree.

Switch to scanning sys.path with glob.glob for _bashkit*.so / *.pyd: this
is independent of Python import state and correctly finds site-packages (pip
install) or the source tree (maturin develop) in all environments.
pytest runs from crates/bashkit-python/, so the subprocess inherits that
CWD. Python always puts '' (= CWD) as sys.path[0], placing the source tree
before any PYTHONPATH entries — including the site-packages path we derive
from the glob scan. Result: subprocess finds bashkit/__init__.py from source
but not _bashkit.so, causing ModuleNotFoundError.

Pass cwd='/' so '' resolves to a neutral directory with no bashkit/ subdir.
The PYTHONPATH (site-packages or maturin-develop source tree, found via glob)
is then the first real bashkit candidate and the .so is always found.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants