Skip to content

fix(threading): catch std::system_error in ThreadPool::worker_thread cv.wait#656

Open
rmcgrorty wants to merge 1 commit into
cactus-compute:mainfrom
rmcgrorty:fix/threadpool-cv-wait-exception-handler
Open

fix(threading): catch std::system_error in ThreadPool::worker_thread cv.wait#656
rmcgrorty wants to merge 1 commit into
cactus-compute:mainfrom
rmcgrorty:fix/threadpool-cv-wait-exception-handler

Conversation

@rmcgrorty
Copy link
Copy Markdown

Summary

CactusThreading::ThreadPool::worker_thread() calls cv.wait() without a try/catch. On macOS, libc++'s condition_variable::wait translates non-zero pthread_cond_wait return codes into std::system_error. When the destructor races a parked worker (stop = true; notify_all() followed by the worker's lambda predicate read), the wait can throw, and the uncaught exception propagates to std::terminate.

This patch wraps the wait in a try/catch. On caught std::system_error, the worker re-checks the predicate. If shutting down, it returns cleanly; otherwise it loops and re-waits. The catch is narrow — it does not mask other exceptions, and it does not change normal-path semantics.

Motivation

We hit this in production six times across 2026-04-30 → 2026-05-19, all with identical stack signatures:

std::terminate
  -> std::__1::condition_variable::wait(...)
  -> CactusThreading::ThreadPool::worker_thread()

The crash is reliably triggerable on macOS during STT workloads where multiple Transcriber instances share the process-global static ThreadPool (e.g. multi-channel streaming sessions where N transcribers are constructed and dropped together). Shutdown sequencing of the dropping transcribers races the workers parked on cv.wait.

We tried three workarounds on our side first:

Each layer added Rust-side complexity without closing the race, because the underlying defect — uncaught exception inside a third-party C++ thread — can't be cleanly compensated for from the FFI consumer side. Catching the exception in worker_thread itself is the right place.

Patch shape

void worker_thread() {
    while (true) {
        std::function<void()> task;
        {
            std::unique_lock<std::mutex> lock(mutex);
            try {
                work_available.wait(lock, [this] {
                    return stop || !tasks.empty();
                });
            } catch (const std::system_error&) {
                // libc++ on macOS can throw std::system_error from
                // condition_variable::wait if the underlying pthread
                // primitive returns a non-zero status during shutdown
                // races. Treat as a shutdown signal: re-check predicate,
                // exit if stopping, otherwise loop.
                if (stop && tasks.empty()) {
                    return;
                }
                continue;
            }

            if (stop && tasks.empty()) {
                return;
            }

            task = std::move(tasks.front());
            tasks.pop_front();
        }

        task();

        if (pending_tasks.fetch_sub(1, std::memory_order_acq_rel) == 1) {
            std::lock_guard<std::mutex> lock(mutex);
            work_done.notify_one();
        }
    }
}

Test plan

  • Builds clean against the existing cactus-sys consumer.
  • Cactus Whisper Small streaming session start/stop cycle no longer crashes (was reliably reproducing pre-patch).
  • Multi-day production acceptance window pending (target: zero new crashes with this signature in 14+ days).

🤖 Generated with Claude Code

…cv.wait

On macOS, libc++ translates non-zero pthread_cond_wait return codes into
std::system_error. During ThreadPool destruction, a parked worker thread
on cv.wait can race with the destructor's notify_all and observe such an
error, which then propagates uncaught and triggers std::terminate.

Symptom in the wild (M3M0 desktop app, 6 crashes 2026-04-30 → 2026-05-19,
identical signature):
  std::terminate -> std::condition_variable::wait
                 -> CactusThreading::ThreadPool::worker_thread

The shared, process-global static ThreadPool plus per-session Model
init/destroy cycles trigger this reliably under STT workloads, especially
multi-channel sessions where multiple Transcribers share the pool.

Wrap the wait in try/catch. On caught std::system_error, re-check the
predicate. Exit cleanly if shutting down, otherwise loop and re-wait.
This is a narrow catch that does not mask other exceptions and does not
change normal-path semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant