fix(threading): catch std::system_error in ThreadPool::worker_thread cv.wait by rmcgrorty · Pull Request #656 · cactus-compute/cactus

rmcgrorty · 2026-05-20T12:52:54Z

Summary

CactusThreading::ThreadPool::worker_thread() calls cv.wait() without a try/catch. On macOS, libc++'s condition_variable::wait translates non-zero pthread_cond_wait return codes into std::system_error. When the destructor races a parked worker (stop = true; notify_all() followed by the worker's lambda predicate read), the wait can throw, and the uncaught exception propagates to std::terminate.

This patch wraps the wait in a try/catch. On caught std::system_error, the worker re-checks the predicate. If shutting down, it returns cleanly; otherwise it loops and re-waits. The catch is narrow — it does not mask other exceptions, and it does not change normal-path semantics.

Motivation

We hit this in production six times across 2026-04-30 → 2026-05-19, all with identical stack signatures:

std::terminate
  -> std::__1::condition_variable::wait(...)
  -> CactusThreading::ThreadPool::worker_thread()

The crash is reliably triggerable on macOS during STT workloads where multiple Transcriber instances share the process-global static ThreadPool (e.g. multi-channel streaming sessions where N transcribers are constructed and dropped together). Shutdown sequencing of the dropping transcribers races the workers parked on cv.wait.

We tried three workarounds on our side first:

Fix Adding local file management logic for models. #1: synchronous Drop join on the Rust wrapper (~4× crash reduction, didn't eliminate)
Fix Changed build flow to minimize repeated work #3: long-lived Arc<Model> to avoid per-session cactus_init/cactus_destroy (narrowed further, didn't eliminate)
Fix Fixed bug in build scripts to #2 (deferred): connection serialization

Each layer added Rust-side complexity without closing the race, because the underlying defect — uncaught exception inside a third-party C++ thread — can't be cleanly compensated for from the FFI consumer side. Catching the exception in worker_thread itself is the right place.

Patch shape

void worker_thread() {
    while (true) {
        std::function<void()> task;
        {
            std::unique_lock<std::mutex> lock(mutex);
            try {
                work_available.wait(lock, [this] {
                    return stop || !tasks.empty();
                });
            } catch (const std::system_error&) {
                // libc++ on macOS can throw std::system_error from
                // condition_variable::wait if the underlying pthread
                // primitive returns a non-zero status during shutdown
                // races. Treat as a shutdown signal: re-check predicate,
                // exit if stopping, otherwise loop.
                if (stop && tasks.empty()) {
                    return;
                }
                continue;
            }

            if (stop && tasks.empty()) {
                return;
            }

            task = std::move(tasks.front());
            tasks.pop_front();
        }

        task();

        if (pending_tasks.fetch_sub(1, std::memory_order_acq_rel) == 1) {
            std::lock_guard<std::mutex> lock(mutex);
            work_done.notify_one();
        }
    }
}

Test plan

Builds clean against the existing cactus-sys consumer.
Cactus Whisper Small streaming session start/stop cycle no longer crashes (was reliably reproducing pre-patch).
Multi-day production acceptance window pending (target: zero new crashes with this signature in 14+ days).

🤖 Generated with Claude Code

…cv.wait On macOS, libc++ translates non-zero pthread_cond_wait return codes into std::system_error. During ThreadPool destruction, a parked worker thread on cv.wait can race with the destructor's notify_all and observe such an error, which then propagates uncaught and triggers std::terminate. Symptom in the wild (M3M0 desktop app, 6 crashes 2026-04-30 → 2026-05-19, identical signature): std::terminate -> std::condition_variable::wait -> CactusThreading::ThreadPool::worker_thread The shared, process-global static ThreadPool plus per-session Model init/destroy cycles trigger this reliably under STT workloads, especially multi-channel sessions where multiple Transcribers share the pool. Wrap the wait in try/catch. On caught std::system_error, re-check the predicate. Exit cleanly if shutting down, otherwise loop and re-wait. This is a narrow catch that does not mask other exceptions and does not change normal-path semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(threading): catch std::system_error in ThreadPool::worker_thread cv.wait#656

fix(threading): catch std::system_error in ThreadPool::worker_thread cv.wait#656
rmcgrorty wants to merge 1 commit into
cactus-compute:mainfrom
rmcgrorty:fix/threadpool-cv-wait-exception-handler

rmcgrorty commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rmcgrorty commented May 20, 2026

Summary

Motivation

Patch shape

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant