Skip to content

fix: prevent EDEADLK self-join in ~CurlHttpOperation on async-thread destruction#1481

Open
bmehta001 wants to merge 3 commits into
microsoft:mainfrom
bmehta001:bhamehta/fix-curl-async-self-join
Open

fix: prevent EDEADLK self-join in ~CurlHttpOperation on async-thread destruction#1481
bmehta001 wants to merge 3 commits into
microsoft:mainfrom
bmehta001:bhamehta/fix-curl-async-self-join

Conversation

@bmehta001

Copy link
Copy Markdown
Contributor

Problem

modules-repo (cpp_client_telemetry_modules) CI on Linux + macOS crashes during the first ECSClientFuncTests test with:

terminate called after throwing an instance of 'std::system_error'
  what():  Resource deadlock avoided
Aborted (core dumped)

Affects every PR that gets past compile (e.g. modules #277).

Root cause

std::future<long> & SendAsync(std::function<void(CurlHttpOperation &)> callback = nullptr) {
    result = std::async(std::launch::async, [this, callback] {
        long result = Send();
        if (callback) callback(*this);   // <-- callback often holds a strong ref to *this
        return result;
    });
    return result;
}

The callback fired from the async thread is the last owner of the CurlHttpOperation. When it releases its shared_ptr, ~CurlHttpOperation runs on that same async thread.

~CurlHttpOperation then destroys its std::future<long> result member. libstdc++'s ~future -> ~_Async_state_impl calls _M_complete_async -> _M_join via std::call_once. Joining the async thread from the async thread is a self-join — call_once throws std::system_error("Resource deadlock avoided"). Because the throw escapes a noexcept destructor, terminate() aborts the process.

A try/catch around the future or even around an explicit result.wait() cannot rescue this: std::future's destructor itself is noexcept, so the throw lands at the noexcept boundary before any user catch can engage.

GDB confirms (modules CI reproduction, Ubuntu 24.04, libstdc++ 13):

#11 ... __cxa_throw ()                                                                                                     
#12 ... std::__throw_system_error(int) ()
#14 std::__future_base::_Async_state_impl<...CurlHttpOperation::SendAsync...>::~_Async_state_impl() at /usr/include/c++/13/future:1765
#15 ... std::_Destroy<...> ...                                                                                             
#17 ... std::_Sp_counted_ptr_inplace<_Async_state_impl<...>>::_M_dispose()
#22 std::__basic_future<long>::~__basic_future ()
#23 std::future<long>::~future ()
#24 Microsoft::Applications::Events::CurlHttpOperation::~CurlHttpOperation () at HttpClient_Curl.hpp

Fix

Move the future onto a detached helper thread before it can destruct on the async thread. The helper is by definition not the async thread (we'd only be on the async thread if its work has already completed, which is the only way ~CurlHttpOperation could run there), so the implicit join completes in microseconds.

On the common path — destruction from the caller thread — the helper thread spawn is also nearly free: the implicit join sees the work is done and returns immediately.

if (result.valid()) {
    std::thread([f = std::move(result)]() mutable {
        // f goes out of scope here on a fresh thread (!= the async thread),
        // so the implicit ~future join cannot self-deadlock.
    }).detach();
}

Verification

Verified on Linux GCC 13 with sister-main + modules-#277 linked (the exact configuration that crashed):

  • Before: ECSClientFuncTests.GetConfigs aborts at first test in suite.
  • After: All 25 ECSClientFuncTests pass; all 113 FuncTests pass ([ PASSED ] 113 tests).

Why this hasn't been seen on sister CI

The crashing path is ECSClientFuncTests, which is in the modules repo (lib/modules/exp/tests/functests/). Sister CI doesn't build lib/modules, so the crash is invisible there. It's only exposed when sister is linked with modules (i.e. the modules-repo CI configuration). That's why master modules CI has been red on Linux/macOS for ~50 consecutive runs — the redefine bug (#1473, merged) was masking this one. Once #1473 cleared, this surfaced as the next gate.

Risk: low. One short-lived thread per ~CurlHttpOperation invocation when result is still valid. Pure additive — no API change, no behavior change for already-joined futures.

bmehta001 and others added 2 commits June 10, 2026 13:48
Under -Werror on Linux/macOS, the modules-repo CI (build-posix-latest-exp)
has been failing for ~2 weeks with:

  config-default.h:36: error: 'HAVE_MAT_LIVEEVENTINSPECTOR' macro redefined
                              [-Werror,-Wmacro-redefined]
  config-default.h:37: error: 'HAVE_MAT_PRIVACYGUARD' macro redefined

tests/functests/CMakeLists.txt and tests/unittests/CMakeLists.txt add
-DHAVE_MAT_LIVEEVENTINSPECTOR / -DHAVE_MAT_PRIVACYGUARD on the command
line when BUILD_LIVEEVENTINSPECTOR / BUILD_PRIVACYGUARD (default YES) and
the respective module dir exists. The three config-default headers then
redefined them unconditionally, which is fatal under -Werror (added by
microsoft#1415).

Wrapping the two defines in #ifndef in all three config-default*.h
headers preserves all existing behavior:
- Without command-line -D: macros get defined here as before.
- With command-line -D: header skips the redefinition, no warning.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…destruction

The modules-repo CI test ECSClientFuncTests.GetConfigs (and every
test in the ECSClientFuncTests suite) crashed on Linux/macOS with:

  terminate called after throwing an instance of 'std::system_error'
    what():  Resource deadlock avoided
  Aborted (core dumped)

Root cause
==========
SendAsync() runs Send() + the user callback on a std::async worker
thread. The callback owns a strong ref to CurlHttpOperation, so when
it releases the last ref the ~CurlHttpOperation destructor runs on
the async thread itself.

libstdc++'s std::future<>::~future implicitly calls
_Async_state_impl::~_Async_state_impl, which calls _M_complete_async
-> _M_join via std::call_once. On the async thread that's a self-join;
call_once throws std::system_error(EDEADLK). Because the throw escapes
a noexcept destructor, terminate() aborts the process. A try/catch
around the future cannot rescue this — destructors of std::future are
noexcept.

Fix
===
Move the future onto a detached helper thread before its destructor
runs. The helper is by definition NOT the async thread (we'd only be
on the async thread if its work already finished), so the implicit
join completes immediately. On the common path (destruction from the
caller thread) it costs one short-lived thread spawn that exits in
microseconds.

Verified locally with sister + modules linked: all 113 FuncTests pass,
including all 25 ECSClientFuncTests (which include the formerly-fatal
GetConfigs).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 requested a review from a team as a code owner June 11, 2026 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant