fix: prevent EDEADLK self-join in ~CurlHttpOperation on async-thread destruction#1481
Open
bmehta001 wants to merge 3 commits into
Open
fix: prevent EDEADLK self-join in ~CurlHttpOperation on async-thread destruction#1481bmehta001 wants to merge 3 commits into
bmehta001 wants to merge 3 commits into
Conversation
Under -Werror on Linux/macOS, the modules-repo CI (build-posix-latest-exp)
has been failing for ~2 weeks with:
config-default.h:36: error: 'HAVE_MAT_LIVEEVENTINSPECTOR' macro redefined
[-Werror,-Wmacro-redefined]
config-default.h:37: error: 'HAVE_MAT_PRIVACYGUARD' macro redefined
tests/functests/CMakeLists.txt and tests/unittests/CMakeLists.txt add
-DHAVE_MAT_LIVEEVENTINSPECTOR / -DHAVE_MAT_PRIVACYGUARD on the command
line when BUILD_LIVEEVENTINSPECTOR / BUILD_PRIVACYGUARD (default YES) and
the respective module dir exists. The three config-default headers then
redefined them unconditionally, which is fatal under -Werror (added by
microsoft#1415).
Wrapping the two defines in #ifndef in all three config-default*.h
headers preserves all existing behavior:
- Without command-line -D: macros get defined here as before.
- With command-line -D: header skips the redefinition, no warning.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…destruction
The modules-repo CI test ECSClientFuncTests.GetConfigs (and every
test in the ECSClientFuncTests suite) crashed on Linux/macOS with:
terminate called after throwing an instance of 'std::system_error'
what(): Resource deadlock avoided
Aborted (core dumped)
Root cause
==========
SendAsync() runs Send() + the user callback on a std::async worker
thread. The callback owns a strong ref to CurlHttpOperation, so when
it releases the last ref the ~CurlHttpOperation destructor runs on
the async thread itself.
libstdc++'s std::future<>::~future implicitly calls
_Async_state_impl::~_Async_state_impl, which calls _M_complete_async
-> _M_join via std::call_once. On the async thread that's a self-join;
call_once throws std::system_error(EDEADLK). Because the throw escapes
a noexcept destructor, terminate() aborts the process. A try/catch
around the future cannot rescue this — destructors of std::future are
noexcept.
Fix
===
Move the future onto a detached helper thread before its destructor
runs. The helper is by definition NOT the async thread (we'd only be
on the async thread if its work already finished), so the implicit
join completes immediately. On the common path (destruction from the
caller thread) it costs one short-lived thread spawn that exits in
microseconds.
Verified locally with sister + modules linked: all 113 FuncTests pass,
including all 25 ECSClientFuncTests (which include the formerly-fatal
GetConfigs).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
modules-repo(cpp_client_telemetry_modules) CI on Linux + macOS crashes during the firstECSClientFuncTeststest with:Affects every PR that gets past compile (e.g. modules #277).
Root cause
The callback fired from the async thread is the last owner of the
CurlHttpOperation. When it releases itsshared_ptr,~CurlHttpOperationruns on that same async thread.~CurlHttpOperationthen destroys itsstd::future<long> resultmember. libstdc++'s~future->~_Async_state_implcalls_M_complete_async->_M_joinviastd::call_once. Joining the async thread from the async thread is a self-join —call_oncethrowsstd::system_error("Resource deadlock avoided"). Because the throw escapes a noexcept destructor,terminate()aborts the process.A
try/catcharound the future or even around an explicitresult.wait()cannot rescue this:std::future's destructor itself isnoexcept, so the throw lands at the noexcept boundary before any usercatchcan engage.GDB confirms (modules CI reproduction, Ubuntu 24.04, libstdc++ 13):
Fix
Move the future onto a detached helper thread before it can destruct on the async thread. The helper is by definition not the async thread (we'd only be on the async thread if its work has already completed, which is the only way
~CurlHttpOperationcould run there), so the implicit join completes in microseconds.On the common path — destruction from the caller thread — the helper thread spawn is also nearly free: the implicit join sees the work is done and returns immediately.
Verification
Verified on Linux GCC 13 with sister-main + modules-#277 linked (the exact configuration that crashed):
ECSClientFuncTests.GetConfigsaborts at first test in suite.ECSClientFuncTestspass; all 113FuncTestspass ([ PASSED ] 113 tests).Why this hasn't been seen on sister CI
The crashing path is
ECSClientFuncTests, which is in the modules repo (lib/modules/exp/tests/functests/). Sister CI doesn't buildlib/modules, so the crash is invisible there. It's only exposed when sister is linked with modules (i.e. the modules-repo CI configuration). That's why master modules CI has been red on Linux/macOS for ~50 consecutive runs — the redefine bug (#1473, merged) was masking this one. Once #1473 cleared, this surfaced as the next gate.Risk: low. One short-lived thread per
~CurlHttpOperationinvocation whenresultis still valid. Pure additive — no API change, no behavior change for already-joined futures.