Fix infer-request slot leak and deadlock on async start_async() error path (relates to #2871)#4335
Open
exzile wants to merge 2 commits into
Open
Fix infer-request slot leak and deadlock on async start_async() error path (relates to #2871)#4335exzile wants to merge 2 commits into
exzile wants to merge 2 commits into
Conversation
In the async inference path (modelInferAsync, used by KServe gRPC and the C-API async endpoint), the ExecutingStreamIdGuard that holds an infer-request pool slot is std::move-d into the OV completion-callback lambda before inferRequest.start_async() is called. The slot is returned to OVInferRequestsQueue only when that lambda is destroyed, which normally happens when the async completion callback fires. If start_async() throws, the function returns immediately and the completion callback never fires, so the lambda (sole owner of the guard) is never destroyed and the infer-request slot is never returned to the pool. Under long-running concurrent load, repeated start_async failures slowly drain the pool until no idle infer request is available; requests then block and gRPC surfaces ResourceExhausted even though CPU/GPU/RAM are not saturated (matches the symptom class in openvinotoolkit#2871). Clear the callback in both start_async() catch blocks so the captured guard is destroyed and the slot is returned, mirroring the existing "set empty callback to release captures" pattern used after normal completion. The synchronous infer() path is unaffected (it uses a stack-allocated guard with RAII release on all paths). Builds clean (//src:ovms //src:ovms_test); CAPIInference.AsyncErrorHandling passes. The change is confined to the start_async error path. Signed-off-by: exzile <joeypongallo@gmail.com>
The previous commit cleared the infer-request callback on the start_async() error path to release the leaked pool slot. That introduced a latent crash: clearing the callback destroys the completion lambda while OpenVINO holds the InferRequest's internal lock, and that lambda owns an OutputKeeper whose destructor calls request.set_tensor() back into the same InferRequest. For models that support output-tensor reset (the common case) this reenters the locked request and throws "resource deadlock would occur" out of a noexcept destructor, terminating the server - on the very error path the fix targets. Fix it without reintroducing the leak: - OutputKeeper gains cancel(), which disables the destructor's tensor restore. - modelInferAsync captures the OutputKeeper into the callback by copy and keeps the original handle in function scope, so clearing the callback no longer destroys the OutputKeeper under the lock. - The error path now cancels the OutputKeeper before clearing the callback (inference never ran, so there is nothing to restore) and the retained handle's later destruction is a no-op. Both catch blocks share a helper, releaseInferRequestSlotAfterStartAsyncFailure(). Add a regression test (CAPIInference.AsyncStartAsyncThrowReleasesInferRequestSlot) that forces start_async() to throw via a small virtual seam (ModelInstance::startAsyncInference) on a single-stream pool and asserts the slot is returned afterwards. The test reproduces both the original leak and the deadlock crash, and passes only with this fix. Existing async tests (AsyncWithCallbackDummy, AsyncErrorHandling) and the full CAPIInference suite (17/17) still pass. Signed-off-by: exzile <joeypongallo@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🛠 Summary
Fixes an infer-request pool-slot leak on the async inference error path that can, over long-running concurrent load, drain the infer-request pool and surface as gRPC
ResourceExhausteddespite low CPU/GPU/RAM. Relates to #2871 (a plausible contributor — see "Scope" below).This PR has two commits:
start_async()throws.Root cause (leak)
In
modelInferAsync(src/inference_executor.hpp, used by the KServe gRPC path and the C-API async endpoint):ExecutingStreamIdGuard— which holds an infer-request pool slot and returns it toOVInferRequestsQueueon destruction — isstd::move-d into the OV completion-callback lambda passed toinferRequest.set_callback(...). After the move, that lambda is the sole owner of the guard, so the slot is returned only when the lambda is destroyed (normally when the async completion callback fires).inferRequest.start_async()is then called inside atry. If it throws, the function returns immediately. The async op never started, the completion callback never fires, the lambda is never destroyed, and the slot is never returned to the pool.Under sustained load, each
start_async()failure permanently loses one slot until the pool drains and gRPC reportsResourceExhausteddespite low hardware utilization.The subtlety (deadlock) — why the naive fix is unsafe
The obvious fix is to clear the callback in the
catchso the captured guard is destroyed and the slot is returned:inferRequest.set_callback([](std::exception_ptr) {}); // unsafe on its ownThis introduces a crash. Clearing the callback destroys the completion lambda synchronously while OpenVINO holds the InferRequest's internal lock. That lambda also owns an
OutputKeeper, whose destructor callsrequest.set_tensor(...)back into the same InferRequest. For models that support output-tensor reset (the common case), this reenters the locked request and throwsresource deadlock would occurout of anoexceptdestructor →std::terminate— crashing the server on the very error path the fix targets.(The normal completion path doesn't hit this: there,
set_callbackis called from inside the executing callback, so OpenVINO defers the lambda's destruction until after it returns — i.e. outside the lock.)Fix
Release the slot without destroying the
OutputKeeperunder the lock:OutputKeeper::cancel()disables the destructor's tensor-restore.modelInferAsynccaptures theOutputKeeperinto the callback by copy and retains the original handle in function scope, so clearing the callback no longer triggers~OutputKeeperunder the lock.catchblocks sharereleaseInferRequestSlotAfterStartAsyncFailure().The synchronous
infer()path is unaffected (stack-allocated guard, RAII release).Regression test
CAPIInference.AsyncStartAsyncThrowReleasesInferRequestSlotforcesstart_async()to throw — via a thin virtual seamModelInstance::startAsyncInference()(production forwards unconditionally tostart_async) — on a single-stream pool (setNireq(1)), then asserts the slot is returned viaOVInferRequestsQueue::tryToGetIdleStream().This test reproduces both failure modes: without the release it observes the leaked slot (nullopt), and with the naive release it crashes via the deadlock described above. It passes only with the corrected fix.
Validation
//src:ovms //src:ovms_test,Build completed successfully).CAPIInferencegtest suite passes (17/17), includingAsyncWithCallbackDummy(async success),AsyncErrorHandling(async error via callback), and the new test — confirming the capture-by-copy change does not regress the normal path.Scope / honesty note
This is a genuine, gRPC-reachable resource leak, but it only triggers when
start_async()actually throws (transient OV/device errors), so I can't claim it is definitively the cause of #2871 (reporter's logs are limited, and the synchronous predict path is already leak-safe). It is a correct, low-risk hardening regardless — and the regression test demonstrates the error path is now both leak-free and crash-free.