Fix infer-request slot leak and deadlock on async start_async() error path (relates to #2871) by exzile · Pull Request #4335 · openvinotoolkit/model_server

exzile · 2026-06-27T23:40:48Z

🛠 Summary

Fixes an infer-request pool-slot leak on the async inference error path that can, over long-running concurrent load, drain the infer-request pool and surface as gRPC ResourceExhausted despite low CPU/GPU/RAM. Relates to #2871 (a plausible contributor — see "Scope" below).

This PR has two commits:

Release the slot when start_async() throws.
Fix a deadlock/crash that commit 1 introduced, and add a regression test that reproduces both the leak and the crash.

Root cause (leak)

In modelInferAsync (src/inference_executor.hpp, used by the KServe gRPC path and the C-API async endpoint):

An ExecutingStreamIdGuard — which holds an infer-request pool slot and returns it to OVInferRequestsQueue on destruction — is std::move-d into the OV completion-callback lambda passed to inferRequest.set_callback(...). After the move, that lambda is the sole owner of the guard, so the slot is returned only when the lambda is destroyed (normally when the async completion callback fires).
inferRequest.start_async() is then called inside a try. If it throws, the function returns immediately. The async op never started, the completion callback never fires, the lambda is never destroyed, and the slot is never returned to the pool.

Under sustained load, each start_async() failure permanently loses one slot until the pool drains and gRPC reports ResourceExhausted despite low hardware utilization.

The subtlety (deadlock) — why the naive fix is unsafe

The obvious fix is to clear the callback in the catch so the captured guard is destroyed and the slot is returned:

inferRequest.set_callback([](std::exception_ptr) {});  // unsafe on its own

This introduces a crash. Clearing the callback destroys the completion lambda synchronously while OpenVINO holds the InferRequest's internal lock. That lambda also owns an OutputKeeper, whose destructor calls request.set_tensor(...) back into the same InferRequest. For models that support output-tensor reset (the common case), this reenters the locked request and throws resource deadlock would occur out of a noexcept destructor → std::terminate — crashing the server on the very error path the fix targets.

(The normal completion path doesn't hit this: there, set_callback is called from inside the executing callback, so OpenVINO defers the lambda's destruction until after it returns — i.e. outside the lock.)

Fix

Release the slot without destroying the OutputKeeper under the lock:

OutputKeeper::cancel() disables the destructor's tensor-restore.
modelInferAsync captures the OutputKeeper into the callback by copy and retains the original handle in function scope, so clearing the callback no longer triggers ~OutputKeeper under the lock.
The error path cancels the keeper (inference never ran, so there is nothing to restore), then clears the callback to return the slot. Both catch blocks share releaseInferRequestSlotAfterStartAsyncFailure().

The synchronous infer() path is unaffected (stack-allocated guard, RAII release).

Regression test

CAPIInference.AsyncStartAsyncThrowReleasesInferRequestSlot forces start_async() to throw — via a thin virtual seam ModelInstance::startAsyncInference() (production forwards unconditionally to start_async) — on a single-stream pool (setNireq(1)), then asserts the slot is returned via OVInferRequestsQueue::tryToGetIdleStream().

This test reproduces both failure modes: without the release it observes the leaked slot (nullopt), and with the naive release it crashes via the deadlock described above. It passes only with the corrected fix.

Validation

Builds clean from source on Windows (//src:ovms //src:ovms_test, Build completed successfully).
Full CAPIInference gtest suite passes (17/17), including AsyncWithCallbackDummy (async success), AsyncErrorHandling (async error via callback), and the new test — confirming the capture-by-copy change does not regress the normal path.

Scope / honesty note

This is a genuine, gRPC-reachable resource leak, but it only triggers when start_async() actually throws (transient OV/device errors), so I can't claim it is definitively the cause of #2871 (reporter's logs are limited, and the synchronous predict path is already leak-safe). It is a correct, low-risk hardening regardless — and the regression test demonstrates the error path is now both leak-free and crash-free.

In the async inference path (modelInferAsync, used by KServe gRPC and the C-API async endpoint), the ExecutingStreamIdGuard that holds an infer-request pool slot is std::move-d into the OV completion-callback lambda before inferRequest.start_async() is called. The slot is returned to OVInferRequestsQueue only when that lambda is destroyed, which normally happens when the async completion callback fires. If start_async() throws, the function returns immediately and the completion callback never fires, so the lambda (sole owner of the guard) is never destroyed and the infer-request slot is never returned to the pool. Under long-running concurrent load, repeated start_async failures slowly drain the pool until no idle infer request is available; requests then block and gRPC surfaces ResourceExhausted even though CPU/GPU/RAM are not saturated (matches the symptom class in openvinotoolkit#2871). Clear the callback in both start_async() catch blocks so the captured guard is destroyed and the slot is returned, mirroring the existing "set empty callback to release captures" pattern used after normal completion. The synchronous infer() path is unaffected (it uses a stack-allocated guard with RAII release on all paths). Builds clean (//src:ovms //src:ovms_test); CAPIInference.AsyncErrorHandling passes. The change is confined to the start_async error path. Signed-off-by: exzile <joeypongallo@gmail.com>

The previous commit cleared the infer-request callback on the start_async() error path to release the leaked pool slot. That introduced a latent crash: clearing the callback destroys the completion lambda while OpenVINO holds the InferRequest's internal lock, and that lambda owns an OutputKeeper whose destructor calls request.set_tensor() back into the same InferRequest. For models that support output-tensor reset (the common case) this reenters the locked request and throws "resource deadlock would occur" out of a noexcept destructor, terminating the server - on the very error path the fix targets. Fix it without reintroducing the leak: - OutputKeeper gains cancel(), which disables the destructor's tensor restore. - modelInferAsync captures the OutputKeeper into the callback by copy and keeps the original handle in function scope, so clearing the callback no longer destroys the OutputKeeper under the lock. - The error path now cancels the OutputKeeper before clearing the callback (inference never ran, so there is nothing to restore) and the retained handle's later destruction is a no-op. Both catch blocks share a helper, releaseInferRequestSlotAfterStartAsyncFailure(). Add a regression test (CAPIInference.AsyncStartAsyncThrowReleasesInferRequestSlot) that forces start_async() to throw via a small virtual seam (ModelInstance::startAsyncInference) on a single-stream pool and asserts the slot is returned afterwards. The test reproduces both the original leak and the deadlock crash, and passes only with this fix. Existing async tests (AsyncWithCallbackDummy, AsyncErrorHandling) and the full CAPIInference suite (17/17) still pass. Signed-off-by: exzile <joeypongallo@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

exzile mentioned this pull request Jun 27, 2026

Got an error from OVMS: rpc error: code = ResourceExhausted desc = #2871

Open

exzile marked this pull request as ready for review June 27, 2026 23:45

exzile changed the title ~~Release infer-request slot when start_async() throws (async path leak; relates to #2871)~~ Fix infer-request slot leak and deadlock on async start_async() error path (relates to #2871) Jun 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix infer-request slot leak and deadlock on async start_async() error path (relates to #2871)#4335

Fix infer-request slot leak and deadlock on async start_async() error path (relates to #2871)#4335
exzile wants to merge 2 commits into
openvinotoolkit:mainfrom
exzile:fix/async-infer-stream-leak-2871

exzile commented Jun 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

exzile commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛠 Summary

Root cause (leak)

The subtlety (deadlock) — why the naive fix is unsafe

Fix

Regression test

Validation

Scope / honesty note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

exzile commented Jun 27, 2026 •

edited

Loading