Skip to content

Fix infer-request slot leak and deadlock on async start_async() error path (relates to #2871)#4335

Open
exzile wants to merge 2 commits into
openvinotoolkit:mainfrom
exzile:fix/async-infer-stream-leak-2871
Open

Fix infer-request slot leak and deadlock on async start_async() error path (relates to #2871)#4335
exzile wants to merge 2 commits into
openvinotoolkit:mainfrom
exzile:fix/async-infer-stream-leak-2871

Conversation

@exzile

@exzile exzile commented Jun 27, 2026

Copy link
Copy Markdown

🛠 Summary

Fixes an infer-request pool-slot leak on the async inference error path that can, over long-running concurrent load, drain the infer-request pool and surface as gRPC ResourceExhausted despite low CPU/GPU/RAM. Relates to #2871 (a plausible contributor — see "Scope" below).

This PR has two commits:

  1. Release the slot when start_async() throws.
  2. Fix a deadlock/crash that commit 1 introduced, and add a regression test that reproduces both the leak and the crash.

Root cause (leak)

In modelInferAsync (src/inference_executor.hpp, used by the KServe gRPC path and the C-API async endpoint):

  1. An ExecutingStreamIdGuard — which holds an infer-request pool slot and returns it to OVInferRequestsQueue on destruction — is std::move-d into the OV completion-callback lambda passed to inferRequest.set_callback(...). After the move, that lambda is the sole owner of the guard, so the slot is returned only when the lambda is destroyed (normally when the async completion callback fires).
  2. inferRequest.start_async() is then called inside a try. If it throws, the function returns immediately. The async op never started, the completion callback never fires, the lambda is never destroyed, and the slot is never returned to the pool.

Under sustained load, each start_async() failure permanently loses one slot until the pool drains and gRPC reports ResourceExhausted despite low hardware utilization.

The subtlety (deadlock) — why the naive fix is unsafe

The obvious fix is to clear the callback in the catch so the captured guard is destroyed and the slot is returned:

inferRequest.set_callback([](std::exception_ptr) {});  // unsafe on its own

This introduces a crash. Clearing the callback destroys the completion lambda synchronously while OpenVINO holds the InferRequest's internal lock. That lambda also owns an OutputKeeper, whose destructor calls request.set_tensor(...) back into the same InferRequest. For models that support output-tensor reset (the common case), this reenters the locked request and throws resource deadlock would occur out of a noexcept destructor → std::terminate — crashing the server on the very error path the fix targets.

(The normal completion path doesn't hit this: there, set_callback is called from inside the executing callback, so OpenVINO defers the lambda's destruction until after it returns — i.e. outside the lock.)

Fix

Release the slot without destroying the OutputKeeper under the lock:

  • OutputKeeper::cancel() disables the destructor's tensor-restore.
  • modelInferAsync captures the OutputKeeper into the callback by copy and retains the original handle in function scope, so clearing the callback no longer triggers ~OutputKeeper under the lock.
  • The error path cancels the keeper (inference never ran, so there is nothing to restore), then clears the callback to return the slot. Both catch blocks share releaseInferRequestSlotAfterStartAsyncFailure().

The synchronous infer() path is unaffected (stack-allocated guard, RAII release).

Regression test

CAPIInference.AsyncStartAsyncThrowReleasesInferRequestSlot forces start_async() to throw — via a thin virtual seam ModelInstance::startAsyncInference() (production forwards unconditionally to start_async) — on a single-stream pool (setNireq(1)), then asserts the slot is returned via OVInferRequestsQueue::tryToGetIdleStream().

This test reproduces both failure modes: without the release it observes the leaked slot (nullopt), and with the naive release it crashes via the deadlock described above. It passes only with the corrected fix.

Validation

  • Builds clean from source on Windows (//src:ovms //src:ovms_test, Build completed successfully).
  • Full CAPIInference gtest suite passes (17/17), including AsyncWithCallbackDummy (async success), AsyncErrorHandling (async error via callback), and the new test — confirming the capture-by-copy change does not regress the normal path.

Scope / honesty note

This is a genuine, gRPC-reachable resource leak, but it only triggers when start_async() actually throws (transient OV/device errors), so I can't claim it is definitively the cause of #2871 (reporter's logs are limited, and the synchronous predict path is already leak-safe). It is a correct, low-risk hardening regardless — and the regression test demonstrates the error path is now both leak-free and crash-free.

In the async inference path (modelInferAsync, used by KServe gRPC and the
C-API async endpoint), the ExecutingStreamIdGuard that holds an
infer-request pool slot is std::move-d into the OV completion-callback
lambda before inferRequest.start_async() is called. The slot is returned
to OVInferRequestsQueue only when that lambda is destroyed, which normally
happens when the async completion callback fires.

If start_async() throws, the function returns immediately and the
completion callback never fires, so the lambda (sole owner of the guard)
is never destroyed and the infer-request slot is never returned to the
pool. Under long-running concurrent load, repeated start_async failures
slowly drain the pool until no idle infer request is available; requests
then block and gRPC surfaces ResourceExhausted even though CPU/GPU/RAM are
not saturated (matches the symptom class in openvinotoolkit#2871).

Clear the callback in both start_async() catch blocks so the captured
guard is destroyed and the slot is returned, mirroring the existing
"set empty callback to release captures" pattern used after normal
completion. The synchronous infer() path is unaffected (it uses a
stack-allocated guard with RAII release on all paths).

Builds clean (//src:ovms //src:ovms_test); CAPIInference.AsyncErrorHandling
passes. The change is confined to the start_async error path.

Signed-off-by: exzile <joeypongallo@gmail.com>
@exzile exzile marked this pull request as ready for review June 27, 2026 23:45
The previous commit cleared the infer-request callback on the start_async()
error path to release the leaked pool slot. That introduced a latent crash:
clearing the callback destroys the completion lambda while OpenVINO holds the
InferRequest's internal lock, and that lambda owns an OutputKeeper whose
destructor calls request.set_tensor() back into the same InferRequest. For
models that support output-tensor reset (the common case) this reenters the
locked request and throws "resource deadlock would occur" out of a noexcept
destructor, terminating the server - on the very error path the fix targets.

Fix it without reintroducing the leak:
- OutputKeeper gains cancel(), which disables the destructor's tensor restore.
- modelInferAsync captures the OutputKeeper into the callback by copy and keeps
  the original handle in function scope, so clearing the callback no longer
  destroys the OutputKeeper under the lock.
- The error path now cancels the OutputKeeper before clearing the callback
  (inference never ran, so there is nothing to restore) and the retained
  handle's later destruction is a no-op. Both catch blocks share a helper,
  releaseInferRequestSlotAfterStartAsyncFailure().

Add a regression test (CAPIInference.AsyncStartAsyncThrowReleasesInferRequestSlot)
that forces start_async() to throw via a small virtual seam
(ModelInstance::startAsyncInference) on a single-stream pool and asserts the
slot is returned afterwards. The test reproduces both the original leak and the
deadlock crash, and passes only with this fix. Existing async tests
(AsyncWithCallbackDummy, AsyncErrorHandling) and the full CAPIInference suite
(17/17) still pass.

Signed-off-by: exzile <joeypongallo@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@exzile exzile changed the title Release infer-request slot when start_async() throws (async path leak; relates to #2871) Fix infer-request slot leak and deadlock on async start_async() error path (relates to #2871) Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant