Possible HTTP/1.1 per-host permit leak on request-timeout / connect-success race

Related to #2176.

I think there is a residual race in the same request-timeout / connect-success path fixed in #2176, but with a different and more severe symptom: a **permanently leaked per-host connection permit** for HTTP/1.1. We have now reproduced the resulting per-host pool exhaustion deterministically under load across a fleet, so I'm fairly confident this is a real accounting defect rather than a one-off.

## Version / scope

Observed with:

- `org.asynchttpclient:async-http-client:3.0.10`
- HTTP/1.1 over the Netty transport
- finite `maxConnectionsPerHost` (we run `5000`, and a `7000` regional override — both reproduce)
- request timeouts enabled

I also checked current `main` and the relevant ordering appears unchanged.

**HTTP/2 is not affected**: it releases the per-host permit immediately after ALPN and never enters the window described below. The leak is specific to the HTTP/1.1 future→channel permit hand-off.

## Permit ownership model

A per-host permit (`PerHostConnectionSemaphore`, `:32,45-64`) is taken per newly opened channel and, at any instant, is owned by exactly one of two things:

| Owner | From → until | Wiring |
|---|---|---|
| the **request future** | acquire → connect completes | `acquirePartitionLockLazily()` stores the token in `partitionKeyLock` (`NettyResponseFuture.java:506-512`, field `:109`, updater `:87`). |
| the **channel's `closeFuture`** | connect success → channel close | `onSuccess` *moves* the token off the future with `takePartitionKeyLock()` (`:89`) and rebinds it to the socket via `attachSemaphoreToChannelClose` → `channel.closeFuture().addListener(... releaseChannelLock(token))` (`NettyConnectListener.java:232,242-246`). |

The transfer is exactly-once: `takePartitionKeyLock()` is a `getAndSet(null)` (`NettyResponseFuture.java:162-169`), so whoever calls it first wins the token and everyone else gets `null`. The bug is in the **hand-off between these two owners**.

## Summary

Under a timeout storm against a peer that accepts TCP/TLS connections but does not respond or close the socket, AHC leaks a per-host semaphore permit permanently.

The suspected race, inside `onSuccess` **after connect completes but before the channel is attached to the future** (`future.channel()` is still `null` in this window):

1. A new HTTP/1.1 connection is opened, so a per-host permit is acquired and stored on `NettyResponseFuture.partitionKeyLock` (`:506-512`).
2. `NettyConnectListener.onSuccess(...)` starts running on the event loop.
3. `onSuccess(...)` calls `future.takePartitionKeyLock()` (`:89` → `NettyResponseFuture.java:162-169`), moving the permit token off the future.
4. Before `future.attachChannel(channel, false)` runs (`NettyConnectListener.java:82-83`), the request timeout fires on the `HashedWheelTimer`.
5. `TimeoutTimerTask.expire(...)` calls
   `requestSender.abort(nettyResponseFuture.channel(), nettyResponseFuture, TimeoutException)`.
6. `future.channel()` is still `null`, because `attachChannel(...)` has not run yet.
7. `NettyRequestSender.abort(...)` skips the close because of the `if (channel != null && channel.isActive())` guard (`NettyRequestSender.java:662-667`). It marks the future done, fires `onThrowable` (recorded as a `timeout`), and tries to reclaim the token via `releasePartitionKeyLock()`.
8. But `onSuccess(...)` already won the token at step 3, so the abort's `getAndSet(null)` returns `null` and **releases nothing**.
9. `onSuccess(...)` then checks `futureIsAlreadyCompleted()` = `future.isDone()` (`:93,58-67`). Crucially, `terminateAndExit()` **releases the lock at `:255` and only sets `isDone` afterward at `:259`** (`NettyResponseFuture.java:149-158,254-260`). So `onSuccess` typically reads `isDone == false`, proceeds, and binds the token to `channel.closeFuture()` (`attachSemaphoreToChannelClose`, `:232,242-246`), writing the request on a socket whose future is already dead.

The `isDone`-after-release ordering is why this is **the expected outcome under the interleaving, not a coincidence** — `onSuccess` wins the token *and* fails to observe the abort.

Final state: the request future is done/aborted, but a live HTTP/1.1 channel holds a per-host permit, and the only path that could return it (the `closeFuture` listener) is compromised — see "Terminal state" below.

## Why nothing else reclaims the orphaned permit

Once the token is bound to the channel-close path, every safety net is gone:

1. **`closeChannel()` does not release the semaphore.** `ChannelManager.closeChannel()` (`:438-443`) only does `setDiscard` / `removeAll` / `silentlyCloseChannel`. The permit comes back **solely** through the `closeFuture` listener.
2. **The idle/TTL reaper only scans the idle pool.** `DefaultChannelPool.IdleChannelDetector.run()` (`:343`) iterates only `partitions` (`:56`), the map populated by `offer()` (`:110-140`). Both `isIdleTimeoutExpired` and `isTtlExpired` (`:287-305`) therefore apply **only to channels returned to the pool**. An orphaned active channel was never `offer()`ed, so neither `pooledConnectionIdleTimeout` nor `connectionTtl` can ever evict it.

So an orphaned active channel is invisible to every automatic mechanism. Only resetting the `Semaphore` (new AHC client / JVM restart) returns the slot.

## Terminal state observed in production (sharpens the mechanism)

My first model was "the socket stays established forever, so `closeFuture` never fires." The live evidence **refines that**, and it's important for whoever fixes this:

At plateau, the JVM holds the full per-host permit count (`Too many connections: 7000`) while having **zero sockets to the peer in *any* TCP state** — not 7000 orphaned `ESTABLISHED`, not `CLOSE_WAIT`, none. So the sockets demonstrably **closed**; `closeFuture` *did* fire — yet the permits were **not** returned.

Since a permit returns only via the `closeFuture` listener (and nothing else, per the section above), the leak is therefore **not** "a socket that never closes." It is that **the release bound to channel-close does not execute for these channels**. Two candidate mechanisms, both inside the race window (I have not yet re-confirmed which against 3.0.10 source):

1. `attachSemaphoreToChannelClose` (`:232,242-246`) was **never reached / never attached** for the orphaned channel under the interleaving, so when the socket later closed there was no listener to release — the token is held by *neither* the future *nor* a `closeFuture`. This produces a permanent leak that does **not** require the socket to stay open, matching the zero-sockets steady state exactly.
2. the listener attached but `releaseChannelLock(token)` no-oped because the token had already been `getAndSet(null)` by the racing abort path.

Net effect: the trigger is the connect/timeout TOCTOU (steps above), but the *terminal* state is "permit pinned with no socket at all, because the release path tied to channel-close was bypassed." This is also why `readTimeout` does not help even though it *does* close stalled sockets — the close happens, but the permit-release bound to that close never runs. I'd suggest reviewing `attachSemaphoreToChannelClose` / `releaseChannelLock` ordering **relative to `abort()`**, not just timeout handling.

## Why #2176 may not fully cover this

#2176 correctly handles the case where `timeoutsHolder == null` or `future.isDone()` is observed by `NettyConnectListener.onSuccess(...)`.

This case is narrower:

- timeout abort sees `future.channel() == null`, so it cannot close the concrete channel;
- `onSuccess(...)` has already taken the partition lock from the future;
- because of the release-then-set-`isDone` ordering (`:255` before `:259`), `onSuccess(...)` usually does **not** observe `future.isDone()` before attaching the permit to `channel.closeFuture()`;
- after that, the only release path is channel close — which (per "Terminal state") is itself bypassed.

So #2176 fixes an NPE / early-exit problem in this race family, but it leaves a permit-accounting window.

## Source map (pinned to 3.0.10 sources jar)

| File | Line(s) | Role |
|---|---|---|
| `PerHostConnectionSemaphore.java` | `32,45-64` | per-host `Semaphore`; `acquireChannelLock` / `releaseChannelLock`. |
| `NettyResponseFuture.java` | `87,109` | `partitionKeyLock` token field + atomic updater. |
| `NettyResponseFuture.java` | `506-512` | `acquirePartitionLockLazily()` — acquire + store token on future. |
| `NettyResponseFuture.java` | `162-169` | `takePartitionKeyLock()` — `getAndSet(null)`, exactly-once transfer. |
| `NettyResponseFuture.java` | `149-158,254-260` | `releasePartitionKeyLock()` / `terminateAndExit()` — releases at `:255`, sets `isDone` at `:259`. |
| `NettyConnectListener.java` | `89` | `onSuccess` takes the token (race point A). |
| `NettyConnectListener.java` | `93,58-67` | `futureIsAlreadyCompleted()` = `isDone()` check (race point B). |
| `NettyConnectListener.java` | `232,242-246` | `attachSemaphoreToChannelClose` — binds release to `channel.closeFuture()`. |
| `NettyConnectListener.java` | `82-83` | `writeRequest` → `future.attachChannel()` — closes the window. |
| `NettyConnectListener.java` | `271-293` | `onFailure` — no direct release (relies on `future.abort`). |
| `NettyRequestSender.java` | `662-667` | `abort(...)` only closes the channel `if (channel != null && channel.isActive())`. |
| `TimeoutTimerTask.java` | `expire()` | request-timeout abort passes `future.channel()` — `null` inside the window. |
| `ChannelManager.java` | `438-443` | `closeChannel()` does **not** release the semaphore. |
| `DefaultChannelPool.java` | `56,110-140,287-305,343-372` | idle/TTL reaper scans only `partitions` (idle pool); orphaned active channels are invisible. |

## Production evidence

After a burst of request timeouts to one host, we saw a sustained, flat `TooManyConnectionsPerHostException` plateau for that host — no gradual decay, cleared only on JVM/AHC-client recycle. Concrete signals:

- **Rate:** ~166 timeouts/s to the peer preceding the pin; once pinned, the host rejects new acquisitions to that peer at a steady ~50–54/s.
- **Permit vs. inventory mismatch (the key signal):** the exception reports `Too many connections: 7000`, while at the same moment `getClientStats()` shows ~268 total / 14 active / 254 idle, and `ss` shows ~445 `ESTAB` for the whole JVM. The `Semaphore` count and the live channel/socket inventory have **diverged** by orders of magnitude — i.e. the rejected count is internal permit state, not live TCP connections. `getClientStats()` is therefore useful corroboration but **not** a sufficient source of truth for the permit count.
- **Zero sockets to the peer while pinned:** resolving the peer's current IPs and diffing against the JVM's sockets found **zero** matching sockets while the app kept throwing `TooManyConnectionsPerHostException` for it (see "Terminal state").
- **Thread dump** confirms live traffic blocked exactly at the permit boundary:
  ```text
  java.util.concurrent.Semaphore.tryAcquire
  org.asynchttpclient.netty.channel.CombinedConnectionSemaphore.acquireChannelLock
  org.asynchttpclient.netty.request.NettyResponseFuture.acquirePartitionLockLazily
  org.asynchttpclient.netty.request.NettyRequestSender.sendRequestWithNewChannel
  org.asynchttpclient.DefaultAsyncHttpClient.execute
  ```
- **Reproduced at scale, independently:** during one degraded-peer event, **240 of 560** instances were simultaneously pinned on the *same* peer, each at ~50–54/s, on independent JVMs. The per-instance distribution is **sharply bimodal — either 0 or ~54/s, almost nothing between** — which is exactly what a one-permit-at-a-time leak against a black-hole peer predicts: a host serves normally until its `Semaphore` hits the cap, then *every* subsequent acquisition to that peer fails. The zero-sockets-while-pinned signature reproduced on multiple independent hosts. Hundreds of JVMs draining to the identical cap at the identical rate under one shared degraded peer is strong evidence this is deterministic-under-trigger, not a per-host anomaly.

## Mitigations that do NOT help (so they don't get suggested first)

| Lever | Helps? | Why |
|---|---|---|
| `connectionTtl` | No | idle-pool only; orphaned active channel was never `offer()`ed. |
| `pooledConnectionIdleTimeout` | No | same — idle-pool only. |
| `readTimeout` | No | closes the socket, but the permit-release bound to that close never runs (see "Terminal state"). |
| `maxConnectionsPerHost` | bounds blast radius only | caps how much of the global pool one host can pin; does not stop it pinning. |

The only things that recover are resetting the `Semaphore` (new AHC client / JVM restart) or not feeding the degraded peer at all (upstream circuit breaker).

## Expected behavior

A request timeout racing with connect success should not be able to leave an HTTP/1.1 per-host permit owned only by a live (or since-closed) channel detached from an active request. At minimum, one of:

- the concrete channel is attached/published before a timeout abort can observe `future.channel() == null`;
- the timeout/connect race handling closes the just-connected channel when the future was already aborted;
- the permit is released if the future completed before ownership was fully transferred to the channel;
- `attachSemaphoreToChannelClose` / `releaseChannelLock` are made robust to the token having already been taken by a racing abort (so a later channel-close still reconciles the permit).

## Question

Is this interleaving possible in 3.0.10 / current `main`? If so, should AHC close the channel and/or release the per-host permit when a request timeout wins before `future.attachChannel(...)` publishes the channel — and is the permit-release path tied to channel-close expected to be safe against the racing abort having already taken the token?


File	Line(s)	Role
`PerHostConnectionSemaphore.java`	`32,45-64`	per-host `Semaphore`; `acquireChannelLock` / `releaseChannelLock`.
`NettyResponseFuture.java`	`87,109`	`partitionKeyLock` token field + atomic updater.
`NettyResponseFuture.java`	`506-512`	`acquirePartitionLockLazily()` — acquire + store token on future.
`NettyResponseFuture.java`	`162-169`	`takePartitionKeyLock()` — `getAndSet(null)`, exactly-once transfer.
`NettyResponseFuture.java`	`149-158,254-260`	`releasePartitionKeyLock()` / `terminateAndExit()` — releases at `:255`, sets `isDone` at `:259`.
`NettyConnectListener.java`	`89`	`onSuccess` takes the token (race point A).
`NettyConnectListener.java`	`93,58-67`	`futureIsAlreadyCompleted()` = `isDone()` check (race point B).
`NettyConnectListener.java`	`232,242-246`	`attachSemaphoreToChannelClose` — binds release to `channel.closeFuture()`.
`NettyConnectListener.java`	`82-83`	`writeRequest` → `future.attachChannel()` — closes the window.
`NettyConnectListener.java`	`271-293`	`onFailure` — no direct release (relies on `future.abort`).
`NettyRequestSender.java`	`662-667`	`abort(...)` only closes the channel `if (channel != null && channel.isActive())`.
`TimeoutTimerTask.java`	`expire()`	request-timeout abort passes `future.channel()` — `null` inside the window.
`ChannelManager.java`	`438-443`	`closeChannel()` does not release the semaphore.
`DefaultChannelPool.java`	`56,110-140,287-305,343-372`	idle/TTL reaper scans only `partitions` (idle pool); orphaned active channels are invisible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible HTTP/1.1 per-host permit leak on request-timeout / connect-success race #2189

Version / scope

Permit ownership model

Summary

Why nothing else reclaims the orphaned permit

Terminal state observed in production (sharpens the mechanism)

Why #2176 may not fully cover this

Source map (pinned to 3.0.10 sources jar)

Production evidence

Mitigations that do NOT help (so they don't get suggested first)

Expected behavior

Question

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Owner	From → until	Wiring
the request future	acquire → connect completes	`acquirePartitionLockLazily()` stores the token in `partitionKeyLock` (`NettyResponseFuture.java:506-512`, field `:109`, updater `:87`).
the channel's `closeFuture`	connect success → channel close	`onSuccess` moves the token off the future with `takePartitionKeyLock()` (`:89`) and rebinds it to the socket via `attachSemaphoreToChannelClose` → `channel.closeFuture().addListener(... releaseChannelLock(token))` (`NettyConnectListener.java:232,242-246`).

Lever	Helps?	Why
`connectionTtl`	No	idle-pool only; orphaned active channel was never `offer()`ed.
`pooledConnectionIdleTimeout`	No	same — idle-pool only.
`readTimeout`	No	closes the socket, but the permit-release bound to that close never runs (see "Terminal state").
`maxConnectionsPerHost`	bounds blast radius only	caps how much of the global pool one host can pin; does not stop it pinning.

Possible HTTP/1.1 per-host permit leak on request-timeout / connect-success race #2189

Description

Version / scope

Permit ownership model

Summary

Why nothing else reclaims the orphaned permit

Terminal state observed in production (sharpens the mechanism)

Why #2176 may not fully cover this

Source map (pinned to 3.0.10 sources jar)

Production evidence

Mitigations that do NOT help (so they don't get suggested first)

Expected behavior

Question

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions