Skip to content

fix(routing): feed least_busy in-flight counts on /v1/messages and /v1/responses#716

Merged
jarvis9443 merged 2 commits into
mainfrom
fix/least-busy-in-flight-messages-responses
Jul 3, 2026
Merged

fix(routing): feed least_busy in-flight counts on /v1/messages and /v1/responses#716
jarvis9443 merged 2 commits into
mainfrom
fix/least-busy-in-flight-messages-responses

Conversation

@jarvis9443

@jarvis9443 jarvis9443 commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Problem

least_busy only worked on /v1/chat/completions. begin_in_flight had exactly two call sites, both in chat.rs (#684 wired chat only and left the passthrough endpoints as a follow-up that never landed). Traffic on /v1/messages (Claude Code / Anthropic SDK) or /v1/responses (Codex) never raised the in-flight count, so every target tied at 0 and the stable sort preserved declaration order — the strategy silently degraded to failover, with the first target absorbing all concurrency. Same silent-degradation class as AISIX-Cloud#954. The inverse also held: load arriving via the passthrough endpoints was invisible to chat-side least_busy decisions.

Fix

Mirror the two existing chat.rs patterns across the four passthrough dispatch functions (messages: anthropic_passthrough_dispatch + cross_provider_dispatch; responses: responses_to_target + responses_cross_provider_to_target):

  • non-streaming / error paths hold an InFlightGuard for the attempt scope (acquired right before the upstream send, mirroring chat.rs:1784);
  • streaming branches move the guard into the end-of-stream closure next to drop(stream_hold), so the count stays raised for the stream's full lifetime (mirroring chat.rs:1228).

/v1/messages/count_tokens stays deliberately excluded, matching the least_latency EWMA scope.

Tests

New least-busy-passthrough-endpoints-e2e.test.ts: occupy the declared-first slow target with an un-awaited request, assert the next request diverts to the idle target; attribution via per-upstream receivedRequests counters. Fails on the pre-fix binary for both endpoints (messages: diverted request served by the busy target; responses: slow upstream took 2 requests) and passes after the fix. Existing least-busy/least-latency/model-group/responses e2e suites pass unchanged (8/8), cargo test -p aisix-proxy 582 passed, fmt + clippy clean.

Fixes #715

Summary by CodeRabbit

  • Bug Fixes
    • Improved least_busy routing to keep upstreams marked as busy for the full duration of streaming passthrough, preventing premature “idle” reassignment.
    • Applied the fix consistently to both /v1/messages and /v1/responses streaming paths.
  • Tests
    • Added end-to-end coverage for least_busy passthrough endpoints using slow vs. fast upstreams, asserting diverted requests reach the idle target and that per-upstream request counts update as expected.

…1/responses

begin_in_flight was only called from chat.rs (#684 left the passthrough
endpoints as a follow-up), so a least_busy group serving Anthropic-SDK
or Codex traffic saw all-zero counts and the stable sort silently
degraded to declaration order — failover in disguise, same
silent-degradation class as AISIX-Cloud#954.

Mirror both chat.rs patterns across the four dispatch functions
(messages: anthropic_passthrough_dispatch + cross_provider_dispatch;
responses: responses_to_target + responses_cross_provider_to_target):
non-streaming / error paths hold the guard for the attempt scope, the
streaming branches move it into the end-of-stream closure next to
stream_hold so the count stays raised for the stream's full lifetime.
count_tokens stays deliberately excluded, matching least_latency.

The new e2e (occupy the declared-first slow target, assert the next
request diverts to the idle one, attribution via per-upstream request
counters) fails on the pre-fix binary for both endpoints and passes
after.

Fixes #715
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: de088ab5-f2ae-4ab3-8c4a-a4008edff9a1

📥 Commits

Reviewing files that changed from the base of the PR and between 0e8c5c7 and e9bb5c0.

📒 Files selected for processing (1)
  • tests/e2e/src/cases/least-busy-passthrough-endpoints-e2e.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/e2e/src/cases/least-busy-passthrough-endpoints-e2e.test.ts

📝 Walkthrough

Walkthrough

Extends least_busy in-flight accounting to streaming passthrough paths in messages.rs and responses.rs, and adds E2E coverage that verifies /v1/messages and /v1/responses route away from an occupied upstream target.

Changes

Least-busy in-flight tracking fix

Layer / File(s) Summary
Messages.rs in-flight guard wiring
crates/aisix-proxy/src/messages.rs
Adds begin_in_flight guard creation in anthropic_passthrough_dispatch and cross_provider_dispatch, with explicit drop(in_flight) in the end-of-stream closures to extend in-flight tracking across the full SSE lifetime.
Responses.rs in-flight guard wiring
crates/aisix-proxy/src/responses.rs
Mirrors the same begin_in_flight/drop(in_flight) pattern in the OpenAI verbatim and cross-provider Responses streaming dispatch paths.
E2E test suite for least_busy routing
tests/e2e/src/cases/least-busy-passthrough-endpoints-e2e.test.ts
Adds a new test suite with helper functions for provider/model setup and request posting, plus two tests confirming requests divert to an idle target while another remains in-flight on /v1/messages and /v1/responses.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Possibly related PRs

  • api7/aisix#613: Both PRs touch /v1/responses streaming end-of-stream handling in responses.rs.
  • api7/aisix#627: Both PRs modify the /v1/responses cross-provider path in responses.rs.
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
E2e Test Quality Review ⚠️ Warning Good E2E shape, but it only proves non-streaming /v1/messages and /v1/responses; the PR also changes stream-lifetime in_flight handling, which remains untested. Add at least one streaming least_busy passthrough case per endpoint, keeping the first stream open and asserting the second request diverts to the idle target.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the routing fix for least_busy in-flight counts on /v1/messages and /v1/responses.
Linked Issues check ✅ Passed The code and test changes match the linked issue: in-flight tracking is added on messages/responses streaming and non-streaming paths, with coverage for both endpoints.
Out of Scope Changes check ✅ Passed The only changes are the expected routing fixes and a focused e2e test, with no obvious unrelated additions.
Security Check ✅ Passed Only routing/in-flight accounting and E2E test changes; no new secret logging, auth bypass, storage, TLS, or ownership issues found in touched code.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/least-busy-in-flight-messages-responses

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/e2e/src/cases/least-busy-passthrough-endpoints-e2e.test.ts (1)

137-239: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add stream:true cases for the stream-lifetime guard paths.

These tests only exercise non-streaming requests, but the PR also changes streaming closures to hold in_flight until end-of-stream. Add streaming variants for /v1/messages and /v1/responses so a regression that drops the guard at handler return is caught. As per coding guidelines, “Tests must cover boundary cases … combination scenarios, and extreme cases.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/src/cases/least-busy-passthrough-endpoints-e2e.test.ts` around
lines 137 - 239, Add streaming coverage for the least_busy passthrough guard
paths in the existing `/v1/messages` and `/v1/responses` e2e tests. The current
cases only validate non-streaming requests, so extend the same test block around
`postMessages`, `postResponses`, and the `least_busy` virtual models to send
`stream:true` requests and assert routing still diverts away from the in-flight
upstream until the stream ends. Keep the existing slow/fast upstream setup and
request counter checks so a regression that releases `in_flight` at handler
return is caught for both stream-lifetime paths.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/e2e/src/cases/least-busy-passthrough-endpoints-e2e.test.ts`:
- Around line 174-180: The least-busy passthrough test is using a fixed sleep
after starting the occupy request, which can race before the upstream has
actually incremented in_flight. Update the affected test cases in
least-busy-passthrough-endpoints-e2e.test.ts to wait by polling the upstream
request count/state until the occupy request is observed, then send the
diversion request. Use the existing postMessages flow and the relevant
upstream/request-count helpers in the test to make the wait deterministic
instead of time-based.

---

Nitpick comments:
In `@tests/e2e/src/cases/least-busy-passthrough-endpoints-e2e.test.ts`:
- Around line 137-239: Add streaming coverage for the least_busy passthrough
guard paths in the existing `/v1/messages` and `/v1/responses` e2e tests. The
current cases only validate non-streaming requests, so extend the same test
block around `postMessages`, `postResponses`, and the `least_busy` virtual
models to send `stream:true` requests and assert routing still diverts away from
the in-flight upstream until the stream ends. Keep the existing slow/fast
upstream setup and request counter checks so a regression that releases
`in_flight` at handler return is caught for both stream-lifetime paths.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1e1c3165-6e0c-408e-945d-33aff46898d4

📥 Commits

Reviewing files that changed from the base of the PR and between 13f09ff and 0e8c5c7.

📒 Files selected for processing (3)
  • crates/aisix-proxy/src/messages.rs
  • crates/aisix-proxy/src/responses.rs
  • tests/e2e/src/cases/least-busy-passthrough-endpoints-e2e.test.ts

Comment thread tests/e2e/src/cases/least-busy-passthrough-endpoints-e2e.test.ts
CodeRabbit: a fixed 200ms delay can race on slow CI — poll the slow
upstream's request count instead. Arrival implies the in-flight guard
is already held (acquired before the upstream send).
@jarvis9443 jarvis9443 merged commit 8bbc6a7 into main Jul 3, 2026
12 checks passed
@jarvis9443 jarvis9443 deleted the fix/least-busy-in-flight-messages-responses branch July 3, 2026 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(routing): least_busy silently degrades to failover on /v1/messages and /v1/responses — in_flight is only fed by chat

1 participant