Skip to content

Add configurable Kubernetes API client QPS and burst rate limits#1281

Open
mcornea wants to merge 1 commit into
aws:mainfrom
mcornea:configurable-kube-api-qps
Open

Add configurable Kubernetes API client QPS and burst rate limits#1281
mcornea wants to merge 1 commit into
aws:mainfrom
mcornea:configurable-kube-api-qps

Conversation

@mcornea

@mcornea mcornea commented Jun 16, 2026

Copy link
Copy Markdown

Description of changes

Add KUBE_API_QPS and KUBE_API_BURST environment variables (and corresponding --kube-api-qps / --kube-api-burst CLI flags) to configure the Kubernetes API client rate limits.

The client created via rest.InClusterConfig() uses client-go defaults of QPS=5 and Burst=10. When WORKERS is set to 50 or higher for correlated spot interruption scenarios, all workers share this single rate limiter, causing severe client-side throttling (8-10 second waits per API call). This prevents NTH from processing all interruption events within the 2-minute AWS spot interruption notice window.

Changes:

  • pkg/config/config.go: Add KUBE_API_QPS (default 5) and KUBE_API_BURST (default 10) configuration parameters
  • cmd/node-termination-handler.go: Apply QPS/Burst to rest.Config before creating the clientset

Defaults preserve backward compatibility.

Fixes: #1280

How you tested your changes

Environment (Linux): ROSA HCP on AWS (us-east-2), 50 spot nodes (c5a.xlarge, 3 AZs)
Kubernetes Version: v1.35.5 (OpenShift 4.22)

Triggered 50 concurrent spot interruptions via AWS FIS (aws:ec2:send-spot-instance-interruptions with durationBeforeInterruption=PT2M):

Configuration Nodes tainted Cordon P99 Throttle events
WORKERS=50, QPS=5 (default) 39/50 (78%) 154s 10 events, 8-10s each
WORKERS=50, QPS=100, Burst=200 50/50 (100%) 52s 0

NTH logs with default QPS show continuous client-side throttling:

INF Waited for 9.994s due to client-side throttling, not priority and fairness, request: GET:...
WRN all workers busy, waiting

With QPS=100, zero throttling events and all 50 nodes processed within the 2-minute window.

The Kubernetes API client created via rest.InClusterConfig() uses
client-go defaults of QPS=5 and Burst=10. When WORKERS is set to
50 or higher for correlated spot interruption scenarios, all workers
share this single rate limiter, causing severe client-side throttling
(8-10 second waits per API call).

Add KUBE_API_QPS and KUBE_API_BURST environment variables (and
corresponding CLI flags) to allow configuring the client rate limits.
Defaults preserve backward compatibility (QPS=5, Burst=10).

Testing with 50 concurrent spot interruptions shows:
- Default QPS=5: 78% taint success, 154s P99 cordon latency
- QPS=100, Burst=200: 100% taint success, 52s P99 cordon latency

Fixes: aws#1280

Signed-off-by: Marius Cornea <mcornea@redhat.com>
@mcornea mcornea requested a review from a team as a code owner June 16, 2026 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Kubernetes API client QPS/Burst rate limits are hardcoded, causing throttling at high WORKERS concurrency

1 participant