Summary
Two CI/CD issues discovered during the PR #7938 parakeet batch deploy (workflow run 27520907697). Both will recur on future deploys until fixed.
Bug 1: GKE auth failure in gcp_parakeet.yml — blocks parakeet CI deploys
Observed: gcp_parakeet.yml Helm step fails with:
Error: Kubernetes cluster unreachable: Get "https://34.136.198.95/version":
getting credentials: exec: executable gke-gcloud-auth-plugin failed with exit code 1
All 4 runs of this workflow have failed with the same error. Other GKE workflows (pusher, diarizer, listen, backend) using the same auth pattern succeed consistently.
Root cause: The parakeet workflow has two differentiators that create conflicting gcloud config state:
- Uses
ubuntu-latest-m runner (larger runner with different pre-installed gcloud version)
- Runs
gcloud auth configure-docker + full Docker build/push BEFORE the apt-installed gke-gcloud-auth-plugin step — this sequence creates a gcloud config state that interferes with the plugin's credential lookup
In other workflows, CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE (set by google-github-actions/auth@v2 via GITHUB_ENV) propagates correctly to the apt-installed plugin. In parakeet, something in the configure-docker + Docker build sequence on the -m runner breaks this propagation.
Scope: gcp_parakeet.yml only (4/4 failures). Other workflows not affected.
Fix options:
- Replace the manual apt-install + get-credentials pattern with
google-github-actions/get-gke-credentials@v2 action (handles auth plugin lifecycle automatically, avoids the config state conflict)
- Add explicit
gcloud auth activate-service-account --key-file=$GOOGLE_APPLICATION_CREDENTIALS after apt install
- Investigate the exact interaction between configure-docker, the
-m runner gcloud version, and the apt plugin version
Option 1 is recommended — it sidesteps the root cause entirely.
Bug 2: GPU pod rolling update deadlocks on single-GPU nodes
Observed: New parakeet pod stuck Unschedulable with 1 Insufficient nvidia.com/gpu. Rolling update tries to create the new pod before killing the old one, but there is no second GPU available.
Root cause: The parakeet Helm chart templates/deployment.yaml has no strategy field. Kubernetes defaults to RollingUpdate with maxSurge=25% (rounds to 1 for 1-replica) and maxUnavailable=0. This means: create new pod first, then kill old — which deadlocks when only 1 GPU exists.
Scope: 3 GPU services are affected (all request nvidia.com/gpu: 1, all single-replica):
backend/charts/parakeet/ — no strategy defined in deployment template or values
backend/charts/diarizer/ — no strategy defined in deployment template or values
backend/charts/vad/ — no strategy defined in deployment template or values
Non-GPU charts (backend-listen, pusher) already have a strategy block with maxUnavailable:1/maxSurge:2 — not affected.
Runtime fix applied: mon manually set maxUnavailable=1, maxSurge=0 during the deploy. This is NOT committed to the chart — next deploy will hit the same deadlock.
Fix options:
- Add strategy conditional to parakeet/diarizer/vad
templates/deployment.yaml + set maxUnavailable: 1, maxSurge: 0 in all GPU values files (kill-first rolling update)
- Set
strategy: { type: Recreate } in all GPU Helm charts (simplest — acceptable for single-replica GPU services)
- Scale GPU node pools to 2+ nodes (expensive, only justified when serving live traffic)
Option 1 is recommended — it preserves rolling update semantics while preventing the deadlock, and matches the pattern already used by backend-listen and pusher.
Impact
by AI for @beastoin
Summary
Two CI/CD issues discovered during the PR #7938 parakeet batch deploy (workflow run 27520907697). Both will recur on future deploys until fixed.
Bug 1: GKE auth failure in gcp_parakeet.yml — blocks parakeet CI deploys
Observed:
gcp_parakeet.ymlHelm step fails with:All 4 runs of this workflow have failed with the same error. Other GKE workflows (pusher, diarizer, listen, backend) using the same auth pattern succeed consistently.
Root cause: The parakeet workflow has two differentiators that create conflicting gcloud config state:
ubuntu-latest-mrunner (larger runner with different pre-installed gcloud version)gcloud auth configure-docker+ full Docker build/push BEFORE the apt-installedgke-gcloud-auth-pluginstep — this sequence creates a gcloud config state that interferes with the plugin's credential lookupIn other workflows,
CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE(set bygoogle-github-actions/auth@v2viaGITHUB_ENV) propagates correctly to the apt-installed plugin. In parakeet, something in the configure-docker + Docker build sequence on the-mrunner breaks this propagation.Scope:
gcp_parakeet.ymlonly (4/4 failures). Other workflows not affected.Fix options:
google-github-actions/get-gke-credentials@v2action (handles auth plugin lifecycle automatically, avoids the config state conflict)gcloud auth activate-service-account --key-file=$GOOGLE_APPLICATION_CREDENTIALSafter apt install-mrunner gcloud version, and the apt plugin versionOption 1 is recommended — it sidesteps the root cause entirely.
Bug 2: GPU pod rolling update deadlocks on single-GPU nodes
Observed: New parakeet pod stuck
Unschedulablewith1 Insufficient nvidia.com/gpu. Rolling update tries to create the new pod before killing the old one, but there is no second GPU available.Root cause: The parakeet Helm chart
templates/deployment.yamlhas nostrategyfield. Kubernetes defaults toRollingUpdatewithmaxSurge=25%(rounds to 1 for 1-replica) andmaxUnavailable=0. This means: create new pod first, then kill old — which deadlocks when only 1 GPU exists.Scope: 3 GPU services are affected (all request
nvidia.com/gpu: 1, all single-replica):backend/charts/parakeet/— no strategy defined in deployment template or valuesbackend/charts/diarizer/— no strategy defined in deployment template or valuesbackend/charts/vad/— no strategy defined in deployment template or valuesNon-GPU charts (backend-listen, pusher) already have a strategy block with
maxUnavailable:1/maxSurge:2— not affected.Runtime fix applied: mon manually set
maxUnavailable=1, maxSurge=0during the deploy. This is NOT committed to the chart — next deploy will hit the same deadlock.Fix options:
templates/deployment.yaml+ setmaxUnavailable: 1, maxSurge: 0in all GPU values files (kill-first rolling update)strategy: { type: Recreate }in all GPU Helm charts (simplest — acceptable for single-replica GPU services)Option 1 is recommended — it preserves rolling update semantics while preventing the deadlock, and matches the pattern already used by backend-listen and pusher.
Impact
by AI for @beastoin