Skip to content

ci: parakeet GKE auth failure + GPU rolling update deadlock on single-GPU nodes #7959

@beastoin

Description

@beastoin

Summary

Two CI/CD issues discovered during the PR #7938 parakeet batch deploy (workflow run 27520907697). Both will recur on future deploys until fixed.

Bug 1: GKE auth failure in gcp_parakeet.yml — blocks parakeet CI deploys

Observed: gcp_parakeet.yml Helm step fails with:

Error: Kubernetes cluster unreachable: Get "https://34.136.198.95/version":
getting credentials: exec: executable gke-gcloud-auth-plugin failed with exit code 1

All 4 runs of this workflow have failed with the same error. Other GKE workflows (pusher, diarizer, listen, backend) using the same auth pattern succeed consistently.

Root cause: The parakeet workflow has two differentiators that create conflicting gcloud config state:

  1. Uses ubuntu-latest-m runner (larger runner with different pre-installed gcloud version)
  2. Runs gcloud auth configure-docker + full Docker build/push BEFORE the apt-installed gke-gcloud-auth-plugin step — this sequence creates a gcloud config state that interferes with the plugin's credential lookup

In other workflows, CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE (set by google-github-actions/auth@v2 via GITHUB_ENV) propagates correctly to the apt-installed plugin. In parakeet, something in the configure-docker + Docker build sequence on the -m runner breaks this propagation.

Scope: gcp_parakeet.yml only (4/4 failures). Other workflows not affected.

Fix options:

  1. Replace the manual apt-install + get-credentials pattern with google-github-actions/get-gke-credentials@v2 action (handles auth plugin lifecycle automatically, avoids the config state conflict)
  2. Add explicit gcloud auth activate-service-account --key-file=$GOOGLE_APPLICATION_CREDENTIALS after apt install
  3. Investigate the exact interaction between configure-docker, the -m runner gcloud version, and the apt plugin version

Option 1 is recommended — it sidesteps the root cause entirely.

Bug 2: GPU pod rolling update deadlocks on single-GPU nodes

Observed: New parakeet pod stuck Unschedulable with 1 Insufficient nvidia.com/gpu. Rolling update tries to create the new pod before killing the old one, but there is no second GPU available.

Root cause: The parakeet Helm chart templates/deployment.yaml has no strategy field. Kubernetes defaults to RollingUpdate with maxSurge=25% (rounds to 1 for 1-replica) and maxUnavailable=0. This means: create new pod first, then kill old — which deadlocks when only 1 GPU exists.

Scope: 3 GPU services are affected (all request nvidia.com/gpu: 1, all single-replica):

  • backend/charts/parakeet/ — no strategy defined in deployment template or values
  • backend/charts/diarizer/ — no strategy defined in deployment template or values
  • backend/charts/vad/ — no strategy defined in deployment template or values

Non-GPU charts (backend-listen, pusher) already have a strategy block with maxUnavailable:1/maxSurge:2 — not affected.

Runtime fix applied: mon manually set maxUnavailable=1, maxSurge=0 during the deploy. This is NOT committed to the chart — next deploy will hit the same deadlock.

Fix options:

  1. Add strategy conditional to parakeet/diarizer/vad templates/deployment.yaml + set maxUnavailable: 1, maxSurge: 0 in all GPU values files (kill-first rolling update)
  2. Set strategy: { type: Recreate } in all GPU Helm charts (simplest — acceptable for single-replica GPU services)
  3. Scale GPU node pools to 2+ nodes (expensive, only justified when serving live traffic)

Option 1 is recommended — it preserves rolling update semantics while preventing the deadlock, and matches the pattern already used by backend-listen and pusher.

Impact


by AI for @beastoin

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingp1Priority: Critical (score 22-29)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions