ci: parakeet GKE auth failure + GPU rolling update deadlock on single-GPU nodes

## Summary

Two CI/CD issues discovered during the PR #7938 parakeet batch deploy (workflow run [27520907697](https://github.com/BasedHardware/omi/actions/runs/27520907697)). Both will recur on future deploys until fixed.

## Bug 1: GKE auth failure in gcp_parakeet.yml — blocks parakeet CI deploys

**Observed:** `gcp_parakeet.yml` Helm step fails with:
```
Error: Kubernetes cluster unreachable: Get "https://34.136.198.95/version":
getting credentials: exec: executable gke-gcloud-auth-plugin failed with exit code 1
```

All 4 runs of this workflow have failed with the same error. Other GKE workflows (pusher, diarizer, listen, backend) using the same auth pattern succeed consistently.

**Root cause:** The parakeet workflow has two differentiators that create conflicting gcloud config state:
1. Uses `ubuntu-latest-m` runner (larger runner with different pre-installed gcloud version)
2. Runs `gcloud auth configure-docker` + full Docker build/push BEFORE the apt-installed `gke-gcloud-auth-plugin` step — this sequence creates a gcloud config state that interferes with the plugin's credential lookup

In other workflows, `CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE` (set by `google-github-actions/auth@v2` via `GITHUB_ENV`) propagates correctly to the apt-installed plugin. In parakeet, something in the configure-docker + Docker build sequence on the `-m` runner breaks this propagation.

**Scope:** `gcp_parakeet.yml` only (4/4 failures). Other workflows not affected.

**Fix options:**
1. Replace the manual apt-install + get-credentials pattern with `google-github-actions/get-gke-credentials@v2` action (handles auth plugin lifecycle automatically, avoids the config state conflict)
2. Add explicit `gcloud auth activate-service-account --key-file=$GOOGLE_APPLICATION_CREDENTIALS` after apt install
3. Investigate the exact interaction between configure-docker, the `-m` runner gcloud version, and the apt plugin version

Option 1 is recommended — it sidesteps the root cause entirely.

## Bug 2: GPU pod rolling update deadlocks on single-GPU nodes

**Observed:** New parakeet pod stuck `Unschedulable` with `1 Insufficient nvidia.com/gpu`. Rolling update tries to create the new pod before killing the old one, but there is no second GPU available.

**Root cause:** The parakeet Helm chart `templates/deployment.yaml` has no `strategy` field. Kubernetes defaults to `RollingUpdate` with `maxSurge=25%` (rounds to 1 for 1-replica) and `maxUnavailable=0`. This means: create new pod first, then kill old — which deadlocks when only 1 GPU exists.

**Scope:** 3 GPU services are affected (all request `nvidia.com/gpu: 1`, all single-replica):
- `backend/charts/parakeet/` — no strategy defined in deployment template or values
- `backend/charts/diarizer/` — no strategy defined in deployment template or values
- `backend/charts/vad/` — no strategy defined in deployment template or values

Non-GPU charts (backend-listen, pusher) already have a strategy block with `maxUnavailable:1`/`maxSurge:2` — not affected.

**Runtime fix applied:** mon manually set `maxUnavailable=1, maxSurge=0` during the deploy. This is NOT committed to the chart — next deploy will hit the same deadlock.

**Fix options:**
1. Add strategy conditional to parakeet/diarizer/vad `templates/deployment.yaml` + set `maxUnavailable: 1, maxSurge: 0` in all GPU values files (kill-first rolling update)
2. Set `strategy: { type: Recreate }` in all GPU Helm charts (simplest — acceptable for single-replica GPU services)
3. Scale GPU node pools to 2+ nodes (expensive, only justified when serving live traffic)

Option 1 is recommended — it preserves rolling update semantics while preventing the deadlock, and matches the pattern already used by backend-listen and pusher.

## Impact

- Bug 1 blocks all CI-triggered parakeet deploys (4/4 failures)
- Bug 2 blocks all GPU service updates (parakeet, diarizer, vad) unless operator manually overrides the strategy
- Both were worked around during #7938 deploy (mon did manual helm upgrade for Bug 1, set maxUnavailable=1 for Bug 2) but the workarounds are not committed

---
_by AI for @beastoin_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: parakeet GKE auth failure + GPU rolling update deadlock on single-GPU nodes #7959

Summary

Bug 1: GKE auth failure in gcp_parakeet.yml — blocks parakeet CI deploys

Bug 2: GPU pod rolling update deadlocks on single-GPU nodes

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ci: parakeet GKE auth failure + GPU rolling update deadlock on single-GPU nodes #7959

Description

Summary

Bug 1: GKE auth failure in gcp_parakeet.yml — blocks parakeet CI deploys

Bug 2: GPU pod rolling update deadlocks on single-GPU nodes

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions