Skip to content

ci(longhaul): add image build, deploy, monitor + auto-upgrade workflows (PR #348 split 3/4)#413

Draft
WentingWu666666 wants to merge 1 commit into
documentdb:mainfrom
WentingWu666666:developer/wentingwu/longhaul-cicd
Draft

ci(longhaul): add image build, deploy, monitor + auto-upgrade workflows (PR #348 split 3/4)#413
WentingWu666666 wants to merge 1 commit into
documentdb:mainfrom
WentingWu666666:developer/wentingwu/longhaul-cicd

Conversation

@WentingWu666666

@WentingWu666666 WentingWu666666 commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

Part 3 of 4 of the PR #348 split (long-haul test infrastructure).
Builds on PR #405 (long-haul driver core, merged).

Adds the GitHub Actions plumbing + Kubernetes manifests to build, deploy,
monitor, and auto-upgrade the long-haul test driver on the dedicated
AKS cluster.

Note: The plan originally split this into two PRs (PR-3 workflows + PR-4 upgrade). Since the upgrade logic is contained entirely in one file (longhaul-monitor.yaml) and all the Go-side dependencies (operations/upgrade.go, K8sClusterClient.UpgradeDocumentDB, schemaVersion: "auto" in setup.yaml) already shipped in PR #405, splitting added review overhead with no benefit. Consolidated into this PR — long-haul story now lands in 3 PRs instead of 5.

What this PR adds

Workflows (.github/workflows/)

Workflow Trigger Purpose
longhaul-image-build.yml push to main (paths: test/longhaul/**), workflow_dispatch Build test/longhaul/Dockerfile, push to GHCR with :sha-<short> (immutable) + :main tags
longhaul-deploy.yml workflow_run on successful build (auto-deploys :sha-<short>), workflow_dispatch (manual rollback) kubectl apply Deployment manifest, set image, wait for rollout
longhaul-monitor.yaml hourly cron + workflow_dispatch (1) Poll Deployment health, report ConfigMap freshness (≤2h), test result ≠ FAIL. (2) Auto-upgrade operator via Helm if a newer GHCR tag is available, with post-upgrade verification + CRD re-apply. (3) Publish the latest DocumentDB GHCR tag into the longhaul-versions ConfigMap; the driver performs the in-band DocumentDB upgrade as a load-aware operation so continuous writers/verifiers can catch any data-integrity regressions.

Manifests (test/longhaul/deploy/)

  • deployment.yaml — single-replica Deployment + tunable ConfigMap; image fields templated (__OWNER__/__IMAGE_TAG__) for the deploy workflow.
  • rbac.yaml — namespace-scoped ServiceAccount/Role/RoleBinding (pods, documentdb.io/dbs, configmaps) + ClusterRole for metrics.k8s.io.

Setup required before the workflows can run

Cluster admin one-time bootstrap (the deployer ServiceAccount is namespace-scoped by design):

  1. kubectl apply -f test/longhaul/deploy/setup.yaml (already on main via PR test(longhaul): add long-haul test driver core #405; namespace + DocumentDB CR + credentials placeholder)
  2. kubectl apply -f test/longhaul/deploy/rbac.yaml
  3. Create longhaul-documentdb-credentials secret with key uri (see test/longhaul/README.md)
  4. Issue a long-lived kubeconfig for the longhaul-test ServiceAccount, store as repo secret LONGHAUL_KUBECONFIG

Why draft

Want to dry-run the workflows end-to-end on the long-haul AKS cluster before un-drafting.

Test plan

  • python -c "import yaml; yaml.safe_load(...)" parses all three workflow YAMLs.
  • Manual dry-run on cluster (pending LONGHAUL_KUBECONFIG secret provisioning).

@documentdb-triage-tool documentdb-triage-tool Bot added CI/CD enhancement New feature or request labels Jun 29, 2026
@documentdb-triage-tool

Copy link
Copy Markdown

🤖 Auto-triaged by documentdb-triage-tool.

Applied: CI/CD, enhancement
Project fields suggested: Component ci · Priority P2 · Effort L · Status In Progress
Confidence: 0.30 (deterministic)

Reasoning

component from path globs (ci); effort from diff stats (672+0 LOC, 5 files); LLM failed: Invalid response body while trying to fetch https://api.anthropic.com/v1/messages: Premature close

If a label is wrong, remove it manually and ping @patty-chow so the rules can be tuned. The bot will not re-label items that already have component labels.

Introduces the GitHub Actions plumbing for the long-haul test driver
that landed in PR documentdb#405. Three workflows + two deploy manifests:

* .github/workflows/longhaul-image-build.yml
    Build/push test/longhaul image to GHCR on main push and on demand.
    Tags every run as :sha-<short> (immutable) plus :main.
* .github/workflows/longhaul-deploy.yml
    Roll an image onto the long-haul AKS cluster. Auto-triggered after
    a successful image build (pins to :sha-<short>) and via
    workflow_dispatch for rollbacks. Uses a namespace-scoped kubeconfig
    in the LONGHAUL_KUBECONFIG secret.
* .github/workflows/longhaul-monitor.yaml
    Hourly health poll: Deployment ready, report ConfigMap fresh
    (<=2h), test result != FAIL. Auto-upgrade and DocumentDB version
    publishing are intentionally left out and will land in a separate
    upgrade PR.
* test/longhaul/deploy/deployment.yaml
    Single-replica Deployment + ConfigMap. Image fields templated
    (__OWNER__/__IMAGE_TAG__) for the deploy workflow to substitute.
    Aligned with the post-PR-405 env-var surface
    (LONGHAUL_DOCUMENTDB_URI, no NUM_VERIFIERS) and the credential
    secret name documented in test/longhaul/README.md
    (longhaul-documentdb-credentials with a uri key).
* test/longhaul/deploy/rbac.yaml
    Namespace-scoped ServiceAccount/Role/RoleBinding (pods, dbs,
    configmaps) plus a ClusterRole for metrics.k8s.io.

Splits PR documentdb#348 part 3 of 5. Operator/DocumentDB auto-upgrade plus
post-upgrade verification follow in PR-4.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/longhaul-cicd branch from 8b658e6 to db58044 Compare June 30, 2026 17:10
@WentingWu666666 WentingWu666666 changed the title ci(longhaul): add image build, deploy, and monitor workflows (PR #348 split 3/5) ci(longhaul): add image build, deploy, monitor + auto-upgrade workflows (PR #348 split 3/4) Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants