ci(longhaul): add image build, deploy, monitor + auto-upgrade workflows (PR #348 split 3/4)#413
Draft
WentingWu666666 wants to merge 1 commit into
Draft
Conversation
|
🤖 Auto-triaged by documentdb-triage-tool. Applied: Reasoningcomponent from path globs (ci); effort from diff stats (672+0 LOC, 5 files); LLM failed: Invalid response body while trying to fetch https://api.anthropic.com/v1/messages: Premature close If a label is wrong, remove it manually and ping |
Introduces the GitHub Actions plumbing for the long-haul test driver that landed in PR documentdb#405. Three workflows + two deploy manifests: * .github/workflows/longhaul-image-build.yml Build/push test/longhaul image to GHCR on main push and on demand. Tags every run as :sha-<short> (immutable) plus :main. * .github/workflows/longhaul-deploy.yml Roll an image onto the long-haul AKS cluster. Auto-triggered after a successful image build (pins to :sha-<short>) and via workflow_dispatch for rollbacks. Uses a namespace-scoped kubeconfig in the LONGHAUL_KUBECONFIG secret. * .github/workflows/longhaul-monitor.yaml Hourly health poll: Deployment ready, report ConfigMap fresh (<=2h), test result != FAIL. Auto-upgrade and DocumentDB version publishing are intentionally left out and will land in a separate upgrade PR. * test/longhaul/deploy/deployment.yaml Single-replica Deployment + ConfigMap. Image fields templated (__OWNER__/__IMAGE_TAG__) for the deploy workflow to substitute. Aligned with the post-PR-405 env-var surface (LONGHAUL_DOCUMENTDB_URI, no NUM_VERIFIERS) and the credential secret name documented in test/longhaul/README.md (longhaul-documentdb-credentials with a uri key). * test/longhaul/deploy/rbac.yaml Namespace-scoped ServiceAccount/Role/RoleBinding (pods, dbs, configmaps) plus a ClusterRole for metrics.k8s.io. Splits PR documentdb#348 part 3 of 5. Operator/DocumentDB auto-upgrade plus post-upgrade verification follow in PR-4. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
8b658e6 to
db58044
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Part 3 of 4 of the PR #348 split (long-haul test infrastructure).
Builds on PR #405 (long-haul driver core, merged).
Adds the GitHub Actions plumbing + Kubernetes manifests to build, deploy,
monitor, and auto-upgrade the long-haul test driver on the dedicated
AKS cluster.
What this PR adds
Workflows (
.github/workflows/)longhaul-image-build.ymlmain(paths:test/longhaul/**), workflow_dispatchtest/longhaul/Dockerfile, push to GHCR with:sha-<short>(immutable) +:maintagslonghaul-deploy.yml:sha-<short>), workflow_dispatch (manual rollback)kubectl applyDeployment manifest, set image, wait for rolloutlonghaul-monitor.yamllonghaul-versionsConfigMap; the driver performs the in-band DocumentDB upgrade as a load-aware operation so continuous writers/verifiers can catch any data-integrity regressions.Manifests (
test/longhaul/deploy/)deployment.yaml— single-replica Deployment + tunable ConfigMap; image fields templated (__OWNER__/__IMAGE_TAG__) for the deploy workflow.rbac.yaml— namespace-scoped ServiceAccount/Role/RoleBinding (pods,documentdb.io/dbs, configmaps) + ClusterRole formetrics.k8s.io.Setup required before the workflows can run
Cluster admin one-time bootstrap (the deployer ServiceAccount is namespace-scoped by design):
kubectl apply -f test/longhaul/deploy/setup.yaml(already on main via PR test(longhaul): add long-haul test driver core #405; namespace + DocumentDB CR + credentials placeholder)kubectl apply -f test/longhaul/deploy/rbac.yamllonghaul-documentdb-credentialssecret with keyuri(seetest/longhaul/README.md)longhaul-testServiceAccount, store as repo secretLONGHAUL_KUBECONFIGWhy draft
Want to dry-run the workflows end-to-end on the long-haul AKS cluster before un-drafting.
Test plan
python -c "import yaml; yaml.safe_load(...)"parses all three workflow YAMLs.LONGHAUL_KUBECONFIGsecret provisioning).