CORS-4470: add GCP Workload Identity Federation support#10610
CORS-4470: add GCP Workload Identity Federation support#10610rochacbruno wants to merge 5 commits into
Conversation
Add support for short-lived token authentication via GCP Workload
Identity Federation (WIF), replacing static service account keys.
Two modes are supported:
- BYO: customer provides existing poolID and providerID in the
install-config. The installer describes the provider via the GCP
IAM API to extract the OIDC issuer URL, validates it, and generates
external_account credential manifests.
- Installer-provisioned: customer sets workloadIdentityFederation: {}
with empty fields. The installer creates a WIF pool, GCS bucket for
OIDC discovery, OIDC provider, and service account bindings using
deterministic names derived from infraID.
CredentialsMode auto-defaults to Manual when WIF is configured.
Destroy flow cleans up installer-provisioned WIF resources but leaves
BYO resources untouched.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@rochacbruno: This pull request references CORS-4470 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
WalkthroughThis PR adds complete Workload Identity Federation (WIF) support to OpenShift's GCP installer. It defines new configuration types, validates WIF provider setup, generates external-account credentials during installation, provisions WIF infrastructure (pool, OIDC issuer, service account bindings), and cleans up resources on cluster destruction. ChangesGCP Workload Identity Federation
🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (1)
pkg/asset/manifests/openshift.go (1)
390-396: GCP project-number extraction is consistent, but strengthen validation instead of changing fields.
- In the vendored
cloudresourcemanager/v3API,Projectdoes not expose aProjectNumberfield;Project.Nameis documented asprojects/{project_number}, soproject.Name[9:]matches the contract.- Still consider parsing more defensively: require the
"projects/"prefix and validate the remainder is non-empty digits before using it in the WIFaudience.projectNumber := "" if len(project.Name) > 9 { projectNumber = project.Name[9:] } if projectNumber == "" { return nil, fmt.Errorf("unexpected project name format: %s", project.Name) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/asset/manifests/openshift.go` around lines 390 - 396, The current extraction of projectNumber from project.Name (projectNumber := ""; if len(project.Name) > 9 { projectNumber = project.Name[9:] } ...) assumes the format but lacks strict validation; update the logic that reads project.Name to first verify it has the "projects/" prefix, then extract the suffix and validate that the remainder is non-empty and consists only of digits before using it to build the WIF audience or returning it; reference the project.Name and projectNumber variables and the WIF audience construction so you replace the naive slice-based extraction with prefix-checking and digit-validation and return the same error if validation fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/asset/installconfig/gcp/client.go`:
- Around line 805-808: The IAM service used for WIF lookup is created via
GetIAMService without applying the endpoint override, causing mismatch with
GetServiceAccount which uses c.endpointName; update the call site in the
function that calls GetIAMService (where IAM service is stored in iamSvc) to
construct the IAM client with the same endpoint override as GetServiceAccount by
passing c.endpointName (or the equivalent endpoint option) into GetIAMService
(or use a variant that accepts client options), so the IAM service honors
c.endpointName for PSC/custom-endpoint environments.
In `@pkg/asset/installconfig/gcp/validation.go`:
- Around line 753-757: The current GetWIFProvider error handling treats all
failures as field.Invalid; change it to return field.NotFound when the lookup
indicates the provider is missing and field.InternalError for other
API/transport failures: call client.GetWIFProvider (using ic.GCP.ProjectID,
wif.PoolID, wif.ProviderID) and inspect the returned error (e.g., using the API
client's NotFound check or error type), then append either
field.NotFound(fldPath, wif.ProviderID, "...") for missing provider or
field.InternalError(fldPath, err.Error()) (or equivalent) for other errors
instead of always using field.Invalid.
In `@pkg/destroy/gcp/wif.go`:
- Around line 11-146: Add unit tests covering the new WIF destroy paths: write
tests for listWIFProviders, deleteWIFProvider, destroyWIFProviders and
listWIFPools, deleteWIFPool, destroyWIFPools that validate gating behavior (when
o.wifEnabled/o.wifBYO), correct list filtering (skip DELETED and match expected
pool prefix), operation/error handling (simulate iamSvc errors and isNoOp
cases), and pending-item lifecycle
(insertPendingItems/getPendingItems/deletePendingItems interaction and
suppression via errorTracker). Use a fake/mocked iamSvc responses and context
timeouts to assert logging/return values and that delete flows call the right
methods and mark items pending/cleared; include tests for error paths returning
wrapped errors and for no-op errors being treated as success.
- Around line 47-56: The code currently removes pending items and logs deletion
immediately after calling
o.iamSvc.Projects.Locations.WorkloadIdentityPools.Providers.Delete (and
similarly in the other block at lines ~116-125) even when the returned op is not
Done; change the flow to wait for the long‑running operation to complete (poll
or use the operation Get/Wait until op.Done is true or returns a non-retryable
error) before calling o.deletePendingItems(item.typeName, []cloudResource{item})
and o.Logger.Infof; preserve existing error handling with isNoOp and ensure you
handle op == nil cases, updating both the provider Delete call and the
corresponding delete path at the other block so resources are only marked
deleted after op.Done is observed.
In `@pkg/infrastructure/gcp/clusterapi/wif.go`:
- Around line 103-105: The OIDC provider is being configured with
AllowedAudiences derived from projectID while BuildAudienceURI and the WIF
credential flow use projectNumber, causing an audience mismatch; update the code
paths that construct AllowedAudiences (the code that calls createOIDCProvider
and the provider configuration in the functions that set AllowedAudiences) to
use projectNumber (or the same BuildAudienceURI result) instead of projectID so
the provider trusts the exact audience URI produced by BuildAudienceURI;
specifically, change uses of projectID when building AllowedAudiences to use
projectNumber (or call BuildAudienceURI and use that string) in the
functions/createOIDCProvider call sites referenced by providerName, poolName,
issuerURL, and infraID so the provider and WIF tokens share the same audience.
- Around line 179-180: The Workload Identity Federation create calls (e.g.,
iamSvc.Projects.Locations.WorkloadIdentityPools.Create and
WorkloadIdentityPoolProviders.Create where op, err := ...Do()) must be made
idempotent: on create failure detect "AlreadyExists"/HTTP 409 (googleapi.Error
with Code 409) and instead call the corresponding Get (or Do/Get method) to
reconcile the existing resource and continue rather than bubbling the error;
apply the same pattern to the subsequent create calls and any binding/role setup
around lines referenced (192-196, 241-242) so transient failures don’t leave the
installer stuck—i.e. try Get first or catch AlreadyExists, load the existing
resource, validate/merge needed fields, and proceed with the remaining steps.
- Around line 209-217: The discovery document is uploaded but the advertised
JWKS is not, so call GenerateJWKS and upload its output to the bucket at the
path referenced by generateOIDCDiscoveryDoc's jwks_uri (the code expects
"<issuer>/keys.json" not under .well-known); in createOIDCBucket (and the other
similar upload blocks around the other OIDC-bucket creation sites) add logic to
call GenerateJWKS(ctx), create a writer for "keys.json" (set ContentType
"application/json"), write the JWKS bytes, close the writer, and return wrapped
errors on write/close failures so Google can fetch the keyset for token
verification.
---
Nitpick comments:
In `@pkg/asset/manifests/openshift.go`:
- Around line 390-396: The current extraction of projectNumber from project.Name
(projectNumber := ""; if len(project.Name) > 9 { projectNumber =
project.Name[9:] } ...) assumes the format but lacks strict validation; update
the logic that reads project.Name to first verify it has the "projects/" prefix,
then extract the suffix and validate that the remainder is non-empty and
consists only of digits before using it to build the WIF audience or returning
it; reference the project.Name and projectNumber variables and the WIF audience
construction so you replace the naive slice-based extraction with
prefix-checking and digit-validation and return the same error if validation
fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 0289485a-9446-4b05-991c-74503fbfa1f5
⛔ Files ignored due to path filters (2)
data/data/install.openshift.io_installconfigs.yamlis excluded by!data/data/install.openshift.io_installconfigs.yamlpkg/types/gcp/zz_generated.deepcopy.gois excluded by!**/zz_generated*
📒 Files selected for processing (18)
data/data/manifests/openshift/cloud-creds-secret.yaml.templatepkg/asset/cluster/gcp/gcp.gopkg/asset/installconfig/azure/mock/azureclient_generated.gopkg/asset/installconfig/gcp/client.gopkg/asset/installconfig/gcp/mock/gcpclient_generated.gopkg/asset/installconfig/gcp/validation.gopkg/asset/manifests/openshift.gopkg/asset/manifests/template.gopkg/destroy/gcp/gcp.gopkg/destroy/gcp/wif.gopkg/infrastructure/gcp/clusterapi/clusterapi.gopkg/infrastructure/gcp/clusterapi/wif.gopkg/infrastructure/gcp/clusterapi/wif_test.gopkg/types/defaults/installconfig.gopkg/types/gcp/metadata.gopkg/types/gcp/platform.gopkg/types/gcp/validation/platform.gopkg/types/gcp/validation/platform_test.go
Use fmt.Errorf with %w in destroy/gcp/wif.go instead of deprecated errors.Wrapf from github.com/pkg/errors. Remove the unrelated Azure mock cosmetic diff that was pulled in by hack/go-genmock.sh. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Upload JWKS (keys.json) to the OIDC GCS bucket so GCP can validate projected service account tokens. The bound SA signing key must be provided by the user via bound-service-account-signing-key.key in the asset directory. Thread the BoundSASigningKey asset through PreProvisionInput so the GCP WIF provisioning flow can extract the public key and generate the JSON Web Key Set. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix audience mismatch: use projectNumber instead of projectID when building the OIDC provider AllowedAudiences, matching the audience URI in the generated external_account credentials - Add PSC endpoint override to GetWIFProvider for custom-endpoint environments - Wait for WIF delete LROs before marking resources as deleted in the destroy flow, preventing premature cleanup - Make WIF pool and provider creation idempotent by handling 409 AlreadyExists errors - Use field.NotFound for missing WIF providers and field.InternalError for API failures instead of generic field.Invalid Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Handle error return from generateOIDCDiscoveryDoc (errcheck) - Annotate external_account type string as not a hardcoded credential (gosec G101) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@rochacbruno: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| workerIgnAsset := &machine.Worker{} | ||
| tfvarsAsset := &tfvars.TerraformVariables{} | ||
| rootCA := &tls.RootCA{} | ||
| boundSASigningKey := &tls.BoundSASigningKey{} |
There was a problem hiding this comment.
There are 3 altearnatives to this:
- Generate the key pair inside the installer
Instead of requiring the user to provide bound-service-account-signing-key.key, make the BoundSASigningKey asset's Generate() method actually produce an RSA key pair when WIF is enabled (currently it's a no-op). The private key would flow into bootstrap ignition so kube-apiserver uses it for SA token signing, and the public key would be uploaded as JWKS to the OIDC bucket.
The challenge: BoundSASigningKey.Generate() currently has no access to the install config (its Dependencies() returns nil). We'd need to add InstallConfig as a dependency and conditionally generate only when WIF is active. This changes the asset DAG, but it's the cleanest long-term approach since the user doesn't need to know about signing keys at all.
- Defer JWKS upload to post-bootstrap
Create the WIF pool, bucket, and OIDC provider during PreProvision but leave the JWKS empty. After bootstrap, when the cluster has generated its own SA signing key, upload the JWKS from a PostProvision hook, or let the cloud-credential-operator handle it.
The trade-off: WIF token validation won't work until after bootstrap completes and the JWKS is uploaded. That's probably fine since WIF credentials aren't needed during bootstrap itself - they're consumed by operators that start after the control plane is up.
- Use inline JWKS on the OIDC provider instead of the bucket
The WorkloadIdentityPoolProvider has a JwksJson field. When set, GCP validates tokens against that inline JWKS instead of fetching from jwks_uri. This eliminates the need for the bucket to serve keys.json at all, though we'd still need the public key from somewhere.
This could combine with option 2 - create the provider without JwksJson initially, then patch it post-bootstrap once the signing key exists.
Summary
workloadIdentityFederationfield to the GCP platform install-config with optionalpoolIDandproviderIDfieldsexternal_accountcredential manifests for STS token exchange instead of static service account keysCredentialsModetoManualwhen WIF is configuredFixes: https://issues.redhat.com/browse/CORS-4470
Generated with Claude Code
Summary by CodeRabbit
New Features
Behavioral
Tests