Add opt-in preflight admission readiness check before kubeadm upgrade apply

### What would you like to be added

I'd like to add an opt-in preflight check that verifies the API server's admission pipeline is ready before `kubeadm upgrade apply` runs on the first control plane node.

We run kubespray across multiple production clusters and have been hitting intermittent `[ERROR CreateJob]` failures during upgrades. Our clusters use `ValidatingAdmissionPolicy` with a CRD-based `paramKind`, and after the API server restarts mid-upgrade, the VAP param controller's RESTMapper takes ~30 seconds to refresh its CRD discovery cache ([kubernetes/kubernetes#124237](https://github.com/kubernetes/kubernetes/issues/124237)). During that window, `kubeadm upgrade apply`'s preflight health-check Job gets rejected by the stale admission controller.

Here's the relevant error from a real upgrade run (sanitized):

```
TASK [kubernetes/control-plane : Kubeadm | Upgrade first control plane node]
...
[ERROR CreateJob]: could not create Job "upgrade-health-check-xxxxx" in the namespace
"kube-system": jobs.batch "upgrade-health-check-xxxxx" is forbidden:
ValidatingAdmissionPolicy '<vap-policy>' denied request: failed to configure policy:
failed to find resource referenced by paramKind: '<crd-group/version, Kind=ParamCRD>'
```


The VAP param controller can't resolve the CRD `paramKind` reference because its RESTMapper cache is stale after the API server restart. The Job itself is fine; the admission pipeline is temporarily broken.

The existing `check-api.yml` only checks `/healthz`, which returns 200 while the admission pipeline is still broken. A potential fix would be to add an opt-in task between `check-api.yml` and `kubeadm upgrade apply` that dry-runs a `batch/v1 Job` (the same resource kind kubeadm creates for its preflight) through the server-side admission pipeline, retrying until admission controllers are ready. This would follow existing patterns like the configurable health-check approach from #12448/PR #12452.

I have a working implementation on our fork and would be happy to submit a PR.

### Why is this needed

I've been hitting this failure intermittently across our production clusters during Kubernetes upgrades with kubespray. Our clusters use `ValidatingAdmissionPolicy` with CRD-based `paramKind` references, and the upgrade fails on the **first control plane node**, where `kubeadm upgrade apply` is called.

The root cause is a timing issue in the upgrade flow:

1. Tasks in `kubeadm-setup.yml` (audit policy, admission control configs, tracing) and `kubeadm-fix-apiserver.yml` (component kubeconfigs) notify the `Restart apiserver` / `restart kubelet` handlers
2. The handlers fire and restart the API server
3. The VAP param controller's RESTMapper cache is invalidated and takes ~30 seconds to refresh ([kubernetes/kubernetes#124237](https://github.com/kubernetes/kubernetes/issues/124237) - still open, no fix merged; related: [kubernetes/kubernetes#122658](https://github.com/kubernetes/kubernetes/issues/122658))
4. `check-api.yml` passes because `/healthz` returns 200 while the admission pipeline is still stale
5. `kubeadm upgrade apply` creates its preflight health-check Job, which gets rejected with HTTP 422
6. Upgrade fails with `[ERROR CreateJob]`

In this case, we traced the failure to the VAP param controller failing to resolve a custom CRD `paramKind` after the restart. API server audit logs showed the first 422 error ~30 seconds before `kubeadm upgrade apply` ran, consistent with the RESTMapper refresh window.

**This is different from #12006/#12009**, which fixed **secondary** control plane nodes by switching to `kubeadm upgrade node` (PR #12015). Our issue is on the first control plane node where `kubeadm upgrade apply` is correct. It's also different from #12448 (PR #12452), which made the `/healthz` retry count configurable. The `/healthz` check passes fine, it's the admission pipeline that's broken.

I think anyone running VAP with CRD `paramKind` references will eventually hit this during upgrades. The preflight check adds a direct probe of the admission pipeline to catch the staleness window. It should be deprecated once [kubernetes/kubernetes#124237](https://github.com/kubernetes/kubernetes/issues/124237) is fixed in all supported Kubernetes versions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add opt-in preflight admission readiness check before kubeadm upgrade apply #13139

What would you like to be added

Why is this needed

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add opt-in preflight admission readiness check before kubeadm upgrade apply #13139

Description

What would you like to be added

Why is this needed

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions