Skip to content

Add another wait for install and improve wipe step#80803

Open
sgoveas wants to merge 1 commit into
openshift:mainfrom
sgoveas:fixes
Open

Add another wait for install and improve wipe step#80803
sgoveas wants to merge 1 commit into
openshift:mainfrom
sgoveas:fixes

Conversation

@sgoveas

@sgoveas sgoveas commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

This pull request improves OpenShift CI’s baremetal lab reliability by making IPI installation waits more resilient (with better diagnostics on timeouts) and by strengthening the post-test wipe/power-off cleanup to provide clearer per-host failure reporting.

Installation Step Enhancements (baremetal-lab-ipi-install)

  • Timeout increase: The step timeout is extended from 3h30m0s to 4h30m0s.
  • More robust “wait for install-complete” handling: The installer wait is updated to:
    • Run the first oinst wait-for install-complete in the background, capture its PID, and wait on that PID.
    • If the first wait times out, enter a NO_END_TIME=true flow:
      • Verify whether the Kubernetes API is reachable.
      • Collect cluster-operator readiness counts (ready vs total) and list not-ready operators to determine whether the cluster appears to be progressing.
      • If progressing, perform a second oinst wait-for install-complete and, on failure, emit detailed diagnostics by dumping nodes and cluster operators before exiting.
      • If the API is unreachable or the cluster is stalled, exit with an error.
    • When using the extended (NO_END_TIME) path, update CLUSTER_INSTALL_END_TIME.
    • Keep existing node readiness waiting behavior, but ensure it runs after the extended timeout/diagnostics logic when ADDITIONAL_WORKERS_DAY2 is false.

Wipe Step Improvements (baremetal-lab-post-wipe)

  • Per-host PDU failure tracking:
    • The wipe step now records PDU-related failures into separate temp files keyed by the BMC forwarded port (e.g., /tmp/failed.<port>) instead of using a single shared failure file.
    • At the start of the wipe, it removes any existing /tmp/failed and /tmp/failed.* files.
    • After all hosts are processed, it concatenates /tmp/failed.* into a consolidated /tmp/failed file for subsequent failure checking.
  • Function signature change for correct host scoping:
    • reset_pdu is updated to accept a host argument and pass it through to pdu_reboot.sh, enabling per-host failure recording.
    • reset_host now calls reset_pdu with both the PDU URI and host, and writes failures to /tmp/failed.${bmc_forwarded_port}.
  • Logging fix:
    • Corrected the “Resetting BMC…” log line to print the computed host number.

Overall, these changes improve visibility and fault isolation during baremetal installs and post-test cleanup, especially when timeouts or PDU/BMC connectivity issues occur.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 75363806-538c-483d-8075-67cf261935e9

📥 Commits

Reviewing files that changed from the base of the PR and between 3563834 and 4d0e381.

📒 Files selected for processing (3)
  • ci-operator/step-registry/baremetal/lab/ipi/install/baremetal-lab-ipi-install-commands.sh
  • ci-operator/step-registry/baremetal/lab/ipi/install/baremetal-lab-ipi-install-ref.yaml
  • ci-operator/step-registry/baremetal/lab/post/wipe/baremetal-lab-post-wipe-commands.sh
🚧 Files skipped from review as they are similar to previous changes (3)
  • ci-operator/step-registry/baremetal/lab/ipi/install/baremetal-lab-ipi-install-ref.yaml
  • ci-operator/step-registry/baremetal/lab/ipi/install/baremetal-lab-ipi-install-commands.sh
  • ci-operator/step-registry/baremetal/lab/post/wipe/baremetal-lab-post-wipe-commands.sh

Walkthrough

The IPI install script gains a two-stage wait-for-install flow: after the first oinst wait-for install-complete times out, it checks Kubernetes API reachability and cluster operator readiness counts before running a final wait attempt with node and operator diagnostics. The step timeout is extended to 4h30m. The post-wipe script updates reset_pdu to forward a host argument to pdu_reboot.sh, switches failure tracking from a shared file to per-port temp files, and consolidates them after all hosts are processed.

Changes

IPI Install Extended Retry with Diagnostics

Layer / File(s) Summary
Timeout update and linting directive
baremetal-lab-ipi-install-ref.yaml, baremetal-lab-ipi-install-commands.sh
Step timeout extends from 3h30m0s to 4h30m0s. Adds # shellcheck source=/dev/null directive for linting context.
PID-based install wait with cluster operator diagnostics
baremetal-lab-ipi-install-commands.sh
Captures install_pid and waits on it; on timeout sets NO_END_TIME=true, checks Kubernetes API responsiveness, computes ready/not-ready cluster operator counts and lists not-ready operators, then runs a final oinst wait-for install-complete whose failure triggers node and cluster-operator diagnostics dump before exit. Updates CLUSTER_INSTALL_END_TIME when extended timeout is in effect.
Node readiness wait sequencing
baremetal-lab-ipi-install-commands.sh
Moves wait_for_nodes_readiness invocation to occur after the extended install-complete timeout and diagnostics logic.

Post-Wipe Per-Host PDU Failure Tracking

Layer / File(s) Summary
reset_pdu host argument and call sites
baremetal-lab-post-wipe-commands.sh
reset_pdu function now reads a host argument ($2) and forwards it to pdu_reboot.sh. bmc_reset log message corrected to reference the computed host variable. Both the unreachable-PDU and routine-reset paths in reset_host call reset_pdu with host.
Per-host temp file scheme and consolidation
baremetal-lab-post-wipe-commands.sh
Failure recording within reset_host writes to per-port temp files (/tmp/failed.${bmc_forwarded_port}) instead of the shared /tmp/failed file. Wipe preamble removes any existing /tmp/failed* files before host processing. After all hosts are processed, concatenates /tmp/failed.* files into /tmp/failed for the final power-off failure check.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error baremetal-lab-post-wipe-commands.sh logs internal infrastructure data: BMC addresses and port numbers are written to /tmp/failed.* files and output to logs via 'cat /tmp/failed' on line 220, exposi... Redact bmc_address and bmc_forwarded_port before logging to /tmp/failed, or log only a generic host identifier instead of actual network infrastructure details.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the two main changes: adding enhanced wait logic for install completion and improving the wipe step with better PDU failure handling and host tracking.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR contains only shell scripts and YAML config files; no Ginkgo test files to check. Custom check not applicable.
Test Structure And Quality ✅ Passed PR modifies shell scripts and YAML config files only. No Ginkgo test code present; check is not applicable.
Microshift Test Compatibility ✅ Passed PR contains only shell scripts and YAML configuration changes; no Ginkgo e2e tests are added, so MicroShift compatibility check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests are present in this PR. Modified files are CI infrastructure configuration (bash scripts and YAML configs), not test code.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only CI infrastructure scripts (baremetal lab provisioning/cleanup) and step configuration—no deployment manifests, operator code, or scheduling constraints that assume HA topology are...
Ote Binary Stdout Contract ✅ Passed PR modifies only bash scripts and YAML config files, not Go code. OTE Binary Stdout Contract applies to Go binaries only, not shell/YAML infrastructure code.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR contains no Ginkgo e2e tests. Changes are limited to shell scripts and YAML config files for CI baremetal lab infrastructure; check not applicable.
No-Weak-Crypto ✅ Passed No weak cryptographic usage found. The PR modifies CI/CD automation scripts (baremetal installation, infrastructure cleanup, and configuration YAML). No MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB, cu...
Container-Privileges ✅ Passed The PR modifies CI operator step registry definitions and bash scripts—not Kubernetes manifests or container specs. No container privilege escalation settings (privileged: true, hostPID, hostNetwor...
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from jadhaj and pamoedom June 19, 2026 23:06
@openshift-ci

openshift-ci Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sgoveas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 19, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/baremetal/lab/ipi/install/baremetal-lab-ipi-install-commands.sh`:
- Line 301: The echo statement at line 301 contains a misleading message that
says "Second install-complete attempt timed out" but this is actually checking
the first wait failure for the initial `oinst create cluster` process, not the
second attempt. Update the message to accurately reflect that this is the first
install-complete wait failure, and clarify that it is checking the state after
the initial cluster creation command failed, not a retry attempt. The actual
second attempt occurs later at the subsequent `oinst wait-for install-complete`
command around line 328.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 06a17404-74ad-42f3-aa32-4c7361f6e561

📥 Commits

Reviewing files that changed from the base of the PR and between f71d69d and 3563834.

📒 Files selected for processing (3)
  • ci-operator/step-registry/baremetal/lab/ipi/install/baremetal-lab-ipi-install-commands.sh
  • ci-operator/step-registry/baremetal/lab/ipi/install/baremetal-lab-ipi-install-ref.yaml
  • ci-operator/step-registry/baremetal/lab/post/wipe/baremetal-lab-post-wipe-commands.sh

Internal checks are timing out recently. Adding another wait may help
the install complete after running for a while. This change also checks
the state of the cluster before waiting for a third time to avoid
unnecessary waiting if the cluster install has definetly failed.
Increase the timeout to 4hours 30mins to accomodate the new added wait

Update wipe script to make sure it runs clean without issues
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@sgoveas: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-4.21-nightly-node-density-heavy-baremetal-3nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-4.20-nightly-control-plane-baremetal-multi-3nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-4.22-nightly-node-density-heavy-baremetal-3nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-openshift-tests-private-main-debug-disasterrecovery-metal-upi openshift/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-openshift-tests-private-release-5.1-debug-disasterrecovery-metal-upi openshift/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-openshift-tests-private-release-5.0-debug-disasterrecovery-metal-upi openshift/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-openshift-tests-private-release-4.23-debug-disasterrecovery-metal-upi openshift/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-openshift-tests-private-release-4.22-debug-disasterrecovery-metal-upi openshift/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-openshift-tests-private-release-4.21-debug-disasterrecovery-metal-upi openshift/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-openshift-tests-private-release-4.20-debug-disasterrecovery-metal-upi openshift/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-openshift-tests-private-release-4.19-debug-disasterrecovery-metal-upi openshift/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-openshift-tests-private-release-4.18-debug-disasterrecovery-metal-upi openshift/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-openshift-tests-private-release-4.17-debug-disasterrecovery-metal-upi openshift/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-openshift-tests-private-release-4.16-debug-disasterrecovery-metal-upi openshift/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-openshift-tests-private-release-4.15-debug-disasterrecovery-metal-upi openshift/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-openshift-tests-private-release-4.14-debug-disasterrecovery-metal-upi openshift/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-priv-openshift-tests-private-main-debug-disasterrecovery-metal-upi openshift-priv/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-priv-openshift-tests-private-release-5.1-debug-disasterrecovery-metal-upi openshift-priv/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-priv-openshift-tests-private-release-5.0-debug-disasterrecovery-metal-upi openshift-priv/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-priv-openshift-tests-private-release-4.23-debug-disasterrecovery-metal-upi openshift-priv/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-priv-openshift-tests-private-release-4.22-debug-disasterrecovery-metal-upi openshift-priv/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-priv-openshift-tests-private-release-4.21-debug-disasterrecovery-metal-upi openshift-priv/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-priv-openshift-tests-private-release-4.20-debug-disasterrecovery-metal-upi openshift-priv/openshift-tests-private presubmit Registry content changed
pull-ci-openshift-eng-agent-qe-infra-extended-support-4.14-amd64-nightly-abi-bm-sno-agent-ipv4-static-connected-fips openshift-eng/agent-qe-infra presubmit Registry content changed
pull-ci-openshift-eng-agent-qe-infra-extended-support-4.12-amd64-nightly-abi-bm-sno-agent-ipv4-static-connected-fips openshift-eng/agent-qe-infra presubmit Registry content changed

A total of 654 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs.

A full list of affected jobs can be found here

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@openshift-ci

openshift-ci Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

@sgoveas: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant