Add another wait for install and improve wipe step#80803
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
WalkthroughThe IPI install script gains a two-stage wait-for-install flow: after the first ChangesIPI Install Extended Retry with Diagnostics
Post-Wipe Per-Host PDU Failure Tracking
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (13 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sgoveas The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@ci-operator/step-registry/baremetal/lab/ipi/install/baremetal-lab-ipi-install-commands.sh`:
- Line 301: The echo statement at line 301 contains a misleading message that
says "Second install-complete attempt timed out" but this is actually checking
the first wait failure for the initial `oinst create cluster` process, not the
second attempt. Update the message to accurately reflect that this is the first
install-complete wait failure, and clarify that it is checking the state after
the initial cluster creation command failed, not a retry attempt. The actual
second attempt occurs later at the subsequent `oinst wait-for install-complete`
command around line 328.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 06a17404-74ad-42f3-aa32-4c7361f6e561
📒 Files selected for processing (3)
ci-operator/step-registry/baremetal/lab/ipi/install/baremetal-lab-ipi-install-commands.shci-operator/step-registry/baremetal/lab/ipi/install/baremetal-lab-ipi-install-ref.yamlci-operator/step-registry/baremetal/lab/post/wipe/baremetal-lab-post-wipe-commands.sh
Internal checks are timing out recently. Adding another wait may help the install complete after running for a while. This change also checks the state of the cluster before waiting for a third time to avoid unnecessary waiting if the cluster install has definetly failed. Increase the timeout to 4hours 30mins to accomodate the new added wait Update wipe script to make sure it runs clean without issues
|
[REHEARSALNOTIFIER]
A total of 654 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs. A full list of affected jobs can be found here Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
@sgoveas: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary by CodeRabbit
This pull request improves OpenShift CI’s baremetal lab reliability by making IPI installation waits more resilient (with better diagnostics on timeouts) and by strengthening the post-test wipe/power-off cleanup to provide clearer per-host failure reporting.
Installation Step Enhancements (
baremetal-lab-ipi-install)oinst wait-for install-completein the background, capture its PID, and wait on that PID.NO_END_TIME=trueflow:oinst wait-for install-completeand, on failure, emit detailed diagnostics by dumping nodes and cluster operators before exiting.NO_END_TIME) path, updateCLUSTER_INSTALL_END_TIME.ADDITIONAL_WORKERS_DAY2isfalse.Wipe Step Improvements (
baremetal-lab-post-wipe)/tmp/failed.<port>) instead of using a single shared failure file./tmp/failedand/tmp/failed.*files./tmp/failed.*into a consolidated/tmp/failedfile for subsequent failure checking.reset_pduis updated to accept a host argument and pass it through topdu_reboot.sh, enabling per-host failure recording.reset_hostnow callsreset_pduwith both the PDU URI and host, and writes failures to/tmp/failed.${bmc_forwarded_port}.Overall, these changes improve visibility and fault isolation during baremetal installs and post-test cleanup, especially when timeouts or PDU/BMC connectivity issues occur.