Migrating Ipfailover test cases to images repo by melvinjoseph86 · Pull Request #243 · openshift/images

melvinjoseph86 · 2026-06-25T18:57:56Z

Summary by CodeRabbit

New Features
- Added a new ipfailover test extension with end-to-end coverage for deployment, failover behavior, VIP changes, and configuration validation.
- Built and packaged the test extension so it can be run from a generated binary/artifact.
Documentation
- Added setup and run instructions for the ipfailover test extension, including suite groupings and execution constraints.
Chores
- Updated ignore rules to keep generated test binaries out of source control.

openshift-ci · 2026-06-25T18:58:12Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2026-06-25T19:01:56Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: melvinjoseph86
Once this PR has been reviewed and has the lgtm label, please assign mrunalp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

melvinjoseph86 · 2026-06-25T19:02:41Z

/payload-job-with-prs pull-ci-openshift-images-master-e2e-metal-ipi-tests-ext openshift/release/pull/81103

openshift-ci · 2026-06-25T19:02:49Z

@melvinjoseph86: An error was encountered. No known errors were detected, please see the full error message for details.

Full error message.


unable to get additional pr info from string: openshift/release/pull/81103: string: openshift/release/pull/81103 doesn't match expected format: org/repo#number

Please contact an administrator to resolve this issue.

melvinjoseph86 · 2026-06-25T19:04:27Z

/payload-job-with-prs pull-ci-openshift-images-master-e2e-metal-ipi-tests-ext openshift/release#81103

openshift-ci · 2026-06-25T19:04:36Z

@melvinjoseph86: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

melvinjoseph86 · 2026-06-25T19:18:22Z

/payload-job-with-prs periodic-ci-openshift-images-master-e2e-metal-ipi-tests-ext openshift/release#81103

openshift-ci · 2026-06-25T19:18:36Z

@melvinjoseph86: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

redhat-chai-bot · 2026-06-25T19:46:36Z

Investigation: Deserialization Error Root Cause

I investigated the Deserialization Error: invalid character 'I' looking for beginning of value failure reported in #wg-openshift-tests-extension. Here are the findings:

Root Cause: `klog.SetOutput(g.GinkgoWriter)` timing bug

The init() function in test/e2e/ipfailover.go captures a reference to g.GinkgoWriter at init-time, but at that point GinkgoWriter is still initialized to os.Stdout (set in Ginkgo's own init() in core_dsl.go). Later, OTE's configureGinkgo() replaces the GinkgoWriter variable with a new Writer pointing to os.Stderr, but klog still holds the old Writer object that writes to stdout.

This means all klog INFO lines (I0625...) go to stdout for the entire process lifetime, contaminating the JSON output stream.

The same timing issue affects rest.SetDefaultWarningHandler(rest.NewWarningWriter(g.GinkgoWriter, ...)).

Recommended fix for this PR

Replace the GinkgoWriter references in init() with direct os.Stderr:

func init() {
    // Write directly to stderr — no timing dependency on GinkgoWriter
    klog.SetOutput(os.Stderr)

    // Redirect REST client warnings directly to stderr
    rest.SetDefaultWarningHandler(
        rest.NewWarningWriter(os.Stderr, rest.WarningWriterOptions{}),
    )

    exutil.WithCleanup(func() {})
}

The flag.Set("ginkgo.output-interceptor-mode", "none") line can also be removed — OTE uses NoopOutputInterceptor{} unconditionally, so the flag has no effect.

Framework-side fix

A structural fix is also being prepared for openshift-eng/openshift-tests-extension to make newTestResultFromOutput() resilient to non-JSON stdout content (similar to how origin's extractJSON() works). This will prevent future occurrences regardless of the contamination source.

Investigation triggered from #wg-openshift-tests-extension discussion

coderabbitai · 2026-06-29T18:11:14Z

Walkthrough

Adds a new openshift-tests-extension (OTE) for ipfailover keepalived tests, migrated from openshift-tests-private. Includes a Go module, Makefile, multi-stage Dockerfile build stage, a test entrypoint with three suite registrations, six e2e test cases with utility helpers, an OpenShift deployment template, and a README.

Changes

ipfailover OTE test extension

Layer / File(s)	Summary
Build tooling and Dockerfile packaging `ipfailover/keepalived/tests/Makefile`, `ipfailover/keepalived/Dockerfile`, `.gitignore`	Makefile defines `build`/`clean`/`help` targets producing `bin/ipfailover-tests-ext`. Dockerfile adds a `test-extension-builder` stage that compiles and archives the binary into `ipfailover-test-extension.tar.gz`, then copies it into `/usr/bin/` in the final image. `.gitignore` excludes the built binary path.
Go module and dependencies `ipfailover/keepalived/tests/go.mod`	Declares module `github.com/openshift/ipfailover-tests-extension` with Go 1.25.0, direct deps on Ginkgo/Gomega, openshift-tests-extension, origin, Cobra, and Kubernetes libs, plus a `replace` block redirecting `k8s.io/*` and OpenTelemetry packages to OpenShift forks at pinned commits.
Test entrypoint and suite registration `ipfailover/keepalived/tests/cmd/main.go`	`main()` initializes flags, builds and filters Ginkgo specs to local `e2e/` sources, applies platform labels, sets blocking lifecycle, and registers specs into an `ipfailover` extension with Cobra commands. `registerSuites` registers serial-conformance, disruptive, and all suites.
Deployment template testdata `ipfailover/keepalived/tests/e2e/testdata/router/ipfailover.yaml`	OpenShift `Template` defining an ipfailover `Deployment` with privileged/hostNetwork settings, liveness probe for `keepalived`, env var wiring, volume mounts, and parameters for name, image, VIPs, interface, monitor port, and VRRP offset.
e2e utility helpers `ipfailover/keepalived/tests/e2e/util.go`	Adds `ipfailoverDescription` struct, node/pod discovery, pod readiness polling, MASTER/BACKUP convergence detection, VIP ownership lookup, image pull spec resolution, template processing with retry, IP stack classification, unicast peer configuration, namespace privilege setup, and exec/grep helpers.
e2e test cases `ipfailover/keepalived/tests/e2e/ipfailover.go`	Six Ginkgo tests under `[OTP][sig-network-edge] Network_Edge`: basic MASTER/backup deploy, VIP failover on pod deletion, full env-var config verification, large VIP range configuration, preemption strategy validation, and VRRP ID exclusion check. Suite init redirects klog/REST warnings and sets up platform/network skip logic in `BeforeEach`.
README `ipfailover/keepalived/tests/README.md`	Documents the OTE extension: suite groupings, concurrency requirements, run commands, expected durations, and directory layout.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (4 errors, 5 warnings)

Check name	Status	Explanation	Resolution
Stable And Deterministic Test Names	❌ Error	The new Ginkgo titles embed author names and ticket IDs (e.g. 'Author:mjoseph-...-41027...'), making them brittle and overly specific.	Rename specs to short static descriptions (e.g. 'should fail over to standby when master fails') and keep author/ticket info outside the title.
Ote Binary Stdout Contract	❌ Error	init() in e2e/ipfailover.go wires klog/rest warnings to ginkgo.GinkgoWriter before OTE rebinds it; Ginkgo starts that writer on os.Stdout, so JSON stdout can be contaminated.	Route those logs to os.Stderr (or defer the wiring until after OTE rebinds GinkgoWriter) so the main binary never emits non-JSON stdout.
Container-Privileges	❌ Error	New ipfailover template sets `hostNetwork: true` and `privileged: true` at pod/container scope.	Remove or narrowly justify these privileges; add the missing hardening settings if the workload still requires host access.
No-Sensitive-Data-In-Logs	❌ Error	New e2e helpers log full node/pod details and pod logs via e2e.Logf, which can expose internal hostnames and cluster data.	Redact or remove the full dumps; log only counts/selected fields, and guard any verbose pod-log/describe output behind a debug flag.
Docstring Coverage	⚠️ Warning	Docstring coverage is 4.17% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	The suite lacks teardown hooks, some It blocks mix multiple failover behaviors, and many assertions are bare `Expect(err).NotTo(HaveOccurred())`.	Split broad It blocks, add cleanup/DeferCleanup for created resources, and add descriptive failure messages to all error assertions.
Microshift Test Compatibility	⚠️ Warning	FAIL: the new Ginkgo suite lacks MicroShift skips and uses template.openshift.io/v1 plus a 2-worker/hostNetwork/privileged Deployment, which MicroShift doesn't support.	Add a [Skipped:MicroShift] or runtime MicroShift guard, and remove/replace the Template+HA assumptions (or skip on MicroShift) before running there.
Single Node Openshift (Sno) Test Compatibility	⚠️ Warning	The new e2e suite requires 2 worker nodes and exercises failover between pods, with no SingleReplicaTopology/SNO guard present.	Add an explicit SNO skip/label (or remove the 2-worker and cross-node failover assumptions) so the suite won’t run on Single Node OpenShift.
Ipv6 And Disconnected Network Test Compatibility	⚠️ Warning	New e2e tests hardcode IPv4-only values and dot-octet parsing (e.g. 192.168.123.123, 192.168.254.1-255, strings.Split on ".").	Replace fixed IPv4 literals with family-aware values, avoid octet parsing/IPv4-only host formatting, and use cluster IP-family detection or explicit IPv6 skips.

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: migrating ipfailover test cases into the images repo.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Topology-Aware Scheduling Compatibility	✅ Passed	The new worker-only 2-replica manifest is gated by platform and a ≥2 Linux worker-node check before creation, and no anti-affinity/spread/PDB or control-plane selectors were added.
No-Weak-Crypto	✅ Passed	Changed ipfailover files contain no weak-crypto APIs, custom crypto, or secret/token comparisons; only unrelated secret-pull usage appears.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 14

🧹 Nitpick comments (1)

ipfailover/keepalived/Dockerfile (1)
5-5: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win

Copy only the inputs this builder stage actually needs.

COPY . . pulls the whole repository into the build context even though this stage only builds ipfailover/keepalived/tests. That widens the accidental secret/file exposure surface and invalidates the cache on unrelated changes.
♻️ Proposed fix
 WORKDIR /go/src/github.com/openshift/ipfailover
-COPY . .
+COPY ipfailover/keepalived/tests ./ipfailover/keepalived/tests
 RUN cd ipfailover/keepalived/tests && \
     make build && \
As per path instructions, COPY specific files, not entire context.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/Dockerfile` at line 5, The builder stage in the
keepalived Dockerfile is over-copying the entire repository with COPY . ., even
though it only needs the inputs required to build ipfailover/keepalived/tests.
Replace the broad copy with explicit copies of the minimal needed files and
directories so the stage stays narrowly scoped, avoids exposing unrelated files,
and preserves cache reuse; use the Dockerfile stage and the tests build target
as the main references when locating the change.
Source: Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ipfailover/keepalived/tests/cmd/main.go`:
- Around line 25-26: Move the framework.AfterReadingAllFlags call so it runs
after root.Execute() has parsed flags, because invoking it before parsing leaves
framework.TestContext at default values and ignores kube/e2e overrides. Update
the startup flow in main so util.InitStandardFlags still happens first, then
flags are parsed via root.Execute, and only then
framework.AfterReadingAllFlags(&framework.TestContext) is called.

In `@ipfailover/keepalived/tests/e2e/ipfailover.go`:
- Around line 168-171: After deleting the current primary, do not immediately
recompute MASTER/BACKUP from the pod list; in the ipfailover e2e flow around
getPodListByLabel, ensureIpfailoverMasterBackup, waitForPrimaryPod, and
futurePrimaryPod[0], add a replica-aware wait until the replacement pod set is
fully available and stable before reading pod names again. Apply the same fix in
both affected flows so the recalculation happens only after the rollout
completes and the pod list reflects the final state.
- Around line 75-84: The IPv6 single-stack check in ipfailover e2e is too narrow
because it only inspects the first clusterNetwork CIDR and can misclassify
IPv6-first dual-stack clusters. Update the logic in the test around the current
oc-based network lookup to reuse checkIPStackType(oc) or otherwise evaluate the
full status.clusterNetwork CIDR list before calling g.Skip, so only true IPv6
single-stack clusters are skipped.
- Around line 69-73: The remote-worker skip check in the ipfailover e2e test is
using the wrong node selector, so it can miss remote-worker clusters. Update the
logic in the test block around the remote-worker gate to query
`node.openshift.io/remote-worker` directly, matching the approach used by
`exactNodeDetails()`, and then keep the skip decision based on that targeted
result instead of searching generic `nodes -l kubernetes.io/hostname` output.

In `@ipfailover/keepalived/tests/e2e/testdata/router/ipfailover.yaml`:
- Around line 28-75: The openshift-ipfailover container definition in the
ipfailover manifest is missing an explicit resource envelope. Add cpu and memory
requests and limits to the openshift-ipfailover container alongside its existing
ports, securityContext, and env settings so it conforms to the resource policy
for every container.
- Around line 14-27: The ipfailover test template hard-codes a 2-replica
deployment with a worker-only node selector, which assumes a standard
multi-worker topology. Update the ipfailover.yaml template to derive replicas
from the number of schedulable nodes and make the e2e test setup topology-aware
before rendering it, so it can handle SingleReplica, DualReplica, and
arbiter-style clusters. Use the existing ipfailover template fields and related
test setup logic to cap replica count to available schedulable nodes and avoid
assuming node-role.kubernetes.io/worker is always present.
- Around line 23-44: The ipfailover test pod spec is effectively fully
host-privileged and needs to be tightened in the openshift-ipfailover workload
definition. Update the container and pod security settings to follow the
security guidance by adding non-root and no-escalation constraints, making the
root filesystem read-only, dropping all capabilities unless explicitly required,
and disabling service account token automount unless the test truly needs it.
Keep host access only if required for the test, but scope it to the minimum
necessary privileges and document any exception in the ipfailover template
rather than leaving the default manifest overly permissive.
- Around line 69-75: The keepalived test pod spec currently defines only
livenessProbe, which lets the pod report Ready too early and makes the e2e flow
race-prone. Add a readinessProbe alongside the existing livenessProbe in the
ipfailover manifest so readiness only flips after keepalived is actually
prepared to serve, keeping ensurePodWithLabelReady in sync with the downstream
assertions in ipfailover.go.
- Around line 22-25: Remove the invalid top-level privileged field from the pod
spec in the ipfailover test manifest. The fix is to delete
spec.template.spec.privileged from the ipfailover.yaml fixture and keep the
existing containers[].securityContext.privileged setting as the only privileged
declaration. Use the ipfailover pod template structure under spec.template.spec
to locate the change.

In `@ipfailover/keepalived/tests/e2e/util.go`:
- Around line 319-331: The helper waitForPodWithLabelReady currently only checks
Ready status, so it can succeed with a single surviving pod during rollout.
Update waitForPodWithLabelReady to also verify the number of pods returned by
the label selector in addition to readiness, and keep polling until both the
expected replica count and Ready condition are met. Use the existing wait.Poll
logic and the kubectl get call in waitForPodWithLabelReady to enforce the count
before returning success.
- Around line 29-31: The worker-counting logic in the e2e helper is parsing `oc
get nodes` table text with `strings.Count`, which double-counts `Ready` inside
`NotReady` and can underreport healthy nodes. Update the node readiness check in
the helper that builds `linuxWorkerDetails` and `nodeCount` to use structured
node data from the API or JSON output, inspecting node conditions and
unschedulable state instead of matching substrings in human-readable output.
- Around line 240-249: The unicast peer list in the Nutanix/none branch is built
from the first node address, but .status.addresses[0] is not guaranteed to be an
InternalIP. Update the logic in the keepalived e2e util path that calls
oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", ...) to filter node
addresses by type == "InternalIP" before assembling OPENSHIFT_HA_UNICAST_PEERS,
and keep the existing validation that enough worker IPs were found.
- Around line 160-176: The getVipOwnerPod helper is treating the literal
“command terminated with exit code 1” as a VIP miss, which can mask real oc exec
failures and return the wrong master or an empty result. Update the logic around
oc.AsAdmin().WithoutNamespace().Run("exec").Args(...).Output() to inspect err
first, only consider exit code 1 from the command as “VIP not present,” and
explicitly retry or fail for any other error conditions; keep the selection
logic in getVipOwnerPod and its primaryNode assignment tied to successful
command results.

In `@ipfailover/keepalived/tests/go.mod`:
- Line 3: The module version in ipfailover/keepalived/tests/go.mod is newer than
the Go toolchain used by the ipfailover/keepalived/Dockerfile builder stage, so
the build may auto-download a different compiler. Update the builder image in
Dockerfile to a Go 1.25-based image, or reduce the go directive in go.mod to
match the supported compiler version. Make sure the versions stay aligned across
go.mod and the Dockerfile so the tests build with the image’s local toolchain.

---

Nitpick comments:
In `@ipfailover/keepalived/Dockerfile`:
- Line 5: The builder stage in the keepalived Dockerfile is over-copying the
entire repository with COPY . ., even though it only needs the inputs required
to build ipfailover/keepalived/tests. Replace the broad copy with explicit
copies of the minimal needed files and directories so the stage stays narrowly
scoped, avoids exposing unrelated files, and preserves cache reuse; use the
Dockerfile stage and the tests build target as the main references when locating
the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

coderabbitai · 2026-06-29T18:11:17Z

+	util.InitStandardFlags()
+	framework.AfterReadingAllFlags(&framework.TestContext)


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n -C3 'InitStandardFlags\(|AfterReadingAllFlags\(' ipfailover/keepalived/tests/cmd/main.go rg -n -C3 --iglob 'main.go' 'InitStandardFlags\(|AfterReadingAllFlags\(' .

Repository: openshift/images

Length of output: 972

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== main.go ==" cat -n ipfailover/keepalived/tests/cmd/main.go | sed -n '1,180p' echo echo "== other AfterReadingAllFlags usages ==" rg -n -C2 'AfterReadingAllFlags\(' . || true echo echo "== root.Execute call site ==" rg -n -C3 'root\.Execute\(\)|Execute\(\)' ipfailover/keepalived/tests/cmd/main.go

Repository: openshift/images

Length of output: 6376

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== framework docs references ==" rg -n -C2 'AfterReadingAllFlags' . || true echo echo "== related entrypoints ==" rg -n -C2 'util.InitStandardFlags\(|logs.InitLogs\(|framework.AfterReadingAllFlags\(' . --glob 'main.go' || true

Repository: openshift/images

Length of output: 2173

Move framework.AfterReadingAllFlags after flag parsing. Calling it before root.Execute() leaves framework.TestContext on defaults, so kube/e2e flag overrides are ignored.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ipfailover/keepalived/tests/cmd/main.go` around lines 25 - 26, Move the framework.AfterReadingAllFlags call so it runs after root.Execute() has parsed flags, because invoking it before parsing leaves framework.TestContext at default values and ignores kube/e2e overrides. Update the startup flow in main so util.InitStandardFlags still happens first, then flags are parsed via root.Execute, and only then framework.AfterReadingAllFlags(&framework.TestContext) is called.

coderabbitai · 2026-06-29T18:11:17Z

+		g.By("check the cluster has remote worker profile")
+		remoteWorkerDetails, _ := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "-l", "kubernetes.io/hostname").Output()
+		if strings.Contains(remoteWorkerDetails, "remote-worker") {
+			g.Skip("Skip as ipfailover currently doesn't support on remote-worker profile")
+		}


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

The remote-worker gate is checking the wrong selector.

-l kubernetes.io/hostname returns every node, then this relies on "remote-worker" appearing in the rendered table output. That misses clusters where the node name does not contain that substring. Query node.openshift.io/remote-worker directly, like exactNodeDetails() already does.

Proposed fix

- remoteWorkerDetails, _ := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "-l", "kubernetes.io/hostname").Output() - if strings.Contains(remoteWorkerDetails, "remote-worker") { + remoteWorkerDetails, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "--selector=node.openshift.io/remote-worker").Output() + o.Expect(err).NotTo(o.HaveOccurred()) + if !strings.Contains(remoteWorkerDetails, "No resources found") { g.Skip("Skip as ipfailover currently doesn't support on remote-worker profile") }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

g.By("check the cluster has remote worker profile")

remoteWorkerDetails, _ := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "-l", "kubernetes.io/hostname").Output()

if strings.Contains(remoteWorkerDetails, "remote-worker") {

g.Skip("Skip as ipfailover currently doesn't support on remote-worker profile")

}

g.By("check the cluster has remote worker profile")

remoteWorkerDetails, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "--selector=node.openshift.io/remote-worker").Output()

o.Expect(err).NotTo(o.HaveOccurred())

if !strings.Contains(remoteWorkerDetails, "No resources found") {

g.Skip("Skip as ipfailover currently doesn't support on remote-worker profile")

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ipfailover/keepalived/tests/e2e/ipfailover.go` around lines 69 - 73, The remote-worker skip check in the ipfailover e2e test is using the wrong node selector, so it can miss remote-worker clusters. Update the logic in the test block around the remote-worker gate to query `node.openshift.io/remote-worker` directly, matching the approach used by `exactNodeDetails()`, and then keep the skip decision based on that targeted result instead of searching generic `nodes -l kubernetes.io/hostname` output.

coderabbitai · 2026-06-29T18:11:18Z

+		g.By("check whether the cluster is not ipv6 single stack")
+		// Get IP stack type using oc command instead of compat_otp
+		networkOutput, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("network", "cluster", "-o=jsonpath={.status.clusterNetworkMTU}").Output()
+		if err == nil && networkOutput != "" {
+			// Simple check - if we can get network info, check for IPv6
+			cidrOutput, _ := oc.AsAdmin().WithoutNamespace().Run("get").Args("network", "cluster", "-o=jsonpath={.status.clusterNetwork[0].cidr}").Output()
+			if strings.Contains(cidrOutput, ":") && !strings.Contains(cidrOutput, ".") {
+				g.Skip("Skip as ipfailover currently doesn't support ipv6 single stack")
+			}
+		}


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

This skips IPv6-first dual-stack clusters as if they were single-stack.

The check only inspects status.clusterNetwork[0].cidr. If the first CIDR is IPv6 and the second is IPv4, the suite is skipped even though dual-stack is supported by the rest of this package. Reuse checkIPStackType(oc) or inspect the full CIDR list.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ipfailover/keepalived/tests/e2e/ipfailover.go` around lines 75 - 84, The IPv6 single-stack check in ipfailover e2e is too narrow because it only inspects the first clusterNetwork CIDR and can misclassify IPv6-first dual-stack clusters. Update the logic in the test around the current oc-based network lookup to reuse checkIPStackType(oc) or otherwise evaluate the full status.clusterNetwork CIDR list before calling g.Skip, so only true IPv6 single-stack clusters are skipped.

coderabbitai · 2026-06-29T18:11:18Z

+		g.By("7. Verify the virtual IP is floated onto the new MASTER node")
+		newPodName1 := getPodListByLabel(oc, ns, "ipfailover=hello-openshift")
+		newMasterNode, _ := ensureIpfailoverMasterBackup(oc, ns, newPodName1)
+		waitForPrimaryPod(oc, ns, newMasterNode, virtualIP)


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Wait for the replacement pod set before recalculating MASTER/BACKUP.

Both flows delete the current primary and then immediately read pod names again. There is a normal window where only the surviving replica exists, so ensureIpfailoverMasterBackup() / futurePrimaryPod[0] can race the rollout and fail nondeterministically. Add a replica-aware wait between the delete and the next pod-list read.

Also applies to: 346-353

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ipfailover/keepalived/tests/e2e/ipfailover.go` around lines 168 - 171, After deleting the current primary, do not immediately recompute MASTER/BACKUP from the pod list; in the ipfailover e2e flow around getPodListByLabel, ensureIpfailoverMasterBackup, waitForPrimaryPod, and futurePrimaryPod[0], add a replica-aware wait until the replacement pod set is fully available and stable before reading pod names again. Apply the same fix in both affected flows so the recalculation happens only after the rollout completes and the pod list reflects the final state.

coderabbitai · 2026-06-29T18:11:18Z

+    replicas: 2
+    selector:
+      matchLabels:
+        ipfailover: hello-openshift
+    template:
+      metadata:
+        labels:
+          ipfailover: hello-openshift
+      spec:
+        serviceAccountName: ipfailover
+        privileged: true
+        hostNetwork: true
+        nodeSelector:
+          node-role.kubernetes.io/worker: ""


🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Avoid hard-coding a 2-replica, worker-only topology.

replicas: 2 plus node-role.kubernetes.io/worker assumes at least two schedulable workers exist. This template will stall on SingleReplica/DualReplica/arbiter-style topologies instead of adapting or being explicitly gated. Cap replicas to the schedulable node count and make the surrounding test logic topology-aware before rendering this template. As per coding guidelines, "check whether they introduce scheduling constraints assuming standard HA topology with 3+ control-plane nodes and dedicated workers" and "Cap replica counts to schedulable nodes."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ipfailover/keepalived/tests/e2e/testdata/router/ipfailover.yaml` around lines 14 - 27, The ipfailover test template hard-codes a 2-replica deployment with a worker-only node selector, which assumes a standard multi-worker topology. Update the ipfailover.yaml template to derive replicas from the number of schedulable nodes and make the e2e test setup topology-aware before rendering it, so it can handle SingleReplica, DualReplica, and arbiter-style clusters. Use the existing ipfailover template fields and related test setup logic to cap replica count to available schedulable nodes and avoid assuming node-role.kubernetes.io/worker is always present.

Source: Coding guidelines

coderabbitai · 2026-06-29T18:11:18Z

+	linuxWorkerDetails, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "--selector=node-role.kubernetes.io/worker=,kubernetes.io/os=linux").Output()
+	o.Expect(err).NotTo(o.HaveOccurred())
+	nodeCount := int(strings.Count(linuxWorkerDetails, "Ready")) - (int(strings.Count(linuxWorkerDetails, "SchedulingDisabled")) + int(strings.Count(linuxWorkerDetails, "NotReady")))


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Stop counting worker readiness from table text.

strings.Count(..., "Ready") also matches the Ready inside NotReady, so a NotReady,SchedulingDisabled worker is subtracted twice. With two healthy workers plus one cordoned unhealthy worker, this helper returns 1 instead of 2 and the suite is skipped even though the precondition is met. Query structured node conditions / unschedulable state instead of parsing oc get nodes text.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ipfailover/keepalived/tests/e2e/util.go` around lines 29 - 31, The worker-counting logic in the e2e helper is parsing `oc get nodes` table text with `strings.Count`, which double-counts `Ready` inside `NotReady` and can underreport healthy nodes. Update the node readiness check in the helper that builds `linuxWorkerDetails` and `nodeCount` to use structured node data from the API or JSON output, inspecting node conditions and unschedulable state instead of matching substrings in human-readable output.

coderabbitai · 2026-06-29T18:11:18Z

+func getVipOwnerPod(oc *exutil.CLI, ns string, podname []string, vip string) string {
+	cmd := fmt.Sprintf("ip address |grep %s", vip)
+	var primaryNode string
+	for i := 0; i < len(podname); i++ {
+		output, err := oc.AsAdmin().WithoutNamespace().Run("exec").Args("-n", ns, podname[i], "--", "bash", "-c", cmd).Output()
+		if len(podname) == 1 && output == "command terminated with exit code 1" {
+			e2e.Failf("The given pod is not master")
+		}
+		if output == "command terminated with exit code 1" {
+			e2e.Logf("This Pod %v does not have the VIP", podname[i])
+		} else if strings.Contains(output, vip) {
+			e2e.Logf("The pod owning the VIP is %v", podname[i])
+			primaryNode = podname[i]
+			break
+		} else {
+			o.Expect(err).NotTo(o.HaveOccurred())
+		}


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Differentiate a grep miss from an oc exec failure.

Matching the literal output "command terminated with exit code 1" makes pod restarts, not-ready containers, and transport failures look the same as “VIP not present”. That can select the wrong master or return an empty pod name. Check err, only treat exit code 1 as a miss, and retry/fail on everything else. As per path instructions, **/*.go: Never ignore error returns.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ipfailover/keepalived/tests/e2e/util.go` around lines 160 - 176, The getVipOwnerPod helper is treating the literal “command terminated with exit code 1” as a VIP miss, which can mask real oc exec failures and return the wrong master or an empty result. Update the logic around oc.AsAdmin().WithoutNamespace().Run("exec").Args(...).Output() to inspect err first, only consider exit code 1 from the command as “VIP not present,” and explicitly retry or fail for any other error conditions; keep the selection logic in getVipOwnerPod and its primaryNode assignment tied to successful command results.

Source: Path instructions

coderabbitai · 2026-06-29T18:11:18Z

+	if platformtype == "nutanix" || platformtype == "none" {
+		getPodListByLabel(oc, oc.Namespace(), "ipfailover=hello-openshift")
+		workerIPAddress, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "--selector=node-role.kubernetes.io/worker=", "-ojsonpath={.items[*].status.addresses[0].address}").Output()
+		o.Expect(err).NotTo(o.HaveOccurred())
+		modifiedIPList := strings.Split(workerIPAddress, " ")
+		if len(modifiedIPList) < 2 {
+			e2e.Failf("There is not enough IP addresses to add as unicast peer")
+		}
+		ipList := strings.Join(modifiedIPList, ",")
+		cmd := fmt.Sprintf("OPENSHIFT_HA_UNICAST_PEERS=%v", ipList)


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don’t assume .status.addresses[0] is the worker IP.

Kubernetes does not guarantee address order. On many clusters index 0 is Hostname or ExternalIP, so OPENSHIFT_HA_UNICAST_PEERS can be populated with unusable values and the Nutanix/none path never converges. Filter node addresses by type=="InternalIP" before building the peer list.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ipfailover/keepalived/tests/e2e/util.go` around lines 240 - 249, The unicast peer list in the Nutanix/none branch is built from the first node address, but .status.addresses[0] is not guaranteed to be an InternalIP. Update the logic in the keepalived e2e util path that calls oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", ...) to filter node addresses by type == "InternalIP" before assembling OPENSHIFT_HA_UNICAST_PEERS, and keep the existing validation that enough worker IPs were found.

coderabbitai · 2026-06-29T18:11:18Z

+func waitForPodWithLabelReady(oc *exutil.CLI, ns, label string) error {
+	return wait.Poll(5*time.Second, 3*time.Minute, func() (bool, error) {
+		status, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("pod", "-n", ns, "-l", label, `-ojsonpath={.items[*].status.conditions[?(@.type=="Ready")].status}`).Output()
+		e2e.Logf("the Ready status of pod is %v", status)
+		if err != nil || status == "" {
+			e2e.Logf("failed to get pod status: %v, retrying...", err)
+			return false, nil
+		}
+		if strings.Contains(status, "False") {
+			e2e.Logf("the pod Ready status not met; wanted True but got %v, retrying...", status)
+			return false, nil
+		}
+		return true, nil


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

This readiness gate can pass before the replacement replica exists.

It only checks the Ready condition of the pods that already exist; it never verifies how many labeled pods are present. During a rollout, one surviving Ready pod is enough for this to return success, and callers then immediately assume a two-pod MASTER/BACKUP set. Make the helper validate the expected replica count too.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ipfailover/keepalived/tests/e2e/util.go` around lines 319 - 331, The helper waitForPodWithLabelReady currently only checks Ready status, so it can succeed with a single surviving pod during rollout. Update waitForPodWithLabelReady to also verify the number of pods returned by the label selector in addition to readiness, and keep polling until both the expected replica count and Ready condition are met. Use the existing wait.Poll logic and the kubectl get call in waitForPodWithLabelReady to enforce the count before returning success.

coderabbitai · 2026-06-29T18:11:18Z

@@ -0,0 +1,350 @@
+module github.com/openshift/ipfailover-tests-extension
+
+go 1.25.0


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== go.mod ==" sed -n '1,6p' ipfailover/keepalived/tests/go.mod echo echo "== Makefile build recipe ==" sed -n '1,10p' ipfailover/keepalived/tests/Makefile echo echo "== Docker builder stage ==" sed -n '1,8p' ipfailover/keepalived/Dockerfile

Repository: openshift/images

Length of output: 910

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== other Go version declarations ==" rg -n --glob 'go.mod' --glob 'Dockerfile*' --glob 'Makefile*' 'go 1\.25|golang-1\.22|GOTOOLCHAIN=auto|GOTOOLCHAIN' . echo echo "== files in keepalived tests ==" git ls-files 'ipfailover/keepalived/tests/*' echo echo "== nearby Dockerfile references ==" rg -n 'ipfailover/keepalived/tests|test-extension-builder|golang-' ipfailover/keepalived -S

Repository: openshift/images

Length of output: 50374

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Dockerfile (full, line-numbered) ==" cat -n ipfailover/keepalived/Dockerfile echo echo "== tests go.mod ==" cat -n ipfailover/keepalived/tests/go.mod echo echo "== tests Makefile ==" cat -n ipfailover/keepalived/tests/Makefile

Repository: openshift/images

Length of output: 26386

Align the Go 1.25 module with the Go 1.22 builder image
ipfailover/keepalived/tests/go.mod:3 now requires Go 1.25.0, but ipfailover/keepalived/Dockerfile:2 still builds in ...golang-1.22.... With GOTOOLCHAIN=auto, this will pull a newer toolchain during the build instead of using the image’s compiler, making CI and offline builds dependent on network access. Bump the builder image to Go 1.25 or lower the module directive to the supported version.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ipfailover/keepalived/tests/go.mod` at line 3, The module version in ipfailover/keepalived/tests/go.mod is newer than the Go toolchain used by the ipfailover/keepalived/Dockerfile builder stage, so the build may auto-download a different compiler. Update the builder image in Dockerfile to a Go 1.25-based image, or reduce the go directive in go.mod to match the supported compiler version. Make sure the versions stay aligned across go.mod and the Dockerfile so the tests build with the image’s local toolchain.

melvinjoseph86 · 2026-06-29T18:23:12Z

/payload-with-prs openshift/release#81231

melvinjoseph86 · 2026-06-29T18:28:37Z

/test-with-release openshift/release#81231

melvinjoseph86 · 2026-06-29T18:32:49Z

/payload-job-with-prs openshift/release#81231 e2e-openstack-ipi-tests-ext

openshift-ci · 2026-06-29T18:32:52Z

@melvinjoseph86: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 25, 2026

openshift-ci Bot mentioned this pull request Jun 25, 2026

Adding three job profiles for migrating ipfailover test cases openshift/release#81103

Closed

redhat-chai-bot mentioned this pull request Jun 25, 2026

Fix deserialization errors from non-JSON stdout contamination openshift-eng/openshift-tests-extension#74

Merged

Migrating Ipfailover test case to images repo

791cb2c

melvinjoseph86 force-pushed the ipfailover_migration_e2e branch from 36578d9 to 791cb2c Compare June 29, 2026 17:54

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

		util.InitStandardFlags()
		framework.AfterReadingAllFlags(&framework.TestContext)

		@@ -0,0 +1,350 @@
		module github.com/openshift/ipfailover-tests-extension

		go 1.25.0

Uh oh!

Conversation

melvinjoseph86 commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented Jun 25, 2026

Uh oh!

openshift-ci Bot commented Jun 25, 2026

Uh oh!

melvinjoseph86 commented Jun 25, 2026

Uh oh!

openshift-ci Bot commented Jun 25, 2026

Uh oh!

melvinjoseph86 commented Jun 25, 2026

Uh oh!

openshift-ci Bot commented Jun 25, 2026

Uh oh!

melvinjoseph86 commented Jun 25, 2026

Uh oh!

openshift-ci Bot commented Jun 25, 2026

Uh oh!

redhat-chai-bot commented Jun 25, 2026

Investigation: Deserialization Error Root Cause

Root Cause: klog.SetOutput(g.GinkgoWriter) timing bug

Recommended fix for this PR

Framework-side fix

Uh oh!

coderabbitai Bot commented Jun 29, 2026

Walkthrough

Changes

Estimated code review effort

Pre-merge checks failed

❌ Failed checks (4 errors, 5 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

melvinjoseph86 commented Jun 29, 2026

Uh oh!

melvinjoseph86 commented Jun 29, 2026

Uh oh!

melvinjoseph86 commented Jun 29, 2026

Uh oh!

openshift-ci Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

melvinjoseph86 commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

Root Cause: `klog.SetOutput(g.GinkgoWriter)` timing bug