Skip to content

Migrating Ipfailover test cases to images repo#243

Draft
melvinjoseph86 wants to merge 1 commit into
openshift:masterfrom
melvinjoseph86:ipfailover_migration_e2e
Draft

Migrating Ipfailover test cases to images repo#243
melvinjoseph86 wants to merge 1 commit into
openshift:masterfrom
melvinjoseph86:ipfailover_migration_e2e

Conversation

@melvinjoseph86

@melvinjoseph86 melvinjoseph86 commented Jun 25, 2026

Copy link
Copy Markdown

Summary by CodeRabbit

  • New Features

    • Added a new ipfailover test extension with end-to-end coverage for deployment, failover behavior, VIP changes, and configuration validation.
    • Built and packaged the test extension so it can be run from a generated binary/artifact.
  • Documentation

    • Added setup and run instructions for the ipfailover test extension, including suite groupings and execution constraints.
  • Chores

    • Updated ignore rules to keep generated test binaries out of source control.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 25, 2026
@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: melvinjoseph86
Once this PR has been reviewed and has the lgtm label, please assign mrunalp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@melvinjoseph86

Copy link
Copy Markdown
Author

/payload-job-with-prs pull-ci-openshift-images-master-e2e-metal-ipi-tests-ext openshift/release/pull/81103

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@melvinjoseph86: An error was encountered. No known errors were detected, please see the full error message for details.

Full error message. unable to get additional pr info from string: openshift/release/pull/81103: string: openshift/release/pull/81103 doesn't match expected format: org/repo#number

Please contact an administrator to resolve this issue.

@melvinjoseph86

Copy link
Copy Markdown
Author

/payload-job-with-prs pull-ci-openshift-images-master-e2e-metal-ipi-tests-ext openshift/release#81103

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@melvinjoseph86: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@melvinjoseph86

Copy link
Copy Markdown
Author

/payload-job-with-prs periodic-ci-openshift-images-master-e2e-metal-ipi-tests-ext openshift/release#81103

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@melvinjoseph86: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@redhat-chai-bot

Copy link
Copy Markdown

Investigation: Deserialization Error Root Cause

I investigated the Deserialization Error: invalid character 'I' looking for beginning of value failure reported in #wg-openshift-tests-extension. Here are the findings:

Root Cause: klog.SetOutput(g.GinkgoWriter) timing bug

The init() function in test/e2e/ipfailover.go captures a reference to g.GinkgoWriter at init-time, but at that point GinkgoWriter is still initialized to os.Stdout (set in Ginkgo's own init() in core_dsl.go). Later, OTE's configureGinkgo() replaces the GinkgoWriter variable with a new Writer pointing to os.Stderr, but klog still holds the old Writer object that writes to stdout.

This means all klog INFO lines (I0625...) go to stdout for the entire process lifetime, contaminating the JSON output stream.

The same timing issue affects rest.SetDefaultWarningHandler(rest.NewWarningWriter(g.GinkgoWriter, ...)).

Recommended fix for this PR

Replace the GinkgoWriter references in init() with direct os.Stderr:

func init() {
    // Write directly to stderr — no timing dependency on GinkgoWriter
    klog.SetOutput(os.Stderr)

    // Redirect REST client warnings directly to stderr
    rest.SetDefaultWarningHandler(
        rest.NewWarningWriter(os.Stderr, rest.WarningWriterOptions{}),
    )

    exutil.WithCleanup(func() {})
}

The flag.Set("ginkgo.output-interceptor-mode", "none") line can also be removed — OTE uses NoopOutputInterceptor{} unconditionally, so the flag has no effect.

Framework-side fix

A structural fix is also being prepared for openshift-eng/openshift-tests-extension to make newTestResultFromOutput() resilient to non-JSON stdout content (similar to how origin's extractJSON() works). This will prevent future occurrences regardless of the contamination source.

Investigation triggered from #wg-openshift-tests-extension discussion

@melvinjoseph86 melvinjoseph86 force-pushed the ipfailover_migration_e2e branch from 36578d9 to 791cb2c Compare June 29, 2026 17:54
@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Walkthrough

Adds a new openshift-tests-extension (OTE) for ipfailover keepalived tests, migrated from openshift-tests-private. Includes a Go module, Makefile, multi-stage Dockerfile build stage, a test entrypoint with three suite registrations, six e2e test cases with utility helpers, an OpenShift deployment template, and a README.

Changes

ipfailover OTE test extension

Layer / File(s) Summary
Build tooling and Dockerfile packaging
ipfailover/keepalived/tests/Makefile, ipfailover/keepalived/Dockerfile, .gitignore
Makefile defines build/clean/help targets producing bin/ipfailover-tests-ext. Dockerfile adds a test-extension-builder stage that compiles and archives the binary into ipfailover-test-extension.tar.gz, then copies it into /usr/bin/ in the final image. .gitignore excludes the built binary path.
Go module and dependencies
ipfailover/keepalived/tests/go.mod
Declares module github.com/openshift/ipfailover-tests-extension with Go 1.25.0, direct deps on Ginkgo/Gomega, openshift-tests-extension, origin, Cobra, and Kubernetes libs, plus a replace block redirecting k8s.io/* and OpenTelemetry packages to OpenShift forks at pinned commits.
Test entrypoint and suite registration
ipfailover/keepalived/tests/cmd/main.go
main() initializes flags, builds and filters Ginkgo specs to local e2e/ sources, applies platform labels, sets blocking lifecycle, and registers specs into an ipfailover extension with Cobra commands. registerSuites registers serial-conformance, disruptive, and all suites.
Deployment template testdata
ipfailover/keepalived/tests/e2e/testdata/router/ipfailover.yaml
OpenShift Template defining an ipfailover Deployment with privileged/hostNetwork settings, liveness probe for keepalived, env var wiring, volume mounts, and parameters for name, image, VIPs, interface, monitor port, and VRRP offset.
e2e utility helpers
ipfailover/keepalived/tests/e2e/util.go
Adds ipfailoverDescription struct, node/pod discovery, pod readiness polling, MASTER/BACKUP convergence detection, VIP ownership lookup, image pull spec resolution, template processing with retry, IP stack classification, unicast peer configuration, namespace privilege setup, and exec/grep helpers.
e2e test cases
ipfailover/keepalived/tests/e2e/ipfailover.go
Six Ginkgo tests under [OTP][sig-network-edge] Network_Edge: basic MASTER/backup deploy, VIP failover on pod deletion, full env-var config verification, large VIP range configuration, preemption strategy validation, and VRRP ID exclusion check. Suite init redirects klog/REST warnings and sets up platform/network skip logic in BeforeEach.
README
ipfailover/keepalived/tests/README.md
Documents the OTE extension: suite groupings, concurrency requirements, run commands, expected durations, and directory layout.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (4 errors, 5 warnings)

Check name Status Explanation Resolution
Stable And Deterministic Test Names ❌ Error The new Ginkgo titles embed author names and ticket IDs (e.g. 'Author:mjoseph-...-41027...'), making them brittle and overly specific. Rename specs to short static descriptions (e.g. 'should fail over to standby when master fails') and keep author/ticket info outside the title.
Ote Binary Stdout Contract ❌ Error init() in e2e/ipfailover.go wires klog/rest warnings to ginkgo.GinkgoWriter before OTE rebinds it; Ginkgo starts that writer on os.Stdout, so JSON stdout can be contaminated. Route those logs to os.Stderr (or defer the wiring until after OTE rebinds GinkgoWriter) so the main binary never emits non-JSON stdout.
Container-Privileges ❌ Error New ipfailover template sets hostNetwork: true and privileged: true at pod/container scope. Remove or narrowly justify these privileges; add the missing hardening settings if the workload still requires host access.
No-Sensitive-Data-In-Logs ❌ Error New e2e helpers log full node/pod details and pod logs via e2e.Logf, which can expose internal hostnames and cluster data. Redact or remove the full dumps; log only counts/selected fields, and guard any verbose pod-log/describe output behind a debug flag.
Docstring Coverage ⚠️ Warning Docstring coverage is 4.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning The suite lacks teardown hooks, some It blocks mix multiple failover behaviors, and many assertions are bare Expect(err).NotTo(HaveOccurred()). Split broad It blocks, add cleanup/DeferCleanup for created resources, and add descriptive failure messages to all error assertions.
Microshift Test Compatibility ⚠️ Warning FAIL: the new Ginkgo suite lacks MicroShift skips and uses template.openshift.io/v1 plus a 2-worker/hostNetwork/privileged Deployment, which MicroShift doesn't support. Add a [Skipped:MicroShift] or runtime MicroShift guard, and remove/replace the Template+HA assumptions (or skip on MicroShift) before running there.
Single Node Openshift (Sno) Test Compatibility ⚠️ Warning The new e2e suite requires 2 worker nodes and exercises failover between pods, with no SingleReplicaTopology/SNO guard present. Add an explicit SNO skip/label (or remove the 2-worker and cross-node failover assumptions) so the suite won’t run on Single Node OpenShift.
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning New e2e tests hardcode IPv4-only values and dot-octet parsing (e.g. 192.168.123.123, 192.168.254.1-255, strings.Split on "."). Replace fixed IPv4 literals with family-aware values, avoid octet parsing/IPv4-only host formatting, and use cluster IP-family detection or explicit IPv6 skips.
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: migrating ipfailover test cases into the images repo.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Topology-Aware Scheduling Compatibility ✅ Passed The new worker-only 2-replica manifest is gated by platform and a ≥2 Linux worker-node check before creation, and no anti-affinity/spread/PDB or control-plane selectors were added.
No-Weak-Crypto ✅ Passed Changed ipfailover files contain no weak-crypto APIs, custom crypto, or secret/token comparisons; only unrelated secret-pull usage appears.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Nitpick comments (1)
ipfailover/keepalived/Dockerfile (1)

5-5: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win

Copy only the inputs this builder stage actually needs.

COPY . . pulls the whole repository into the build context even though this stage only builds ipfailover/keepalived/tests. That widens the accidental secret/file exposure surface and invalidates the cache on unrelated changes.

♻️ Proposed fix
 WORKDIR /go/src/github.com/openshift/ipfailover
-COPY . .
+COPY ipfailover/keepalived/tests ./ipfailover/keepalived/tests
 RUN cd ipfailover/keepalived/tests && \
     make build && \

As per path instructions, COPY specific files, not entire context.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/Dockerfile` at line 5, The builder stage in the
keepalived Dockerfile is over-copying the entire repository with COPY . ., even
though it only needs the inputs required to build ipfailover/keepalived/tests.
Replace the broad copy with explicit copies of the minimal needed files and
directories so the stage stays narrowly scoped, avoids exposing unrelated files,
and preserves cache reuse; use the Dockerfile stage and the tests build target
as the main references when locating the change.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ipfailover/keepalived/tests/cmd/main.go`:
- Around line 25-26: Move the framework.AfterReadingAllFlags call so it runs
after root.Execute() has parsed flags, because invoking it before parsing leaves
framework.TestContext at default values and ignores kube/e2e overrides. Update
the startup flow in main so util.InitStandardFlags still happens first, then
flags are parsed via root.Execute, and only then
framework.AfterReadingAllFlags(&framework.TestContext) is called.

In `@ipfailover/keepalived/tests/e2e/ipfailover.go`:
- Around line 168-171: After deleting the current primary, do not immediately
recompute MASTER/BACKUP from the pod list; in the ipfailover e2e flow around
getPodListByLabel, ensureIpfailoverMasterBackup, waitForPrimaryPod, and
futurePrimaryPod[0], add a replica-aware wait until the replacement pod set is
fully available and stable before reading pod names again. Apply the same fix in
both affected flows so the recalculation happens only after the rollout
completes and the pod list reflects the final state.
- Around line 75-84: The IPv6 single-stack check in ipfailover e2e is too narrow
because it only inspects the first clusterNetwork CIDR and can misclassify
IPv6-first dual-stack clusters. Update the logic in the test around the current
oc-based network lookup to reuse checkIPStackType(oc) or otherwise evaluate the
full status.clusterNetwork CIDR list before calling g.Skip, so only true IPv6
single-stack clusters are skipped.
- Around line 69-73: The remote-worker skip check in the ipfailover e2e test is
using the wrong node selector, so it can miss remote-worker clusters. Update the
logic in the test block around the remote-worker gate to query
`node.openshift.io/remote-worker` directly, matching the approach used by
`exactNodeDetails()`, and then keep the skip decision based on that targeted
result instead of searching generic `nodes -l kubernetes.io/hostname` output.

In `@ipfailover/keepalived/tests/e2e/testdata/router/ipfailover.yaml`:
- Around line 28-75: The openshift-ipfailover container definition in the
ipfailover manifest is missing an explicit resource envelope. Add cpu and memory
requests and limits to the openshift-ipfailover container alongside its existing
ports, securityContext, and env settings so it conforms to the resource policy
for every container.
- Around line 14-27: The ipfailover test template hard-codes a 2-replica
deployment with a worker-only node selector, which assumes a standard
multi-worker topology. Update the ipfailover.yaml template to derive replicas
from the number of schedulable nodes and make the e2e test setup topology-aware
before rendering it, so it can handle SingleReplica, DualReplica, and
arbiter-style clusters. Use the existing ipfailover template fields and related
test setup logic to cap replica count to available schedulable nodes and avoid
assuming node-role.kubernetes.io/worker is always present.
- Around line 23-44: The ipfailover test pod spec is effectively fully
host-privileged and needs to be tightened in the openshift-ipfailover workload
definition. Update the container and pod security settings to follow the
security guidance by adding non-root and no-escalation constraints, making the
root filesystem read-only, dropping all capabilities unless explicitly required,
and disabling service account token automount unless the test truly needs it.
Keep host access only if required for the test, but scope it to the minimum
necessary privileges and document any exception in the ipfailover template
rather than leaving the default manifest overly permissive.
- Around line 69-75: The keepalived test pod spec currently defines only
livenessProbe, which lets the pod report Ready too early and makes the e2e flow
race-prone. Add a readinessProbe alongside the existing livenessProbe in the
ipfailover manifest so readiness only flips after keepalived is actually
prepared to serve, keeping ensurePodWithLabelReady in sync with the downstream
assertions in ipfailover.go.
- Around line 22-25: Remove the invalid top-level privileged field from the pod
spec in the ipfailover test manifest. The fix is to delete
spec.template.spec.privileged from the ipfailover.yaml fixture and keep the
existing containers[].securityContext.privileged setting as the only privileged
declaration. Use the ipfailover pod template structure under spec.template.spec
to locate the change.

In `@ipfailover/keepalived/tests/e2e/util.go`:
- Around line 319-331: The helper waitForPodWithLabelReady currently only checks
Ready status, so it can succeed with a single surviving pod during rollout.
Update waitForPodWithLabelReady to also verify the number of pods returned by
the label selector in addition to readiness, and keep polling until both the
expected replica count and Ready condition are met. Use the existing wait.Poll
logic and the kubectl get call in waitForPodWithLabelReady to enforce the count
before returning success.
- Around line 29-31: The worker-counting logic in the e2e helper is parsing `oc
get nodes` table text with `strings.Count`, which double-counts `Ready` inside
`NotReady` and can underreport healthy nodes. Update the node readiness check in
the helper that builds `linuxWorkerDetails` and `nodeCount` to use structured
node data from the API or JSON output, inspecting node conditions and
unschedulable state instead of matching substrings in human-readable output.
- Around line 240-249: The unicast peer list in the Nutanix/none branch is built
from the first node address, but .status.addresses[0] is not guaranteed to be an
InternalIP. Update the logic in the keepalived e2e util path that calls
oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", ...) to filter node
addresses by type == "InternalIP" before assembling OPENSHIFT_HA_UNICAST_PEERS,
and keep the existing validation that enough worker IPs were found.
- Around line 160-176: The getVipOwnerPod helper is treating the literal
“command terminated with exit code 1” as a VIP miss, which can mask real oc exec
failures and return the wrong master or an empty result. Update the logic around
oc.AsAdmin().WithoutNamespace().Run("exec").Args(...).Output() to inspect err
first, only consider exit code 1 from the command as “VIP not present,” and
explicitly retry or fail for any other error conditions; keep the selection
logic in getVipOwnerPod and its primaryNode assignment tied to successful
command results.

In `@ipfailover/keepalived/tests/go.mod`:
- Line 3: The module version in ipfailover/keepalived/tests/go.mod is newer than
the Go toolchain used by the ipfailover/keepalived/Dockerfile builder stage, so
the build may auto-download a different compiler. Update the builder image in
Dockerfile to a Go 1.25-based image, or reduce the go directive in go.mod to
match the supported compiler version. Make sure the versions stay aligned across
go.mod and the Dockerfile so the tests build with the image’s local toolchain.

---

Nitpick comments:
In `@ipfailover/keepalived/Dockerfile`:
- Line 5: The builder stage in the keepalived Dockerfile is over-copying the
entire repository with COPY . ., even though it only needs the inputs required
to build ipfailover/keepalived/tests. Replace the broad copy with explicit
copies of the minimal needed files and directories so the stage stays narrowly
scoped, avoids exposing unrelated files, and preserves cache reuse; use the
Dockerfile stage and the tests build target as the main references when locating
the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

Comment on lines +25 to +26
util.InitStandardFlags()
framework.AfterReadingAllFlags(&framework.TestContext)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n -C3 'InitStandardFlags\(|AfterReadingAllFlags\(' ipfailover/keepalived/tests/cmd/main.go
rg -n -C3 --iglob 'main.go' 'InitStandardFlags\(|AfterReadingAllFlags\(' .

Repository: openshift/images

Length of output: 972


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== main.go =="
cat -n ipfailover/keepalived/tests/cmd/main.go | sed -n '1,180p'

echo
echo "== other AfterReadingAllFlags usages =="
rg -n -C2 'AfterReadingAllFlags\(' . || true

echo
echo "== root.Execute call site =="
rg -n -C3 'root\.Execute\(\)|Execute\(\)' ipfailover/keepalived/tests/cmd/main.go

Repository: openshift/images

Length of output: 6376


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== framework docs references =="
rg -n -C2 'AfterReadingAllFlags' . || true

echo
echo "== related entrypoints =="
rg -n -C2 'util.InitStandardFlags\(|logs.InitLogs\(|framework.AfterReadingAllFlags\(' . --glob 'main.go' || true

Repository: openshift/images

Length of output: 2173


Move framework.AfterReadingAllFlags after flag parsing. Calling it before root.Execute() leaves framework.TestContext on defaults, so kube/e2e flag overrides are ignored.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/tests/cmd/main.go` around lines 25 - 26, Move the
framework.AfterReadingAllFlags call so it runs after root.Execute() has parsed
flags, because invoking it before parsing leaves framework.TestContext at
default values and ignores kube/e2e overrides. Update the startup flow in main
so util.InitStandardFlags still happens first, then flags are parsed via
root.Execute, and only then
framework.AfterReadingAllFlags(&framework.TestContext) is called.

Comment on lines +69 to +73
g.By("check the cluster has remote worker profile")
remoteWorkerDetails, _ := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "-l", "kubernetes.io/hostname").Output()
if strings.Contains(remoteWorkerDetails, "remote-worker") {
g.Skip("Skip as ipfailover currently doesn't support on remote-worker profile")
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

The remote-worker gate is checking the wrong selector.

-l kubernetes.io/hostname returns every node, then this relies on "remote-worker" appearing in the rendered table output. That misses clusters where the node name does not contain that substring. Query node.openshift.io/remote-worker directly, like exactNodeDetails() already does.

Proposed fix
-		remoteWorkerDetails, _ := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "-l", "kubernetes.io/hostname").Output()
-		if strings.Contains(remoteWorkerDetails, "remote-worker") {
+		remoteWorkerDetails, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "--selector=node.openshift.io/remote-worker").Output()
+		o.Expect(err).NotTo(o.HaveOccurred())
+		if !strings.Contains(remoteWorkerDetails, "No resources found") {
 			g.Skip("Skip as ipfailover currently doesn't support on remote-worker profile")
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
g.By("check the cluster has remote worker profile")
remoteWorkerDetails, _ := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "-l", "kubernetes.io/hostname").Output()
if strings.Contains(remoteWorkerDetails, "remote-worker") {
g.Skip("Skip as ipfailover currently doesn't support on remote-worker profile")
}
g.By("check the cluster has remote worker profile")
remoteWorkerDetails, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "--selector=node.openshift.io/remote-worker").Output()
o.Expect(err).NotTo(o.HaveOccurred())
if !strings.Contains(remoteWorkerDetails, "No resources found") {
g.Skip("Skip as ipfailover currently doesn't support on remote-worker profile")
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/tests/e2e/ipfailover.go` around lines 69 - 73, The
remote-worker skip check in the ipfailover e2e test is using the wrong node
selector, so it can miss remote-worker clusters. Update the logic in the test
block around the remote-worker gate to query `node.openshift.io/remote-worker`
directly, matching the approach used by `exactNodeDetails()`, and then keep the
skip decision based on that targeted result instead of searching generic `nodes
-l kubernetes.io/hostname` output.

Comment on lines +75 to +84
g.By("check whether the cluster is not ipv6 single stack")
// Get IP stack type using oc command instead of compat_otp
networkOutput, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("network", "cluster", "-o=jsonpath={.status.clusterNetworkMTU}").Output()
if err == nil && networkOutput != "" {
// Simple check - if we can get network info, check for IPv6
cidrOutput, _ := oc.AsAdmin().WithoutNamespace().Run("get").Args("network", "cluster", "-o=jsonpath={.status.clusterNetwork[0].cidr}").Output()
if strings.Contains(cidrOutput, ":") && !strings.Contains(cidrOutput, ".") {
g.Skip("Skip as ipfailover currently doesn't support ipv6 single stack")
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

This skips IPv6-first dual-stack clusters as if they were single-stack.

The check only inspects status.clusterNetwork[0].cidr. If the first CIDR is IPv6 and the second is IPv4, the suite is skipped even though dual-stack is supported by the rest of this package. Reuse checkIPStackType(oc) or inspect the full CIDR list.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/tests/e2e/ipfailover.go` around lines 75 - 84, The IPv6
single-stack check in ipfailover e2e is too narrow because it only inspects the
first clusterNetwork CIDR and can misclassify IPv6-first dual-stack clusters.
Update the logic in the test around the current oc-based network lookup to reuse
checkIPStackType(oc) or otherwise evaluate the full status.clusterNetwork CIDR
list before calling g.Skip, so only true IPv6 single-stack clusters are skipped.

Comment on lines +168 to +171
g.By("7. Verify the virtual IP is floated onto the new MASTER node")
newPodName1 := getPodListByLabel(oc, ns, "ipfailover=hello-openshift")
newMasterNode, _ := ensureIpfailoverMasterBackup(oc, ns, newPodName1)
waitForPrimaryPod(oc, ns, newMasterNode, virtualIP)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Wait for the replacement pod set before recalculating MASTER/BACKUP.

Both flows delete the current primary and then immediately read pod names again. There is a normal window where only the surviving replica exists, so ensureIpfailoverMasterBackup() / futurePrimaryPod[0] can race the rollout and fail nondeterministically. Add a replica-aware wait between the delete and the next pod-list read.

Also applies to: 346-353

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/tests/e2e/ipfailover.go` around lines 168 - 171, After
deleting the current primary, do not immediately recompute MASTER/BACKUP from
the pod list; in the ipfailover e2e flow around getPodListByLabel,
ensureIpfailoverMasterBackup, waitForPrimaryPod, and futurePrimaryPod[0], add a
replica-aware wait until the replacement pod set is fully available and stable
before reading pod names again. Apply the same fix in both affected flows so the
recalculation happens only after the rollout completes and the pod list reflects
the final state.

Comment on lines +14 to +27
replicas: 2
selector:
matchLabels:
ipfailover: hello-openshift
template:
metadata:
labels:
ipfailover: hello-openshift
spec:
serviceAccountName: ipfailover
privileged: true
hostNetwork: true
nodeSelector:
node-role.kubernetes.io/worker: ""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Avoid hard-coding a 2-replica, worker-only topology.

replicas: 2 plus node-role.kubernetes.io/worker assumes at least two schedulable workers exist. This template will stall on SingleReplica/DualReplica/arbiter-style topologies instead of adapting or being explicitly gated. Cap replicas to the schedulable node count and make the surrounding test logic topology-aware before rendering this template. As per coding guidelines, "check whether they introduce scheduling constraints assuming standard HA topology with 3+ control-plane nodes and dedicated workers" and "Cap replica counts to schedulable nodes."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/tests/e2e/testdata/router/ipfailover.yaml` around lines
14 - 27, The ipfailover test template hard-codes a 2-replica deployment with a
worker-only node selector, which assumes a standard multi-worker topology.
Update the ipfailover.yaml template to derive replicas from the number of
schedulable nodes and make the e2e test setup topology-aware before rendering
it, so it can handle SingleReplica, DualReplica, and arbiter-style clusters. Use
the existing ipfailover template fields and related test setup logic to cap
replica count to available schedulable nodes and avoid assuming
node-role.kubernetes.io/worker is always present.

Source: Coding guidelines

Comment on lines +29 to +31
linuxWorkerDetails, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "--selector=node-role.kubernetes.io/worker=,kubernetes.io/os=linux").Output()
o.Expect(err).NotTo(o.HaveOccurred())
nodeCount := int(strings.Count(linuxWorkerDetails, "Ready")) - (int(strings.Count(linuxWorkerDetails, "SchedulingDisabled")) + int(strings.Count(linuxWorkerDetails, "NotReady")))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Stop counting worker readiness from table text.

strings.Count(..., "Ready") also matches the Ready inside NotReady, so a NotReady,SchedulingDisabled worker is subtracted twice. With two healthy workers plus one cordoned unhealthy worker, this helper returns 1 instead of 2 and the suite is skipped even though the precondition is met. Query structured node conditions / unschedulable state instead of parsing oc get nodes text.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/tests/e2e/util.go` around lines 29 - 31, The
worker-counting logic in the e2e helper is parsing `oc get nodes` table text
with `strings.Count`, which double-counts `Ready` inside `NotReady` and can
underreport healthy nodes. Update the node readiness check in the helper that
builds `linuxWorkerDetails` and `nodeCount` to use structured node data from the
API or JSON output, inspecting node conditions and unschedulable state instead
of matching substrings in human-readable output.

Comment on lines +160 to +176
func getVipOwnerPod(oc *exutil.CLI, ns string, podname []string, vip string) string {
cmd := fmt.Sprintf("ip address |grep %s", vip)
var primaryNode string
for i := 0; i < len(podname); i++ {
output, err := oc.AsAdmin().WithoutNamespace().Run("exec").Args("-n", ns, podname[i], "--", "bash", "-c", cmd).Output()
if len(podname) == 1 && output == "command terminated with exit code 1" {
e2e.Failf("The given pod is not master")
}
if output == "command terminated with exit code 1" {
e2e.Logf("This Pod %v does not have the VIP", podname[i])
} else if strings.Contains(output, vip) {
e2e.Logf("The pod owning the VIP is %v", podname[i])
primaryNode = podname[i]
break
} else {
o.Expect(err).NotTo(o.HaveOccurred())
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Differentiate a grep miss from an oc exec failure.

Matching the literal output "command terminated with exit code 1" makes pod restarts, not-ready containers, and transport failures look the same as “VIP not present”. That can select the wrong master or return an empty pod name. Check err, only treat exit code 1 as a miss, and retry/fail on everything else. As per path instructions, **/*.go: Never ignore error returns.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/tests/e2e/util.go` around lines 160 - 176, The
getVipOwnerPod helper is treating the literal “command terminated with exit code
1” as a VIP miss, which can mask real oc exec failures and return the wrong
master or an empty result. Update the logic around
oc.AsAdmin().WithoutNamespace().Run("exec").Args(...).Output() to inspect err
first, only consider exit code 1 from the command as “VIP not present,” and
explicitly retry or fail for any other error conditions; keep the selection
logic in getVipOwnerPod and its primaryNode assignment tied to successful
command results.

Source: Path instructions

Comment on lines +240 to +249
if platformtype == "nutanix" || platformtype == "none" {
getPodListByLabel(oc, oc.Namespace(), "ipfailover=hello-openshift")
workerIPAddress, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "--selector=node-role.kubernetes.io/worker=", "-ojsonpath={.items[*].status.addresses[0].address}").Output()
o.Expect(err).NotTo(o.HaveOccurred())
modifiedIPList := strings.Split(workerIPAddress, " ")
if len(modifiedIPList) < 2 {
e2e.Failf("There is not enough IP addresses to add as unicast peer")
}
ipList := strings.Join(modifiedIPList, ",")
cmd := fmt.Sprintf("OPENSHIFT_HA_UNICAST_PEERS=%v", ipList)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don’t assume .status.addresses[0] is the worker IP.

Kubernetes does not guarantee address order. On many clusters index 0 is Hostname or ExternalIP, so OPENSHIFT_HA_UNICAST_PEERS can be populated with unusable values and the Nutanix/none path never converges. Filter node addresses by type=="InternalIP" before building the peer list.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/tests/e2e/util.go` around lines 240 - 249, The unicast
peer list in the Nutanix/none branch is built from the first node address, but
.status.addresses[0] is not guaranteed to be an InternalIP. Update the logic in
the keepalived e2e util path that calls
oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", ...) to filter node
addresses by type == "InternalIP" before assembling OPENSHIFT_HA_UNICAST_PEERS,
and keep the existing validation that enough worker IPs were found.

Comment on lines +319 to +331
func waitForPodWithLabelReady(oc *exutil.CLI, ns, label string) error {
return wait.Poll(5*time.Second, 3*time.Minute, func() (bool, error) {
status, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("pod", "-n", ns, "-l", label, `-ojsonpath={.items[*].status.conditions[?(@.type=="Ready")].status}`).Output()
e2e.Logf("the Ready status of pod is %v", status)
if err != nil || status == "" {
e2e.Logf("failed to get pod status: %v, retrying...", err)
return false, nil
}
if strings.Contains(status, "False") {
e2e.Logf("the pod Ready status not met; wanted True but got %v, retrying...", status)
return false, nil
}
return true, nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

This readiness gate can pass before the replacement replica exists.

It only checks the Ready condition of the pods that already exist; it never verifies how many labeled pods are present. During a rollout, one surviving Ready pod is enough for this to return success, and callers then immediately assume a two-pod MASTER/BACKUP set. Make the helper validate the expected replica count too.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/tests/e2e/util.go` around lines 319 - 331, The helper
waitForPodWithLabelReady currently only checks Ready status, so it can succeed
with a single surviving pod during rollout. Update waitForPodWithLabelReady to
also verify the number of pods returned by the label selector in addition to
readiness, and keep polling until both the expected replica count and Ready
condition are met. Use the existing wait.Poll logic and the kubectl get call in
waitForPodWithLabelReady to enforce the count before returning success.

@@ -0,0 +1,350 @@
module github.com/openshift/ipfailover-tests-extension

go 1.25.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== go.mod =="
sed -n '1,6p' ipfailover/keepalived/tests/go.mod

echo
echo "== Makefile build recipe =="
sed -n '1,10p' ipfailover/keepalived/tests/Makefile

echo
echo "== Docker builder stage =="
sed -n '1,8p' ipfailover/keepalived/Dockerfile

Repository: openshift/images

Length of output: 910


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== other Go version declarations =="
rg -n --glob 'go.mod' --glob 'Dockerfile*' --glob 'Makefile*' 'go 1\.25|golang-1\.22|GOTOOLCHAIN=auto|GOTOOLCHAIN' .

echo
echo "== files in keepalived tests =="
git ls-files 'ipfailover/keepalived/tests/*'

echo
echo "== nearby Dockerfile references =="
rg -n 'ipfailover/keepalived/tests|test-extension-builder|golang-' ipfailover/keepalived -S

Repository: openshift/images

Length of output: 50374


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Dockerfile (full, line-numbered) =="
cat -n ipfailover/keepalived/Dockerfile

echo
echo "== tests go.mod =="
cat -n ipfailover/keepalived/tests/go.mod

echo
echo "== tests Makefile =="
cat -n ipfailover/keepalived/tests/Makefile

Repository: openshift/images

Length of output: 26386


Align the Go 1.25 module with the Go 1.22 builder image
ipfailover/keepalived/tests/go.mod:3 now requires Go 1.25.0, but ipfailover/keepalived/Dockerfile:2 still builds in ...golang-1.22.... With GOTOOLCHAIN=auto, this will pull a newer toolchain during the build instead of using the image’s compiler, making CI and offline builds dependent on network access. Bump the builder image to Go 1.25 or lower the module directive to the supported version.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/tests/go.mod` at line 3, The module version in
ipfailover/keepalived/tests/go.mod is newer than the Go toolchain used by the
ipfailover/keepalived/Dockerfile builder stage, so the build may auto-download a
different compiler. Update the builder image in Dockerfile to a Go 1.25-based
image, or reduce the go directive in go.mod to match the supported compiler
version. Make sure the versions stay aligned across go.mod and the Dockerfile so
the tests build with the image’s local toolchain.

@melvinjoseph86

Copy link
Copy Markdown
Author

/payload-with-prs openshift/release#81231

@melvinjoseph86

Copy link
Copy Markdown
Author

/test-with-release openshift/release#81231

@melvinjoseph86

Copy link
Copy Markdown
Author

/payload-job-with-prs openshift/release#81231 e2e-openstack-ipi-tests-ext

@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

@melvinjoseph86: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants