Skip to content

Commit 1529b59

Browse files
committed
Add NVIDIA DRA E2E tests for OpenShift
Add extended tests for NVIDIA Dynamic Resource Allocation (DRA) on OpenShift clusters with GPU nodes. Tests validate DRA driver lifecycle, GPU allocation via ResourceClaims, and pod GPU accessibility. Test coverage includes: - Single and multi-GPU allocation - ResourceClaim lifecycle and cleanup - Claim sharing behavior - ResourceClaimTemplate usage - CDI device injection validation The test framework automatically: - Validates GPU Operator installation - Installs DRA driver if not present (via Helm) - Configures required SCC permissions and node labels - Waits for all components to be ready Documentation includes prerequisites, quick start guide, test scenarios, manual installation steps, and troubleshooting. Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>
1 parent e368054 commit 1529b59

8 files changed

Lines changed: 2077 additions & 0 deletions

File tree

test/extended/include.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ import (
3939
_ "github.com/openshift/origin/test/extended/machines"
4040
_ "github.com/openshift/origin/test/extended/networking"
4141
_ "github.com/openshift/origin/test/extended/node"
42+
_ "github.com/openshift/origin/test/extended/node/dra/nvidia"
4243
_ "github.com/openshift/origin/test/extended/node/node_e2e"
4344
_ "github.com/openshift/origin/test/extended/node_tuning"
4445
_ "github.com/openshift/origin/test/extended/oauth"

test/extended/node/dra/OWNERS

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
approvers:
2+
- sairameshv
3+
- harche
4+
- haircommander
5+
- rphillips
6+
- mrunalp
7+
8+
reviewers:
9+
- sairameshv
10+
- harche
11+
- haircommander
12+
- rphillips
13+
- mrunalp
14+
15+
labels:
16+
- sig/scheduling
17+
- area/dra
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
approvers:
2+
- sairameshv
3+
- harche
4+
- haircommander
5+
- rphillips
6+
- mrunalp
7+
8+
reviewers:
9+
- sairameshv
10+
- harche
11+
- haircommander
12+
- rphillips
13+
- mrunalp
14+
15+
labels:
16+
- sig/scheduling
17+
- area/dra
Lines changed: 251 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,251 @@
1+
# NVIDIA DRA Extended Tests for OpenShift
2+
3+
This directory contains extended tests for NVIDIA Dynamic Resource Allocation (DRA) functionality on OpenShift clusters with GPU nodes.
4+
5+
## Overview
6+
7+
These tests validate:
8+
- NVIDIA DRA driver installation and lifecycle
9+
- Single GPU allocation via ResourceClaims
10+
- Multi-GPU workload allocation
11+
- Pod lifecycle and resource cleanup
12+
- GPU device accessibility in pods
13+
14+
## Prerequisites
15+
16+
1. **OpenShift 4.21+** with GPU-enabled worker nodes
17+
2. **Node Feature Discovery Operator** pre-installed
18+
3. **NVIDIA GPU Operator** pre-installed with **CDI enabled** (`cdi.enabled=true`)
19+
4. **Helm 3** installed and available in PATH
20+
5. **Cluster-admin** access
21+
22+
For installation instructions, see the [NVIDIA GPU Operator on OpenShift Installation Guide](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/install-gpu-ocp.html).
23+
24+
**Note**: The test framework automatically installs the DRA driver if not already present.
25+
26+
## Quick Start
27+
28+
### Running Tests
29+
30+
```bash
31+
# 1. Build test binary
32+
# Note: Your origin checkout should match the cluster version to avoid version mismatch errors
33+
make WHAT=cmd/openshift-tests
34+
35+
# 2. Set kubeconfig
36+
export KUBECONFIG=/path/to/kubeconfig
37+
38+
# 3. Run all NVIDIA DRA tests
39+
./openshift-tests run --dry-run all 2>&1 | \
40+
grep "\[Feature:NVIDIA-DRA\]" | \
41+
./openshift-tests run -f -
42+
43+
# OR run a specific test
44+
./openshift-tests run-test \
45+
'[sig-scheduling][Feature:NVIDIA-DRA][Serial] Basic GPU Allocation should allocate single GPU to pod via DRA'
46+
47+
# OR list all available tests without running them
48+
./openshift-tests run --dry-run all 2>&1 | grep "\[Feature:NVIDIA-DRA\]"
49+
```
50+
51+
**What the tests do automatically:**
52+
- Verify GPU Operator is installed (test fails if not present)
53+
- Install DRA Driver if not already present (version 25.12.0 by default)
54+
- Configure necessary SCC permissions and node labels
55+
- Wait for all components to be ready before running tests
56+
57+
To use a different DRA driver version: `export NVIDIA_DRA_DRIVER_VERSION=25.8.1`
58+
59+
### Running Individual Tests
60+
61+
```bash
62+
# Set your kubeconfig first
63+
export KUBECONFIG=/path/to/kubeconfig
64+
65+
# Discover and run all NVIDIA DRA tests sequentially
66+
./openshift-tests run --dry-run all 2>&1 | grep "\[Feature:NVIDIA-DRA\]" | while read -r test; do
67+
echo "Running: $test"
68+
./openshift-tests run-test "$test"
69+
done
70+
```
71+
72+
## Test Scenarios
73+
74+
### 1. Single GPU Allocation ✅
75+
- Creates DeviceClass with CEL selector
76+
- Creates ResourceClaim requesting exactly 1 GPU
77+
- Schedules pod with ResourceClaim
78+
- Validates GPU accessibility via nvidia-smi
79+
- Validates CDI device injection
80+
81+
**Expected**: PASSED
82+
83+
### 2. Resource Cleanup ✅
84+
- Creates pod with GPU ResourceClaim
85+
- Deletes pod
86+
- Verifies ResourceClaim persists after pod deletion
87+
- Validates resource lifecycle management
88+
89+
**Expected**: PASSED
90+
91+
### 3. Multi-GPU Workloads ⚠️
92+
- Creates ResourceClaim requesting exactly 2 GPUs
93+
- Schedules pod requiring multiple GPUs
94+
- Validates all GPUs are accessible
95+
96+
**Expected**: SKIPPED if cluster has fewer than 2 GPUs (expected behavior)
97+
98+
### 4. Claim Sharing 🔄
99+
- Creates a single ResourceClaim
100+
- Creates two pods referencing the same ResourceClaim
101+
- Tests whether NVIDIA DRA driver supports claim sharing
102+
- Validates behavior when multiple pods attempt to use the same claim
103+
104+
**Expected**: Behavior depends on driver support for claim sharing. Test verifies that:
105+
- If sharing is NOT supported: Second pod remains Pending, first pod continues to work
106+
- If sharing IS supported: Both pods run and have GPU access
107+
108+
### 5. ResourceClaimTemplate 📋
109+
- Creates a ResourceClaimTemplate
110+
- Creates pod with ResourceClaimTemplate reference
111+
- Validates that ResourceClaim is automatically created from template
112+
- Verifies GPU access in pod
113+
- Validates automatic cleanup of template-generated claim when pod is deleted
114+
115+
**Expected**: PASSED
116+
117+
## Manual DRA Driver Installation
118+
119+
The tests automatically install the DRA driver if needed. This section is for manually installing or debugging the DRA driver outside of the test framework.
120+
121+
### Step 1: Add NVIDIA Helm Repository
122+
123+
```bash
124+
helm repo add nvidia https://nvidia.github.io/gpu-operator
125+
helm repo update
126+
```
127+
128+
### Step 2: Label GPU Nodes
129+
130+
```bash
131+
# Label all GPU nodes for DRA kubelet plugin scheduling
132+
for node in $(oc get nodes -l nvidia.com/gpu.present=true -o name); do
133+
oc label $node nvidia.com/dra-kubelet-plugin=true --overwrite
134+
done
135+
136+
# Verify labels
137+
oc get nodes -l nvidia.com/dra-kubelet-plugin=true
138+
```
139+
140+
This label ensures the DRA kubelet plugin only runs on GPU nodes and works around NVIDIA Driver Manager eviction issues.
141+
142+
### Step 3: Install DRA Driver
143+
144+
```bash
145+
# Create namespace
146+
oc create namespace nvidia-dra-driver-gpu
147+
148+
# Grant SCC permissions
149+
oc adm policy add-scc-to-user privileged \
150+
-z nvidia-dra-driver-gpu-service-account-controller \
151+
-n nvidia-dra-driver-gpu
152+
oc adm policy add-scc-to-user privileged \
153+
-z nvidia-dra-driver-gpu-service-account-kubeletplugin \
154+
-n nvidia-dra-driver-gpu
155+
oc adm policy add-scc-to-user privileged \
156+
-z compute-domain-daemon-service-account \
157+
-n nvidia-dra-driver-gpu
158+
159+
# Install via Helm (pinned to version used by tests)
160+
# Version can be overridden via NVIDIA_DRA_DRIVER_VERSION environment variable
161+
NVIDIA_DRA_DRIVER_VERSION=${NVIDIA_DRA_DRIVER_VERSION:-25.12.0}
162+
helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
163+
--namespace nvidia-dra-driver-gpu \
164+
--version ${NVIDIA_DRA_DRIVER_VERSION} \
165+
--set nvidiaDriverRoot=/run/nvidia/driver \
166+
--set gpuResourcesEnabledOverride=true \
167+
--set image.pullPolicy=IfNotPresent \
168+
--set-string kubeletPlugin.nodeSelector.nvidia\.com/dra-kubelet-plugin=true \
169+
--set controller.tolerations[0].key=node-role.kubernetes.io/master \
170+
--set controller.tolerations[0].operator=Exists \
171+
--set controller.tolerations[0].effect=NoSchedule \
172+
--set controller.tolerations[1].key=node-role.kubernetes.io/control-plane \
173+
--set controller.tolerations[1].operator=Exists \
174+
--set controller.tolerations[1].effect=NoSchedule \
175+
--wait --timeout 5m
176+
```
177+
178+
### Step 4: Verify Installation
179+
180+
```bash
181+
# Check DRA driver pods
182+
oc get pods -n nvidia-dra-driver-gpu
183+
# Expected: All pods should be Running
184+
185+
# Verify ResourceSlices are published
186+
oc get resourceslices
187+
# Should show at least 2 slices per GPU node
188+
```
189+
190+
### Uninstalling DRA Driver
191+
192+
```bash
193+
# Uninstall DRA Driver
194+
helm uninstall nvidia-dra-driver-gpu -n nvidia-dra-driver-gpu --wait --timeout 5m
195+
oc delete namespace nvidia-dra-driver-gpu
196+
197+
# Remove SCC permissions
198+
oc delete clusterrolebinding \
199+
nvidia-dra-privileged-nvidia-dra-driver-gpu-service-account-controller \
200+
nvidia-dra-privileged-nvidia-dra-driver-gpu-service-account-kubeletplugin \
201+
nvidia-dra-privileged-compute-domain-daemon-service-account
202+
```
203+
204+
**Note**: GPU Operator is NOT removed as it's cluster infrastructure. ResourceSlices are automatically cleaned up.
205+
206+
## Troubleshooting
207+
208+
### GPU Operator not found
209+
210+
**Cause**: GPU Operator not installed on cluster.
211+
212+
**Solution**: Install GPU Operator following the [official guide](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/install-gpu-ocp.html).
213+
214+
### Version mismatch error
215+
216+
**Cause**: Local origin checkout doesn't match cluster's release commit.
217+
218+
**Solution**: Ensure your origin checkout matches the cluster version, or use `./openshift-tests run-test` command which bypasses version checks.
219+
220+
### DRA driver kubelet plugin stuck at Init:0/1
221+
222+
**Cause**: Wrong `nvidiaDriverRoot` setting.
223+
224+
**Solution**: Ensure `nvidiaDriverRoot=/run/nvidia/driver` (not `/`). This is automatically configured by tests.
225+
226+
### ResourceSlices not appearing
227+
228+
**Cause**: DRA driver not fully initialized or missing SCC permissions.
229+
230+
**Solution**:
231+
```bash
232+
# Check DRA driver logs
233+
oc logs -n nvidia-dra-driver-gpu -l app.kubernetes.io/name=nvidia-dra-driver-gpu --all-containers
234+
235+
# Verify SCC permissions
236+
oc describe scc privileged | grep nvidia-dra-driver-gpu
237+
238+
# Restart DRA driver if needed
239+
oc delete pod -n nvidia-dra-driver-gpu -l app.kubernetes.io/name=nvidia-dra-driver-gpu
240+
```
241+
242+
## References
243+
244+
- **NVIDIA GPU Operator**: https://github.com/NVIDIA/gpu-operator
245+
- **NVIDIA DRA Driver**: https://github.com/NVIDIA/k8s-dra-driver-gpu
246+
- **Kubernetes DRA Documentation**: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
247+
- **OpenShift Extended Tests**: https://github.com/openshift/origin/tree/master/test/extended
248+
249+
---
250+
251+
**Tested On**: OpenShift 4.21.0, Kubernetes 1.34.2, Tesla T4

0 commit comments

Comments
 (0)