Skip to content

Commit 7dbd2a2

Browse files
Merge pull request #30758 from sairameshv/nvidia_dra_ocp
OCPNODE-4043: Add DRA e2e tests to run on NVIDIA GPU
2 parents b5be119 + 1529b59 commit 7dbd2a2

8 files changed

Lines changed: 2077 additions & 0 deletions

File tree

test/extended/include.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ import (
3939
_ "github.com/openshift/origin/test/extended/machines"
4040
_ "github.com/openshift/origin/test/extended/networking"
4141
_ "github.com/openshift/origin/test/extended/node"
42+
_ "github.com/openshift/origin/test/extended/node/dra/nvidia"
4243
_ "github.com/openshift/origin/test/extended/node/node_e2e"
4344
_ "github.com/openshift/origin/test/extended/node_tuning"
4445
_ "github.com/openshift/origin/test/extended/oauth"

test/extended/node/dra/OWNERS

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
approvers:
2+
- sairameshv
3+
- harche
4+
- haircommander
5+
- rphillips
6+
- mrunalp
7+
8+
reviewers:
9+
- sairameshv
10+
- harche
11+
- haircommander
12+
- rphillips
13+
- mrunalp
14+
15+
labels:
16+
- sig/scheduling
17+
- area/dra
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
approvers:
2+
- sairameshv
3+
- harche
4+
- haircommander
5+
- rphillips
6+
- mrunalp
7+
8+
reviewers:
9+
- sairameshv
10+
- harche
11+
- haircommander
12+
- rphillips
13+
- mrunalp
14+
15+
labels:
16+
- sig/scheduling
17+
- area/dra
Lines changed: 251 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,251 @@
1+
# NVIDIA DRA Extended Tests for OpenShift
2+
3+
This directory contains extended tests for NVIDIA Dynamic Resource Allocation (DRA) functionality on OpenShift clusters with GPU nodes.
4+
5+
## Overview
6+
7+
These tests validate:
8+
- NVIDIA DRA driver installation and lifecycle
9+
- Single GPU allocation via ResourceClaims
10+
- Multi-GPU workload allocation
11+
- Pod lifecycle and resource cleanup
12+
- GPU device accessibility in pods
13+
14+
## Prerequisites
15+
16+
1. **OpenShift 4.21+** with GPU-enabled worker nodes
17+
2. **Node Feature Discovery Operator** pre-installed
18+
3. **NVIDIA GPU Operator** pre-installed with **CDI enabled** (`cdi.enabled=true`)
19+
4. **Helm 3** installed and available in PATH
20+
5. **Cluster-admin** access
21+
22+
For installation instructions, see the [NVIDIA GPU Operator on OpenShift Installation Guide](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/install-gpu-ocp.html).
23+
24+
**Note**: The test framework automatically installs the DRA driver if not already present.
25+
26+
## Quick Start
27+
28+
### Running Tests
29+
30+
```bash
31+
# 1. Build test binary
32+
# Note: Your origin checkout should match the cluster version to avoid version mismatch errors
33+
make WHAT=cmd/openshift-tests
34+
35+
# 2. Set kubeconfig
36+
export KUBECONFIG=/path/to/kubeconfig
37+
38+
# 3. Run all NVIDIA DRA tests
39+
./openshift-tests run --dry-run all 2>&1 | \
40+
grep "\[Feature:NVIDIA-DRA\]" | \
41+
./openshift-tests run -f -
42+
43+
# OR run a specific test
44+
./openshift-tests run-test \
45+
'[sig-scheduling][Feature:NVIDIA-DRA][Serial] Basic GPU Allocation should allocate single GPU to pod via DRA'
46+
47+
# OR list all available tests without running them
48+
./openshift-tests run --dry-run all 2>&1 | grep "\[Feature:NVIDIA-DRA\]"
49+
```
50+
51+
**What the tests do automatically:**
52+
- Verify GPU Operator is installed (test fails if not present)
53+
- Install DRA Driver if not already present (version 25.12.0 by default)
54+
- Configure necessary SCC permissions and node labels
55+
- Wait for all components to be ready before running tests
56+
57+
To use a different DRA driver version: `export NVIDIA_DRA_DRIVER_VERSION=25.8.1`
58+
59+
### Running Individual Tests
60+
61+
```bash
62+
# Set your kubeconfig first
63+
export KUBECONFIG=/path/to/kubeconfig
64+
65+
# Discover and run all NVIDIA DRA tests sequentially
66+
./openshift-tests run --dry-run all 2>&1 | grep "\[Feature:NVIDIA-DRA\]" | while read -r test; do
67+
echo "Running: $test"
68+
./openshift-tests run-test "$test"
69+
done
70+
```
71+
72+
## Test Scenarios
73+
74+
### 1. Single GPU Allocation ✅
75+
- Creates DeviceClass with CEL selector
76+
- Creates ResourceClaim requesting exactly 1 GPU
77+
- Schedules pod with ResourceClaim
78+
- Validates GPU accessibility via nvidia-smi
79+
- Validates CDI device injection
80+
81+
**Expected**: PASSED
82+
83+
### 2. Resource Cleanup ✅
84+
- Creates pod with GPU ResourceClaim
85+
- Deletes pod
86+
- Verifies ResourceClaim persists after pod deletion
87+
- Validates resource lifecycle management
88+
89+
**Expected**: PASSED
90+
91+
### 3. Multi-GPU Workloads ⚠️
92+
- Creates ResourceClaim requesting exactly 2 GPUs
93+
- Schedules pod requiring multiple GPUs
94+
- Validates all GPUs are accessible
95+
96+
**Expected**: SKIPPED if cluster has fewer than 2 GPUs (expected behavior)
97+
98+
### 4. Claim Sharing 🔄
99+
- Creates a single ResourceClaim
100+
- Creates two pods referencing the same ResourceClaim
101+
- Tests whether NVIDIA DRA driver supports claim sharing
102+
- Validates behavior when multiple pods attempt to use the same claim
103+
104+
**Expected**: Behavior depends on driver support for claim sharing. Test verifies that:
105+
- If sharing is NOT supported: Second pod remains Pending, first pod continues to work
106+
- If sharing IS supported: Both pods run and have GPU access
107+
108+
### 5. ResourceClaimTemplate 📋
109+
- Creates a ResourceClaimTemplate
110+
- Creates pod with ResourceClaimTemplate reference
111+
- Validates that ResourceClaim is automatically created from template
112+
- Verifies GPU access in pod
113+
- Validates automatic cleanup of template-generated claim when pod is deleted
114+
115+
**Expected**: PASSED
116+
117+
## Manual DRA Driver Installation
118+
119+
The tests automatically install the DRA driver if needed. This section is for manually installing or debugging the DRA driver outside of the test framework.
120+
121+
### Step 1: Add NVIDIA Helm Repository
122+
123+
```bash
124+
helm repo add nvidia https://nvidia.github.io/gpu-operator
125+
helm repo update
126+
```
127+
128+
### Step 2: Label GPU Nodes
129+
130+
```bash
131+
# Label all GPU nodes for DRA kubelet plugin scheduling
132+
for node in $(oc get nodes -l nvidia.com/gpu.present=true -o name); do
133+
oc label $node nvidia.com/dra-kubelet-plugin=true --overwrite
134+
done
135+
136+
# Verify labels
137+
oc get nodes -l nvidia.com/dra-kubelet-plugin=true
138+
```
139+
140+
This label ensures the DRA kubelet plugin only runs on GPU nodes and works around NVIDIA Driver Manager eviction issues.
141+
142+
### Step 3: Install DRA Driver
143+
144+
```bash
145+
# Create namespace
146+
oc create namespace nvidia-dra-driver-gpu
147+
148+
# Grant SCC permissions
149+
oc adm policy add-scc-to-user privileged \
150+
-z nvidia-dra-driver-gpu-service-account-controller \
151+
-n nvidia-dra-driver-gpu
152+
oc adm policy add-scc-to-user privileged \
153+
-z nvidia-dra-driver-gpu-service-account-kubeletplugin \
154+
-n nvidia-dra-driver-gpu
155+
oc adm policy add-scc-to-user privileged \
156+
-z compute-domain-daemon-service-account \
157+
-n nvidia-dra-driver-gpu
158+
159+
# Install via Helm (pinned to version used by tests)
160+
# Version can be overridden via NVIDIA_DRA_DRIVER_VERSION environment variable
161+
NVIDIA_DRA_DRIVER_VERSION=${NVIDIA_DRA_DRIVER_VERSION:-25.12.0}
162+
helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
163+
--namespace nvidia-dra-driver-gpu \
164+
--version ${NVIDIA_DRA_DRIVER_VERSION} \
165+
--set nvidiaDriverRoot=/run/nvidia/driver \
166+
--set gpuResourcesEnabledOverride=true \
167+
--set image.pullPolicy=IfNotPresent \
168+
--set-string kubeletPlugin.nodeSelector.nvidia\.com/dra-kubelet-plugin=true \
169+
--set controller.tolerations[0].key=node-role.kubernetes.io/master \
170+
--set controller.tolerations[0].operator=Exists \
171+
--set controller.tolerations[0].effect=NoSchedule \
172+
--set controller.tolerations[1].key=node-role.kubernetes.io/control-plane \
173+
--set controller.tolerations[1].operator=Exists \
174+
--set controller.tolerations[1].effect=NoSchedule \
175+
--wait --timeout 5m
176+
```
177+
178+
### Step 4: Verify Installation
179+
180+
```bash
181+
# Check DRA driver pods
182+
oc get pods -n nvidia-dra-driver-gpu
183+
# Expected: All pods should be Running
184+
185+
# Verify ResourceSlices are published
186+
oc get resourceslices
187+
# Should show at least 2 slices per GPU node
188+
```
189+
190+
### Uninstalling DRA Driver
191+
192+
```bash
193+
# Uninstall DRA Driver
194+
helm uninstall nvidia-dra-driver-gpu -n nvidia-dra-driver-gpu --wait --timeout 5m
195+
oc delete namespace nvidia-dra-driver-gpu
196+
197+
# Remove SCC permissions
198+
oc delete clusterrolebinding \
199+
nvidia-dra-privileged-nvidia-dra-driver-gpu-service-account-controller \
200+
nvidia-dra-privileged-nvidia-dra-driver-gpu-service-account-kubeletplugin \
201+
nvidia-dra-privileged-compute-domain-daemon-service-account
202+
```
203+
204+
**Note**: GPU Operator is NOT removed as it's cluster infrastructure. ResourceSlices are automatically cleaned up.
205+
206+
## Troubleshooting
207+
208+
### GPU Operator not found
209+
210+
**Cause**: GPU Operator not installed on cluster.
211+
212+
**Solution**: Install GPU Operator following the [official guide](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/install-gpu-ocp.html).
213+
214+
### Version mismatch error
215+
216+
**Cause**: Local origin checkout doesn't match cluster's release commit.
217+
218+
**Solution**: Ensure your origin checkout matches the cluster version, or use `./openshift-tests run-test` command which bypasses version checks.
219+
220+
### DRA driver kubelet plugin stuck at Init:0/1
221+
222+
**Cause**: Wrong `nvidiaDriverRoot` setting.
223+
224+
**Solution**: Ensure `nvidiaDriverRoot=/run/nvidia/driver` (not `/`). This is automatically configured by tests.
225+
226+
### ResourceSlices not appearing
227+
228+
**Cause**: DRA driver not fully initialized or missing SCC permissions.
229+
230+
**Solution**:
231+
```bash
232+
# Check DRA driver logs
233+
oc logs -n nvidia-dra-driver-gpu -l app.kubernetes.io/name=nvidia-dra-driver-gpu --all-containers
234+
235+
# Verify SCC permissions
236+
oc describe scc privileged | grep nvidia-dra-driver-gpu
237+
238+
# Restart DRA driver if needed
239+
oc delete pod -n nvidia-dra-driver-gpu -l app.kubernetes.io/name=nvidia-dra-driver-gpu
240+
```
241+
242+
## References
243+
244+
- **NVIDIA GPU Operator**: https://github.com/NVIDIA/gpu-operator
245+
- **NVIDIA DRA Driver**: https://github.com/NVIDIA/k8s-dra-driver-gpu
246+
- **Kubernetes DRA Documentation**: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
247+
- **OpenShift Extended Tests**: https://github.com/openshift/origin/tree/master/test/extended
248+
249+
---
250+
251+
**Tested On**: OpenShift 4.21.0, Kubernetes 1.34.2, Tesla T4

0 commit comments

Comments
 (0)