|
| 1 | +# NVIDIA DRA Extended Tests for OpenShift |
| 2 | + |
| 3 | +This directory contains extended tests for NVIDIA Dynamic Resource Allocation (DRA) functionality on OpenShift clusters with GPU nodes. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +These tests validate: |
| 8 | +- NVIDIA DRA driver installation and lifecycle |
| 9 | +- Single GPU allocation via ResourceClaims |
| 10 | +- Multi-GPU workload allocation |
| 11 | +- Pod lifecycle and resource cleanup |
| 12 | +- GPU device accessibility in pods |
| 13 | + |
| 14 | +## Prerequisites |
| 15 | + |
| 16 | +1. **OpenShift 4.21+** with GPU-enabled worker nodes |
| 17 | +2. **Node Feature Discovery Operator** pre-installed |
| 18 | +3. **NVIDIA GPU Operator** pre-installed with **CDI enabled** (`cdi.enabled=true`) |
| 19 | +4. **Helm 3** installed and available in PATH |
| 20 | +5. **Cluster-admin** access |
| 21 | + |
| 22 | +For installation instructions, see the [NVIDIA GPU Operator on OpenShift Installation Guide](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/install-gpu-ocp.html). |
| 23 | + |
| 24 | +**Note**: The test framework automatically installs the DRA driver if not already present. |
| 25 | + |
| 26 | +## Quick Start |
| 27 | + |
| 28 | +### Running Tests |
| 29 | + |
| 30 | +```bash |
| 31 | +# 1. Build test binary |
| 32 | +# Note: Your origin checkout should match the cluster version to avoid version mismatch errors |
| 33 | +make WHAT=cmd/openshift-tests |
| 34 | + |
| 35 | +# 2. Set kubeconfig |
| 36 | +export KUBECONFIG=/path/to/kubeconfig |
| 37 | + |
| 38 | +# 3. Run all NVIDIA DRA tests |
| 39 | +./openshift-tests run --dry-run all 2>&1 | \ |
| 40 | + grep "\[Feature:NVIDIA-DRA\]" | \ |
| 41 | + ./openshift-tests run -f - |
| 42 | + |
| 43 | +# OR run a specific test |
| 44 | +./openshift-tests run-test \ |
| 45 | + '[sig-scheduling][Feature:NVIDIA-DRA][Serial] Basic GPU Allocation should allocate single GPU to pod via DRA' |
| 46 | + |
| 47 | +# OR list all available tests without running them |
| 48 | +./openshift-tests run --dry-run all 2>&1 | grep "\[Feature:NVIDIA-DRA\]" |
| 49 | +``` |
| 50 | + |
| 51 | +**What the tests do automatically:** |
| 52 | +- Verify GPU Operator is installed (test fails if not present) |
| 53 | +- Install DRA Driver if not already present (version 25.12.0 by default) |
| 54 | +- Configure necessary SCC permissions and node labels |
| 55 | +- Wait for all components to be ready before running tests |
| 56 | + |
| 57 | +To use a different DRA driver version: `export NVIDIA_DRA_DRIVER_VERSION=25.8.1` |
| 58 | + |
| 59 | +### Running Individual Tests |
| 60 | + |
| 61 | +```bash |
| 62 | +# Set your kubeconfig first |
| 63 | +export KUBECONFIG=/path/to/kubeconfig |
| 64 | + |
| 65 | +# Discover and run all NVIDIA DRA tests sequentially |
| 66 | +./openshift-tests run --dry-run all 2>&1 | grep "\[Feature:NVIDIA-DRA\]" | while read -r test; do |
| 67 | + echo "Running: $test" |
| 68 | + ./openshift-tests run-test "$test" |
| 69 | +done |
| 70 | +``` |
| 71 | + |
| 72 | +## Test Scenarios |
| 73 | + |
| 74 | +### 1. Single GPU Allocation ✅ |
| 75 | +- Creates DeviceClass with CEL selector |
| 76 | +- Creates ResourceClaim requesting exactly 1 GPU |
| 77 | +- Schedules pod with ResourceClaim |
| 78 | +- Validates GPU accessibility via nvidia-smi |
| 79 | +- Validates CDI device injection |
| 80 | + |
| 81 | +**Expected**: PASSED |
| 82 | + |
| 83 | +### 2. Resource Cleanup ✅ |
| 84 | +- Creates pod with GPU ResourceClaim |
| 85 | +- Deletes pod |
| 86 | +- Verifies ResourceClaim persists after pod deletion |
| 87 | +- Validates resource lifecycle management |
| 88 | + |
| 89 | +**Expected**: PASSED |
| 90 | + |
| 91 | +### 3. Multi-GPU Workloads ⚠️ |
| 92 | +- Creates ResourceClaim requesting exactly 2 GPUs |
| 93 | +- Schedules pod requiring multiple GPUs |
| 94 | +- Validates all GPUs are accessible |
| 95 | + |
| 96 | +**Expected**: SKIPPED if cluster has fewer than 2 GPUs (expected behavior) |
| 97 | + |
| 98 | +### 4. Claim Sharing 🔄 |
| 99 | +- Creates a single ResourceClaim |
| 100 | +- Creates two pods referencing the same ResourceClaim |
| 101 | +- Tests whether NVIDIA DRA driver supports claim sharing |
| 102 | +- Validates behavior when multiple pods attempt to use the same claim |
| 103 | + |
| 104 | +**Expected**: Behavior depends on driver support for claim sharing. Test verifies that: |
| 105 | +- If sharing is NOT supported: Second pod remains Pending, first pod continues to work |
| 106 | +- If sharing IS supported: Both pods run and have GPU access |
| 107 | + |
| 108 | +### 5. ResourceClaimTemplate 📋 |
| 109 | +- Creates a ResourceClaimTemplate |
| 110 | +- Creates pod with ResourceClaimTemplate reference |
| 111 | +- Validates that ResourceClaim is automatically created from template |
| 112 | +- Verifies GPU access in pod |
| 113 | +- Validates automatic cleanup of template-generated claim when pod is deleted |
| 114 | + |
| 115 | +**Expected**: PASSED |
| 116 | + |
| 117 | +## Manual DRA Driver Installation |
| 118 | + |
| 119 | +The tests automatically install the DRA driver if needed. This section is for manually installing or debugging the DRA driver outside of the test framework. |
| 120 | + |
| 121 | +### Step 1: Add NVIDIA Helm Repository |
| 122 | + |
| 123 | +```bash |
| 124 | +helm repo add nvidia https://nvidia.github.io/gpu-operator |
| 125 | +helm repo update |
| 126 | +``` |
| 127 | + |
| 128 | +### Step 2: Label GPU Nodes |
| 129 | + |
| 130 | +```bash |
| 131 | +# Label all GPU nodes for DRA kubelet plugin scheduling |
| 132 | +for node in $(oc get nodes -l nvidia.com/gpu.present=true -o name); do |
| 133 | + oc label $node nvidia.com/dra-kubelet-plugin=true --overwrite |
| 134 | +done |
| 135 | + |
| 136 | +# Verify labels |
| 137 | +oc get nodes -l nvidia.com/dra-kubelet-plugin=true |
| 138 | +``` |
| 139 | + |
| 140 | +This label ensures the DRA kubelet plugin only runs on GPU nodes and works around NVIDIA Driver Manager eviction issues. |
| 141 | + |
| 142 | +### Step 3: Install DRA Driver |
| 143 | + |
| 144 | +```bash |
| 145 | +# Create namespace |
| 146 | +oc create namespace nvidia-dra-driver-gpu |
| 147 | + |
| 148 | +# Grant SCC permissions |
| 149 | +oc adm policy add-scc-to-user privileged \ |
| 150 | + -z nvidia-dra-driver-gpu-service-account-controller \ |
| 151 | + -n nvidia-dra-driver-gpu |
| 152 | +oc adm policy add-scc-to-user privileged \ |
| 153 | + -z nvidia-dra-driver-gpu-service-account-kubeletplugin \ |
| 154 | + -n nvidia-dra-driver-gpu |
| 155 | +oc adm policy add-scc-to-user privileged \ |
| 156 | + -z compute-domain-daemon-service-account \ |
| 157 | + -n nvidia-dra-driver-gpu |
| 158 | + |
| 159 | +# Install via Helm (pinned to version used by tests) |
| 160 | +# Version can be overridden via NVIDIA_DRA_DRIVER_VERSION environment variable |
| 161 | +NVIDIA_DRA_DRIVER_VERSION=${NVIDIA_DRA_DRIVER_VERSION:-25.12.0} |
| 162 | +helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ |
| 163 | + --namespace nvidia-dra-driver-gpu \ |
| 164 | + --version ${NVIDIA_DRA_DRIVER_VERSION} \ |
| 165 | + --set nvidiaDriverRoot=/run/nvidia/driver \ |
| 166 | + --set gpuResourcesEnabledOverride=true \ |
| 167 | + --set image.pullPolicy=IfNotPresent \ |
| 168 | + --set-string kubeletPlugin.nodeSelector.nvidia\.com/dra-kubelet-plugin=true \ |
| 169 | + --set controller.tolerations[0].key=node-role.kubernetes.io/master \ |
| 170 | + --set controller.tolerations[0].operator=Exists \ |
| 171 | + --set controller.tolerations[0].effect=NoSchedule \ |
| 172 | + --set controller.tolerations[1].key=node-role.kubernetes.io/control-plane \ |
| 173 | + --set controller.tolerations[1].operator=Exists \ |
| 174 | + --set controller.tolerations[1].effect=NoSchedule \ |
| 175 | + --wait --timeout 5m |
| 176 | +``` |
| 177 | + |
| 178 | +### Step 4: Verify Installation |
| 179 | + |
| 180 | +```bash |
| 181 | +# Check DRA driver pods |
| 182 | +oc get pods -n nvidia-dra-driver-gpu |
| 183 | +# Expected: All pods should be Running |
| 184 | + |
| 185 | +# Verify ResourceSlices are published |
| 186 | +oc get resourceslices |
| 187 | +# Should show at least 2 slices per GPU node |
| 188 | +``` |
| 189 | + |
| 190 | +### Uninstalling DRA Driver |
| 191 | + |
| 192 | +```bash |
| 193 | +# Uninstall DRA Driver |
| 194 | +helm uninstall nvidia-dra-driver-gpu -n nvidia-dra-driver-gpu --wait --timeout 5m |
| 195 | +oc delete namespace nvidia-dra-driver-gpu |
| 196 | + |
| 197 | +# Remove SCC permissions |
| 198 | +oc delete clusterrolebinding \ |
| 199 | + nvidia-dra-privileged-nvidia-dra-driver-gpu-service-account-controller \ |
| 200 | + nvidia-dra-privileged-nvidia-dra-driver-gpu-service-account-kubeletplugin \ |
| 201 | + nvidia-dra-privileged-compute-domain-daemon-service-account |
| 202 | +``` |
| 203 | + |
| 204 | +**Note**: GPU Operator is NOT removed as it's cluster infrastructure. ResourceSlices are automatically cleaned up. |
| 205 | + |
| 206 | +## Troubleshooting |
| 207 | + |
| 208 | +### GPU Operator not found |
| 209 | + |
| 210 | +**Cause**: GPU Operator not installed on cluster. |
| 211 | + |
| 212 | +**Solution**: Install GPU Operator following the [official guide](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/install-gpu-ocp.html). |
| 213 | + |
| 214 | +### Version mismatch error |
| 215 | + |
| 216 | +**Cause**: Local origin checkout doesn't match cluster's release commit. |
| 217 | + |
| 218 | +**Solution**: Ensure your origin checkout matches the cluster version, or use `./openshift-tests run-test` command which bypasses version checks. |
| 219 | + |
| 220 | +### DRA driver kubelet plugin stuck at Init:0/1 |
| 221 | + |
| 222 | +**Cause**: Wrong `nvidiaDriverRoot` setting. |
| 223 | + |
| 224 | +**Solution**: Ensure `nvidiaDriverRoot=/run/nvidia/driver` (not `/`). This is automatically configured by tests. |
| 225 | + |
| 226 | +### ResourceSlices not appearing |
| 227 | + |
| 228 | +**Cause**: DRA driver not fully initialized or missing SCC permissions. |
| 229 | + |
| 230 | +**Solution**: |
| 231 | +```bash |
| 232 | +# Check DRA driver logs |
| 233 | +oc logs -n nvidia-dra-driver-gpu -l app.kubernetes.io/name=nvidia-dra-driver-gpu --all-containers |
| 234 | + |
| 235 | +# Verify SCC permissions |
| 236 | +oc describe scc privileged | grep nvidia-dra-driver-gpu |
| 237 | + |
| 238 | +# Restart DRA driver if needed |
| 239 | +oc delete pod -n nvidia-dra-driver-gpu -l app.kubernetes.io/name=nvidia-dra-driver-gpu |
| 240 | +``` |
| 241 | + |
| 242 | +## References |
| 243 | + |
| 244 | +- **NVIDIA GPU Operator**: https://github.com/NVIDIA/gpu-operator |
| 245 | +- **NVIDIA DRA Driver**: https://github.com/NVIDIA/k8s-dra-driver-gpu |
| 246 | +- **Kubernetes DRA Documentation**: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ |
| 247 | +- **OpenShift Extended Tests**: https://github.com/openshift/origin/tree/master/test/extended |
| 248 | + |
| 249 | +--- |
| 250 | + |
| 251 | +**Tested On**: OpenShift 4.21.0, Kubernetes 1.34.2, Tesla T4 |
0 commit comments