Skip to content

Commit 0ac7498

Browse files
cpcloudclaude
andcommitted
test(cufile): prime libcufile before parameter-set tests to avoid SIGFPE
Under pytest-randomly, the cuFile test module fatally crashes with SIGFPE in CUFileDrv::ReadVersionInfo (unguarded div %rcx with rcx=0) inside libcufile.so cuFileDriverOpen+0xe. The crash is deterministic given specific test orderings and was reproducible with seed 2758108007. Root cause is a libcufile 1.17.1 bug. Calling cuFileSetParameterSizeT (or other pre-open configuration APIs) BEFORE the first cuFileDriverOpen leaves an internal version list uninitialized; the next driver_open then divides by its zero length. Minimal repro: pytest tests/test_cufile.py::test_set_get_parameter_size_t \\ tests/test_cufile.py::test_buf_register_invalid_flags Fix: add a module-scope autouse _cufile_driver_prewarm fixture that performs one driver_open/driver_close before any test in the module runs. That single cycle initializes libcufile's version list; both test regimes (driver-open tests via the function-scope `driver` fixture, and driver-closed parameter-set tests) then work under any ordering. Also swap test_set_parameter_posix_pool_slab_array's inline driver_open/close for the `driver` fixture. pytest fixture ordering guarantees driver_config (which calls set_parameter_posix_pool_slab_array while closed) runs before `driver` opens, matching the previous manual ordering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent a8805e5 commit 0ac7498

File tree

1 file changed

+48
-8
lines changed

1 file changed

+48
-8
lines changed

cuda_bindings/tests/test_cufile.py

Lines changed: 48 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,52 @@ def ctx():
140140
cuda.cuDevicePrimaryCtxRelease(device)
141141

142142

143+
@pytest.fixture(scope="module", autouse=True)
144+
def _cufile_driver_prewarm():
145+
"""Prime libcufile with one driver_open/close cycle before any test runs.
146+
147+
The cuFile test module mixes two incompatible regimes:
148+
149+
- Driver-open tests (buf_register_*, cufile_read_write, batch_io, stats,
150+
etc.) need cuFileDriverOpen; they use the function-scope `driver`
151+
fixture to open/close per test.
152+
- Driver-closed tests (test_set_get_parameter_*, test_set_parameter_posix_*)
153+
must run with the driver CLOSED — libcufile rejects parameter-set calls
154+
when the driver is open (DRIVER_ALREADY_OPEN, 5026).
155+
156+
Workaround for NVIDIA libcufile 1.17.1 bug: calling cuFileSetParameterSizeT
157+
(or similar pre-open configuration APIs) BEFORE the first cuFileDriverOpen
158+
leaves an internal version list uninitialized such that a later
159+
cuFileDriverOpen SIGFPEs in CUFileDrv::ReadVersionInfo (div-by-zero).
160+
Under random ordering, a driver-closed test can run before any
161+
driver-open test, poisoning libcufile and tearing down pytest with a fatal
162+
signal on the next driver_open.
163+
164+
One open/close cycle up front primes libcufile's version list. After that,
165+
both regimes work: the per-test `driver` fixture can open/close freely,
166+
and parameter-set tests run against the (now properly initialized) closed
167+
driver.
168+
169+
Note: per-test driver_open/close is not ideal on throughput grounds, but
170+
it is forced by the libcufile API — parameter-set tests cannot coexist
171+
with a session-wide open driver.
172+
"""
173+
(err,) = cuda.cuInit(0)
174+
assert err == cuda.CUresult.CUDA_SUCCESS
175+
err, device = cuda.cuDeviceGet(0)
176+
assert err == cuda.CUresult.CUDA_SUCCESS
177+
err, dctx = cuda.cuDevicePrimaryCtxRetain(device)
178+
assert err == cuda.CUresult.CUDA_SUCCESS
179+
(err,) = cuda.cuCtxSetCurrent(dctx)
180+
assert err == cuda.CUresult.CUDA_SUCCESS
181+
try:
182+
cufile.driver_open()
183+
cufile.driver_close()
184+
finally:
185+
cuda.cuDevicePrimaryCtxRelease(device)
186+
yield
187+
188+
143189
@pytest.fixture
144190
def driver(ctx):
145191
cufile.driver_open()
@@ -1896,8 +1942,7 @@ def driver_config(slab_sizes, slab_counts):
18961942
@pytest.mark.skipif(
18971943
cufileVersionLessThan(1150), reason="cuFile parameter APIs require cuFile library version 13.0 or later"
18981944
)
1899-
@pytest.mark.usefixtures("ctx")
1900-
def test_set_parameter_posix_pool_slab_array(slab_sizes, slab_counts, driver_config):
1945+
def test_set_parameter_posix_pool_slab_array(slab_sizes, slab_counts, driver_config, driver):
19011946
"""Test cuFile POSIX pool slab array configuration."""
19021947
# After setting parameters, retrieve them back to verify
19031948
n_slab_sizes = len(slab_sizes)
@@ -1907,12 +1952,7 @@ def test_set_parameter_posix_pool_slab_array(slab_sizes, slab_counts, driver_con
19071952
retrieved_sizes_addr = ctypes.addressof(retrieved_sizes)
19081953
retrieved_counts_addr = ctypes.addressof(retrieved_counts)
19091954

1910-
# Open cuFile driver AFTER setting parameters
1911-
cufile.driver_open()
1912-
try:
1913-
cufile.get_parameter_posix_pool_slab_array(retrieved_sizes_addr, retrieved_counts_addr, n_slab_sizes)
1914-
finally:
1915-
cufile.driver_close()
1955+
cufile.get_parameter_posix_pool_slab_array(retrieved_sizes_addr, retrieved_counts_addr, n_slab_sizes)
19161956

19171957
# Verify they match what we set
19181958
assert list(retrieved_sizes) == slab_sizes

0 commit comments

Comments
 (0)