Skip to content

Allow for CUDA driver minor version compatibility#226

Open
ocaisa wants to merge 5 commits intoEESSI:mainfrom
ocaisa:cuda_error_to_warning
Open

Allow for CUDA driver minor version compatibility#226
ocaisa wants to merge 5 commits intoEESSI:mainfrom
ocaisa:cuda_error_to_warning

Conversation

@ocaisa
Copy link
Copy Markdown
Member

@ocaisa ocaisa commented May 8, 2026

It works, but is a little too chatty:

{EESSI/2023.06} [aocais00@lrdn3368 software-layer-scripts]$ module load OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0
Lmod Warning:
Your driver CUDA version is 12.2  but the module you want to load requires CUDA 12.4.0. You will therefore be in minor version compatibility mode as described in
https://docs.nvidia.com/deploy/cuda-compatibility/minor-version-compatibility.html .

While processing the following module(s):
    Module fullname                                   Module Filename
    ---------------                                   ---------------
    UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0        /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0.lua
    NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0            /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0.lua
    OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0  /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0.lua

Lmod Warning:
Your driver CUDA version is 12.2  but the module you want to load requires CUDA 12.4.0. You will therefore be in minor version compatibility mode as described in
https://docs.nvidia.com/deploy/cuda-compatibility/minor-version-compatibility.html .

While processing the following module(s):
    Module fullname                                   Module Filename
    ---------------                                   ---------------
    NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0            /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0.lua
    OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0  /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0.lua

Lmod Warning:
Your driver CUDA version is 12.2  but the module you want to load requires CUDA 12.4.0. You will therefore be in minor version compatibility mode as described in
https://docs.nvidia.com/deploy/cuda-compatibility/minor-version-compatibility.html .

While processing the following module(s):
    Module fullname                                   Module Filename
    ---------------                                   ---------------
    UCC-CUDA/1.2.0-GCCcore-13.2.0-CUDA-12.4.0         /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/UCC-CUDA/1.2.0-GCCcore-13.2.0-CUDA-12.4.0.lua
    OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0  /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0.lua

Lmod Warning:
Your driver CUDA version is 12.2  but the module you want to load requires CUDA 12.4.0. You will therefore be in minor version compatibility mode as described in
https://docs.nvidia.com/deploy/cuda-compatibility/minor-version-compatibility.html .

While processing the following module(s):
    Module fullname                                   Module Filename
    ---------------                                   ---------------
    OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0  /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0.lua

@ocaisa
Copy link
Copy Markdown
Member Author

ocaisa commented May 8, 2026

Ok, I think we are ready for showtime here:

[aocais00@login07 software-layer-scripts]$ git pull
remote: Enumerating objects: 5, done.
remote: Counting objects: 100% (5/5), done.
remote: Compressing objects: 100% (1/1), done.
remote: Total 3 (delta 2), reused 3 (delta 2), pack-reused 0 (from 0)
Unpacking objects: 100% (3/3), 325 bytes | 40.00 KiB/s, done.
From https://github.com/ocaisa/software-layer-scripts
   288ce97..4b8b034  cuda_error_to_warning -> origin/cuda_error_to_warning
Updating 288ce97..4b8b034
Fast-forward
 create_lmodsitepackage.py | 1 +
 1 file changed, 1 insertion(+)
[aocais00@login07 software-layer-scripts]$ rm -r test
[aocais00@login07 software-layer-scripts]$ mkdir test
[aocais00@login07 software-layer-scripts]$ ./create_lmodsitepackage.py test
test/.lmod/SitePackage.lua
[aocais00@login07 software-layer-scripts]$ srun -p boost_usr_prod --cpus-per-task=10 -N 1 --ntasks-per-node=1 --gres=gpu:1 -J test_eessi --account EUHPC_D30_076 -t 0:10:00 --pty /bin/bash
[aocais00@lrdn0052 software-layer-scripts]$ clush -w "$SLURM_JOB_NODELIST" /usr/local/bin/afuse_cvmfs2_helper
[aocais00@lrdn0052 software-layer-scripts]$ source /cvmfs/software.eessi.io/versions/2023.06/init/lmod/bash
Modules purged before initialising EESSI
Module for EESSI/2023.06 loaded successfully
EESSI has selected x86_64/intel/icelake as the compatible CPU target for EESSI/2023.06
EESSI has selected accel/nvidia/cc80 as the compatible accelerator target for EESSI/2023.06
(for debug information when loading the EESSI module, set the environment variable EESSI_MODULE_DEBUG_INIT)
{EESSI/2023.06} [aocais00@lrdn0052 software-layer-scripts]$ module load OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0
Lmod has detected the following error:
Your driver CUDA version is 12.2  but the module you want to load requires CUDA 12.4.0. Please update your CUDA driver libraries and then let EESSI know about the update.
For more information on how to do this, see https://www.eessi.io/docs/site_specific_config/gpu/.

While processing the following module(s):
    Module fullname                                   Module Filename
    ---------------                                   ---------------
    UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0        /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0.lua
    NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0            /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0.lua
    OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0  /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0.lua

{EESSI/2023.06} [aocais00@lrdn0052 software-layer-scripts]$ export LMOD_PACKAGE_PATH=$PWD/test/.lmod
{EESSI/2023.06} [aocais00@lrdn0052 software-layer-scripts]$ module load OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0
Lmod Warning:
Your driver CUDA version is 12.2  but the module you want to load requires CUDA 12.4.0. You will therefore be in minor version compatibility mode as described in
https://docs.nvidia.com/deploy/cuda-compatibility/minor-version-compatibility.html .

While processing the following module(s):
    Module fullname                                   Module Filename
    ---------------                                   ---------------
    UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0        /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0.lua
    NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0            /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0.lua
    OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0  /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0.lua

Comment thread create_lmodsitepackage.py
Comment on lines -164 to -165
local eessi_version = os.getenv('EESSI_VERSION') or ""
local eessi_eprefix = os.getenv("EESSI_EPREFIX") or ""
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test below checks for nil

Comment thread create_lmodsitepackage.py
-- even if it fails to set EESSI_CUDA_DRIVER_VERSION
-- Essentially, we handle that case here by raising an error, which can be suppressed
if not cudaVersion or cudaVersion == "" then
-- Having EESSICUDAVERSION set means we have an NVIDIA accelerator
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this for forward thinking for when we have to deal with something similar for AMD

Comment thread create_lmodsitepackage.py Outdated
warn = warn .. "https://docs.nvidia.com/deploy/cuda-compatibility/minor-version-compatibility.html .\\n"
if not suppress_warn or suppress_warn == 1 then
LmodWarning("\\nYour driver CUDA version is ", cudaVersion, " ", warn)
setenv(suppress_var, "1")
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the way this is set up, this warning will only get printed once per session (the variable never gets unset, even after a module purge).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to work around this but it is quite complicated. If I am in an unload mode and the variable is set to the module name I want it unset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant