Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
55f26f2
Switched to new GPU Codegen.
philip-paul-mueller May 6, 2026
574ed87
Was this the error.
philip-paul-mueller May 6, 2026
7a96e6b
This should be the thing.
philip-paul-mueller May 6, 2026
48a8f1c
Let's try this fix.
philip-paul-mueller May 7, 2026
6d52e24
Let's try this fix.
philip-paul-mueller May 7, 2026
246cc24
Merge remote-tracking branch 'origin/main' into dace_new_gpu_codegen
philip-paul-mueller May 7, 2026
cc69d9a
Updated DaCe dependency.
philip-paul-mueller May 11, 2026
1d69a81
Merge remote-tracking branch 'gt4py/main' into dace_new_gpu_codegen
philip-paul-mueller May 26, 2026
038d47a
Updated the DaCe version.
philip-paul-mueller May 26, 2026
815006e
Updated the DaCe version again.
philip-paul-mueller May 26, 2026
446e5cf
This should handle the issue.
philip-paul-mueller May 27, 2026
21e39f7
Not always a transient.
philip-paul-mueller May 27, 2026
e599d19
Realized that I have not yet used the new GPU code generator.
philip-paul-mueller May 27, 2026
597d4dc
Updated DaCe
philip-paul-mueller May 27, 2026
9d3138d
Fixed some double references in subsets.
philip-paul-mueller May 27, 2026
3261016
Fixed some more double references in subsets.
edopao May 27, 2026
2cf71af
Merge branch 'main' into dace_new_gpu_codegen
edopao May 27, 2026
ad06ab1
Merge branch 'main' into dace_new_gpu_codegen
edopao May 27, 2026
670d254
Removed dublicate Memlet.
philip-paul-mueller May 28, 2026
b4357bf
Merge branch 'main' into dace_new_gpu_codegen
edopao May 28, 2026
77d999d
Merge branch 'main' into dace_new_gpu_codegen
edopao May 28, 2026
28cd530
Merge branch 'main' into dace_new_gpu_codegen
edopao May 29, 2026
8d039a2
Let's test this fix.
philip-paul-mueller Jun 2, 2026
55f04b1
Made the text a bit better.
philip-paul-mueller Jun 2, 2026
bd61271
Also fixed that problem.
philip-paul-mueller Jun 2, 2026
84d4ac2
Merge remote-tracking branch 'gt4py/main' into dace_new_gpu_codegen
philip-paul-mueller Jun 2, 2026
e9d68aa
This fixes the issue.
philip-paul-mueller Jun 2, 2026
9a85292
Adapted the tests.
philip-paul-mueller Jun 2, 2026
6985e62
Potential fix for pull request finding
philip-paul-mueller Jun 2, 2026
cbd4298
Applied fix to make the unit tests pass.
philip-paul-mueller Jun 2, 2026
5e6a606
Made it a bit better.
philip-paul-mueller Jun 2, 2026
3db04cc
Merge branch 'fixed_dace_gpu_symbol_issue' into dace_new_gpu_codegen
philip-paul-mueller Jun 2, 2026
162eca9
Merge remote-tracking branch 'gt4py/main' into dace_new_gpu_codegen
philip-paul-mueller Jun 3, 2026
01a789e
Used the newest DaCe version.
philip-paul-mueller Jun 3, 2026
a11cbb3
Updated DaCe.
philip-paul-mueller Jun 3, 2026
768c937
Updated DaCe.
philip-paul-mueller Jun 3, 2026
350c791
Updated the CodeGen.
philip-paul-mueller Jun 4, 2026
23ccced
Used Yakup's way to ensure that the new code gen is used, not sure if…
philip-paul-mueller Jun 4, 2026
8abcc3c
Disable the generation of some functions that we do not need.
philip-paul-mueller Jun 4, 2026
8827204
Skipped some more tests.
philip-paul-mueller Jun 4, 2026
4573cd2
Updated DaCe and removed the fix for the expansion bug as it should b…
philip-paul-mueller Jun 4, 2026
3f9cef3
Updated DaCe, let's ask CI.
philip-paul-mueller Jun 4, 2026
895ef4d
Let's reduce the things we actually test.
philip-paul-mueller Jun 4, 2026
6418859
Let's artificially restrict the numbers of tests we do. let's see wha…
philip-paul-mueller Jun 4, 2026
d5732e4
Let's even more restrict it.
philip-paul-mueller Jun 4, 2026
7d60d00
Even more restrictive
philip-paul-mueller Jun 4, 2026
446e905
Updated DaCe
philip-paul-mueller Jun 5, 2026
1cec818
Updated DaCe.
philip-paul-mueller Jun 5, 2026
d34c3c4
Again increased what should be done.
philip-paul-mueller Jun 5, 2026
207374b
Updated DaCe, try it again.
philip-paul-mueller Jun 8, 2026
e304a4d
Updated DaCe.
philip-paul-mueller Jun 8, 2026
4041540
Again an update for DaCe.
philip-paul-mueller Jun 8, 2026
fb11d90
Merge remote-tracking branch 'gt4py/main' into dace_new_gpu_codegen
philip-paul-mueller Jun 8, 2026
7c7b39d
CI looks okay, but some jobs were killed, increase time limit to see …
philip-paul-mueller Jun 8, 2026
64e7316
Disabled the memory pool for non CUDA GPU backends.
philip-paul-mueller Jun 8, 2026
a32ff5a
Restored the original CI configuration.
philip-paul-mueller Jun 9, 2026
f4a077b
Undo the merge with `main` to see if that is the problem as it fails.
philip-paul-mueller Jun 9, 2026
0120fb7
Disabled cartesian in GPU tests.
philip-paul-mueller Jun 9, 2026
d756308
The CUDA CI passes but there are some tests failing on AMD.
philip-paul-mueller Jun 9, 2026
79bad8b
Fix filtering of nodes
edopao Jun 9, 2026
48b9652
Merge remote-tracking branch 'gt4py/main' into dace_new_gpu_codegen
philip-paul-mueller Jun 9, 2026
f074f21
Merge remote-tracking branch 'gt4py/dace_fix_gpu_tx_markers' into dac…
philip-paul-mueller Jun 9, 2026
4298d40
Run into timeout.
philip-paul-mueller Jun 9, 2026
e9107ba
Merge remote-tracking branch 'gt4py/main' into dace_new_gpu_codegen
philip-paul-mueller Jun 9, 2026
475808e
Merge remote-tracking branch 'gt4py/main' into dace_new_gpu_codegen
philip-paul-mueller Jun 10, 2026
3ec3768
Updated DaCe to the newest version.
philip-paul-mueller Jun 10, 2026
bdca888
Updated DaCe
philip-paul-mueller Jun 10, 2026
ad13299
Updated DaCe.
philip-paul-mueller Jun 10, 2026
6624360
Updated.
philip-paul-mueller Jun 10, 2026
55d6958
Fixed it now.
philip-paul-mueller Jun 10, 2026
b9d8eb2
Updated DaCe.
philip-paul-mueller Jun 10, 2026
ae0918a
Updated DaCe.
philip-paul-mueller Jun 11, 2026
30c542a
Prevented the creation of sync tasklet.
philip-paul-mueller Jun 15, 2026
4681ee8
Updated DaCe.
philip-paul-mueller Jun 15, 2026
e5a5c49
Updated version of DaCe.
philip-paul-mueller Jun 15, 2026
7835d00
Updated dace.
philip-paul-mueller Jun 15, 2026
f039347
Updated DacCe.
philip-paul-mueller Jun 15, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 12 additions & 5 deletions ci/cscs-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ stages:
DOCKER_BUILD_ARGS: '["BASE_IMAGE", "CACHE_DIR", "EXTRA_APTGET", "EXTRA_UV_ENV_VARS", "EXTRA_UV_PIP_ARGS", "EXTRA_UV_SYNC_ARGS", "PY_VERSION", "UV_VERSION", "WORKDIR_PATH" ]'
PERSIST_IMAGE_NAME: ${CSCS_REGISTRY_PATH}/public/${ARCH}/base/gt4py-ci-${PY_VERSION} # The $DOCKER_TAG tag is added in the before_script of .dynamic-image-name
WATCH_FILECHANGES: 'ci/Dockerfile ci/cscs-ci.yml uv.lock'
DACE_compiler_cuda_implementation: experimental
parallel:
matrix:
- PY_VERSION: *test_python_versions
Expand All @@ -57,6 +58,7 @@ stages:
# jfrog.svc.cscs.ch/dockerhub/nvidia is the cached version of docker.io/nvidia
BASE_IMAGE: jfrog.svc.cscs.ch/dockerhub/nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}
EXTRA_UV_SYNC_ARGS: "--extra cuda12"
DACE_compiler_cuda_implementation: experimental

.build_extra_rocm:
variables:
Expand All @@ -66,6 +68,7 @@ stages:
EXTRA_UV_ENV_VARS: "CUPY_INSTALL_USE_HIP=1 HCC_AMDGPU_TARGET=gfx942 ROCM_HOME=/opt/rocm"
KUBERNETES_MEMORY_REQUEST: "64Gi"
KUBERNETES_MEMORY_LIMIT: "64Gi"
DACE_compiler_cuda_implementation: experimental

build_cscs_gh200:
extends:
Expand All @@ -89,13 +92,15 @@ build_cscs_amd_rocm:
TEST_VARIANTS: 'cpu' # Extended jobs should redefine which variants (cpu, cuda12, rocm6) to test
USE_MPI: 0 # TODO(havogt): to workaround the libfabric hook injecting incompatible libraries
SLURM_JOB_NUM_NODES: 1
SLURM_TIMELIMIT: 5
SLURM_TIMELIMIT: 10
DACE_compiler_cuda_implementation: experimental
parallel:
matrix:
- SUBPACKAGE: [cartesian]
VARIANT: ['internal', 'dace']
SUBVARIANT: ['cuda12', 'rocm7', 'cpu']
PY_VERSION: *test_python_versions
# TODO(phimuell): `cartesian` does not work with the new code generator, no idea why.
#- SUBPACKAGE: [cartesian]
# VARIANT: ['internal', 'dace']
# SUBVARIANT: ['cuda12', 'rocm7', 'cpu']
# PY_VERSION: *test_python_versions
- SUBPACKAGE: eve
PY_VERSION: *test_python_versions
- SUBPACKAGE: next
Expand Down Expand Up @@ -139,6 +144,7 @@ test_cscs_gh200:
GT4PY_BUILD_JOBS: 8
# Limit test parallelism to avoid "OSError: too many open files" in the gt4py build stage.
PYTEST_XDIST_AUTO_NUM_WORKERS: 32
DACE_compiler_cuda_implementation: experimental
rules:
- *exclude_variants_rules
- if: $SUBPACKAGE == 'next' && $VARIANT == 'dace' && $DETAIL == 'nomesh'
Expand Down Expand Up @@ -167,6 +173,7 @@ test_cscs_amd_rocm:
CMAKE_PREFIX_PATH: /opt/rocm # for next
CUDA_HOME: /opt/rocm # for cartesian
SLURM_TIMELIMIT: 20 # relaxed relative to gh200 as there is no pressure on the queue
DACE_compiler_cuda_implementation: experimental
rules:
- *exclude_variants_rules
- if: $SUBPACKAGE == 'cartesian' && $VARIANT == 'internal' && $SUBVARIANT == 'cpu'
Expand Down
5 changes: 4 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ dependencies = [
'click>=8.0.0',
'cmake>=3.22',
'cytoolz>=1.0.1',
'dace>=2.0.0a3',
'dace==2.3.24',
'deepdiff>=8.1.0',
'devtools>=0.6',
'factory-boy>=3.3.3',
Expand Down Expand Up @@ -478,6 +478,9 @@ url = 'https://gridtools.github.io/pypi/'
# dace = {index = "gridtools"}
[tool.uv.sources]
atlas4py = {index = "test.pypi"}
dace = [
{git = "https://github.com/philip-paul-mueller/dace", branch = "phimuell__new-gpu-codegen-dev"}
]

# -- versioningit --
[tool.versioningit]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1057,6 +1057,16 @@ def gt_gpu_apply_mempool(sdfg: dace.SDFG) -> None:
Args:
sdfg: The SDFG that should be processed.
"""

# TODO(phimuell): Reverse once the new codegen has caught up.
gpu_backend = dace.Config.get("compiler.cuda.backend")
if gpu_backend != "cuda":
warnings.warn(
f"GPU Memory-Pool is only implemented for `CUDA` and not for `{gpu_backend}`.",
stacklevel=0,
)
return

for _, _, desc in sdfg.arrays_recursive():
if (
isinstance(desc, dace.data.Array)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,15 @@ def set_dace_config(
# This setting allows to throw an exception if any implicit Copy-Map slips thorugh.
dace.Config.set("compiler.cuda.allow_implicit_memlet_to_map", value=False)

# Use the new GPU code generator
# NOTE: In the CI file we export the variable to force the experimental code gen to be used.
dace.Config.set("compiler.cuda.implementation", value="experimental")

# Skip GPU Sync at the end.
# NOTE: That this will most likely break the UNIT tests, but should not be a problem
# for the blueline.
dace.Config.set("compiler", "cuda", "synchronize_on_exit", value=False)

# In some stencils, for example `apply_diffusion_to_w`, the cuda codegen messes
# up with the cuda streams, i.e. it allocates N streams but uses N+1. The first
# idea was to use just one stream. However, even in that case the generator
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,10 @@ def make_sdfg_call_async(sdfg: dace.SDFG, gpu: bool) -> None:
Todo: Revisit this function once DaCe changes its behaviour in this regard.
"""

# TODO(phimuell, edopao): Revisit this function after we understand the new
# code generator better.
return

# This is only a problem on GPU.
# TODO(phimuell): Figuring out what about OpenMP.
if not gpu:
Expand Down Expand Up @@ -282,6 +286,10 @@ def make_sdfg_call_sync(sdfg: dace.SDFG, gpu: bool) -> None:
work that runs on the GPU. Furthermore, all work is scheduled on the default stream.
"""

# TODO(phimuell, edopao): Revisit this function after we understand the new
# code generator better.
return

if not gpu:
# This is only a problem on GPU. Dace uses OpenMP on CPU and
# the OpenMP parallel region creates a synchronization point.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,7 @@ def _check_cpu_sdfg_call(sdfg: dace.SDFG) -> None:
assert not _are_streams_synchronized(sdfg)


@pytest.mark.skip("To revisit after switch to new code gen.")
@pytest.mark.parametrize(
"make_async_sdfg_call",
[False, True],
Expand Down Expand Up @@ -242,6 +243,7 @@ def test_generate_sdfg_async_call(make_async_sdfg_call: bool, device_type: core_
_check_sdfg_without_async_call(sdfg)


@pytest.mark.skip("To revisit after switch to new code gen.")
def test_generate_sdfg_async_call_no_map(device_type: core_defs.DeviceType):
"""Verify that the flag `async_sdfg_call=True` has no effect on an SDFG that does not contain any GPU map."""

Expand Down Expand Up @@ -367,6 +369,7 @@ def _make_multi_state_sdfg_3(
return sdfg, first_state, second_state


@pytest.mark.skip("To revisit after switch to new code gen.")
@pytest.mark.parametrize(
"multi_state_config",
[
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,6 @@ def test_set_gpu_properties(method: int):
sdfg = dace.SDFG(gtx_transformations_utils.unique_name("gpu_properties_test"))
state = sdfg.add_state(is_start_block=True)

map_entries: dict[int, dace_nodes.MapEntry] = {}
for dim in [1, 2, 3, 4]:
shape = (10,) * dim
sdfg.add_array(
Expand All @@ -168,15 +167,15 @@ def test_set_gpu_properties(method: int):
sdfg.add_array(
f"B_{dim}", shape=shape, dtype=dace.float64, storage=dace.StorageType.GPU_Global
)
_, me, _ = state.add_mapped_tasklet(
state.add_mapped_tasklet(
f"map_{dim}",
map_ranges={f"__i{i}": f"0:{s}" for i, s in enumerate(shape)},
inputs={"__in": dace.Memlet(f"A_{dim}[{','.join(f'__i{i}' for i in range(dim))}]")},
code="__out = math.cos(__in)",
outputs={"__out": dace.Memlet(f"B_{dim}[{','.join(f'__i{i}' for i in range(dim))}]")},
external_edges=True,
)
map_entries[dim] = me
del state
sdfg.validate()

if method == 0:
Expand Down Expand Up @@ -204,6 +203,11 @@ def test_set_gpu_properties(method: int):
else:
raise ValueError(f"Unknown method {method}")

# Because of the inplace reconstruction all references to graph objects are destroyed.
map_entries: dict[int, dace_nodes.MapEntry] = {}
for node in sdfg.states()[0].nodes():
if isinstance(node, dace_nodes.MapEntry):
map_entries[int(node.label[4])] = node
map1, map2, map3, map4 = (map_entries[d].map for d in [1, 2, 3, 4])

# It takes the normal block size and does not regulate anything.
Expand Down Expand Up @@ -259,6 +263,7 @@ def test_set_gpu_properties_1D():
map_entries[dim] = me
sdfg.validate()

# `get_set_gpu_blocksize()` is non destructive, so `map_entries` are still pointing into the SDFG.
sdfg.apply_gpu_transformations()
gtx_dace_fieldview_gpu_utils.gt_set_gpu_blocksize(
sdfg=sdfg,
Expand Down Expand Up @@ -323,6 +328,7 @@ def test_set_gpu_properties_2D_3D():
map_entries[dim] = me
sdfg.validate()

# `get_set_gpu_blocksize()` is non destructive, so `map_entries` are still pointing into the SDFG.
sdfg.apply_gpu_transformations()
gtx_dace_fieldview_gpu_utils.gt_set_gpu_blocksize(
sdfg=sdfg,
Expand Down
7 changes: 3 additions & 4 deletions uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading