What
This proposal introduces a performance optimization for the phase_encode_kernel by leveraging LLVM with Just-In-Time (JIT) compilation and the CUDA Driver API. Initial assessments suggest this refactoring can yield a 5x+ speedup over the current implementation, significantly improving performance for workloads relying on this kernel.
| Language |
Kernel execution time |
Gain |
| CUDA C runtime api |
0.302787 ms |
1 |
| LLVM+JIT |
0.0195 ms |
~ 15 |
Why
The goal of this optimization is to maximize hardware efficiency and properly utilize all available computing resources. Enhancing its execution speed via runtime compilation and direct hardware acceleration will dramatically reduce compute time and improve resource utilization across modern hardware backends.
How
-
LLVM + JIT Refactoring: Refactor the existing phase_encode_kernel function to generate LLVM Intermediate Representation (IR) dynamically, allowing for runtime optimization tailored to specific data shapes and parameters.
-
Execution Pipeline: Finalize the execution pipeline to smoothly bind the JIT-compiled LLVM kernel with the CUDA driver, including proper memory mapping.
-
tentive plan
Step 1: Bridge LLVM JIT with Rust Core
Step 2: Load the PTX with the CUDA Driver API
Step 3: Wire into the Mahout Pipeline
Q
- The phase-encode kernel itself is refactored in C++ using LLVM and JIT, so it's expected to be stored as a
.cpp file. Shall it be with the original .cu file, or in an independent folder?
- The CUDA C driver API will be added, shall it be with the original kernel, or in a independent
.rs file?
- If Mahout accepts this merge, shall the kernel be renamed, or should a feature like
define be used to switch between them?
What
This proposal introduces a performance optimization for the phase_encode_kernel by leveraging LLVM with Just-In-Time (JIT) compilation and the CUDA Driver API. Initial assessments suggest this refactoring can yield a 5x+ speedup over the current implementation, significantly improving performance for workloads relying on this kernel.
Why
The goal of this optimization is to maximize hardware efficiency and properly utilize all available computing resources. Enhancing its execution speed via runtime compilation and direct hardware acceleration will dramatically reduce compute time and improve resource utilization across modern hardware backends.
How
LLVM + JIT Refactoring: Refactor the existing phase_encode_kernel function to generate LLVM Intermediate Representation (IR) dynamically, allowing for runtime optimization tailored to specific data shapes and parameters.
Execution Pipeline: Finalize the execution pipeline to smoothly bind the JIT-compiled LLVM kernel with the CUDA driver, including proper memory mapping.
tentive plan
Step 1: Bridge LLVM JIT with Rust Core
Step 2: Load the PTX with the CUDA Driver API
Step 3: Wire into the Mahout Pipeline
Q
.cppfile. Shall it be with the original.cufile, or in an independent folder?.rsfile?definebe used to switch between them?