What’s the latest in NVIDIA CUDA Python

Dr. Andy R. Terrel - NVIDIA

Python

C++

Python and NVIDIA CUDA have long been friends. Over the last year, NVIDIA teams are working to improve the Pythonista’s experience. This means a top-to-bottom update to the CUDA Platform is fueling the GenAI movement, e.g. llama3, gpt and nemo. These improvements will allow Python teams to utilize new CUDA features on day one. While the work is still in progress, much of it is available on GitHub.

The CUDA software stack starts with the driver. Driver bindings for Python are hosted in the cuda-python project, with source available on GitHub. These driver bindings are generated from the CUDA C++ source code and feature a low-level set of Python functions and Cython definitions.

The average project in Python will almost certainly wrap these bindings in their abstraction layers, after all, it does not provide language features like exception handling or JIT compilation integrations. To solve this, the CUDA driver bindings are moving to a cuda.bindings module and a new object layer is being introduced.

Seeing the numerous ways projects define which stream to default to or incompatibilities with kernel generators, the team is releasing a new module called cuda.core. Last week, cuda.core 0.2.0 was released with objects managing the Device, Kernel, Event, Stream, and LaunchConfigs. These give the Pythonista a natural way of defining CUDA programs without shelling out to a C++ compiler or requiring an embedded language that is often found in these cases.

Additionally, the system enables users to control the compilation of kernels with objects managing a program and linker.

The new API is in the documentation. The following code, gives a basic SAXPY example using a C++ defined kernel:

import cupy as cp
from cuda.core.experimental import Device, LaunchConfig, Program, ProgramOptions, launch

# compute out = a * x + y
code = """
template<typename T>
__global__ void saxpy(const T a,const T* x, const T* y, T* out, size_t N) {
    const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
    for (size_t i=tid; i<N; i+=gridDim.x*blockDim.x) {
        out[tid] = a * x[tid] + y[tid];
    }
}
"""

# Prepare Device and compile the kernel Program
dev = Device()
dev.set_current()
s = dev.create_stream()
prog = Program(code, code_type="c++")
mod = prog.compile("cubin",name_expressions=("saxpy<float>", "saxpy<double>"))
ker = mod.get_kernel("saxpy<float>")
dtype = cp.float32

# prepare input/output
size = cp.uint64(64)
a = dtype(10)
rng = cp.random.default_rng()
x = rng.random(size, dtype=dtype)
y = rng.random(size, dtype=dtype)
out = cp.empty_like(x)
dev.sync()  # cupy runs on a different stream from s, so sync before accessing

# prepare launch and launch
block = 32
grid = int((size + block - 1) // block)
config = LaunchConfig(grid=grid, block=block, stream=s)
ker_args = (a, x.data.ptr, y.data.ptr, out.data.ptr, size)
launch(ker, config, *ker_args)
s.sync()

# check result
assert cp.allclose(out, a * x + y)

This example shows a basic use case of combining different objects. We take a C++ kernel, then programmatically compile it for the device. To keep things simple, we use a CuPy array to generate random arrays for testing and launch the kernel. We check our results for accuracy.

Managing the Device and compiling kernel functions is a step forward in helping out Python teams build interoperable code. Up next is the use of cooperative threads and parallel algorithms.

Stay tuned as the team releases more exciting news towards a Python-native CUDA at NVIDIA GTC 2025.

Dr. Andy R. Terrel - NVIDIA

WeAreDevelopers Dev Digest

Your weekly digest of news, tools, and expert tips to elevate your developer career.

What’s the latest in NVIDIA CUDA Python

WeAreDevelopers Dev Digest

Newsletter