Kernel Rules and Error Handling¶

This episode covers the rules for writing kernels and device functions, what language features can and cannot be used in device code, and how to handle errors in CUDA programs.

Objectives

Understand the distinction between __global__, __device__, and __host__ functions
Know the rules and restrictions for kernels and device functions
Know what language features are available in device code
Implement proper error handling for CUDA API calls and kernel launches

Instructor note

30 min teaching
15 min exercises

Kernels and device functions revisited¶

CUDA provides three function qualifiers that control where a function can be called from and where it executes:

Qualifier	Called from	Executes on	Notes
`__global__`	Host (or device with dynamic parallelism)	Device	Kernel — requires execution configuration `<<<>>>`
`__device__`	Device	Device	Device function — can be called from kernels
`__host__`	Host	Host	Regular host function (default if no qualifier)
`__host__ __device__`	Both	Both	Compiled for both host and device

In Fortran, the equivalents are attributes(global), attributes(device), attributes(host), and attributes(host, device).

Key rules for kernels and device functions:

Kernels (__global__) cannot have return values and need an execution configuration to be called.
Functions with __host__ __device__ are compiled for both execution spaces. The compiler generates code for both host and device. Exception: constexpr/consteval functions may, with restrictions, be called across execution spaces.
Only device functions may be called from kernels or other device functions.
Pointers or passed-by-reference variables must point to device-accessible memory (device memory, managed memory, or mapped pinned memory). Otherwise, expect segmentation faults.
Variables passed by value to kernels are stored in constant memory and limited to 4 KB (or 32,764 bytes starting with Volta and CUDA 12.1). They are automatically copied — no explicit allocation or transfer needed.
Variables declared in a kernel are private to each thread and stored in registers (or local memory if the compiler chooses).

Argument passing¶

Argument passing works as usual in each language:

C/C++: Arguments are passed by value by default. Use pointers for pass-by-reference.
Fortran: Arguments are passed by reference by default. Use the value attribute for pass-by-value.

Rules for kernels (C/C++)¶

Kernels (__global__ functions) do not support recursion.
Non-trivially copyable or non-trivially-destructible argument types are allowed when launched from host code, but the processing does not follow the standard C++ model (see the programming guide for details).
Kernels cannot have a variable number of arguments.
Kernel parameters cannot be pass-by-reference (use pointers instead).

Rules for kernels (Fortran)¶

Return values of functions or intent(out)/intent(inout) variables cannot have the value attribute.
A subroutine or function with the device or global attribute may not be recursive, pure, or elemental.
Device and global subroutines/functions must appear within a Fortran module, and device functions may only be called from device subprograms in the same module.
They may not contain another subprogram.
They may not be contained in another subroutine or function.

What can be used in kernels¶

Available since early CUDA¶

All mathematical operators
Control flow constructs (if, for, while, case, goto)
Single-precision transcendental mathematical functions (see section 17.1 in the Programming Guide)
Calls to device functions/subroutines
Pointers (C), structs/classes (C), statically-sized arrays (C, Fortran)
Assumed-shape arrays as kernel arguments (Fortran)
CUDA-specific built-in functions
C++ templates and function overloading

Available on newer devices (CC ≥ 2.0, CUDA ≥ 4)¶

Double-precision floating point and DP mathematical functions
__float128 mathematical functions (CC 10+, if host compiler supports __float128 or _Float128, no dedicated hardware)
Recursive function calls (C)
Allocatable arrays, pointers
Function pointers
C++ polymorphic classes
printf (C), limited print support (Fortran)
malloc, new (no Fortran equivalent yet)
R-value references
Static variables declared in device functions

What cannot be used in kernels¶

C99/C++ dynamically-sized []-arrays (VLAs)
System calls, file I/O, general memory management (fopen, fprintf, system, etc.)
Run Time Type Information (RTTI), e.g., no dynamic_cast
Arguments with virtual base classes
long double floats
Allocatable arrays, pointers, value with intent(out) or intent(inout) (Fortran)
Recursive, pure, elemental function calls, optional function arguments (Fortran)
save attribute (Fortran)

Error handling¶

C++ exception handling is not supported in device code. Instead, CUDA uses error codes:

All runtime API functions return a cudaError_t error code.
If no error occurred, the returned value is cudaSuccess.
cudaGetErrorString(error) returns a human-readable description of the error.
cudaGetLastError() returns the last error and resets it to cudaSuccess.
cudaPeekAtLastError() returns the last error without resetting it.

__host__ __device__ const char* cudaGetErrorName(cudaError_t error)
__host__ __device__ cudaError_t cudaGetLastError(void)
__host__ __device__ cudaError_t cudaPeekAtLastError(void)

function cudaGetErrorString(errcode)
    integer, intent(in) :: errcode
    character*(*) :: cudaGetErrorString

integer function cudaGetLastError()
integer function cudaPeekAtLastError()

Important caveats¶

Asynchronous calls cannot return errors while the code runs (except for launch errors).
Kernel launches using the <<<>>> syntax do not return any error code at all.
Errors from asynchronous functions or kernel launches typically surface on the next blocking call (e.g., cudaDeviceSynchronize or cudaMemcpy). This means cudaMemcpy may report an “unspecified launch failure” caused by a prior kernel.
A common error is “unspecified launch failure”, which usually means the kernel crashed — most commonly due to invalid memory access (array bounds violation, accessing host memory from device code, etc.).

Error-checking macro¶

In practice, wrapping CUDA API calls in an error-checking macro is the most convenient approach. Here is a simple macro used in this course:

cuda_utils.h¶

#ifndef NDEBUG
#define cudaVerify(x) do {                                   \
    cudaError_t __cu_result = x;                             \
    if (__cu_result != cudaSuccess) {                         \
        fprintf(stderr,                                      \
                "%s:%i: error: cuda function call failed:\n"  \
                "  %s;\nmessage: %s\n",                      \
                __FILE__, __LINE__,                           \
                #x, cudaGetErrorString(__cu_result));          \
        exit(1);                                              \
    }                                                         \
} while(0)
#endif

Usage:

#include "cuda_utils.h"

cudaVerify(cudaMalloc(&ptr, size));
cudaVerify(cudaMemcpy(dst, src, size, cudaMemcpyHostToDevice));

// For kernel launches, use cudaGetLastError after the launch:
myKernel<<<grid, block>>>(args);
cudaVerify(cudaGetLastError());
cudaVerify(cudaDeviceSynchronize());

The same approach works in Fortran using a preprocessor.

Exercise: Device function call¶

Device Function Call (C/C++)

The program below computes the element-wise square root of a vector on both CPU and GPU. The function sqrt_of is currently only callable from host code.

Tasks:

Make sqrt_of callable from both host and device code.
Define a launch configuration for the kernel.

#include <cstdio>
#include <cstdlib>

const int vec_size = 1'000'000;
__managed__ double vector[vec_size];
__managed__ double root_dev;

// TODO: Make this function callable in host and device code.
double sqrt_of(double x, double precision = 1e-8) {
    double approximation = x / 2;
    double prev_val;
    double error = 1.0;
    if (x == 0.0) return 0.0;
    if (x < 0) return -1.0;
    while (error > precision) {
        prev_val = approximation;
        approximation = 0.5 * (approximation + x / approximation);
        error = (prev_val * approximation - x) * (prev_val * approximation - x)
                / (2 * prev_val * approximation * approximation);
    }
    return approximation;
}

__global__ void sqrt_kernel(double* vector, int vec_size) {
    int index = threadIdx.x + blockDim.x * blockIdx.x;
    if (index < vec_size) {
        vector[index] = sqrt_of(vector[index]);
    }
}

int main() {
    // TODO: Define a launch configuration
    const ___ num_threads = ____;
    const ___ num_blocks  = ____;
    // ...
    sqrt_kernel<<<num_blocks, num_threads>>>(vector, vec_size);
    cudaDeviceSynchronize();
}

Solution

Add __host__ __device__ to the function declaration and forward declaration:

__host__ __device__ double sqrt_of(double x, double precision = 1e-8) {
    // ... (function body unchanged)
}

const int num_threads = 256;
const int num_blocks = (vec_size + num_threads - 1) / num_threads;

Keypoints

__global__ marks a kernel (called from host, runs on device), __device__ marks a device function, __host__ __device__ compiles for both
Kernels have restrictions: no return values, no recursion, no variable arguments, no pass-by-reference parameters
Most standard mathematical operations work on the GPU; system calls and I/O do not
Always check CUDA error codes — use a macro like cudaVerify to wrap API calls
Kernel launch errors surface asynchronously on the next blocking call