Kernel Rules and Error Handling

This episode covers the rules for writing kernels and device functions, what language features can and cannot be used in device code, and how to handle errors in CUDA programs.

Objectives

  • Understand the distinction between __global__, __device__, and __host__ functions

  • Know the rules and restrictions for kernels and device functions

  • Know what language features are available in device code

  • Implement proper error handling for CUDA API calls and kernel launches

Instructor note

  • 30 min teaching

  • 15 min exercises

Kernels and device functions revisited

CUDA provides three function qualifiers that control where a function can be called from and where it executes:

Qualifier

Called from

Executes on

Notes

__global__

Host (or device with dynamic parallelism)

Device

Kernel — requires execution configuration <<<>>>

__device__

Device

Device

Device function — can be called from kernels

__host__

Host

Host

Regular host function (default if no qualifier)

__host__ __device__

Both

Both

Compiled for both host and device

In Fortran, the equivalents are attributes(global), attributes(device), attributes(host), and attributes(host, device).

Key rules for kernels and device functions:

  • Kernels (__global__) cannot have return values and need an execution configuration to be called.

  • Functions with __host__ __device__ are compiled for both execution spaces. The compiler generates code for both host and device. Exception: constexpr/consteval functions may, with restrictions, be called across execution spaces.

  • Only device functions may be called from kernels or other device functions.

  • Pointers or passed-by-reference variables must point to device-accessible memory (device memory, managed memory, or mapped pinned memory). Otherwise, expect segmentation faults.

  • Variables passed by value to kernels are stored in constant memory and limited to 4 KB (or 32,764 bytes starting with Volta and CUDA 12.1). They are automatically copied — no explicit allocation or transfer needed.

  • Variables declared in a kernel are private to each thread and stored in registers (or local memory if the compiler chooses).

Argument passing

Argument passing works as usual in each language:

  • C/C++: Arguments are passed by value by default. Use pointers for pass-by-reference.

  • Fortran: Arguments are passed by reference by default. Use the value attribute for pass-by-value.

Rules for kernels (C/C++)

  • Kernels (__global__ functions) do not support recursion.

  • Non-trivially copyable or non-trivially-destructible argument types are allowed when launched from host code, but the processing does not follow the standard C++ model (see the programming guide for details).

  • Kernels cannot have a variable number of arguments.

  • Kernel parameters cannot be pass-by-reference (use pointers instead).

Rules for kernels (Fortran)

  • Return values of functions or intent(out)/intent(inout) variables cannot have the value attribute.

  • A subroutine or function with the device or global attribute may not be recursive, pure, or elemental.

  • Device and global subroutines/functions must appear within a Fortran module, and device functions may only be called from device subprograms in the same module.

  • They may not contain another subprogram.

  • They may not be contained in another subroutine or function.

What can be used in kernels

Available since early CUDA

  • All mathematical operators

  • Control flow constructs (if, for, while, case, goto)

  • Single-precision transcendental mathematical functions (see section 17.1 in the Programming Guide)

  • Calls to device functions/subroutines

  • Pointers (C), structs/classes (C), statically-sized arrays (C, Fortran)

  • Assumed-shape arrays as kernel arguments (Fortran)

  • CUDA-specific built-in functions

  • C++ templates and function overloading

Available on newer devices (CC ≥ 2.0, CUDA ≥ 4)

  • Double-precision floating point and DP mathematical functions

  • __float128 mathematical functions (CC 10+, if host compiler supports __float128 or _Float128, no dedicated hardware)

  • Recursive function calls (C)

  • Allocatable arrays, pointers

  • Function pointers

  • C++ polymorphic classes

  • printf (C), limited print support (Fortran)

  • malloc, new (no Fortran equivalent yet)

  • R-value references

  • Static variables declared in device functions

What cannot be used in kernels

  • C99/C++ dynamically-sized []-arrays (VLAs)

  • System calls, file I/O, general memory management (fopen, fprintf, system, etc.)

  • Run Time Type Information (RTTI), e.g., no dynamic_cast

  • Arguments with virtual base classes

  • long double floats

  • Allocatable arrays, pointers, value with intent(out) or intent(inout) (Fortran)

  • Recursive, pure, elemental function calls, optional function arguments (Fortran)

  • save attribute (Fortran)

Error handling

C++ exception handling is not supported in device code. Instead, CUDA uses error codes:

  • All runtime API functions return a cudaError_t error code.

  • If no error occurred, the returned value is cudaSuccess.

  • cudaGetErrorString(error) returns a human-readable description of the error.

  • cudaGetLastError() returns the last error and resets it to cudaSuccess.

  • cudaPeekAtLastError() returns the last error without resetting it.

__host__ __device__ const char* cudaGetErrorName(cudaError_t error)
__host__ __device__ cudaError_t cudaGetLastError(void)
__host__ __device__ cudaError_t cudaPeekAtLastError(void)

Important caveats

  • Asynchronous calls cannot return errors while the code runs (except for launch errors).

  • Kernel launches using the <<<>>> syntax do not return any error code at all.

  • Errors from asynchronous functions or kernel launches typically surface on the next blocking call (e.g., cudaDeviceSynchronize or cudaMemcpy). This means cudaMemcpy may report an “unspecified launch failure” caused by a prior kernel.

  • A common error is “unspecified launch failure”, which usually means the kernel crashed — most commonly due to invalid memory access (array bounds violation, accessing host memory from device code, etc.).

Error-checking macro

In practice, wrapping CUDA API calls in an error-checking macro is the most convenient approach. Here is a simple macro used in this course:

cuda_utils.h
#ifndef NDEBUG
#define cudaVerify(x) do {                                   \
    cudaError_t __cu_result = x;                             \
    if (__cu_result != cudaSuccess) {                         \
        fprintf(stderr,                                      \
                "%s:%i: error: cuda function call failed:\n"  \
                "  %s;\nmessage: %s\n",                      \
                __FILE__, __LINE__,                           \
                #x, cudaGetErrorString(__cu_result));          \
        exit(1);                                              \
    }                                                         \
} while(0)
#endif

Usage:

#include "cuda_utils.h"

cudaVerify(cudaMalloc(&ptr, size));
cudaVerify(cudaMemcpy(dst, src, size, cudaMemcpyHostToDevice));

// For kernel launches, use cudaGetLastError after the launch:
myKernel<<<grid, block>>>(args);
cudaVerify(cudaGetLastError());
cudaVerify(cudaDeviceSynchronize());

The same approach works in Fortran using a preprocessor.

Exercise: Device function call

Device Function Call (C/C++)

The program below computes the element-wise square root of a vector on both CPU and GPU. The function sqrt_of is currently only callable from host code.

Tasks:

  1. Make sqrt_of callable from both host and device code.

  2. Define a launch configuration for the kernel.

#include <cstdio>
#include <cstdlib>

const int vec_size = 1'000'000;
__managed__ double vector[vec_size];
__managed__ double root_dev;

// TODO: Make this function callable in host and device code.
double sqrt_of(double x, double precision = 1e-8) {
    double approximation = x / 2;
    double prev_val;
    double error = 1.0;
    if (x == 0.0) return 0.0;
    if (x < 0) return -1.0;
    while (error > precision) {
        prev_val = approximation;
        approximation = 0.5 * (approximation + x / approximation);
        error = (prev_val * approximation - x) * (prev_val * approximation - x)
                / (2 * prev_val * approximation * approximation);
    }
    return approximation;
}

__global__ void sqrt_kernel(double* vector, int vec_size) {
    int index = threadIdx.x + blockDim.x * blockIdx.x;
    if (index < vec_size) {
        vector[index] = sqrt_of(vector[index]);
    }
}

int main() {
    // TODO: Define a launch configuration
    const ___ num_threads = ____;
    const ___ num_blocks  = ____;
    // ...
    sqrt_kernel<<<num_blocks, num_threads>>>(vector, vec_size);
    cudaDeviceSynchronize();
}

Keypoints

  • __global__ marks a kernel (called from host, runs on device), __device__ marks a device function, __host__ __device__ compiles for both

  • Kernels have restrictions: no return values, no recursion, no variable arguments, no pass-by-reference parameters

  • Most standard mathematical operations work on the GPU; system calls and I/O do not

  • Always check CUDA error codes — use a macro like cudaVerify to wrap API calls

  • Kernel launch errors surface asynchronously on the next blocking call