Kernel Rules and Error Handling¶
This episode covers the rules for writing kernels and device functions, what language features can and cannot be used in device code, and how to handle errors in CUDA programs.
Objectives
Understand the distinction between
__global__,__device__, and__host__functionsKnow the rules and restrictions for kernels and device functions
Know what language features are available in device code
Implement proper error handling for CUDA API calls and kernel launches
Instructor note
30 min teaching
15 min exercises
Kernels and device functions revisited¶
CUDA provides three function qualifiers that control where a function can be called from and where it executes:
Qualifier |
Called from |
Executes on |
Notes |
|---|---|---|---|
|
Host (or device with dynamic parallelism) |
Device |
Kernel — requires execution configuration |
|
Device |
Device |
Device function — can be called from kernels |
|
Host |
Host |
Regular host function (default if no qualifier) |
|
Both |
Both |
Compiled for both host and device |
In Fortran, the equivalents are attributes(global), attributes(device), attributes(host), and attributes(host, device).
Key rules for kernels and device functions:
Kernels (
__global__) cannot have return values and need an execution configuration to be called.Functions with
__host__ __device__are compiled for both execution spaces. The compiler generates code for both host and device. Exception:constexpr/constevalfunctions may, with restrictions, be called across execution spaces.Only device functions may be called from kernels or other device functions.
Pointers or passed-by-reference variables must point to device-accessible memory (device memory, managed memory, or mapped pinned memory). Otherwise, expect segmentation faults.
Variables passed by value to kernels are stored in constant memory and limited to 4 KB (or 32,764 bytes starting with Volta and CUDA 12.1). They are automatically copied — no explicit allocation or transfer needed.
Variables declared in a kernel are private to each thread and stored in registers (or local memory if the compiler chooses).
Argument passing¶
Argument passing works as usual in each language:
C/C++: Arguments are passed by value by default. Use pointers for pass-by-reference.
Fortran: Arguments are passed by reference by default. Use the
valueattribute for pass-by-value.
Rules for kernels (C/C++)¶
Kernels (
__global__functions) do not support recursion.Non-trivially copyable or non-trivially-destructible argument types are allowed when launched from host code, but the processing does not follow the standard C++ model (see the programming guide for details).
Kernels cannot have a variable number of arguments.
Kernel parameters cannot be pass-by-reference (use pointers instead).
Rules for kernels (Fortran)¶
Return values of functions or
intent(out)/intent(inout)variables cannot have thevalueattribute.A subroutine or function with the
deviceorglobalattribute may not be recursive, pure, or elemental.Device and global subroutines/functions must appear within a Fortran module, and device functions may only be called from device subprograms in the same module.
They may not contain another subprogram.
They may not be contained in another subroutine or function.
What can be used in kernels¶
Available since early CUDA¶
All mathematical operators
Control flow constructs (
if,for,while,case,goto)Single-precision transcendental mathematical functions (see section 17.1 in the Programming Guide)
Calls to device functions/subroutines
Pointers (C), structs/classes (C), statically-sized arrays (C, Fortran)
Assumed-shape arrays as kernel arguments (Fortran)
CUDA-specific built-in functions
C++ templates and function overloading
Available on newer devices (CC ≥ 2.0, CUDA ≥ 4)¶
Double-precision floating point and DP mathematical functions
__float128mathematical functions (CC 10+, if host compiler supports__float128or_Float128, no dedicated hardware)Recursive function calls (C)
Allocatable arrays, pointers
Function pointers
C++ polymorphic classes
printf(C), limitedprintsupport (Fortran)malloc,new(no Fortran equivalent yet)R-value references
Static variables declared in device functions
What cannot be used in kernels¶
C99/C++ dynamically-sized
[]-arrays (VLAs)System calls, file I/O, general memory management (
fopen,fprintf,system, etc.)Run Time Type Information (RTTI), e.g., no
dynamic_castArguments with virtual base classes
long doublefloatsAllocatable arrays, pointers,
valuewithintent(out)orintent(inout)(Fortran)Recursive, pure, elemental function calls, optional function arguments (Fortran)
saveattribute (Fortran)
Error handling¶
C++ exception handling is not supported in device code. Instead, CUDA uses error codes:
All runtime API functions return a
cudaError_terror code.If no error occurred, the returned value is
cudaSuccess.cudaGetErrorString(error)returns a human-readable description of the error.cudaGetLastError()returns the last error and resets it tocudaSuccess.cudaPeekAtLastError()returns the last error without resetting it.
__host__ __device__ const char* cudaGetErrorName(cudaError_t error)
__host__ __device__ cudaError_t cudaGetLastError(void)
__host__ __device__ cudaError_t cudaPeekAtLastError(void)
function cudaGetErrorString(errcode)
integer, intent(in) :: errcode
character*(*) :: cudaGetErrorString
integer function cudaGetLastError()
integer function cudaPeekAtLastError()
Important caveats¶
Asynchronous calls cannot return errors while the code runs (except for launch errors).
Kernel launches using the
<<<>>>syntax do not return any error code at all.Errors from asynchronous functions or kernel launches typically surface on the next blocking call (e.g.,
cudaDeviceSynchronizeorcudaMemcpy). This meanscudaMemcpymay report an “unspecified launch failure” caused by a prior kernel.A common error is “unspecified launch failure”, which usually means the kernel crashed — most commonly due to invalid memory access (array bounds violation, accessing host memory from device code, etc.).
Error-checking macro¶
In practice, wrapping CUDA API calls in an error-checking macro is the most convenient approach. Here is a simple macro used in this course:
#ifndef NDEBUG
#define cudaVerify(x) do { \
cudaError_t __cu_result = x; \
if (__cu_result != cudaSuccess) { \
fprintf(stderr, \
"%s:%i: error: cuda function call failed:\n" \
" %s;\nmessage: %s\n", \
__FILE__, __LINE__, \
#x, cudaGetErrorString(__cu_result)); \
exit(1); \
} \
} while(0)
#endif
Usage:
#include "cuda_utils.h"
cudaVerify(cudaMalloc(&ptr, size));
cudaVerify(cudaMemcpy(dst, src, size, cudaMemcpyHostToDevice));
// For kernel launches, use cudaGetLastError after the launch:
myKernel<<<grid, block>>>(args);
cudaVerify(cudaGetLastError());
cudaVerify(cudaDeviceSynchronize());
The same approach works in Fortran using a preprocessor.
Exercise: Device function call¶
Device Function Call (C/C++)
The program below computes the element-wise square root of a vector on both CPU and GPU.
The function sqrt_of is currently only callable from host code.
Tasks:
Make
sqrt_ofcallable from both host and device code.Define a launch configuration for the kernel.
#include <cstdio>
#include <cstdlib>
const int vec_size = 1'000'000;
__managed__ double vector[vec_size];
__managed__ double root_dev;
// TODO: Make this function callable in host and device code.
double sqrt_of(double x, double precision = 1e-8) {
double approximation = x / 2;
double prev_val;
double error = 1.0;
if (x == 0.0) return 0.0;
if (x < 0) return -1.0;
while (error > precision) {
prev_val = approximation;
approximation = 0.5 * (approximation + x / approximation);
error = (prev_val * approximation - x) * (prev_val * approximation - x)
/ (2 * prev_val * approximation * approximation);
}
return approximation;
}
__global__ void sqrt_kernel(double* vector, int vec_size) {
int index = threadIdx.x + blockDim.x * blockIdx.x;
if (index < vec_size) {
vector[index] = sqrt_of(vector[index]);
}
}
int main() {
// TODO: Define a launch configuration
const ___ num_threads = ____;
const ___ num_blocks = ____;
// ...
sqrt_kernel<<<num_blocks, num_threads>>>(vector, vec_size);
cudaDeviceSynchronize();
}
Solution
Add __host__ __device__ to the function declaration and forward declaration:
__host__ __device__ double sqrt_of(double x, double precision = 1e-8) {
// ... (function body unchanged)
}
const int num_threads = 256;
const int num_blocks = (vec_size + num_threads - 1) / num_threads;
Keypoints
__global__marks a kernel (called from host, runs on device),__device__marks a device function,__host__ __device__compiles for bothKernels have restrictions: no return values, no recursion, no variable arguments, no pass-by-reference parameters
Most standard mathematical operations work on the GPU; system calls and I/O do not
Always check CUDA error codes — use a macro like
cudaVerifyto wrap API callsKernel launch errors surface asynchronously on the next blocking call