CUDA & PyTorch Internals

Part 1 — CUDA C++ Programming Guide, Ch. 1–5

Ch. 1–3: Why GPUs? The CUDA Origin Story

Modern CPUs are designed for low-latency serial execution. They devote most of their transistor budget to cache hierarchies, branch predictors, and out-of-order execution logic — all aimed at making a single thread run as fast as possible. GPUs make the opposite tradeoff: thousands of smaller, simpler cores optimized for throughput over latency.

CUDA (Compute Unified Device Architecture) was introduced by NVIDIA in 2006 to expose this parallelism to general-purpose workloads beyond graphics. The programming model extends C++ with a small set of keywords and a new execution model built around the idea that you have many thousands of identical threads running simultaneously.

The key insight of the scalable programming model: a program written for a small GPU will run correctly (though slower) on a GPU with fewer multiprocessors, and will automatically scale up on a more powerful GPU. You don't target a specific number of cores — the hardware schedules your work.

Ch. 4: Kernels and the Thread Hierarchy

Kernels

A kernel is a C++ function that runs on the GPU. It's declared with the __global__ qualifier. You invoke it from host (CPU) code using the <<<...>>> syntax, specifying how many threads to launch:

__global__ void vecAdd(float* A, float* B, float* C, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) C[i] = A[i] + B[i];
}

// Launch 1024 threads in blocks of 256
vecAdd<<<4, 256>>>(dA, dB, dC, 1024);

Each thread has its own index (threadIdx) and knows which block it belongs to (blockIdx). This gives it a unique global identity, used to partition the work.

Thread Hierarchy: threads → blocks → grid

Thread: the fundamental execution unit. Runs one instance of the kernel.
Block: a group of threads (up to 1024) that can cooperate via shared memory and synchronize with __syncthreads(). Blocks execute on a single Streaming Multiprocessor (SM).
Grid: the collection of all blocks for a kernel launch. Blocks within a grid are independent — no synchronization is possible between blocks within a single launch.

Blocks can be 1D, 2D, or 3D (useful for images or volumetric data). Starting with compute capability 9.0 (Hopper), Thread Block Clusters add an optional grouping above blocks, allowing cooperative execution across multiple SMs.

Warp: The hardware schedules threads in groups of 32 called warps. All threads in a warp execute the same instruction (SIMT — Single Instruction, Multiple Threads). Divergent branches (if-else where threads go different ways) serialize, reducing throughput. Minimizing warp divergence is one of the first GPU optimization rules.

Ch. 5: Memory Hierarchy

Memory is the central bottleneck on GPUs. Understanding the hierarchy is essential for writing fast kernels.

Registers

Per-thread. ~255 registers/thread. Fastest. Spills to local memory if exceeded.

Shared Memory

Per-block. 48–228 KB/SM. Low latency (~5 cycles). Programmable L1 cache. Critical for tiling.

L1 Cache

Per-SM. Shares SRAM with shared memory (configurable split).

L2 Cache

Device-wide. Several MB. ~100-cycle latency. CUDA 11.1+ allows pinning residency for streaming workloads.

Global Memory

HBM2/HBM3. Up to 80 GB on H100. ~600–800 cycles. The main data store. Coalescing accesses is critical.

Coalescing

When threads in a warp access consecutive memory addresses, the hardware can satisfy the whole warp in a single transaction. This is called coalesced access. Strided or random accesses require multiple transactions, multiplying memory bandwidth consumption. Always arrange data so neighboring threads access neighboring addresses.

Shared Memory Tiling Pattern

The canonical technique for compute-bound kernels (matrix multiply being the textbook example) is to tile the computation: load a tile of data from global memory into shared memory, synchronize, compute, and repeat. This replaces many slow global reads with fast shared memory reads.

__shared__ float tile_A[TILE][TILE];
__shared__ float tile_B[TILE][TILE];

// load phase
tile_A[ty][tx] = A[row * N + (phase * TILE + tx)];
tile_B[ty][tx] = B[(phase * TILE + ty) * N + col];
__syncthreads();

// compute phase
for (int k = 0; k < TILE; ++k)
    sum += tile_A[ty][k] * tile_B[k][tx];
__syncthreads();

Ch. 5: Heterogeneous & Async Programming

The CPU–GPU Relationship

CUDA treats the CPU as the host and the GPU as the device. They have separate memory spaces (unless using unified memory). The typical flow: allocate device memory (cudaMalloc), copy data from host to device (cudaMemcpy H2D), launch kernels, copy results back (cudaMemcpy D2H), free.

Kernel launches are asynchronous by default — the host thread returns immediately and can queue multiple kernels or H2D/D2H transfers. Streams are ordered queues of GPU work; operations in the same stream execute in order, while operations in different streams can overlap.

Asynchronous SIMT and Compute Capability

The asynchronous programming model deepens with newer hardware. cuda::barrier and cuda::pipeline (CUDA 11+) allow finer-grained producer-consumer pipelining within a block — threads can issue async copies with memcpy_async into shared memory while other threads compute on previously loaded data. This double-buffering pattern is fundamental in tensor core matmul implementations. Compute capability is a versioned number (e.g., 8.0 for Ampere, 9.0 for Hopper) that gates which hardware features are available.

Part 2 — PyTorch Internals (ezyang, 2019)

PyTorch: Tensors, Strides, and Storage

The tensor is the central data structure in PyTorch. Most users think of it as an n-dimensional array with a dtype, a device, and a shape. But the actual implementation is richer.

Storage and the Tensor–Storage Split

Physically, data lives in a Storage object: a flat, typed, device-resident buffer. A Tensor is a view on top of a Storage — it records sizes, strides, and an offset, but shares the underlying bytes. This is what enables zero-copy slicing and transposing.

x = torch.zeros(4, 4)
row = x[1, :]  # new Tensor, same Storage, offset=4, strides=(1,)
col = x[:, 0]  # new Tensor, same Storage, offset=0, strides=(4,)

Strides: the key abstraction

To find element tensor[i, j] in physical memory: offset + i * stride[0] + j * stride[1]. For a contiguous row-major (M, N) tensor, strides are (N, 1). A transpose flips them to (1, M) — no data copy required, just a metadata change. Many PyTorch "views" (slice, reshape, permute) are purely stride arithmetic.

Contiguous check: tensor.is_contiguous() returns true when strides follow the standard row-major layout. Calling .contiguous() on a non-contiguous tensor triggers a copy. Many CUDA kernels require contiguous input — this is a common source of hidden copies in production code.

The three extension axes

Every tensor is characterized by three orthogonal parameters that define a Cartesian product of possible tensor types:

device: where the storage lives (cpu, cuda, xla, hip). Each device has its own allocator.
layout: how logical indices map to physical memory (strided, sparse_coo, sparse_csr, MKL-DNN blocked, etc.)
dtype: the scalar type (float32, bfloat16, int8, quantized types, etc.)

PyTorch: The Dispatch Chain

When you call torch.add(a, b), what actually happens? It's a two-level dispatch:

Python → C++ boundary. Autogenerated binding code (THPVariable_add) uses PythonArgParser to extract C++ tensors from Python arguments, releases the GIL, and calls into C++.

Variable (autograd) dispatch. The Variable layer unwraps tensors, records the operation in the autograd graph if requires_grad=True, then calls the underlying implementation.

Device/layout dispatch. A virtual call on a "Type" object routes to the right backend — CPUFloatType, CUDAHalfType, etc. This is a dynamic dispatch because CPU and CUDA kernels live in separate shared libraries.

dtype dispatch. Inside the backend kernel, AT_DISPATCH_ALL_TYPES (a switch macro) selects the concrete scalar-type specialization and jumps into the numeric computation.

All code up to step 4 is autogenerated from operator schemas defined in native_functions.yaml. This is why you can add a new operator, write a single kernel, and automatically get Python bindings, gradient support, and dtype dispatch — without touching binding or dispatch code manually.

Key directories in the PyTorch codebase

torch/ — Python modules you import. Easy to hack on.
torch/csrc/ — C++ Python bindings + autograd engine + JIT.
aten/src/ATen/native/ — the actual operator kernels (modern C++).
c10/ — core abstractions: TensorImpl, Storage, device/dtype types.

PyTorch: Autograd Internals

PyTorch implements reverse-mode automatic differentiation. The forward pass builds a dynamic computation graph (a DAG of Function nodes). Each node stores the inputs needed to compute its backward pass. When you call loss.backward(), the autograd engine traverses this graph in reverse, accumulating gradients.

The variable wrapping adds an AutogradMeta struct to each tensor that stores:

The grad_fn: the Function that produced this tensor (null for leaf tensors)
The accumulated .grad
Whether requires_grad is set

Technically, the gradients computed are not pure gradients but Jacobian-vector products (JVPs with the upstream gradient vector). For scalar losses this simplifies to standard gradients. For vector-valued outputs you need torch.autograd.grad with explicit grad_outputs.

In-place operations and autograd: In-place ops (x += 1) that modify a tensor needed for a backward pass will raise a RuntimeError. This is not a limitation — it's a correctness guard. If you're hitting this in production, check for unexpected in-place ops in your model.

PyTorch: What Writing a Kernel Actually Looks Like

A kernel in ATen native consists of:

Schema in native_functions.yaml: declares signature, dispatch keys, and whether it has an out= variant or an in-place variant.
Error checking using TORCH_CHECK(cond, "message") or higher-level checkDim() helpers.
Output allocation: create the result tensor via at::empty or at::zeros.
dtype dispatch via AT_DISPATCH_ALL_TYPES(input.scalar_type(), "kernel_name", [&] { ... }) — the lambda is instantiated once per supported dtype.
Parallelism: at::parallel_for for CPU; CUDA kernels are implicitly parallel.
Data access: use TensorAccessor (bounds-checked, index-calculated) or raw pointers via data_ptr<scalar_t>().

Most production kernels also define a derivatives.yaml entry for autograd, which tells the codegen what backward formula to use — often composing existing ops, occasionally a hand-written _backward function.

Sources: NVIDIA CUDA C++ Programming Guide (chapters 1–5), docs.nvidia.com. PyTorch internals overview by Edward Yang, blog.ezyang.com (May 2019).

CUDA Programming Model (Ch. 1–5) + PyTorch Internals

Contents

Ch. 1–3: Why GPUs? The CUDA Origin Story

Ch. 4: Kernels and the Thread Hierarchy

Kernels

Thread Hierarchy: threads → blocks → grid

Ch. 5: Memory Hierarchy

Coalescing

Shared Memory Tiling Pattern

Ch. 5: Heterogeneous & Async Programming

The CPU–GPU Relationship

Asynchronous SIMT and Compute Capability

PyTorch: Tensors, Strides, and Storage

Storage and the Tensor–Storage Split

Strides: the key abstraction

The three extension axes

PyTorch: The Dispatch Chain

Key directories in the PyTorch codebase

PyTorch: Autograd Internals

PyTorch: What Writing a Kernel Actually Looks Like