Ch. 1–3: Why GPUs? The CUDA Origin Story
Modern CPUs are designed for low-latency serial execution. They devote most of their transistor budget to cache hierarchies, branch predictors, and out-of-order execution logic — all aimed at making a single thread run as fast as possible. GPUs make the opposite tradeoff: thousands of smaller, simpler cores optimized for throughput over latency.
CUDA (Compute Unified Device Architecture) was introduced by NVIDIA in 2006 to expose this parallelism to general-purpose workloads beyond graphics. The programming model extends C++ with a small set of keywords and a new execution model built around the idea that you have many thousands of identical threads running simultaneously.
The key insight of the scalable programming model: a program written for a small GPU will run correctly (though slower) on a GPU with fewer multiprocessors, and will automatically scale up on a more powerful GPU. You don't target a specific number of cores — the hardware schedules your work.
Ch. 4: Kernels and the Thread Hierarchy
Kernels
A kernel is a C++ function that runs on the GPU. It's declared with the __global__ qualifier. You invoke it from host (CPU) code using the <<<...>>> syntax, specifying how many threads to launch:
__global__ void vecAdd(float* A, float* B, float* C, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) C[i] = A[i] + B[i];
}
// Launch 1024 threads in blocks of 256
vecAdd<<<4, 256>>>(dA, dB, dC, 1024);
Each thread has its own index (threadIdx) and knows which block it belongs to (blockIdx). This gives it a unique global identity, used to partition the work.
Thread Hierarchy: threads → blocks → grid
- Thread: the fundamental execution unit. Runs one instance of the kernel.
- Block: a group of threads (up to 1024) that can cooperate via shared memory and synchronize with
__syncthreads(). Blocks execute on a single Streaming Multiprocessor (SM). - Grid: the collection of all blocks for a kernel launch. Blocks within a grid are independent — no synchronization is possible between blocks within a single launch.
Blocks can be 1D, 2D, or 3D (useful for images or volumetric data). Starting with compute capability 9.0 (Hopper), Thread Block Clusters add an optional grouping above blocks, allowing cooperative execution across multiple SMs.
Ch. 5: Memory Hierarchy
Memory is the central bottleneck on GPUs. Understanding the hierarchy is essential for writing fast kernels.
Coalescing
When threads in a warp access consecutive memory addresses, the hardware can satisfy the whole warp in a single transaction. This is called coalesced access. Strided or random accesses require multiple transactions, multiplying memory bandwidth consumption. Always arrange data so neighboring threads access neighboring addresses.
Shared Memory Tiling Pattern
The canonical technique for compute-bound kernels (matrix multiply being the textbook example) is to tile the computation: load a tile of data from global memory into shared memory, synchronize, compute, and repeat. This replaces many slow global reads with fast shared memory reads.
__shared__ float tile_A[TILE][TILE];
__shared__ float tile_B[TILE][TILE];
// load phase
tile_A[ty][tx] = A[row * N + (phase * TILE + tx)];
tile_B[ty][tx] = B[(phase * TILE + ty) * N + col];
__syncthreads();
// compute phase
for (int k = 0; k < TILE; ++k)
sum += tile_A[ty][k] * tile_B[k][tx];
__syncthreads();
Ch. 5: Heterogeneous & Async Programming
The CPU–GPU Relationship
CUDA treats the CPU as the host and the GPU as the device. They have separate memory spaces (unless using unified memory). The typical flow: allocate device memory (cudaMalloc), copy data from host to device (cudaMemcpy H2D), launch kernels, copy results back (cudaMemcpy D2H), free.
Kernel launches are asynchronous by default — the host thread returns immediately and can queue multiple kernels or H2D/D2H transfers. Streams are ordered queues of GPU work; operations in the same stream execute in order, while operations in different streams can overlap.
Asynchronous SIMT and Compute Capability
The asynchronous programming model deepens with newer hardware. cuda::barrier and cuda::pipeline (CUDA 11+) allow finer-grained producer-consumer pipelining within a block — threads can issue async copies with memcpy_async into shared memory while other threads compute on previously loaded data. This double-buffering pattern is fundamental in tensor core matmul implementations. Compute capability is a versioned number (e.g., 8.0 for Ampere, 9.0 for Hopper) that gates which hardware features are available.
PyTorch: Tensors, Strides, and Storage
The tensor is the central data structure in PyTorch. Most users think of it as an n-dimensional array with a dtype, a device, and a shape. But the actual implementation is richer.
Storage and the Tensor–Storage Split
Physically, data lives in a Storage object: a flat, typed, device-resident buffer. A Tensor is a view on top of a Storage — it records sizes, strides, and an offset, but shares the underlying bytes. This is what enables zero-copy slicing and transposing.
x = torch.zeros(4, 4) row = x[1, :] # new Tensor, same Storage, offset=4, strides=(1,) col = x[:, 0] # new Tensor, same Storage, offset=0, strides=(4,)
Strides: the key abstraction
To find element tensor[i, j] in physical memory: offset + i * stride[0] + j * stride[1]. For a contiguous row-major (M, N) tensor, strides are (N, 1). A transpose flips them to (1, M) — no data copy required, just a metadata change. Many PyTorch "views" (slice, reshape, permute) are purely stride arithmetic.
tensor.is_contiguous() returns true when strides follow the standard row-major layout. Calling .contiguous() on a non-contiguous tensor triggers a copy. Many CUDA kernels require contiguous input — this is a common source of hidden copies in production code.
The three extension axes
Every tensor is characterized by three orthogonal parameters that define a Cartesian product of possible tensor types:
- device: where the storage lives (
cpu,cuda,xla,hip). Each device has its own allocator. - layout: how logical indices map to physical memory (
strided,sparse_coo,sparse_csr, MKL-DNN blocked, etc.) - dtype: the scalar type (
float32,bfloat16,int8, quantized types, etc.)
PyTorch: The Dispatch Chain
When you call torch.add(a, b), what actually happens? It's a two-level dispatch:
Python → C++ boundary. Autogenerated binding code (THPVariable_add) uses PythonArgParser to extract C++ tensors from Python arguments, releases the GIL, and calls into C++.
Variable (autograd) dispatch. The Variable layer unwraps tensors, records the operation in the autograd graph if requires_grad=True, then calls the underlying implementation.
Device/layout dispatch. A virtual call on a "Type" object routes to the right backend — CPUFloatType, CUDAHalfType, etc. This is a dynamic dispatch because CPU and CUDA kernels live in separate shared libraries.
dtype dispatch. Inside the backend kernel, AT_DISPATCH_ALL_TYPES (a switch macro) selects the concrete scalar-type specialization and jumps into the numeric computation.
All code up to step 4 is autogenerated from operator schemas defined in native_functions.yaml. This is why you can add a new operator, write a single kernel, and automatically get Python bindings, gradient support, and dtype dispatch — without touching binding or dispatch code manually.
Key directories in the PyTorch codebase
torch/— Python modules you import. Easy to hack on.torch/csrc/— C++ Python bindings + autograd engine + JIT.aten/src/ATen/native/— the actual operator kernels (modern C++).c10/— core abstractions:TensorImpl,Storage, device/dtype types.
PyTorch: Autograd Internals
PyTorch implements reverse-mode automatic differentiation. The forward pass builds a dynamic computation graph (a DAG of Function nodes). Each node stores the inputs needed to compute its backward pass. When you call loss.backward(), the autograd engine traverses this graph in reverse, accumulating gradients.
The variable wrapping adds an AutogradMeta struct to each tensor that stores:
- The
grad_fn: theFunctionthat produced this tensor (null for leaf tensors) - The accumulated
.grad - Whether
requires_gradis set
Technically, the gradients computed are not pure gradients but Jacobian-vector products (JVPs with the upstream gradient vector). For scalar losses this simplifies to standard gradients. For vector-valued outputs you need torch.autograd.grad with explicit grad_outputs.
x += 1) that modify a tensor needed for a backward pass will raise a RuntimeError. This is not a limitation — it's a correctness guard. If you're hitting this in production, check for unexpected in-place ops in your model.
PyTorch: What Writing a Kernel Actually Looks Like
A kernel in ATen native consists of:
- Schema in
native_functions.yaml: declares signature, dispatch keys, and whether it has anout=variant or an in-place variant. - Error checking using
TORCH_CHECK(cond, "message")or higher-levelcheckDim()helpers. - Output allocation: create the result tensor via
at::emptyorat::zeros. - dtype dispatch via
AT_DISPATCH_ALL_TYPES(input.scalar_type(), "kernel_name", [&] { ... })— the lambda is instantiated once per supported dtype. - Parallelism:
at::parallel_forfor CPU; CUDA kernels are implicitly parallel. - Data access: use
TensorAccessor(bounds-checked, index-calculated) or raw pointers viadata_ptr<scalar_t>().
Most production kernels also define a derivatives.yaml entry for autograd, which tells the codegen what backward formula to use — often composing existing ops, occasionally a hand-written _backward function.