The Parallel Engine: Throughput Over Latency

While the Central Processing Unit (CPU) is designed as a latency-optimized generalist—capable of handling complex logic, branching, and sequential tasks with incredible speed—the Graphics Processing Unit (GPU) operates on a fundamentally different philosophy: massive parallelism. When a program offloads a task to the GPU, whether it is rendering a 3D scene or training a neural network, the execution model shifts from a single, fast lane to a massive, multi-lane highway.

The Host-Device Handshake

The process begins on the CPU, referred to in this context as the “host.” The host program prepares the data (geometry, textures, or matrices) and the instructions (shaders or compute kernels) required for the task. Because the CPU and GPU typically possess separate memory spaces, the host must marshal this data and transfer it across the PCI Express (PCIe) bus into the GPU’s dedicated Video RAM (VRAM). This bus often acts as a bottleneck, so efficient programs attempt to minimize transfer frequency, preferring to load large batches of data at once. The CPU then issues a draw call or a dispatch command, placing a packet of instructions into a command buffer. This action effectively signals the GPU driver to translate these high-level commands into machine code that the specific GPU microarchitecture can understand.

The Work Distributor

Once the command reaches the GPU (the “device”), a hardware scheduler or “Gigathread Engine” takes over. Unlike the CPU, which might juggle a few dozen threads, the GPU is designed to manage tens of thousands of simultaneous threads. The scheduler decomposes the workload into a grid of thread blocks. These blocks are then distributed across the available Streaming Multiprocessors (SMs) or Compute Units. This distribution is dynamic; if the GPU has more hardware cores, it processes more blocks in parallel, allowing the code to scale naturally across different tiers of hardware without manual intervention.

Single Instruction, Multiple Threads (SIMT)

Inside the Streaming Multiprocessor, the execution model differs strictly from the CPU. The hardware groups threads into bundles, commonly known as “warps” (in NVIDIA terminology) or “wavefronts” (in AMD terminology), typically consisting of 32 or 64 threads. These threads operate in “lockstep,” meaning they all execute the exact same instruction at the same time, but on different pieces of data. This architecture, known as Single Instruction, Multiple Threads (SIMT), allows the GPU to devote the vast majority of its transistor count to Arithmetic Logic Units (ALUs) rather than control logic and caching.

Branch Divergence and Masking

The rigidity of lockstep execution introduces a unique challenge known as branch divergence. If the code contains a conditional statement (an if-else block) where half the threads in a warp need to take the “true” path and the other half need to take the “false” path, the hardware cannot execute both simultaneously. Instead, it must serialize the execution. It forces all threads to execute the “true” path while masking off (deactivating) the threads that disagree. Once the first path is complete, it inverts the mask and executes the “false” path for the remaining threads. Both paths are executed for the warp, but threads only commit results for the path they actually took. This phenomenon explains why complex branching logic can severely degrade GPU performance.

The Memory Hierarchy and Output

As the ALUs crunch the numbers, they rely on a specialized memory hierarchy designed for throughput. While GPU memory has higher latency than CPU cache, it possesses significantly wider bandwidth, allowing it to feed the thousands of hungry cores simultaneously. Threads often use fast, on-chip “shared memory” to communicate with neighbors in the same block, reducing the need to reach out to the slower VRAM. Upon completion of the calculation or the pixel shading, the results are written to the output buffer. For graphics, this is the Framebuffer, which is eventually scanned out to the monitor. For compute tasks, the data remains in VRAM until the CPU explicitly requests a transfer back across the PCIe bus to read the results.