The Graphics Processing Unit (GPU) is architecturally engineered for throughput over latency, contrasting sharply with the CPU’s design emphasis on single-thread speed. When a program offloads a task, such as rendering geometry or performing large-scale matrix multiplication, the CPU (the host) marshals the data and instructions (kernels or shaders) and transfers them across the PCI Express (PCIe) bus to the GPU’s dedicated Video RAM (VRAM). The GPU, acting as the device, distributes this workload across its numerous Streaming Multiprocessors (SMs) or Compute Units. Within these units, execution operates on the Single Instruction, Multiple Threads (SIMT) model, where tens of thousands of lightweight threads are grouped into bundles known as warps or wavefronts. All threads within a warp execute the exact same instruction simultaneously, each operating on a different data element. This massive parallelism is highly efficient for data-parallel tasks. However, it faces a major challenge in branch divergence: if a conditional statement requires threads in the same warp to follow different execution paths, the hardware must execute both paths sequentially, masking off the threads that are not currently active for each path. This serialization significantly reduces performance and illustrates why GPU programming favors uniform, non-branching code. Furthermore, the GPU manages its data with a specialized memory hierarchy featuring high-bandwidth GDDR or HBM memory, complemented by a limited amount of fast, on-chip shared memory accessible only to threads within the same block, which is critical for minimizing high-latency VRAM accesses.