GPU Architecture Explained: How Graphics Cards Work

In a nutshell: A GPU is a processor built for doing a very large number of similar calculations at the same time. A CPU is optimised for flexible control flow and fast decisions across varied tasks; a GPU is optimised for parallel throughput across huge batches of pixels, vertices, tensors, simulations or hashes. That is why the same basic idea powers games, rendering, machine learning, scientific computing and parts of modern AI infrastructure.

Quick definitions

Term	What it means
GPU	Graphics Processing Unit. A parallel processor originally built for graphics, now widely used for AI and compute workloads.
Graphics card	The full add-in board: GPU chip, graphics memory, power delivery, cooling, PCB and display or PCIe connectors.
CUDA core	NVIDIA's term for a small arithmetic execution unit used for shader and compute work.
Streaming Multiprocessor (SM)	A repeated block inside an NVIDIA GPU that contains CUDA cores, Tensor Cores, schedulers, cache and other execution resources.
Warp	A group of 32 CUDA threads that are scheduled together on NVIDIA GPUs.
Tensor Core	A specialised unit for matrix operations, especially useful in AI training, inference and some graphics features.
RT Core	A specialised unit for ray-tracing work such as ray-triangle intersection tests.
GDDR / HBM	High-bandwidth graphics memory placed close to the GPU so data can be fed to thousands of execution units quickly.

Watch the visual explanation

Video walkthrough:

The article below builds from Branch Education's visual teardown of GPU architecture, using NVIDIA's GA102 family as a concrete example. The video is especially useful for seeing the repeated blocks inside the die and the physical layout of a graphics card.

GPU vs CPU: different jobs

The easiest mistake is to ask whether a GPU is "better" than a CPU. It is not a universal upgrade; it is a different kind of processor. A CPU has a smaller number of powerful, flexible cores. It is excellent at running an operating system, handling interrupts, making branch-heavy decisions, managing I/O and executing many kinds of instructions with low latency.

A GPU takes the opposite trade-off. It uses thousands of simpler execution units, organised into repeated blocks, and keeps them busy by sending many similar operations through the chip at once. If the workload can be split into many independent pieces, the GPU wins. If the workload is sequential, branch-heavy or constantly waiting on outside events, the CPU usually wins.

Question	CPU	GPU
Core design	Fewer, more complex cores	Many simpler parallel execution units
Best at	Control flow, OS tasks, sequential logic, low-latency decisions	High-throughput parallel arithmetic
Typical work	Business applications, databases, browsers, operating systems, orchestration	Pixels, vertices, matrices, tensors, simulations, batch compute
Failure mode	Can be underused if the job is massively parallel	Can stall if data is not ready or threads diverge heavily
Relationship	The host that prepares, schedules and coordinates work	The accelerator that executes parallel batches

Graphics card vs GPU die

In everyday speech, people often say "GPU" when they mean the whole graphics card. Technically, the GPU is the silicon die: the processor chip with billions of transistors. The graphics card is the full board that lets the chip work inside a PC or server.

GPU die. The processor itself, built from repeated blocks of execution units, caches, memory controllers and media/display logic.
Graphics memory. Usually GDDR on gaming/workstation cards, or HBM on many data-centre accelerators. It stores textures, frame buffers, model weights, activations and other data close to the processor.
PCB. The circuit board that connects the GPU, memory, power delivery, PCIe interface and display outputs.
Power delivery. Voltage regulators and power stages that convert supply power into the stable low voltages the chip needs.
Cooling. Heatsinks, fans, vapour chambers or liquid cooling that keep the chip and memory inside operating limits.

This distinction matters for buyers. Two cards can use related GPU silicon but differ in memory capacity, cooling, power limit, firmware, driver support, warranty and suitability for 24/7 workloads.

Inside a GPU: SMs, CUDA, Tensor and RT cores

Modern GPUs are deliberately repetitive. NVIDIA's GA102, the Ampere-generation die used across several RTX 30-series cards, is a useful example because the layout is easy to explain: it contains up to 7 Graphics Processing Clusters (GPCs), 84 Streaming Multiprocessors (SMs), 10,752 CUDA cores, 336 Tensor Cores and 84 RT Cores in the full configuration. Product SKUs may expose fewer active blocks.

The SM is the main building block. Each SM has schedulers, registers, cache/shared memory and execution resources. CUDA cores handle general shader and arithmetic work. Tensor Cores accelerate matrix operations. RT Cores accelerate ray-tracing intersection work. They are not interchangeable; they are specialised units that let the chip handle different parts of a graphics or compute pipeline efficiently.

Unit	What it does	Where it helps
CUDA cores	General arithmetic such as add, multiply, fused multiply-add and shader operations.	Raster graphics, physics, simulation, general GPU compute.
Tensor Cores	Matrix multiply-accumulate operations at high throughput.	AI training, AI inference, upscaling, denoising and other matrix-heavy workloads.
RT Cores	Ray and bounding-box / triangle intersection acceleration.	Real-time ray tracing, photorealistic rendering and some simulation workloads.
Schedulers and registers	Feed work to execution units and hold thread state.	Keeping thousands of threads moving without CPU-style overhead.
Caches and shared memory	Store frequently used data close to the execution units.	Reducing trips to external memory and improving throughput.

How GPU parallel work is scheduled

A GPU program breaks work into many threads. On NVIDIA GPUs, threads are grouped into warps of 32 threads. A warp is scheduled together, which is why GPUs are happiest when neighbouring threads run the same instruction on different pieces of data. NVIDIA calls this model SIMT: Single Instruction, Multiple Threads.

Imagine shading a frame in a game. Millions of pixels and vertices need related calculations. The GPU can hand similar work to huge numbers of threads and run them in parallel. The same logic applies to AI: matrix multiplication is a large grid of repeated arithmetic, so Tensor Cores can process many parts of the calculation at once.

The catch: Parallelism only helps when the work can actually be split. If half the threads in a warp take one branch and half take another, the GPU has to serialize parts of that work. This is called divergence, and it is one reason GPU programming rewards careful data layout and predictable control flow.

Why GPU memory matters

A GPU can have enormous arithmetic throughput and still perform poorly if it cannot get data quickly enough. That is why graphics cards use high-bandwidth memory such as GDDR6X, and why data-centre accelerators often use HBM. The memory subsystem has to feed thousands of execution units, not just a handful of CPU cores.

Capacity decides how large a scene, model or dataset can fit locally.
Bandwidth decides how quickly the GPU can move data in and out of memory.
Cache and shared memory decide how often data can be reused without going back to external memory.
Interconnect matters when multiple GPUs work together. PCIe, NVLink and networking determine how expensive it is to move data between accelerators.

For AI infrastructure, memory is often the first hard constraint. A model that fits comfortably in memory can run efficiently; a model that spills across devices needs careful partitioning, high-speed interconnects and more engineering effort.

Why GPUs fit graphics and AI

GPUs became famous through graphics because graphics is naturally parallel. Each frame is built from a pipeline of geometry, rasterisation, shading, texture lookups and post-processing. Many of those steps can be applied to thousands or millions of pixels, vertices or samples independently.

AI uses the same hardware for a different reason: neural networks are dominated by matrix and vector operations. Training and inference both involve large batches of multiply-add operations. Tensor Cores exist because that pattern is so common and so performance-critical.

Workload	Why a GPU helps
Gaming and rendering	Many pixels, vertices, rays and samples can be processed in parallel.
AI training	Large matrix operations can be split across Tensor Cores and multiple GPUs.
AI inference	Batched requests and matrix-heavy model layers benefit from high throughput.
Scientific computing	Simulations often apply the same numerical method across many grid cells or particles.
Video processing	Encoding, decoding and image operations map well to repeated parallel steps.
Cryptographic hashing	Large numbers of independent hashes can be attempted in parallel, where the algorithm permits it.

Binning and product tiers

Not every manufactured die is perfect. Tiny defects can affect one section of a chip while the rest still works. GPU vendors design around this by making the chip highly repetitive and then disabling faulty or surplus blocks. The result is binning: tested chips are sorted into product tiers based on how many blocks work, what clocks they can sustain and how much power they require.

That is why several cards can come from the same die family but expose different core counts, memory configurations or power limits. In the GA102 family, for example, the full die has 84 SMs, while some products expose fewer active SMs. Binning improves manufacturing yield and lets one chip design serve multiple price and performance bands.

Buyer checklist

Is the workload graphics, AI training, AI inference, rendering, simulation or general desktop use? Does the software stack support this GPU well: drivers, CUDA or ROCm, frameworks, plugins and vendor support? How much local GPU memory is required for the model, scene, batch size or dataset? Is memory bandwidth or compute throughput the bottleneck? Will one GPU be enough, or does the workload need multi-GPU interconnect and cluster networking? Can the workstation, rack or data centre support the card's power draw and cooling requirements? Is this a 24/7 production workload that needs data-centre-class support rather than a consumer graphics card? Are licensing, warranty, remote management and replacement timelines acceptable for the business risk?

Common misconceptions

More CUDA cores do not automatically mean better real-world performance. Memory capacity, bandwidth, clocks, software support, cooling and workload shape all matter.
A GPU is not a replacement CPU. It is usually an accelerator controlled by a host CPU and operating system.
Gaming cards and data-centre accelerators are not the same product with a different label. Reliability, firmware, cooling form factor, memory type, ECC support, virtualisation and support contracts can differ materially.
AI performance is not just about Tensor Cores. Input pipelines, storage, networking, batching, model architecture and orchestration can all become the bottleneck.
Ray tracing is not just "more shaders". Dedicated RT hardware exists because realistic ray traversal and intersection work has a different shape from ordinary raster shading.

Evaluating GPUs for AI, rendering or infrastructure?

Browse Singapore AI computing providers, data-centre operators and technology vendors who can help size GPU workstations, hosted accelerators or production AI infrastructure.

Browse AI computing providers

Frequently asked questions

What is a GPU in simple terms?

A GPU is a processor designed to run many similar calculations at the same time. It was originally built for graphics, where millions of pixels and vertices can be processed in parallel, and is now widely used for AI, rendering, simulation and other parallel workloads.

How is a GPU different from a CPU?

A CPU has fewer, more flexible cores that are excellent at control flow, operating systems and low-latency decision-making. A GPU has many simpler execution units that deliver high throughput when a job can be split into thousands of similar parallel tasks.

What are CUDA cores?

CUDA cores are NVIDIA's term for the arithmetic execution units inside its GPUs. They handle general shader and compute operations. They are important, but real performance also depends on memory bandwidth, clocks, software support, cooling and the workload itself.

Why are GPUs useful for AI?

AI models rely heavily on matrix and vector operations. Those operations can be split into many repeated calculations, which is exactly what GPUs are built to do. Tensor Cores further accelerate matrix multiply-accumulate operations used in training and inference.

What is the difference between a GPU and a graphics card?

The GPU is the processor chip. The graphics card is the full board that includes the GPU, graphics memory, power delivery, cooling, PCB and connectors. In casual speech people often use GPU to mean the whole card, but technically they are different.

Sources and further reading

Related articles:

See also: