Quick definitions
| Term | What it means |
|---|---|
| GPU | Graphics Processing Unit. A parallel processor originally built for graphics, now widely used for AI and compute workloads. |
| Graphics card | The full add-in board: GPU chip, graphics memory, power delivery, cooling, PCB and display or PCIe connectors. |
| CUDA core | NVIDIA's term for a small arithmetic execution unit used for shader and compute work. |
| Streaming Multiprocessor (SM) | A repeated block inside an NVIDIA GPU that contains CUDA cores, Tensor Cores, schedulers, cache and other execution resources. |
| Warp | A group of 32 CUDA threads that are scheduled together on NVIDIA GPUs. |
| Tensor Core | A specialised unit for matrix operations, especially useful in AI training, inference and some graphics features. |
| RT Core | A specialised unit for ray-tracing work such as ray-triangle intersection tests. |
| GDDR / HBM | High-bandwidth graphics memory placed close to the GPU so data can be fed to thousands of execution units quickly. |
Watch the visual explanation
The article below builds from Branch Education's visual teardown of GPU architecture, using NVIDIA's GA102 family as a concrete example. The video is especially useful for seeing the repeated blocks inside the die and the physical layout of a graphics card.
GPU vs CPU: different jobs
The easiest mistake is to ask whether a GPU is "better" than a CPU. It is not a universal upgrade; it is a different kind of processor. A CPU has a smaller number of powerful, flexible cores. It is excellent at running an operating system, handling interrupts, making branch-heavy decisions, managing I/O and executing many kinds of instructions with low latency.
A GPU takes the opposite trade-off. It uses thousands of simpler execution units, organised into repeated blocks, and keeps them busy by sending many similar operations through the chip at once. If the workload can be split into many independent pieces, the GPU wins. If the workload is sequential, branch-heavy or constantly waiting on outside events, the CPU usually wins.
| Question | CPU | GPU |
|---|---|---|
| Core design | Fewer, more complex cores | Many simpler parallel execution units |
| Best at | Control flow, OS tasks, sequential logic, low-latency decisions | High-throughput parallel arithmetic |
| Typical work | Business applications, databases, browsers, operating systems, orchestration | Pixels, vertices, matrices, tensors, simulations, batch compute |
| Failure mode | Can be underused if the job is massively parallel | Can stall if data is not ready or threads diverge heavily |
| Relationship | The host that prepares, schedules and coordinates work | The accelerator that executes parallel batches |
Graphics card vs GPU die
In everyday speech, people often say "GPU" when they mean the whole graphics card. Technically, the GPU is the silicon die: the processor chip with billions of transistors. The graphics card is the full board that lets the chip work inside a PC or server.
- GPU die. The processor itself, built from repeated blocks of execution units, caches, memory controllers and media/display logic.
- Graphics memory. Usually GDDR on gaming/workstation cards, or HBM on many data-centre accelerators. It stores textures, frame buffers, model weights, activations and other data close to the processor.
- PCB. The circuit board that connects the GPU, memory, power delivery, PCIe interface and display outputs.
- Power delivery. Voltage regulators and power stages that convert supply power into the stable low voltages the chip needs.
- Cooling. Heatsinks, fans, vapour chambers or liquid cooling that keep the chip and memory inside operating limits.
This distinction matters for buyers. Two cards can use related GPU silicon but differ in memory capacity, cooling, power limit, firmware, driver support, warranty and suitability for 24/7 workloads.
Inside a GPU: SMs, CUDA, Tensor and RT cores
Modern GPUs are deliberately repetitive. NVIDIA's GA102, the Ampere-generation die used across several RTX 30-series cards, is a useful example because the layout is easy to explain: it contains up to 7 Graphics Processing Clusters (GPCs), 84 Streaming Multiprocessors (SMs), 10,752 CUDA cores, 336 Tensor Cores and 84 RT Cores in the full configuration. Product SKUs may expose fewer active blocks.
The SM is the main building block. Each SM has schedulers, registers, cache/shared memory and execution resources. CUDA cores handle general shader and arithmetic work. Tensor Cores accelerate matrix operations. RT Cores accelerate ray-tracing intersection work. They are not interchangeable; they are specialised units that let the chip handle different parts of a graphics or compute pipeline efficiently.
| Unit | What it does | Where it helps |
|---|---|---|
| CUDA cores | General arithmetic such as add, multiply, fused multiply-add and shader operations. | Raster graphics, physics, simulation, general GPU compute. |
| Tensor Cores | Matrix multiply-accumulate operations at high throughput. | AI training, AI inference, upscaling, denoising and other matrix-heavy workloads. |
| RT Cores | Ray and bounding-box / triangle intersection acceleration. | Real-time ray tracing, photorealistic rendering and some simulation workloads. |
| Schedulers and registers | Feed work to execution units and hold thread state. | Keeping thousands of threads moving without CPU-style overhead. |
| Caches and shared memory | Store frequently used data close to the execution units. | Reducing trips to external memory and improving throughput. |
How GPU parallel work is scheduled
A GPU program breaks work into many threads. On NVIDIA GPUs, threads are grouped into warps of 32 threads. A warp is scheduled together, which is why GPUs are happiest when neighbouring threads run the same instruction on different pieces of data. NVIDIA calls this model SIMT: Single Instruction, Multiple Threads.
Imagine shading a frame in a game. Millions of pixels and vertices need related calculations. The GPU can hand similar work to huge numbers of threads and run them in parallel. The same logic applies to AI: matrix multiplication is a large grid of repeated arithmetic, so Tensor Cores can process many parts of the calculation at once.
Why GPU memory matters
A GPU can have enormous arithmetic throughput and still perform poorly if it cannot get data quickly enough. That is why graphics cards use high-bandwidth memory such as GDDR6X, and why data-centre accelerators often use HBM. The memory subsystem has to feed thousands of execution units, not just a handful of CPU cores.
- Capacity decides how large a scene, model or dataset can fit locally.
- Bandwidth decides how quickly the GPU can move data in and out of memory.
- Cache and shared memory decide how often data can be reused without going back to external memory.
- Interconnect matters when multiple GPUs work together. PCIe, NVLink and networking determine how expensive it is to move data between accelerators.
For AI infrastructure, memory is often the first hard constraint. A model that fits comfortably in memory can run efficiently; a model that spills across devices needs careful partitioning, high-speed interconnects and more engineering effort.
Why GPUs fit graphics and AI
GPUs became famous through graphics because graphics is naturally parallel. Each frame is built from a pipeline of geometry, rasterisation, shading, texture lookups and post-processing. Many of those steps can be applied to thousands or millions of pixels, vertices or samples independently.
AI uses the same hardware for a different reason: neural networks are dominated by matrix and vector operations. Training and inference both involve large batches of multiply-add operations. Tensor Cores exist because that pattern is so common and so performance-critical.
| Workload | Why a GPU helps |
|---|---|
| Gaming and rendering | Many pixels, vertices, rays and samples can be processed in parallel. |
| AI training | Large matrix operations can be split across Tensor Cores and multiple GPUs. |
| AI inference | Batched requests and matrix-heavy model layers benefit from high throughput. |
| Scientific computing | Simulations often apply the same numerical method across many grid cells or particles. |
| Video processing | Encoding, decoding and image operations map well to repeated parallel steps. |
| Cryptographic hashing | Large numbers of independent hashes can be attempted in parallel, where the algorithm permits it. |
Binning and product tiers
Not every manufactured die is perfect. Tiny defects can affect one section of a chip while the rest still works. GPU vendors design around this by making the chip highly repetitive and then disabling faulty or surplus blocks. The result is binning: tested chips are sorted into product tiers based on how many blocks work, what clocks they can sustain and how much power they require.
That is why several cards can come from the same die family but expose different core counts, memory configurations or power limits. In the GA102 family, for example, the full die has 84 SMs, while some products expose fewer active SMs. Binning improves manufacturing yield and lets one chip design serve multiple price and performance bands.
Buyer checklist
Common misconceptions
- More CUDA cores do not automatically mean better real-world performance. Memory capacity, bandwidth, clocks, software support, cooling and workload shape all matter.
- A GPU is not a replacement CPU. It is usually an accelerator controlled by a host CPU and operating system.
- Gaming cards and data-centre accelerators are not the same product with a different label. Reliability, firmware, cooling form factor, memory type, ECC support, virtualisation and support contracts can differ materially.
- AI performance is not just about Tensor Cores. Input pipelines, storage, networking, batching, model architecture and orchestration can all become the bottleneck.
- Ray tracing is not just "more shaders". Dedicated RT hardware exists because realistic ray traversal and intersection work has a different shape from ordinary raster shading.
Evaluating GPUs for AI, rendering or infrastructure?
Browse Singapore AI computing providers, data-centre operators and technology vendors who can help size GPU workstations, hosted accelerators or production AI infrastructure.
Frequently asked questions
What is a GPU in simple terms?
A GPU is a processor designed to run many similar calculations at the same time. It was originally built for graphics, where millions of pixels and vertices can be processed in parallel, and is now widely used for AI, rendering, simulation and other parallel workloads.
How is a GPU different from a CPU?
A CPU has fewer, more flexible cores that are excellent at control flow, operating systems and low-latency decision-making. A GPU has many simpler execution units that deliver high throughput when a job can be split into thousands of similar parallel tasks.
What are CUDA cores?
CUDA cores are NVIDIA's term for the arithmetic execution units inside its GPUs. They handle general shader and compute operations. They are important, but real performance also depends on memory bandwidth, clocks, software support, cooling and the workload itself.
Why are GPUs useful for AI?
AI models rely heavily on matrix and vector operations. Those operations can be split into many repeated calculations, which is exactly what GPUs are built to do. Tensor Cores further accelerate matrix multiply-accumulate operations used in training and inference.
What is the difference between a GPU and a graphics card?
The GPU is the processor chip. The graphics card is the full board that includes the GPU, graphics memory, power delivery, cooling, PCB and connectors. In casual speech people often use GPU to mean the whole card, but technically they are different.