What counts as AI compute
AI workloads run on CPUs, GPUs and specialised accelerators. CPUs still handle orchestration, data preparation and many smaller models, but large language models, vision models and recommendation systems usually depend on GPUs or AI accelerators because they can process huge matrix operations in parallel.
The useful unit is increasingly the cluster, not the individual chip. A modern AI platform combines accelerators, high-bandwidth memory, fast scale-up interconnects inside a node or rack, scale-out networking between nodes, high-throughput storage, scheduling software and monitoring.
Training, fine-tuning and inference
| Workload | Main constraint | Typical buyer concern |
|---|---|---|
| Pre-training | Massive GPU clusters, fast interconnect and sustained storage throughput. | Rare outside hyperscalers, labs and national AI programmes. |
| Fine-tuning | GPU memory, data quality, experiment tracking and repeatability. | Right-size clusters and avoid idle reserved capacity. |
| Inference | Latency, throughput, cost per token or request, availability and scaling. | Optimise model size, batching, caching and autoscaling. |
| RAG and agents | Vector search, orchestration, tool calls and long-context cost. | End-to-end latency and governance, not only GPU speed. |
The bottlenecks beyond GPUs
- Memory. Model size and batch size often depend on HBM capacity, not raw compute alone.
- Interconnect. Multi-GPU and multi-node workloads need fast communication to avoid idle accelerators.
- Storage. Training pipelines can starve GPUs if datasets cannot be read fast enough.
- Networking. East-west cluster traffic and north-south user traffic have different design needs.
- Power and cooling. Dense AI racks may require liquid cooling, higher rack power and facility upgrades.
- Software. Drivers, Kubernetes, schedulers, observability and MLOps tooling decide utilisation.
Deployment options
Most enterprises choose between public cloud GPUs, managed AI platforms, colocation with owned hardware, hosted private GPU clusters or specialist GPU clouds. Public cloud is fast to start and useful for bursty demand. Owned or hosted clusters can be cheaper at sustained utilisation but require capacity planning, operations and lifecycle management.
The break-even point depends on utilisation. A reserved cluster running at 20 percent utilisation is expensive even if the headline hourly rate looks attractive. A small cloud deployment can also become expensive if inference volume grows and nobody optimises models, prompts or batching.
Benchmarks and sizing
Benchmarks are useful only when they match your workload. MLPerf provides public benchmark suites for training and inference, but buyers should still run proof-of-concept tests using their own model family, sequence lengths, batch sizes, precision settings and latency targets.
For inference, measure cost per useful request, p95 and p99 latency, throughput, failure rate and scaling behaviour. For training or fine-tuning, measure time to train, GPU utilisation, data pipeline throughput and checkpoint/restart behaviour.
AI compute buyer checklist
Sources and further reading
- NVIDIA Enterprise Reference Architectures
- NVIDIA GB200 NVL72 overview
- Open Compute Project AI Computing Continuum
- MLPerf Training benchmark paper
- MLPerf Inference benchmark paper
- TechDirectory: Liquid cooling racks explained