NVIDIA Blackwell sets new records
NVIDIA has once again pushed the boundaries of AI performance with its latest results in the MLPerf Inference V5.0 benchmarks.
The company's new Blackwell platform set multiple records, marking NVIDIA’s first MLPerf submission using the GB200 NVL72 system – a rack-scale solution designed for AI inference.
The rise of AI factories
As AI continues to advance, traditional data centres are evolving into AI factories – specialised infrastructures that manufacture intelligence at scale. These AI factories are designed to process vast amounts of data efficiently, delivering accurate responses to queries while minimising costs and maximising accessibility.
The increasing complexity of AI models, now reaching billions or even trillions of parameters, demands substantial computational power. This growth presents challenges in maintaining high inference throughput while keeping the cost per token low. To address this, rapid innovation is required across silicon, network systems, and software.
MLPerf Inference V5.0: new challenges and achievements
MLPerf Inference, a widely recognised industry benchmark, introduced new tests, including the Llama 3.1 405B model – one of the largest and most computationally demanding open-weight models. The Llama 2 70B Interactive benchmark was also introduced, featuring stricter latency requirements to better reflect real-world deployment conditions.
NVIDIA's Blackwell and Hopper platforms demonstrated exceptional performance across these benchmarks. The GB200 NVL72 system, which connects 72 NVIDIA Blackwell GPUs to function as a single large GPU, achieved up to 30 times higher throughput on the Llama 3.1 405B benchmark compared to NVIDIA’s H200 NVL8 submission. This leap in performance was enabled by more than triple the performance per GPU and a ninefold increase in NVIDIA NVLink interconnect bandwidth.
Real-world AI inference performance
AI inference efficiency is often measured by two key latency metrics:
- Time to First Token (TTFT): how quickly a response begins after a query is submitted
- Time Per Output Token (TPOT): the speed at which tokens are delivered
The new Llama 2 70B Interactive benchmark set stricter constraints, requiring a fivefold shorter TPOT and 4.4 times lower TTFT. On this test, an NVIDIA DGX B200 system with eight Blackwell GPUs tripled the performance of an equivalent system using H200 GPUs.
Ongoing optimisation of Hopper AI factories
NVIDIA's Hopper architecture, introduced in 2022, continues to power AI inference and training workloads. Ongoing software optimisations have significantly increased throughput for Hopper-based AI factories. On the Llama 2 70B benchmark, H100 GPU throughput has improved by 1.5 times since its introduction in MLPerf Inference V4.0, while the H200 GPU has extended this gain to 1.6 times.
Hopper remains highly versatile, running all benchmarks in this MLPerf round, including Llama 3.1 405B and new graph neural network tests, ensuring it can meet the demands of increasingly complex AI models.
Expanding the NVIDIA AI ecosystem
Fifteen technology partners submitted results in this MLPerf round using NVIDIA hardware, including ASUS, Cisco, CoreWeave, Dell Technologies, Fujitsu, Giga Computing, Google Cloud, Hewlett Packard Enterprise, Lambda, Lenovo, Oracle Cloud Infrastructure, Quanta Cloud Technology, Supermicro, Sustainable Metal Cloud, and VMware. The widespread adoption of NVIDIA’s AI platforms highlights their availability across cloud service providers and enterprise server manufacturers worldwide.
The continuous evolution of MLPerf Inference benchmarks, overseen by MLCommons, ensures IT decision-makers have access to peer-reviewed performance data. This helps organisations select the most effective AI infrastructure to meet their needs as AI applications continue to scale.