How NVIDIA Blackwell is levelling up real-time LLMs
NVIDIA has announced how it aims to improve cutting-edge real-time LLMs with its newest NVIDIA Blackwell platform developments for data centres.
NVIDIA has been hard at work recently to show its NVIDIA Blackwell offering being fully utilised within a data centre, as well as the sort of numbers it can achieve at optimum levels.
NVIDIA’s data centre racks, utilising the newest NVIDIA Blackwell technology
Blackwell as a platform
In order to achieve optimum results, NVIDIA stresses that Blackwell isn’t just a product that starts and ends with the GPU, but rather that it is a platform and a sum of many parts.
The NVIDIA chip family that enables the Blackwell rack system
“The GPU is just the beginning,” explains Dave Salvator, Director of Accelerated Computing Products at NVIDIA. “As you can see across the top row, these are all of the chips that go into a Blackwell system. These are what makes it able to do what it does, taking us into that next era of generative AI.” It is all of these components which need to come together to make a fully-fletched racked system and each individual tray. Among these chips is the integral NVLink Switch Chip. “This really gives you a sense of the amount of system engineering that goes into making Blackwell happen,” he continued.
The need for multi-GPU inferences
There is a continually growing demand for more powerful AI models to be accessible in real-time, the question lies in how to achieve this outcome. As Salvator explains: “As we’ve seen models grow in size over time and the fact that most generative AI applications are expected to run in real-time, the requirement for inference has gone up dramatically over the last several years.
“And so, one of the things that real-time large language model inferencing needs is multiple GPUs.”
In order to enable cutting-edge LLMs to operate in real-time there are two integral factors that must be met – more compute power and lower latency. NVIDIA has found that the best way to meet these requirements is to operate a multi-GPU system. Even if a large LLM could fit into a single GPU, multiple GPUs working together can help the model achieve far lower latency. Multi-GPU works by splitting calculations across multiple GPUs, lowering the stress on a single GPU. This is critical for applications that desire low latency whilst maintaining a high throughput.
For example, LLMs operate on a ‘token’ system to deliver the results as words on a screen. These tokens aren’t a 1:1 word-to-token ratio either, with this amount varying anywhere from 1.2-1.5 tokens per word. User experiences demand different outcomes, but generally, people want a full result immediately so that they can perhaps skim-read, or copy it for use elsewhere, this increases demand on the system, especially when using a newer model – Llama 3.1 has around 405 billion parameters. This means that most LLMs have gone from operating around 5 tokens/second/user to almost 50 tokens/sec/user. Whilst multi-GPU can aid in easing this demanding process, it also creates a new problem that must be solved. Each GPU in a multi-GPU system must send the results of its own calculations to every other GPU across the system, this demands an incredible amount of GPU-to-GPU communication bandwidth.
To meet this challenge, NVIDIA is utilising its new generation NVLink and NVSwitch technologies.
NVIDIA’s Hopper NVSwitch provides 900GB/s of data communications between all GPUs in a stack, up to 8 GPUs. This alone provides improves of almost 1.5x when analysing results from a real-time Llama 3.1 7GB inference on a H200 Tensor Core GPU.
NVIDIA’s Hopper NVSwitch tray (Left) & a comparison of Llama 3.1 performance with and without NVSwitch (Right)
However, NVIDIA understands that the AI industry isn’t a static one and that as models continue to grow, so must its technology. To meet the inevitable future of trillion parameter models, NVIDIA has also launched the Blackwell generation of NVSwitch. This is NVIDIA’s cutting-edge offering for multi-GPU solutions, providing GPU-to-GPU bandwidth of up to 1.8TB/s across a staggering 72 GPUs on GB200 NVL72. Two of these chips would be utilised per tray, delivering 14.4TB/s of total bandwidth
An NVLink Switch tray, featuring the Blackwell generation of NVSwitch
Liquid cooling in data centres
NVIDIA has also been experimenting with liquid cooling within these demanding scenarios to further amplify performance. As generative AI and LLMs continue to drive demand for accelerated computing, new solutions are needed to meet this demand, this is where liquid cooling comes in.
An illustration example of a liquid-cooled NVIDIA Blackwell data centre
NVIDIA has found that liquid cooling can enable Blackwell technologies to achieve considerable performance enhancements both in the training and inference stages. “Liquid cooling has a lot of advantages over traditional air cooling, and there are a number of different approaches to it,” says Salvator.
Presently, NVIDIA is leaning into what it dubs ‘warm water direct-to-chip cooling,’ which offers improved cooling efficiencies, lower operation costs, extended IT server life, and the possibility to recycle heat energy back into the system. NVIDIA believes that this approach, or similar, can result in an overall reduction of 28% in data centre facility power.
This topic will be further explored in NVIDIA’s Hot Chips talks.