Artificial Intelligence

NVIDIA announces Blackwell & Hopper MLPerf results

29th August 2024
Harry Fowle
0

NVIDIA has announced the results of the recent MLPerf Inference v4.1 for its new Blackwell and Hopper architecture, showing strong results.

Large language model (LLM) inference presents a comprehensive challenge across the technology stack. Achieving high-throughput and low-latency inference necessitates the use of powerful GPUs, high-bandwidth connections between GPUs, efficient acceleration libraries, and a well-optimised inference engine.

The latest version of the MLPerf Inference benchmarks, version 4.1, has been released by the MLCommons consortium. This widely recognised benchmark suite includes various popular AI models, addressing a range of use cases from LLMs and generative AI to recommendation systems and computer vision. The benchmarks are regularly updated to maintain their relevance in the rapidly evolving market.

In the most recent round of submissions, NVIDIA achieved notable results, showcasing advancements across its technology stack. Key highlights include:

  • The first submission featuring the NVIDIA Blackwell architecture, which delivered up to four times the performance on Llama 2 70B compared to the NVIDIA H100 Tensor Core GPU.
  • Submissions for the NVIDIA H200 Tensor Core GPU covered every data centre workload, offering up to 1.5 times the performance of previous H100 submissions.
  • Software enhancements contributed to a performance increase of up to 27% on the H200 compared to earlier submissions from the previous round.
  • The debut of Llama 2 70B submissions using the NVIDIA Triton Inference Server, achieving comparable performance to submissions using NVIDIA TensorRT-LLM.
  • A performance improvement of up to 6.2 times on the GPT-J benchmark in the Edge category, using the NVIDIA Jetson AGX Orin platform, compared to the previous round.

NVIDIA Blackwell’s debut

Unveiled at NVIDIA GTC 2024, the NVIDIA Blackwell architecture represents a new category of AI superchip. Built with 208 billion transistors using the TSMC 4NP process specifically optimised for NVIDIA, it stands as the largest GPU ever constructed. The Blackwell architecture incorporates a second-generation Transformer Engine, which leverages the advanced Blackwell Tensor Core technology along with TensorRT-LLM innovations to deliver rapid and precise FP4 AI inference.

For this MLPerf Inference round, NVIDIA made its initial submissions featuring the Blackwell architecture. On the Llama 2 70B large language model benchmark, Blackwell achieved up to four times the tokens per second per GPU compared to the H100 GPU.

Table 1. Per-GPU performance increases compared to NVIDIA Hopper on the MLPerf Llama 2 70B benchmark. H100 per-GPU throughput obtained by dividing submitted eight-GPU results by eight

This performance was largely due to the use of the Blackwell FP4 Transformer Engine. The submission was in the Closed division, indicating that the results were achieved without altering the model while still satisfying the benchmark's stringent accuracy criteria. FP4 quantisation was accomplished using the NVIDIA TensorRT Model Optimiser library, which integrates cutting-edge model optimisation techniques, eliminating the need for model re-training.

NVIDIA H200 Tensor Core GPU delivers top performances

The NVIDIA H200 GPU enhances the NVIDIA Hopper architecture by integrating HBM3e, the fastest AI memory available in the industry. This enhancement increases memory capacity by 1.8 times and memory bandwidth by 1.4 times compared to the H100, providing significant advantages for memory-intensive applications.

In this benchmark round, NVIDIA submitted results for every workload using eight H200 GPUs, participating in all available categories.

Table 2. NVIDIA MLPerf Inference v4.1 data center results using H200 GPUs. Llama 2 70B results based on H200 configured at 1000W, all other results using H200 at 700W 

Jetson AGX Orin sees GenAI leap

The Jetson AGX Orin platform combines high AI computing power, expansive unified memory, and a robust software suite designed for generative AI applications at the edge. Through extensive software optimisations, the NVIDIA Jetson AGX Orin 64 GB has achieved significant improvements for edge-based generative AI models, delivering up to 6.2 times higher throughput and 2.4 times better latency on the GPT-J 6B parameter large language model benchmark. These models have the potential to transform sensor data, such as images and videos, into actionable, real-time insights with enhanced contextual understanding.

Supported by the comprehensive NVIDIA software stack, Jetson AGX Orin is ideally suited as a leading platform for running transformer models like GPT-J, vision transformers, and Stable Diffusion at the edge. Developers can leverage additional platform resources, such as Jetson Generative AI Lab and Jetson Platform Services, to develop and deploy innovative solutions.

Table 3. GPT-J LLM performance in the MLPerf Inference; Edge (v4.0 and v4.1) on Jetson AGX Orin

This performance enhancement has been achieved through various software optimisations in TensorRT-LLM, including the implementation of in-flight batching and the use of INT4 Activation-aware Weight Quantisation (AWQ). AWQ maintains the top 1% of “salient weights” at a higher FP16 precision while quantising the remaining weights to four-bit integer (INT4) precision. This approach significantly reduces the memory footprint, allowing for larger batch processing, thereby boosting inference throughput.

NVIDIA also submitted results for the demanding Llama 2 70B model running on Jetson AGX Orin in the Open division, highlighting the potential of advanced model optimisation techniques. This submission used the same 16B depth and width pruned model as in the H200 submission. The INT4 AWQ technique—employed in the GPT-J submission for Jetson AGX Orin in the Closed division—was also utilised here. The combination of model parameter pruning and INT4 quantisation effectively reduced the memory footprint of the model weights to approximately 8 GB for the Llama 2 70B model.

Featured products

Product Spotlight

Upcoming Events

View all events
Newsletter
Latest global electronics news
© Copyright 2024 Electronic Specifier