Artificial Intelligence

NVIDIA and xAI unveil Colossus, world’s largest AI supercomputer

30th October 2024
Paige West
0

In a landmark development, NVIDIA and xAI revealed that the Colossus supercomputer cluster, composed of 100,000 NVIDIA Hopper GPUs, has been deployed in Memphis, Tennessee.

Built on NVIDIA’s Spectrum-X Ethernet networking platform, the supercomputer represented a significant leap in scalable AI, tailored for the demands of multi-tenant, hyperscale AI operations and standards-based Ethernet.

Designed to support xAI’s Grok language models, the Colossus supercomputer powers AI-driven chatbots available to X Premium subscribers. The system achieved this unprecedented scale through Spectrum-X’s advanced capabilities, particularly Remote Direct Memory Access (RDMA) support, enabling high-performance networking without the flow collisions that impact conventional Ethernet systems. Colossus marked the highest performance rate for any AI supercomputer, sustaining 95% data throughput across its network fabric with zero latency degradation or packet loss.

In an impressive feat of engineering and collaboration, NVIDIA and xAI constructed the Colossus facility in just 122 days – a fraction of the time typically required for such infrastructure, which often takes years to complete. Only 19 days passed from the installation of the initial racks to the initiation of the first training operations.

“AI is becoming mission-critical and requires increased performance, security, scalability, and cost-efficiency,” stated Gilad Shainer, NVIDIA’s Senior Vice President of Networking. “The NVIDIA Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis, and execution of AI workloads, accelerating the development, deployment, and time-to-market of AI solutions.”

NVIDIA’s Spectrum-X networking architecture is anchored by its Spectrum SN5600 Ethernet switch, capable of port speeds of up to 800Gb/s, and built on the Spectrum-4 ASIC. This switch, coupled with NVIDIA BlueField-3 SuperNICs, allowed xAI to achieve the colossal bandwidth required for the system while minimising latency. The Colossus network platform enabled adaptive routing with NVIDIA Direct Data Placement technology, congestion control, and enhanced visibility, making it uniquely suited to generative AI Clouds and expansive enterprise environments.

Elon Musk remarked on X: “Colossus is the most powerful training system in the world. Nice work by the xAI team, NVIDIA, and our many partners and suppliers.”

An xAI spokesperson added: “With NVIDIA’s Hopper GPUs and Spectrum-X, we’re pioneering large-scale AI model training with unmatched acceleration and efficiency.”

As Colossus now begins an ambitious expansion to 200,000 NVIDIA Hopper GPUs, the system demonstrates the power of accelerated AI ‘factories’ grounded in Ethernet technology, setting new standards for high-performance, scalable AI infrastructure.

Featured products

Upcoming Events

View all events
Newsletter
Latest global electronics news
© Copyright 2024 Electronic Specifier