xAI’s Colossus supercomputer cluster uses 100,000 Nvidia Hopper GPUs — and it was all made possible using Nvidia’s Spectrum-X Ethernet networking platform

Trending 2 weeks ago

  • Nvidia and xAI collaborate connected Colossus development
  • xAI has markedly trim down 'flow collisions' during AI exemplary training
  • Spectrum-X has been important successful training nan Grok AI exemplary family

Nvidia has shed ray connected really xAI’s ‘Colossus’ supercomputer cluster tin support a grip connected 100,000 Hopper GPUs - and it’s each down to utilizing nan chipmaker's Spectrum-X Ethernet networking platform.

Spectrum-X, nan institution revealed, is designed to supply monolithic capacity capabilities to multi-tenant, hyperscale AI factories utilizing its Remote Directory Memory Access (RDMA) network.

The level has been deployed astatine Colossus, nan world’s largest AI supercomputer, since its inception. The Elon Musk-owned patient has been utilizing nan cluster to train its Grok bid of ample connection models (LLMs), which powerfulness nan chatbots offered to X users.

The installation was built successful collaboration pinch Nvidia successful conscionable 122 days, and xAI is presently successful nan process of expanding it, pinch plans to deploy a full of 200,000 Nvidia Hopper GPUs.

Training Grok takes superior firepower

The Grok AI models are highly large, pinch Grok-1 measuring successful arsenic 314 cardinal parameters and Grok-2 outperforming Claude 3.5 Sonnet and GPT-4 Turbo astatine nan clip of motorboat successful August.

Naturally, training these models requires important web performance. Using Nvidia’s Spectrum-X platform, xAI recorded zero exertion bequest degradation aliases packet nonaccomplishment arsenic a consequence of ‘flow collisions’, aliases bottlenecks wrong AI networking paths.

xAI revealed it has been capable to support 95% information throughput enabled by Spectrum-X’s congestion power capabilities. The institution added this level of capacity cannot beryllium delivered astatine this standard via modular Ethernet.

Sign up to nan TechRadar Pro newsletter to get each nan apical news, opinion, features and guidance your business needs to succeed!

Using accepted Ethernet, this typically creates thousands of travel collisions while delivering only 60% information throughput, according to Nvidia.

A spokesperson for xAI said nan operation of Hopper GPUs and Spectrum-X has allowed nan institution to “push nan boundaries of training AI models” and created a “super-accelerated and optimized AI factory”

“AI is becoming mission-critical and requires accrued performance, security, scalability and cost-efficiency,” said Gilad Shainer, elder vice president of networking astatine Nvidia.

“The NvidiaSpectrum-X Ethernet networking level is designed to supply innovators specified arsenic xAI pinch faster processing, study and execution of AI workloads, and successful move accelerates nan development, deployment and clip to marketplace of AI solutions.”

Part of nan Spectrum-X level includes nan Spectrum SN5600 Ethernet move - this supports larboard speeds of up to 800Gb/s and is based connected nan Spectrum-4 move ASIC, according to Nvidia.

xAI opted to harvester nan Spectrum-X SN5600 move pinch NVIDIA BlueField-3 SuperNICs for higher performance.

You mightiness besides like

  • Google's ace powerful Arm-based CPU is now available
  • Meta is letting nan US subject usage its Llama AI exemplary for ‘national information applications’
  • Take a look astatine our choices for the best AI tools astir today
More
Source Technology
Technology