Elon Musk's supercomputer with 100,000 Nvidia GPUs uses proprietary Spectrum-X networking platform

Trending 3 weeks ago

Serving tech enthusiasts for complete 25 years.
TechSpot intends tech study and proposal you can trust.

In brief: Elon Musk's chaotic foray into nan AI business has resulted successful nan building of a monolithic supercomputer successful grounds time. Curiously, Nvidia notes that this supersystem doesn't utilize nan accepted InfiniBand networking modular to transportation information arsenic 1 mightiness expect.

The high-performance computing strategy built by xAI, featuring 100,000 Hopper GPUs, is named Colossus. The strategy utilizes nan company's Spectrum-X networking level alternatively of InfiniBand, which Nvidia acquired successful 2019 on pinch nan past independent supplier of nan technology, Mellanox.

Nvidia stated that nan designers of Colossus achieved nan system's monolithic standard mostly acknowledgment to Spectrum-X. This exertion importantly improves nonstop representation entree web capacity while utilizing "standards-based" Ethernet connection devices. Colossus was constructed successful grounds time, and nan xAI squad is now successful nan process of doubling its capacity by installing an further 100,000 Hopper GPUs into nan system.

Standard Ethernet devices are insufficient for Colossus, arsenic they tin origin thousands of travel collisions and present a meager 60 percent information throughput. In contrast, Spectrum-X guarantees "zero exertion latency degradation" and eliminates packet nonaccomplishment owed to travel collisions, maintaining a importantly higher 95 percent information throughput done its "congestion control" system. Colossus is training ample connection models belonging to nan Grok family and requires "unprecedented" web capacity to do so.

Spectrum-X isn't your run-of-the-mill Ethernet technology. The halfway of nan level is nan Spectrum SN5600 Ethernet switch, which Nvidia claims tin support up to 800 Gbps per azygous port. This move is built connected a Spectrum-4 civilization ASIC, and xAI has paired it pinch Nvidia BlueField-3 SuperNICs to efficaciously accelerate GPU-to-GPU communication.

– NVIDIA (@nvidia) October 28, 2024

InfiniBand was specifically designed to meet nan connection needs of HPC systems, keeping packet nonaccomplishment to an absolute minimum. While Ethernet has a importantly higher complaint of information loss, it remains highly celebrated – moreover successful nan speed-sensitive HPC marketplace – owed to factors specified arsenic precocious compatibility, vendor choice, and perchance higher bandwidth capabilities per azygous port.

Nvidia stated that its Spectrum-X Ethernet networking level tin accelerate nan improvement of powerful AI systems for illustration Colossus, reducing nan clip needed to bring monolithic HPC machines online. Spectrum-X exertion is scalable and tin perchance supply networking features that were antecedently disposable only done InfiniBand solutions.

More
Source Tech Spot
Tech Spot