Meta AI wants to build the world's most powerful supercomputer

On January 24th, Meta introduced the AI Research SuperCluster (RSC), which is among the fastest AI supercomputers running today and will be the fastest in the world once fully built out in mid-2022, according to Meta.

AI can currently perform tasks like translating text between languages and helping identify potentially harmful content, but developing the next generation of AI will require powerful supercomputers capable of quintillions of operations per second.

RSC will help Meta’s AI researchers build better AI models that can learn from trillions of examples; work across hundreds of different languages; seamlessly analyze text, images and video together; develop new augmented reality tools and more. Ultimately, the work done with RSC will pave the way toward building technologies for the next major computing platform — the metaverse, where AI-driven applications and products will play an important role.

Why We Need AI at This Scale

Since 2013, Facebook has been making significant strides in AI, including self-supervised learning, where algorithms can learn from vast numbers of unlabeled examples and transformers, which allow AI models to reason more effectively by focusing on certain areas of their input. To fully realize the benefits of advanced AI, various domains, whether vision, speech, language, will require training increasingly large and complex models, especially for critical use cases like identifying harmful content. In early 2020, it was decided that the best way to accelerate progress was to design a new computing infrastructure — RSC.

Using RSC to Build for the Metaverse

With RSC, it is possible to train models more quickly that use multimodal signals to determine whether an action, sound or image is harmful or benign. This research will not only help keep people safe on Meta services today, but also in the future, for the build of the metaverse.

RSC: Under the hood

AI supercomputers are built by combining multiple GPUs into compute nodes, which are then connected by a high-performance network fabric to allow fast communication between those GPUs. RSC today comprises a total of 760 NVIDIA DGX A100 systems as its compute nodes, for a total of 6,080 GPUs — with each A100 GPU being more powerful than the V100 used in our previous system. Each DGX communicates via an NVIDIA Quantum 1600 Gb/s InfiniBand two-level Clos fabric that has no oversubscription. RSC’s storage tier has 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Altus systems, and 10 petabytes of Pure Storage FlashBlade.

Early benchmarks on RSC, compared with Meta’s legacy production and research infrastructure, have shown that it runs computer vision workflows up to 20 times faster, runs the NVIDIA Collective Communication Library (NCCL) more than nine times faster, and trains large-scale NLP models three times faster. That means a model with tens of billions of parameters can finish training in three weeks, compared with nine weeks before.

Phase two and beyond

RSC is up and running today, but its development is ongoing. Once the phase two of building out RSC is competed, it will be the fastest AI supercomputer in the world, performing at nearly 5 exaflops of mixed precision compute. Through 2022, Meta will work to increase the number of GPUs from 6,080 to 16,000, which will increase AI training performance by more than 2.5x. The InfiniBand fabric will expand to support 16,000 ports in a two-layer topology with no oversubscription. The storage system will have a target delivery bandwidth of 16 TB/s and exabyte-scale capacity to meet increased demand.