Microsoft has released the latest details of its AI supercomputer system, but stopped short of delivering the power figures. The system is to use the latest A100 GPU from Nvidia with processors from AMD.
Microsoft announced it would host an AI supercomputer in the cloud with OpenAI system back in May, but did not detail the technology it would use. The virtual machine (VM) developed by Microsoft for AI will combine eight A100 Ampere A100 with an AMD processor.
This architecture, called the ND A100 v4 VM series, can scale up to thousands of GPUs with 1.6 Tbit/s of interconnect bandwidth per VM. Each 400W GPU is provided with its own dedicated topology-agnostic 200 Gbit/s NVIDIA Mellanox HDR InfiniBand connection.
These GPU sub-systems will be coupled with AMD’s ‘Rome’ processors. These use a hybrid multi-die architecture that decouples two streams: eight dies for the processor cores to map directly to the GPUs, and one I/O die that supports security and communication outside the processor. The latest 64 core / 128 thread version, the EPYC 7H12, is built on a 7nm process from TSMC with a 14nm I/O chip in the package. This is designed for liquid-cooled data centre operation with a 2.6GHz base frequency and power consumption up to 280W and delivers up to 4.2TFLOPS.
This gives a power envelope of 3.7kW for each VM.
All this is needed to handle large machine learning models says Microsoft. Large training models at this scale requires large clusters of hundreds of machines with specialized AI accelerators interconnected by high-bandwidth networks inside and across the machines.
This builds on the previous public cloud offering clusters of VMs with Nvidia’s V100 Tensor Core GPUs, connected by a Mellanox InfiniBand network. “Most customers will see an immediate boost of 2x to 3x compute performance over the previous generation of systems based on NVIDIA V100 GPUs with no engineering work,”