GraphCore, regarded as the world's leading AI chip developer, has detailed the power performance of its 7nm Colossus Mk2 AI chip, highlighting a complex balance of size, power and performance.
The first generation chip, the Mk1, saw significant power advantage over graphics processor units (GPUs) in training AI networks and the Mk2 takes this further at the system level. Lower power consumption is the key advantage of the GraphCore chip architecture, which integrates 1472 processor tiles with a record 896Mbits of on-chip SRAM. Running with a 1.325GHz clock, the chip consumes 300W and can process a peak of 1000 Gflop/s (1PFlop/s) of 32bit floating point operations for machine learning.
“One third of the cost [of AI in the data centre] is power so power efficiency is an important cost factor for AI,” said Simon Knowles, CTO and co-founder of GraphCore.
He gave details the Colossus Mk2 chip at the Hot Chips conference this week. This has a very specific 59,334,610,787 active transistors in 823mm2 on a 7nm process, compared to the 16nm process of the first generation.
Four of the Mk2 chips form the basic building block of the M2000 intelligent processor unit (IPU), providing a peak performance of 1Pflop/s peak performance in a 1U rack unit. This has a 1.5kW TDP thermal envelope and typically machine learning training applications use 1kW, for example for training BERT large AI frameworks.
“What we learned from the first generation was that bulk synchrony can really challenge power supplies, especially within the power envelope of PCIe cards, so we deliver the Mk2 in a disaggregated pizza box chassis,” said Knowles. “The distributed memory architecture limits the energy required for memory and transport so that the energy of arithmetic dominates, which is what we want.”
The 300W per chip power figure for the Mk2 compares to 400 to 500W for Nvidia’s A100 GPU and 450W for Google’s TPUv3, and 315W for the two-chip PCIe card using Graphcore’s MK1 chip.
Instead of using PCIe, a specially designed interconnect called IPU-Fabric uses standard copper or optical OSFP connectors, linking IPUs up and down the rack via a separate dedicated gateway chip, the GC4000. In larger configurations, communication between IPU-PODs uses tunnelling-over-Ethernet technology to maintain throughput, while allowing the use of standard QSFP interconnect and 100Gb Ethernet switches.
At a system level, this means a GraphCore Pod with 16 Mk2 IPU chips, server and DDR DRAM memory consumes 7kW, or 437 system Watts per chip and delivers 4Pflops/s of peak performance in a 4U footprint. With a more standard metric, this is 571Mflops/W.
This compares to 6.5kW for the 8 GPU Nvidia system that delivers 2.5Pflop/s (384Mflops/W) or 9.3kW for a 16 chip TPU system with 1.9Pflop/s (204Mflops/W).
To compare equal sized 4U systems, the first generation Dell design using the Mk1 IPUs had 16 processors on eight PCIe cards delivering 1.6Pflops in a 2.5kW thermal envelope. This is 156 system watts per chip or 640Mflops/W.
- GraphCore opens major R&D centre in Poland
- Graphcore signs strategic deal with Atos
- Graphcore, SiPearl team for AI supercomputer
- GraphCore joins €2.6m project for real world AI
Other articles on eeNews Power
- WiBotic sees passively cooled wireless charging
- Dialog, Xilinx team for AI System-on-Modules
- New insulation material boosts HVDC cable performance
- Chinese wind turbine maker eyes Europe with largest system
- UK startup raises £10m for fast charging niobium cells
- II-VI plans 2500V SiC MOSFET