Power management for 14,336 cores in a chiplet-based waferscale AI engine

October 15, 2021 // By Nick Flaherty
Power management for 14,336 cores in a chiplet-based waferscale AI engine
Researchers in the US developed a custom power architecture for the world’s largest waferscale machine learning engine with 14,336 ARM Cortex M3 cores

The researchers at UCLA and the University of Illinois have developed a custom power management architecture a waferscale AI engine based on chiplets.

The system comprises an array of 1024 tiles, where each tile is composed of two chiplets, for a total of 2048 chiplets and about 15,000 mm2 of total area.

“To the best of our knowledge, this is the largest chiplet assembly based system ever attempted,” said the team in a recent paper. “In terms of active area, our prototype system is about 10x larger than a single chiplet-based system from Nvidia/AMD and about 100x larger than the 64-chiplet Simba research system from Nvidia.”

The chiplet-based waferscale system developed by UCLA uses silicon interconnect Fabric (Si-IF) for tight integration of many chiplets on a high-density interconnect wafer on a fine-pitch copper pillar based (10µm pitch) I/Os which are at least 16x denser than conventional µ-bumps used in an interposer based system, as well as ∼100µm inter-chiplet spacing.

Related articles

“The scale of this prototype system forced us to rethink several aspects of the design flow. Because this is the first attempt at building such a system, there were several unknowns around the manufacturing and assembly process,” said the team in the paper. “As a result, fault tolerance and resiliency, was one of the primary drivers behind the design decisions we took. We also ensured that the design decisions were not too complex, such that they could be reliably implemented by a small team,” they said.

Each tile is comprised of two chiplets: a Compute chiplet and a Memory chiplet. Each 40nm  compute chiplet contains 14 independently programmable ARM Cortex-M3 processor cores with 64kbits of local SRAM while the memory chiplet provides 512KB of globally shared memory. The system is architected as a unified memory system where any core on any tile can directly access the globally shared memory across the entire waferscale system using the interconnect.

The chiplets are designed and fabricated in the TSMC 40nm-LP process and the peak power per tile is about 350mW when operating at a voltage of 1.21V. This means 290A of current needs to be delivered to the chiplets across the wafer. The number of metal layers in the substrate is restricted to four in order to maximize yield with two dedicated to inter-chip signaling, leaving two layers are available for power distribution.

With a global power distribution network, the chiplets near the edge receive power at much higher voltage (2.5V) and the chiplets away from the edge would receive power at lower voltage of 1.4V due to resistive power loss related voltage droop.

This makes the LDO design challenging as it has to produce a stable voltage of 1.1V (nominal) for the logic devices while the DC supply voltage can vary between 1.4V and 2.5V depending on where the chiplet is placed on the wafer, so the team built a custom LDO which can track this wide input voltage range.

The other challenge is that the LDO regulator has to support up to 350mW of peak power while sustaining up to 200mA current demand fluctuation (worst case) within a few cycles. In order to achieve good regulation under these operating conditions, the LDO regulator needs sufficient decoupling capacitance at the output. Such high capacitance requirements are usually fulfilled using off-chip discrete decoupling capacitors.

However the waferscale design means off-chip capacitors can only be placed around the edge of the array. As a result, the chiplets at the centre of the array can be as far as 70 mm away from the nearest capacitor. TO address this, the team designed a custom on-chip decoupling capacitor and dedicated ∼35% of the total tile area to decoupling capacitance giving about 20 nF per tile. This design ensures that the regulated voltage is always between 1.0V and 1.2V across process/voltage/temperature corners.

www.ucla.edu

Other chiplet articles

Other articles on eeNews Power


Vous êtes certain ?

Si vous désactivez les cookies, vous ne pouvez plus naviguer sur le site.

Vous allez être rediriger vers Google.