Power management is key to 10nm 48 core ARM server chip
Details of the Centriq 2400 system-on-chip (SoC) and its Falkor processor core were shown at the Hot Chips conference this week. This is the first ARM-based server chip in 10nm, running at 2GHz from a 1.0V supply.
Over half the servers sold by 2020 will be deployed for cloud computing services, according to market researchers IDC, and the processors need to be optimized to address the demand for scalable performance under the unique characteristics of cloud software and services.
Cloud services need to perform well in highly-loaded and multi-tenant environments, and the hardware platform needs to maximize aggregate compute performance while improving the cloud operator’s operational costs, largely driven by the cost of power and cooling.
The Falkor core was designed from the ground up specifically for the cloud datacentre server market. The ARMv8 64bit core has two custom Falkor CPUs, a shared L2 cache and a shared bus interface to the Qualcomm System Bus (QSB) ring interconnect. This modular building block serves as the foundation for the 48-core Centriq 2400 SoC design.
A range of power management techniques were included in the design from the start, such as independent power-state control for each of the CPUs and L2, with entry to and exit from low-power states controlled by hardware state machines for ultra-fast state transitions, and hardware state retention for power-collapsed sleep states with ultra-fast recovery.
The micro-architecture of each CPU has a 4-issue, 8-dispatch heterogeneous pipeline that is designed to optimize performance per unit of power, with variable length pipelines that are tuned per function to maximize throughput and minimize idle hardware. It also uses out-of-order and rename resources to prevent instruction retirement from being in the performance-critical path, preventing stalls that waste energy.
The core is also designed to handle memory-intensive workloads more efficiently as these can burn significant amounts of power in the datacentre. It uses a new split instruction cache comprised of a single-cycle, low-power 24KB L0 instruction cache complementing its 64KB L1 I–cache. The two caches are managed exclusively to provide a total of 88KB of low-latency I-cache. The core also supports a 32KB L1 Data cache with a 3-cycle load-use latency. This has a multi-level hardware prefetch engine that dynamically adapts to system conditions.
The 48 Falkor CPUs (in 24 cores) are connected by a high bandwidth, low-latency ring interconnect extending out to its large L3 cache and multiple memory controllers, avoiding on-die non-uniform memory access (NUMA) effects (see diagram above). The memory subsystem also uses shared resource management techniques such as L3 Quality of Service (QoS) extensions and effective memory bandwidth enhancement via in-line and transparent memory compression.
The chip is in a 55 x 55mm socketed package and is sampling to key customers (think Google, Facebook and Amazon) and will enter production later this year. Qualcomm has also released a motherboard design based around Microsoft’s Project Olympus specification that is used by the Open Compute Project Foundation to standaise datacentre building clocks.