Baidu is the newest member of the Open Compute Project (OCP), and the collaboration on the OCP AI Module (OAM) is intended to shorten the development of AI accelerators and speed up large scale adoption.
A common specification is needed as AI’s rapid evolution is creating an explosion of new types of hardware accelerators for Machine Learning and Deep Learning. The lack of interoperability among AI accelerators has led to slower development and increased time to adoption. With the rapid development of deep learning, silicon chip giants as well as start-ups are developing new AI accelerators which are expected to be deployed in late 2019, and this brings more choices for large internet companies.
“Baidu is excited to work together with Facebook and Microsoft to define the OAM specification which significantly increases interoperability of AI accelerators and speeds up the large scale deployment. We believe the global AI ecosystem will benefit a lot from this,” said Zhenyu Hou, Baidu Vice President,
Baidu has been building an AI ecosystem around its X-MAN technology, which uses hardware disaggregation, resource pooling, liquid cooling and modular hardware, hence the interest in the AI module specification. THe third generation X-MAN3.0 consists of two independent 4U AI modules, each supporting eight of Nvidia's latest V100 Tensor Core GPUs. The two AI modules are connected by high-speed interconnected backplanes with 48 NVLink links. It also has two levels of PCIe switch supporting interconnections among CPU, AI accelerators and other IO. The logical relationship between the CPUs and GPUs is defined in software to support a wide range of AI workloads without system bottlenecks.
One key partner will be Inspur, which builds the X-MAN for Baidu's data centres across China, as well as building an all-in-one compute and storage platform called ABC, and the Scorpio Rack Standard-based cold storage server