BY DR. PHIL GARROU, Contributing Editor
The need for ever more computational power continues to grow and exaflop (1018 ) capabilities may soon become necessary. A paper by AMD on “Design and Analysis of an APU for Exascale Computing” presented at the IEEE High Performance Computing Architec- tures Conference (HPCA) gave the AMD vision for an exascale node architecture for exascale computing including low-power and high-performance CPU cores, integrated energy-efficient GPU units, in-package high-bandwidth 3D memory, die-stacking and chiplet technologies, and advanced memory systems.
Two of the building blocks for this exascale node architecture are (1) it’s chiplet-based approach that decouples performance-critical processing components like CPUs and GPUs from components that do not scale well with technology (e.g., analog components), allowing fabrication in individually optimized process technologies for cost reduction and design reuse in other market segments and (2) the use of in-package 3D memory, which is stacked directly above high- bandwidth-consuming GPUs.
The exascale heterogeneous processor (Figure 1) is an accelerated processing unit (APU) consisting of CPU and GPU compute integrated with in-package 3D DRAM. The overall structure makes use of a modular “chiplet” design, with the chiplets 3D-stacked on other “active interposer” chips. “The use of advanced packaging technologies enables a large amount of computational and memory resources to be located in a single package.” The exascale targets for memory bandwidth and energy efficiency are incredibly challenging for off-package memory solutions. Thus AMD proposes to integrate 3D-stacked DRAM into the EHP package.
In the center of the EHP are two CPU clusters, each consisting of four multi-core CPU chiplets stacked on an active interposer base die. On either side of the CPU clusters are a total of four GPU clusters, each consisting of two GPU chiplets on a respective active interposer. Upon each GPU chiplet is a 3D stack of DRAM. The DRAM is directly stacked on the GPU chiplets to maximize bandwidth. The interposers underneath the chiplets provide interconnection between the chiplets along with other functions such as external I/O interfaces, power distribution and system management. Interposers maintain high-bandwidth connectivity among themselves by utilizing wide, short distance, point-to-point paths.
Chiplets
The performance requirements require a large amount of compute and memory to be integrated into a single package. Rather than build a single, monolithic system on chip (SOC), AMD proposes to leverage advanced die-stacking technologies to decompose the EHP into smaller components consisting of active interposers and chiplets. Each chiplet houses either multiple GPU compute units or CPU cores. The chiplet approach differs from conventional multi-chip module (MCM) designs in that each individual chiplet is not a complete chip. For example, the CPU chiplet contains CPU cores and caches, but lacks memory interfaces and external I/O.
A monolithic SOC imposes a single process technology choice on all components in the system. With chiplets and interposers, each discrete piece of silicon can be optimized for its own functions. It is expected that smaller chiplets will have higher yield due to their size, and when combined with KGD testing, can be assembled into larger systems at reasonable cost.
It is expected that the decomposition (or disintegration as I prefer to call it) of the EHP into smaller pieces will enable silicon-level reuse of IP (note – this is one of the main drivers of the DARPA CHIPS program)