When AMD bought ATI the long-term plan was to integrate the GPU cores in the same die than the CPU cores to bring a new scale of heterogeneous computing. The integration has resulted to be harder than originally planned and only the last Kaveri APUs start to show the benefits of the integration.

However, there is no doubt that APUs are here to stay. Silicon scaling projections show that the cost of moving information outside die will increase in the next years. The Pacific Northwest National Laboratory puts it in succinct form:

There is a strong agreement among researchers on the increasing cost of data movement with respect to computation. This ratio will further increase in future systems that will approach NTV operation levels: the energy consumption of a double precision register-to-register floating point operation is expected to decrease by 10x by 2018. The energy cost of moving data from memory to processor is not expected to follow the same trend, hence the relative energy cost of data movement with respect to performing a register-to-register operation will increase (energy wall — analogous to the memory wall).

This implies than moving data in future will cost 10 times more than now in relative terms. The solution to this energy wall problem problem was identified in the original DARPA report about exascale challenges, the solution is locality: all computations have to be made locally, minimizing the motion of data.

The principle of locality imposes strong constraints in the architectures that will be developed for the future. A traditional architecture based in a discrete CPU and a discrete GPU connected by some interconnect, PCIe or otherwise, will not work because moving data between the CPU and an external GPU will be too expensive in energy terms, which will hurt performance. Even moving data to external memory on board will be too expensive. As a consequence, all the future designs for exascale supercomputers will integrate the computation units in the same die and the main memory in the same die or package. Those exascale supercomputers will be capable of performing more than one quintillion calculations per second, roughly 50 times faster than today’s fastest available supercomputers. I will discuss the designs by IBM, Fujitsu, Intel, Nvidia and others elsewhere, in this article I will only discuss the AMD proposal.

By the physical and technological reasons stated above, the only option left to AMD for future high performance systems are APUs. A traditional architecture based in a discrete CPU and a discrete GPU connected by some interconnect doesn’t scale up enough (check above figure). Those high-performance APUs will not be those in your current laptop. Current APUs are constrained by at least two reasons: current node density and DDR3 memory bottleneck.

But these constrains are temporal. Stacked DRAM will eliminate the DDR3 memory bottleneck and future nodes will bring the needed density. On 7nm node, the dual-module iCPU of Kaveri would occupy about the 3% of the total die of the future APU. We can increase the number of iCPU cores from four to eight and then duplicate the size of each individual core and still the iCPU will occupy only the 12% of the total die, the remaining 88% will be iGPU. This means that a discrete GPU only could be, a priori, up to 12% faster that the iGPU on the APU doing irrelevant the development of the discrete GPU. Moreover, this “12% faster” is not correct because ignores the performance penalty of moving the data between the discrete CPU and the discrete GPU, and also ignores that the discrete GPU will spend part of that 12% of extra space in the separate die for required logic (own memory controller, interconnect logic…). When all the technical details are considered the net result is that a traditional combination of discrete CPU and discrete GPU is slower than the high-performance APU.

To achieve about 50 times faster supercomputers, AMD will increase power consumption by about 2 times and efficiency by about 25 times. The preliminary design presented by AMD is shown in the next figure

The 10 TFLOP/s of performance are for double precision computations and correspond to an efficiency of between 40 and 50 GFLOP/s per watt; the projected bandwidth to stacked DRAM is of 4 TB/s. For comparison, the fastest and expensive discrete card released by AMD, the FirePro S9150, gives a peak performance of 2.53 TFLOP/s with an efficiency of 10.8 GFLOP/s per watt and communicates with its GDDR5 memory at a peak rate of 0.32 TB/s.

It is a bit shocking that AMD uses a modified Bulldozer/Piledriver architecture to represent future iCPUs, specially after his former president Rory Read admitted in public that the Bulldozer family was a fiasco and would be replaced by a new architecture in 2016. It is also weird that AMD mentions the bloated x86 ISA instead the more modern and elegant A64 ISA of ARMv8 praised by AMD’s Jim Keller during a recent conference. Note that the proposed node uses 256 bit FMAC units, this implies that the octo-core iCPU in the APU would produce a total throughput twice higher than the FX-8350: an octo-core CPU based in Piledriver architecture.

The central APU represented above will be complemented by a set of smaller APUs that will perform PIM (Processing In Memory). Again this is a consequence of the principle of locality: moving memory-intensive computations closer to memory presents another opportunity to reduce both energy and data movement overheads whereas increasing performance. The red lines denote optical interconnects.

As the figure shows those smaller APUs will have NVRAM directly stacked on top. However, the big APU will have DRAM stacked near it. Thermal challenges are a key impediment to stacking memory directly on top of a high-performance processor. Heat generated by the processor cannot be conveniently dissipated through the stack of memory, requiring the throttling of processor performance; moreover the non-dissipated heat also affects the stacked DRAM increasing memory refresh rate. I explained above the rationale for choosing high-performance APUs over discrete CPUs and GPUs for achieving the highest performance. The reasons for choosing APUs for PIM are as follow:

Both the host and in-memory processors in our system organization are accelerated processing units (APU). Each APU consists of CPU and GPU cores on the same silicon die. We believe the choice of an APU as the in-memory processor has several benefits. First, the CPU and GPU components support familiar programming models and lower the barrier of using in-memory processors. Second, the programmability allows a broad range of applications to exploit PIM. Third, the use of existing GPU and CPU designs lowers the investment and risk inherent in developing PIM. Finally, the architectural uniformity of the host and in-memory processors ease porting of code to exploit PIM. Porting a subset of an application’s routines (e.g., the memory-intensive kernels) to PIM does not require a fundamental rewrite as the same languages, run-time systems and abstractions are supported on both host and in-memory processors. Syntactically, simply annotating the routines that are to be executed using PIM is sufficient.

The total memory per node (both DRAM and NVRAM) is of 4TB. AMD does not give any more relevant detail about its design, not even core frequencies or IPC levels. However, we guess 3 GHz for the CPU cores and 1 GHz for the GPU cores of the host APU and 650 MHz for the PIM APUs must be close to final values. We also expect the use of 10nm or 7nm nodes with FinFETs. Prototypes could be ready for 2018, but we expect first the commercial products for 2020–2025.

Leave a Comment