The official website for the youtube channel Coreteks

AMD Master Plan: Achieving Exascale through heterogeneous computing.

For over a decade the industry has been preparing to reach a new level for modern computer systems: Exascale computing. However, getting a computer system to reach 1018 floating point operations per second in a scenario where it could no longer depend on Moore’s Law or Dennard scaling forced the industry as a whole into an unprecedented race for innovation.

In this period, AMD has been executing its plans to achieve exascale, which culminated in the emergence of HBM, the Zen architecture and other very significant developments. However, looking deeply into the latest research and patents released by AMD, we will see that AMD has not yet reached the apex of its master plan.: The development of an exascale heterogeneous processor architecture. In this brief article we will show some of the developments and research that AMD is doing to achieve exascale and show that the possibility of a future exascale APU is getting closer to becoming a reality.

AMD announcement in San Francisco, California, Wednesday, August 17, 2016.
(credit photo by Paul Sakuma Photography)

The challenges to achieve exascale.

First of all, to understand how challenging it is to achieve exascale, we first need to think about the power budget of such a system. Over the years, HPC systems have always taken into consideration only pure computational power, without much consideration for the average energy cost per operation. Everything changed radically with the end of Dennard scaling, as it would not be enough just to wait for the next node to get more operations with the same power budget. As the cost of operation and local infrastructure for research centers has risen, the existing paradigm needed radical change. After all, how many research centers in the world have their exclusive nuclear power plant as an energy solution for their supercomputers?

AMD presentation slide on advancements for exascale computing, in 2016

Partnership programs to accelerate exascale R&D like US DOE FastForward have been encouraging and challenging the HPC community to target a 20MW power budget for a exascale system, which would limit the average energy cost per operation to 20 pJ/FLOP, a very challenging proposition since massive innovations in architecture, memory, and interconnection were essentially required to achieve these goals.

This is because data-intensive applications on conventional von Neumann systems cause massive data movements between processors and memory elements and induces dramatic performance and energy overheads. Without significant advances, the cost of moving data could easily exceed the cost of operating on that data. Since we cannot trust that any Deus ex machina will provide us with a magical solution to this problem, both academia and industry focused on three major lines of research as possible ways to reduce the amount of energy per operation: 3D integration, processing in-memory and non-volatile memory.

Seeking to solve all the challenges in each of these lines of research, AMD has developed an innovative master plan to achieve exascale, integrating all these technologies in a single system.

The Exascale Node Architecture

Seeking to address all these challenges, AMD in 2016 presented their vision of how to build an exascale APU which would include many leading edge technologies in a single massive package, using aggressive die-stacking and chiplet technologies, and advanced 3D memory systems. The proposed exascale APU consisted of two CPU clusters, each with four CPU chiplets stacked on an active interposer base die interconnected with four GPU clusters, with each of the GPU chiplets with an HBM module stacked on top, providing the necessary computational throughput while minimizing memory-related data movement energy and total package footprint.

The Exascale APU [Link]

In the original article, even using a relatively modest configuration (32 CPU cores, 320CUs, 3TB / s memory, 160W node power budget), the authors have shown through simulations to be feasible to develop an APU as a solution to achieve exascale computing (Given the performance goal of 1 exaflop and a power budget of 20MW for a 100,000-node machine, each node would provide 10TFLOPs of performance in a 200W power envelope.). However, when we look at the latest research developed by AMD in conjunction with some of its most recently published patents, we realize that a possible implementation of an exascale APU can go well beyond what was originally envisaged by the original article.

In reality, it is still very likely that in its initial implementation, AMD will design the exascale APU using the originally proposed 2.5D implementation. However, there is a possibility that AMD may eventually introduce a 3D design in both CPU and GPU, in its future implementation of EHP. Several patents have emerged showing the extensive effort of AMD in this direction as the case of the patent below which shows a 3D chip stack with integrated voltage regulation.

Patent: 3D chip stack with integrated voltage regulation – AMD [link]

Naturally, one of the major 3D-IC design problems is in heat dissipation and thermal distribution within the die stacking, something that was covered extensively in the article (linked above) because of the 3D stacking between HBM modules and GPU chiplets. In the past year, some patents have emerged showing how AMD plans to handle these problems. The first proposed solution (1) would be to use dummy materials that are thermally and mechanically connected with die stacking in such a way as to improve the thermal conduction of the processor with IHS. The second proposed solution (2) would be to use dummy TSVs (Through-silicon Vias) in such a way as to improve the thermal distribution by decreasing the heat concentration at critical points inside the stack which could possibly, without this architectural trick, cause a significant loss of performance due to thermal throttling, decrease processor lifespan or even destroying it in extreme cases. As we can see, AMD has developed passive dissipation solutions in these patents to mitigate some of these problems in its architectural design. Nothing out of the ordinary so far…

LEFT Patent: Arrangement and thermal management of 3d stacked dies – AMD [Link]
RIGHT Patent: Dummy TSV to improve process uniformity and heat dissipation – AMD [Link]

However, the most daring and innovative solution proposed by AMD for its future 3D-IC thermal solution is to use a thermoelectric device to improve heat dissipation (3). Based on combined use of Peltier effect (production of a temperature gradient at a semiconductor junction when subjected to an electrical potential difference) and Seebeck effect (production of an electrical potential difference at a semiconductor junction when subjected to a temperature gradient), AMD’s proposed patent shows an energy-efficient thermoelectric cooling device, where cooling is achieved with minimal external energy, becoming a viable thermal solution for EHP.

LEFT Patent: Integrated thermoelectric cooler for three-dimensional stacked DRAM and temperature-inverted cores – AMD – In the image: An overview of the thermoelectric device applied between logic and memory dies. [Link]

RIGHT (Cont.) Patent: Integrated thermoelectric cooler for three-dimensional stacked DRAM and temperature-inverted cores – AMD – In the image: A conceptual view of the thermoelectric device developed by AMD.

Thus, using all of these innovations together makes even more extreme applications such as 3D stacking of an HBM module on a GPU possible, as widely discussed in the original article and shown in the following patent.

Patent: Extreme-bandwidth scalable performance-per-watt GPU architecture – AMD [Link]. In the image: A simplified representation of the 3D stacking between the HBM module and a GPU, proposed in the patent.

Processing in memory and non-volatile memory technologies: The next frontier.

The industry as a whole has been working on developing in-memory computing and non-volative memory solutions as part of its strategy to reduce energy costs per operation. Although not widely discussed in the original article, over the years AMD has also been working on these key technologies in its strategy to achieve exascale. One of the initial strategies for introducing these technologies in the exascale APU would be to implement processing in memory of some bitwise operations using the available L3 SRAM, as shown in the following patents. In them we can find some of the methods necessary for the large-scale implementation of acceleration through processing in memory, as a means of determining whether the cost of operating in memory is greater than operating in the processor or the methods necessary to maintain cache coherence.

Patent: In memory logic functions using memory arrays – AMD [link]. In this image: block diagram of one implementation of a memory.
Patent: Cache coherence for processing in memory [link] – In the image: illustration of a functional block diagram of an exemplary coherent cache link between a host processor and a processor in consistent memory.

Another possible strategy to introduce these technologies could be through the development of a huge L4 cache using STT-MRAM, which could also include in-memory computing. Although there is still a considerable difference between the read and write latency of the spin-transfer torque RAM, many researches have been showing that the development of an L4 cache using STT-MRAM can become a reality very soon.

On both the Intel and AMD camps, some researches have shown that it is possible to carry out some design trade-offs to improve these latencies in writing, in such a way as to significantly reduce them, making the use of STT-MRAM a viable solution for a huge LLC. In addition, when we look at research related to the manufacturing process, both from Intel and from TSMC (current foundry responsible for manufacturing AMD processors) we see very significant advances. Last year at IEDM 2019, Intel presented its advances in the development of a viable L4 cache solution using STT-MRAMs, building 2MB arrays of scaled MTJ devices, meeting L4 cache specifications across all operating temperatures, demonstrating a 20ns write time, 4ns read time, endurance of 10^12 cycles, and memory retention at elevated temperature (110ºC). This year, in February, TSMC will present at ISSCC 2020 a work titled “A 22nm 32Mb Embedded STT-MRAM with 10ns Read Speed, 1M Cycle Write Endurance, 10 Years Retention at 150°C and High Immunity to Magnetic Field Interference”.

In the end, everything suggests that both companies will be able to present solutions for LLC using this type of non-volatile memory in the coming years and that this solution may be included in the exascale APU proposed by AMD.

A final insight

I would like to emphasize to the reader the speculative character of this article in these concluding notes. Even a careful analysis of patents and their related research does not confirm the existence of future related products. Even though there are demonstrably great research being done and patents that are proven to enable the development of an exascale APU, it is still quite possible that this will not become a reality. Everything will depend on AMD’s success in the coming years, with successive Zen and Radeon generations to come…

However, what can be highlighted in this article is the great research and development work that AMD has been doing on over the past decade and that should eventually give the world a brilliant heterogeneous solution to exascale. In the author’s view, AMD’s future has never looked so bright, and it seems very possible that a huge AMD Italian fortress will dominate the HPC market in the coming years.

Underfox is a Physicist, Telecom Engineering lover, HPC Enthusiast and Prog Rock/Metal fan. Underfox’s views are his own and do not necessarily reflect Coreteks’s. You can find Underfox on Twitter @Underfox3

Some references and further reading:



The Optimist, the Pessimist, and the Global Race to Exascale in 20 Megawatts –


Patent: In memory logic functions using memory arrays – AMD:

Patent: Programming In-memory accelerators to improve the efficiency of datacenter operations – AMD:

Patent: Cache coherence for processing in memory – AMD

Patent: Method and apparatus for controlling cache line storage in cache memory – AMD

Patent: Nondeterministic memory access requests to non-volatile memory – AMD

Patent: Integrated thermoelectric cooler for three-dimensional stacked DRAM and temperature-inverted cores – AMD

Patent: High-performance on-module caching architectures for non-volatile dual in-line memory module (nvdimm) – AMD

Patent: Die stacking for multi-tier 3D integration – AMD

Patent: Low power and low latency GPU coprocessor for persistent computing – AMD

Patent: Memory pools in a memory model for a unified computing system – AMD

Patent: Cache coherency using die-stacked memory device with logic die – AMD

Patent: Out-of-Order Cache Returns – AMD

Patent: Offset-aligned three-dimensional integrated circuit – AMD

Patent: Multi-chip package with offset 3D structure – AMD

Patent: Method and apparatus for power delivery to a die stack via a heat spreader – AMD

Patent: Configuration of multi-die modules with through-silicon vias – AMD

Patent: Mechanisms to improve data locality for distributed GPUs – AMD

Patent: Extreme-bandwidth scalable performance-per-watt GPU architecture – AMD

Patent: 3d chip stack with integrated voltage regulation – AMD

Patent: Self identifying interconnect topology – AMD

Patent: Method and apparatus of integrating memory stacks – AMD


Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top