For over a decade the industry has been preparing to reach a new level for modern computer systems: Exascale computing. However, getting a computer system to reach 1018 floating point operations per second in a scenario where it could no longer depend on Moore’s Law or Dennard scaling forced the industry as a whole into an unprecedented race for innovation.
In this period, AMD has been executing its plans to achieve exascale, which culminated in the emergence of HBM, the Zen architecture and other very significant developments. However, looking deeply into the latest research and patents released by AMD, we will see that AMD has not yet reached the apex of its master plan.: The development of an exascale heterogeneous processor architecture. In this brief article we will show some of the developments and research that AMD is doing to achieve exascale and show that the possibility of a future exascale APU is getting closer to becoming a reality.
The challenges to achieve exascale.
First of all, to understand how challenging it is to achieve exascale, we first need to think about the power budget of such a system. Over the years, HPC systems have always taken into consideration only pure computational power, without much consideration for the average energy cost per operation. Everything changed radically with the end of Dennard scaling, as it would not be enough just to wait for the next node to get more operations with the same power budget. As the cost of operation and local infrastructure for research centers has risen, the existing paradigm needed radical change. After all, how many research centers in the world have their exclusive nuclear power plant as an energy solution for their supercomputers?
Partnership programs to accelerate exascale R&D like US DOE FastForward have been encouraging and challenging the HPC community to target a 20MW power budget for a exascale system, which would limit the average energy cost per operation to 20 pJ/FLOP, a very challenging proposition since massive innovations in architecture, memory, and interconnection were essentially required to achieve these goals.
This is because data-intensive applications on conventional von Neumann systems cause massive data movements between processors and memory elements and induces dramatic performance and energy overheads. Without significant advances, the cost of moving data could easily exceed the cost of operating on that data. Since we cannot trust that any Deus ex machina will provide us with a magical solution to this problem, both academia and industry focused on three major lines of research as possible ways to reduce the amount of energy per operation: 3D integration, processing in-memory and non-volatile memory.
Seeking to solve all the challenges in each of these lines of research, AMD has developed an innovative master plan to achieve exascale, integrating all these technologies in a single system.
The Exascale Node Architecture
Seeking to address all these challenges, AMD in 2016 presented their vision of how to build an exascale APU which would include many leading edge technologies in a single massive package, using aggressive die-stacking and chiplet technologies, and advanced 3D memory systems. The proposed exascale APU consisted of two CPU clusters, each with four CPU chiplets stacked on an active interposer base die interconnected with four GPU clusters, with each of the GPU chiplets with an HBM module stacked on top, providing the necessary computational throughput while minimizing memory-related data movement energy and total package footprint.
In the original article, even using a relatively modest configuration (32 CPU cores, 320CUs, 3TB / s memory, 160W node power budget), the authors have shown through simulations to be feasible to develop an APU as a solution to achieve exascale computing (Given the performance goal of 1 exaflop and a power budget of 20MW for a 100,000-node machine, each node would provide 10TFLOPs of performance in a 200W power envelope.). However, when we look at the latest research developed by AMD in conjunction with some of its most recently published patents, we realize that a possible implementation of an exascale APU can go well beyond what was originally envisaged by the original article.
In reality, it is still very likely that in its initial implementation, AMD will design the exascale APU using the originally proposed 2.5D implementation. However, there is a possibility that AMD may eventually introduce a 3D design in both CPU and GPU, in its future implementation of EHP. Several patents have emerged showing the extensive effort of AMD in this direction as the case of the patent below which shows a 3D chip stack with integrated voltage regulation.
Naturally, one of the major 3D-IC design problems is in heat dissipation and thermal distribution within the die stacking, something that was covered extensively in the article (linked above) because of the 3D stacking between HBM modules and GPU chiplets. In the past year, some patents have emerged showing how AMD plans to handle these problems. The first proposed solution (1) would be to use dummy materials that are thermally and mechanically connected with die stacking in such a way as to improve the thermal conduction of the processor with IHS. The second proposed solution (2) would be to use dummy TSVs (Through-silicon Vias) in such a way as to improve the thermal distribution by decreasing the heat concentration at critical points inside the stack which could possibly, without this architectural trick, cause a significant loss of performance due to thermal throttling, decrease processor lifespan or even destroying it in extreme cases. As we can see, AMD has developed passive dissipation solutions in these patents to mitigate some of these problems in its architectural design. Nothing out of the ordinary so far…
However, the most daring and innovative solution proposed by AMD for its future 3D-IC thermal solution is to use a thermoelectric device to improve heat dissipation (3). Based on combined use of Peltier effect (production of a temperature gradient at a semiconductor junction when subjected to an electrical potential difference) and Seebeck effect (production of an electrical potential difference at a semiconductor junction when subjected to a temperature gradient), AMD’s proposed patent shows an energy-efficient thermoelectric cooling device, where cooling is achieved with minimal external energy, becoming a viable thermal solution for EHP.
Thus, using all of these innovations together makes even more extreme applications such as 3D stacking of an HBM module on a GPU possible, as widely discussed in the original article and shown in the following patent.
Processing in memory and non-volatile memory technologies: The next frontier.
The industry as a whole has been working on developing in-memory computing and non-volative memory solutions as part of its strategy to reduce energy costs per operation. Although not widely discussed in the original article, over the years AMD has also been working on these key technologies in its strategy to achieve exascale. One of the initial strategies for introducing these technologies in the exascale APU would be to implement processing in memory of some bitwise operations using the available L3 SRAM, as shown in the following patents. In them we can find some of the methods necessary for the large-scale implementation of acceleration through processing in memory, as a means of determining whether the cost of operating in memory is greater than operating in the processor or the methods necessary to maintain cache coherence.
Another possible strategy to introduce these technologies could be through the development of a huge L4 cache using STT-MRAM, which could also include in-memory computing. Although there is still a considerable difference between the read and write latency of the spin-transfer torque RAM, many researches have been showing that the development of an L4 cache using STT-MRAM can become a reality very soon.
On both the Intel and AMD camps, some researches have shown that it is possible to carry out some design trade-offs to improve these latencies in writing, in such a way as to significantly reduce them, making the use of STT-MRAM a viable solution for a huge LLC. In addition, when we look at research related to the manufacturing process, both from Intel and from TSMC (current foundry responsible for manufacturing AMD processors) we see very significant advances. Last year at IEDM 2019, Intel presented its advances in the development of a viable L4 cache solution using STT-MRAMs, building 2MB arrays of scaled MTJ devices, meeting L4 cache specifications across all operating temperatures, demonstrating a 20ns write time, 4ns read time, endurance of 10^12 cycles, and memory retention at elevated temperature (110ºC). This year, in February, TSMC will present at ISSCC 2020 a work titled “A 22nm 32Mb Embedded STT-MRAM with 10ns Read Speed, 1M Cycle Write Endurance, 10 Years Retention at 150°C and High Immunity to Magnetic Field Interference”.
In the end, everything suggests that both companies will be able to present solutions for LLC using this type of non-volatile memory in the coming years and that this solution may be included in the exascale APU proposed by AMD.
A final insight
I would like to emphasize to the reader the speculative character of this article in these concluding notes. Even a careful analysis of patents and their related research does not confirm the existence of future related products. Even though there are demonstrably great research being done and patents that are proven to enable the development of an exascale APU, it is still quite possible that this will not become a reality. Everything will depend on AMD’s success in the coming years, with successive Zen and Radeon generations to come…
However, what can be highlighted in this article is the great research and development work that AMD has been doing on over the past decade and that should eventually give the world a brilliant heterogeneous solution to exascale. In the author’s view, AMD’s future has never looked so bright, and it seems very possible that a huge AMD Italian fortress will dominate the HPC market in the coming years.
Underfox is a Physicist, Telecom Engineering lover, HPC Enthusiast and Prog Rock/Metal fan. Underfox’s views are his own and do not necessarily reflect Coreteks’s. You can find Underfox on Twitter @Underfox3
Some references and further reading:
The Optimist, the Pessimist, and the Global Race to Exascale in 20 Megawatts – https://ieeexplore.ieee.org/document/6128005
Patent: In memory logic functions using memory arrays – AMD: http://www.freepatentsonline.com/20190334524.pdf
Patent: Programming In-memory accelerators to improve the efficiency of datacenter operations – AMD: https://patentimages.storage.googleapis.com/21/35/d4/a5e744755ee516/US20180081583A1.pdf
Patent: Cache coherence for processing in memory – AMD https://patentimages.storage.googleapis.com/9f/8e/40/cf03b3804a7188/US20170344479A1.pdf
Patent: Method and apparatus for controlling cache line storage in cache memory – AMD http://www.freepatentsonline.com/20190205253.pdf
Patent: Nondeterministic memory access requests to non-volatile memory – AMD https://patentimages.storage.googleapis.com/df/60/83/d2e04600379db2/US20180060257A1.pdf
Patent: Integrated thermoelectric cooler for three-dimensional stacked DRAM and temperature-inverted cores – AMD https://patentimages.storage.googleapis.com/59/a8/51/c2e7e43372021c/US10210912.pdf
Patent: High-performance on-module caching architectures for non-volatile dual in-line memory module (nvdimm) – AMD http://www.freepatentsonline.com/20190189210.pdf
Patent: Die stacking for multi-tier 3D integration – AMD http://www.freepatentsonline.com/20190371763.pdf
Patent: Low power and low latency GPU coprocessor for persistent computing – AMD http://www.freepatentsonline.com/20180144435.pdf
Patent: Memory pools in a memory model for a unified computing system – AMD http://www.freepatentsonline.com/20190303302.pdf
Patent: Cache coherency using die-stacked memory device with logic die – AMD http://www.freepatentsonline.com/9170948.pdf
Patent: Out-of-Order Cache Returns – AMD http://www.freepatentsonline.com/20180165790.pdf
Patent: Offset-aligned three-dimensional integrated circuit – AMD http://www.freepatentsonline.com/20190326272.pdf
Patent: Multi-chip package with offset 3D structure – AMD https://patentscope.wipo.int/search/docs2/pct/WO2019209460/pdf/xa9embc93bq3-li5rOozU5UphNaIDFICx88KbjuoMJ-6ff7Xa_Od9ZojT49oOF8WIDqGX3LRKonzerYE4PPN1pt1FvtGqMQJmdhXx9Z8I2jOI7MXDffDWsN201m6wg-w?docId=id00000051005547
Patent: Method and apparatus for power delivery to a die stack via a heat spreader – AMD www.freepatentsonline.com/20190333876.pdf
Patent: Configuration of multi-die modules with through-silicon vias – AMD http://www.freepatentsonline.com/20190332561.pdf
Patent: Mechanisms to improve data locality for distributed GPUs – AMD http://www.freepatentsonline.com/20180115496.pdf
Patent: Extreme-bandwidth scalable performance-per-watt GPU architecture – AMD http://www.freepatentsonline.com/20190196742.pdf
Patent: 3d chip stack with integrated voltage regulation – AMD https://patentimages.storage.googleapis.com/32/bb/f6/6ef8ec827402bd/US20190103153A1.pdf
Patent: Self identifying interconnect topology – AMD http://www.freepatentsonline.com/20190199617.pdf
Patent: Method and apparatus of integrating memory stacks – AMD https://patentimages.storage.googleapis.com/33/a9/fd/ad7ab064979843/US20180341613A1.pdf