Coreteks

The official website for the youtube channel Coreteks

AMD Master Plan Pt.2 – Heterogeneous Revolution

The death of Dennard scaling and the rising of the phenomenon known as “dark silicon” have brought the IC industry to a crossroads: How to keep improving microprocessors with current power constraints without excessively increasing heat densities on the chip which could decrease its reliability and useful life? How can we continue to improve single and multi-threaded performance without increasing the cooling cost of the processor excessively, and thus continue to bring innovations to environments with significant cooling limitations? In the real world where no amount of financial horsepower is capable of breaking the laws of physics, a fundamental rethink in architectures is necessary if semiconductor companies are to continue innovating.

42 Years of Microprocessor Trend Data. Orange: Moore’s Law trend; Purple: Dennard scaling breakdown; Green & Red: Immediate implications of Dennard scaling breakdown; Blue: Slowdown of ST increase in performances; Black: Focus switches to higher parallelism. Source: https://bit.ly/3ljVkDR

Therefore in 2012 AMD adopted a new approach to the development of its architecture: The Heterogeneous Systems Architecture, which using virtual memory, memory coherence, and architected dispatch mechanisms, would allow the GPU to operate as a true peer of the host CPU, making the CPU and GPU work together much more efficiently, and in such a way that was much easier to write applications that took advantage of both. However, AMD’s plans have been stalled by Bulldozer‘s absolute failure along with many poor management decisions that have seriously put the company’s future at risk. Their only remaining option being “innovate to survive”, AMD continued to work on the development of its heterogeneous vision and now, through its most recent published patents, we can get a real insight into the scale of the revolution that AMD is preparing.


AMD heterogeneous CPU – Enabling Multi-ISA heterogeneity using x86 as ISA superset

Fig. 1 – A heterogeneous processor system which includes a high-feature processor configured to support entirety of the set of ISA features and a low-feature processor configured to support a subset of the set of ISA features. [link]

The first major disruptive movement that AMD is bringing as seen in their patents is the development of a new heterogeneous CPU, which takes the level of heterogeneity to the limit. Unlike ARM big.LITTLE this new heterogeneous processor’s main feature is having a large “high-feature” processor configured to support entirety the set of x86 ISA features and a small “low-feature” processor configured to support just a compact subset of the x86 ISA features (fig. 1). The way in which this new heterogeneous processor is described in that patent reminded me of a brilliant work presented last year in HPCA which was even included in the symposium’s best paper nominees.

Fig. 2 – A composite x86 ISA architecture, employing compact cores that implement fully customized x86 ISAs, derived from a single large x86 ISA superset. [link]


Fig. 3 – Conventional heterogeneous multi core employing x86 ISA.

In the HPCA2019 researchers have proposed a composite ISA architecture that employs compact cores implementing fully customized x86 ISAs, derived from a single large x86 ISA superset, showing that with this approach it was possible outperform fully heterogeneous-ISA designs, due to greatly increased flexibility in creating cores that mix and match specific sets of features. They proposed two different sets of designs: In designs optimized for multi-threaded mixed workloads, 18% performance improvement and 35% reduction in energy-delay product was achieved over single-ISA heterogeneous designs, without sacrificing most of the benefits of a single ISA. In designs optimized for single-thread performance/efficiency it achieved a speedup of 20% and an EDP reduction of 28% on average.

Thus, looking at the results obtained from this massive design space exploration carried out in this work with due attention to all the proposed compiler and runtime strategies, we can have an indicative insight of what we can expect in terms of performance improvement and power consumption reduction that future AMD heterogeneous processors will bring to the mobile market. In addition, there is already a wide range of related patents covering pipeline recovery (fig. 4), cache coherence (fig. 5) and wakeup latency (fig. 6) solutions that could be included in this new design approach and which could bring even greater improvements to the architecture as a whole.

Fig. 4 – A new method for performing efficient processor pipeline flush recovery [link]

Fig. 5 – Locality-aware and sharing-aware cache coherence for collections of processors [link]


Fig. 6 – Method for reducing chiplet interrupt latency [link]

However, AMD went even further in exploring new heterogeneous solutions for its products. In the high-performance computing environment where energy efficiency and chip utilization need to be pushed to the limit, AMD has also implemented this heterogeneous revolution, and this time in GPUs.


Heterogeneous GPU – Maximizing chip utilization through the use of variable width SIMD units

Fig. 7 – Heterogeneous graphics processing unit for scheduling thread groups for execution on variable width SIMD units – [link]


Perhaps even more impressive is a new patent filed by AMD that aims to improve the chip utilization in its Exascale projects. As you may know many GPU workloads are non-uniform and have numerous wavefronts with predicated-off threads. Unfortunately, the predicated instructions take up space, waste power, produce heat, and produce no useful output. Even the most modern GPU micro-architectures are unable to cope with certain dynamic runtime behaviors which are very difficult to know at compile time.

Therefore to solve this problem AMD have proposed a disruptive approach to push the chip utilization level to the limit: A new GPU architecture in which its SIMD units have different numbers of ALUs, so that each SIMD unit can run a different number of threads (fig. 7). Thus, by providing a set of execution resources within each GPU compute unit tailored to a range of execution profiles, the GPU can handle irregular workloads more efficiently.

This approach also works very well with branch divergence in a wavefront. Because of branch divergence, some threads follow a control flow path and other threads will not follow the control flow path, which means that many threads are predicated off. So effectively there will only be a few subsets of threads running. When it is determined that the active threads can be run in a smaller width SIMD unit, then the threads will be moved to the smaller width SIMD unit, and any unused SIMD units will not be powered up. Likewise, if divergence of control or other problems reduces the number of active threads on a wavefront, the more restricted execution feature can also be more efficient.

Fig. 8 – Block diagram detailing the implementation of multiple instances proposed by AMD, which includes multi-tasking support in each pipeline stage [link]


To properly support this new GPU architecture AMD has already filed two other patents: First, the systems, apparatuses, and methods for abstracting tasks in virtual memory identifier containers, an implementation of multiple instances proposed by AMD which includes multi-tasking support in each pipeline stage, already mentioned and duly explained in my previous article (fig. 8), and new methods for processing variable wavefront sizes on GPU, which provides the dynamic warp subdivision necessary for the proper use of the resources of this new architecture.

Fig. 9 – Indicating instruction scheduling mode for processing wavefront portions [link]


The arduous path for revolution

A patent is a profound indicator of long-term R&D planning which when we analyze in detail can provide valuable clues about the evolution path in the search for innovation. Anyone who ignores these clear signs and considers them insignificant for a proper analysis would demonstrate total ignorance about the process of developing an intellectual property as well as technological development as a whole.

Fig. 10 – Asymmetric multi-core processor with native switching mechanism – Via Technologies [link]


That said, given the level of difficulty of the proposals presented and their feasibility it is possible to say that the evolution path that AMD is taking to bring its heterogeneous revolution to the world is arduous. Nothing presented here is necessarily breaking news. As far back as 2014 VIA also filed a patent for an asymmetric multi-core processor which was completely abandoned in subsequent years. The fundamental point discussed here is the huge effort that AMD is putting into the development of heterogeneous solutions that previously would have been avoided. It is very clear that AMD realized that the emergence of “Dark Silicon” could no longer simply be ignored in coming years and that there would be no Deus Ex Machina that could provide a disruptive solution to these problems. After all, the era of silicon tricks is over and the only way to not leave any transistors behind is to face challenges previously avoided to deliver real innovation in the future.

Some references and further reading:

  1. Mednick, E. H., Mclellan, E., “Instruction subset implementation for low power operation”, US10698472, 2020
  2. Meng, J., Tarjan, D., Skadron, K., Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance, ACM SIGARCH Computer Architecture News, 2010
  3. Venkat, A., Basavaraj, H., Tullsen, DM., Composite-ISA Cores: Enabling Multi-ISA Heterogeneity Using a Single ISA, HPCA, 2019.
  4. Shekhar Borkar and Andrew A. Chien, “The future of microprocessors”, Communications of the ACM – 2011

Disclaimer: The views, analysis and opinions in this article are the author’s and aren’t necessarily shared by coreteks.

Underfox

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top