Will AMD’s bios patch lead to system wide failures?

Will AMD’s new bios patch lead to system wide failures?

Author: Essen

10th September 2019

This morning I was going through my discord feed and saw this article by Tom’s hardware on the leaked AMD bios patch for reaching higher frequencies. I remember looking at the figures and seeing the beautifully smooth 95 degrees celsius CPU temperature and thinking to myself that that must be a very good cooler and that Tom’s hardware often had very nice builds. As the Discord conversation went on, we joked around posting memes on how these CPU’s would die young, fun times, R.I.P. But then I remembered a talk given by “Janak H. Patel” at Stanford called “CMOS Process Variations: A Critical Operation Point Hypothesis” and realized that the bios updates by AMD might lead to massive system crashes even sooner than expected. Because let us be honest, we don’t all have nice rigs like Tom’s hardware.

Disclaimer 1: I am not saying that all I am describing here will happen for certain once the patch is released and the owners of 3000 series AMD CPUs have updated their bios. I hope that in a month or two we can look back at this article the way we look back on the last episode of season 8 of Game of Thrones, much ado about nothing. Regardless, I suspect there is a risk that we will be seeing sudden and massive system wide failures, and this is what leads me to my thoughts in this article.

Disclaimer 2: I am not affiliated with AMD or Intel. This was not written with the intent of harming AMD’s reputation or attacking their products. Rather it was simply written, in good faith, as a followup to a discord discussion that didn’t fit into the 2000 character limit.

Before we can understand why such a massive failure might occur we first need to take a step back and look at what frequency actually means for a CPU.

Flip-flops

Team Red Flip-Flops

CPU’s are composed of layers of combinational logic. This logic is generally built using CMOS transistors. Calculations are done inside a CPU by propagating signals though these layers of combinational logic, and this propagation takes time. The major contributors to this time is the time it takes to switch the state of a transistor, and increasingly, the time on the path between the transistors. Different paths take different amounts of time depending on the depth of the logic the signal is going through and how it is organized on the die.

We can separate two propagations by creating a “sample point” that will temporarily store the state of a signal and write the state back on the wire cadenced by a clock signal. At the beginning of a clock cycle we are writing the states of the signals out onto the wire. During the cycle these are being propagated through the logic. And at the end of the cycle we are sampling the values of the signals on the end of the combinational logic so that we can use them in the next clock cycle as input to some other combinational logic. This sampling and saving of values is done by a particular type of logic structure called a “flip-flop”.

So, by design we have a certain amount of logic we wish to go through in the span of one clock cycle, this logic and its interconnection is fixed and cannot be altered. There exists a time difference between the time it takes us to propagate through our logical and settle to a stable state, and the time we perform our final sampling, we call this the “slack time”. If our slack time become negative, it means we were unable to go through all of our logic, and so unable to perform the operations the logic was designed to accomplish. Long story short, if ever this happens the behavior of our CPU becomes massively unpredictable and, depending on how important that logic was and what it controlled, you have pretty high chance of crashing.

Because of this, the paths with the least slack are called the “critical path” and are the limiting factors for how short a single clock cycle can be. Since frequency is the inverse of the period, a shorter clock cycle is a higher frequency: at 4Ghz you have a period of 250ps and at 5GHz 200ps. As stated above, we cannot modify the combinational logic of CPU once it has been built, but we can modify the clock period by modifying the frequency.

Now when the design goes from “combinational logic” to the way we map it out onto the silicon they will try to optimize its layout for area occupancy, power and timing. Because of this optimization, the distribution of the path delay for all my paths is NOT a nicely centered gaussian with only a few outliers, it is much closer to an accumulation on the edges with a high concentration of paths with little slack. So if my chip hits a configuration where my logic propagation exceeds the available period, I will not be having just a small amount of paths not finishing, but a massive amount of them, and so a pretty catastrophic system crash.

Gaussian like distribution of the path delayer for a 5GHz processor, this is what we would expect intuitively.

Actual distribution of the path delayer for a 5GHz processor, after it has been gone through the optimization. We now have a lot more critical paths with very little slack.

Okay, that doesn’t sound very good… Let’s take a look at some of the parameters that will influence the time it takes for a signal to be propagated through the logic.

I would like to focus your attention on two of them: temperature and voltage. Toggling my logic results in the production of heat, the more I toggle, the hotter it gets. So when I increase my frequency, at a given voltage, my transistors consume more power. The power consumption of a transistor’s activity is known as dynamic power, in opposition to its power draw when idle, known as static power. This increase in power consumption will, in turn, result in more heat. Overheating is known to significantly reduce the lifespan of a CPU. This is a major problem because we don’t want our transistors to die on us, but you probably already know this and did not come all this way just to read trivial information. Something that is also of interest here is that increased heat also affects the time it takes to propagate through our logic by positively impacting the resistance on our wire and so increasing the wire propagation delay.

Let’s assume we are AMD, and we still want to maximize our frequency, preferably without our chips dropping like flies. A solution for circumventing this heat problem is reducing the operating voltage, a.k.a undervolting. Now we can push our frequencies higher without using so much dynamic power on our transistors, all is good right, fixed the problem?

Well, now we have another fun property of CMOS transistors that kicks in: the switching rate of a transistor decreases with lower voltages as the gate delay increases. Also, we only delayed our heat problem: we just reduced the amount of dynamic power that transistors were drawing at a given voltage-frequency, if we increase the frequency further then the produced heat will still increase.

And this is where things get interesting and it’s the basis for my hunch on system wide crashes. If you increase the heat too much your wire delay increases and you fail to meet timing. If you decrease your power too much, your gate delay increases and you fail to meet timing. If you fail to meet timing, since a big proportion of your paths have very little slack, a lot of your logic becomes unpredictable leading to a catastrophic system failure. AMD are walking on the side of a cliff, one misstep and everything comes crumbling down. But why would there be any misstep?

All chips are not created equal, there is a variation in quality, this is sometimes referred to as the silicon lottery: some chips will have lower delays out of the box and some will not. In addition to that all setups are not created equal, people have cooling systems and power supplies of varying quality, and this is where things might start to go south. For my system to fail I need one of two things two happen : too much heat or too low voltage, let us look first at voltage.

Power is like a river, and your supply has inertia when ramping up the amount of power flowing through a local part of your chip. If you have a sudden spike in power draw, depending on the quality of your supply, there will be a varying amount of inertia before you can meet that demand. If your PSU fails to supply enough power, you will have a local drop in voltage. Naturally, there are checks in place to prevent this from happening on the CPU as well as the board side, but we are operating on the edge so any small inconsistency could be sufficient to send us overboard.

On the other side we have our cooling system, it has inertia when ramping up, as well as a maximum cooling capacity. And its smooth operation is dependent on a sufficient flow of power from our potentially struggling power supply. I think you can see where this is going. Not being able to ramp up cooling could lead to an overshoot over the maximum operating temperature threshold, and my delay will increase leading to the death of my system.

Now the engineers at AMD are not beginners, this is not their first rodeo. So how and why could a “bug” inducing BIOS update like this be released? This would be a combination of multiple factors: the first is the recent public backlash they have been put under by the release of der8auer’s survey numbers. This would have pressured AMD to push a bios patch out the door sooner, cutting a few corners, one of them being testing. Insufficient testing, or especially testing focused on high-end builds might cause them to miss this problem. The second factor is that this type of failure doesn’t start occurring gradually, there is no gradual slope of crashes with a linearly increasing probability when put under stress. It is closer to a step, with a very steep slope, that when crossed causes quasi certain crash.

Linear increase of the error rate, behavior we would assume to happen

Step like increase of the error rate, no errors are detected until the tipping point

Conclusion

The essence of it comes down to the following: whenever you are in a configuration with a combination of high frequency and low voltage, with a power supply and ( or ) a cooling solution that can no longer keep up, this upcoming BIOS update might cause a system wide crash. This can occur at any moment, on heavy single and multi-threaded workloads as long as you are running at the boost frequency, and will be more likely to happen on builds with less reactive power supplies and cooling solutions. This crash might be, depending on your OS, a blue screen of death or a kernel panic, either way, it is irrecoverable.

a recipe for disaster

Sources and references:
CPU animation, original video: https://youtu.be/urqPobwPOzs
Der8auer ryzen 3000 bost clock survey: https://youtu.be/DgSoZAdk_E8
Tom’s hardware bios fix leak article: https://www.tomshardware.com/news/amd-ryzen-3000-boost-frequency-bios-fix-agesa,40359.html
Janak H. Patel, CMOS Process Variations: A Critical Operation Point Hypothesis: https://youtu.be/rf8qTpW6BH4

About the guest author: Essen is a patron of Coreteks and an active member of the community on the Discord Server. Her views do not necessarily reflect those of Coreteks.