Article Discussion: Speed vs Throughput

Coreteks Forums Forums General Discussion Article Discussion: Speed vs Throughput


This topic contains 5 replies, has 3 voices, and was last updated by  Essen 2 months, 1 week ago.

Viewing 6 posts - 1 through 6 (of 6 total)
  • Author
  • #42


    Use this thread to discuss this article



    Nice article and I would like to add another point regarding Throughput. The article was heavily focused on the computational potential of the CPU identified as influenced by : SMT’s availability -> thread scheduling at a cpu’s level , single threaded performance and the number of cores available. But you in my opinion, forgot to mention that throughput is determined by the weakest node on a network of connected dependencies. And, in my perception, for most people, currently the weakest node slowing down the chain is … memory. The most common example, that most of use have at least seen once is a system were you have say 12 threads that are stalling because next to that brand new cpu you have 2G of RAM and a HDD for main memory. But in memory I include the mostly forgotten caches and the system making sure they get fed : your internal memory bus. And this is were things get complex because now we have an internal network were all that memory needs to transit to and from your CPU <-> L1 <=> L2 <=> (L3) <=> RAM <=> DISK on an eater un-shared buses ( <-> symbol ) but mostly, on shared buses ( <=> symbol ). So now, in a good scenario we have a bunch of parallel thread firing up and displacing each others data in the caches and so needing to request it, so they fire up on the buses, asking for memory from the RAM, causing a page fault, potentially stalling because of it. Okay, now it’s not a pretty situation but you can argue : “Yeah, but that is why we have threads: to balance between stalled threads and the ones that have filled all the requirement to be executed.” and you would be totally right … but. In that vision you are supposing that threads are independent but in some cases they aren’t. In a ‘nice’ example ( nice because this example doesn’t actually screw up your performance to it’s full potential ) suppose I am on an Intel x86 chip : I have 3 threads, 2 are reading a shared memory region, 1 is writing to that shared memory region. Okay, so what will happen? According to Intel’s In a multiple-processor system, the following ordering principles apply:
    … stuff …
    •Writes by a single processor are observed in the same order by all processors
    … more stuff …
    So how does this affect our case? Well my guess is that when you wright to this shared memory you will broadcast you cache line dirty signal to all CPU’s ( because we don’t know on what CPU your shared threads are running and will be running ) to make sure no old version of your variable are still stored. Then you will have to re-acquire a copy of that new data everywhere the line was flushed.
    CPU_w -> L1 -> L2 -> L3 -> L2 -> L1 – CPU_r
    ……………………………………………|–> L2 -> L1 – CPU_r
    And now do you remember that small but significant “observed in the same order by all processors” that translate to “redo the invalidate/flush/super expensive acquire” for all writes to shared memory while your thread running on CPU_r stalls because it doesn’t have the data so it can’t do anymore computation on it ( the cache line was invalidated ) and you thread on CPU_w is waiting for that acknowledgement signal telling him the everybody got the signal and the data. This results in 3 threads stalling for a bit while broadcasting synchronization signals to every other core and consuming very precious bandwidths on my internal buses, every time we use shared memory. Did I mention this is a ‘nice’ example ?
    Basically the idea I am trying to introduce is maybe, when you get to a certain level of multi-threaded program, allowing it to truly run in-parallel on multiple cores might not only hurt the performance of that program but also hurt the performance of all you other programs by saturating the internal buses.

    reference :
    Intel® 64 and IA-32 Architectures Software Developer’s Manual: Volume 3, p.263

    • This reply was modified 2 months, 1 week ago by  Essen. Reason: fixed shematic unaligned
    • This reply was modified 2 months, 1 week ago by  Essen.
    • This reply was modified 2 months, 1 week ago by  Essen.


    That’s an interesting observation. My (admittedly short) thesis in the article was more general and consumer oriented. As you point out there are definitely bottlenecks today that limit throughput in various scenarios/systems. I’m not too familiar with that caching model in your example, so I can’t comment if that’s how it would work. I need to study up on that. Would you say that this is Intel specific?



    I would not say this is Intel specific but it is rather an architectural decision of x86 memory model ( = mostly strong memory model ) that was made long before we really ever had mulitcore systems with fast CPU’s and slow memory. So it’s a legacy issue that cannot be fixed without taking the potential risk of breaking some legacy compatibility.
    Here is a very short explanation from the most reliable source on the internet ( I’m only half joking ), the Linux kernel documentation :

    It has to be assumed that the conceptual CPU is weakly-ordered but that it will
    maintain the appearance of program causality with respect to itself. Some CPUs
    (such as i386 or x86_64) are more constrained than others (such as powerpc or
    frv), and so the most relaxed case (namely DEC Alpha) must be assumed outside
    of arch-specific code.

    The main idea I’m trying to express is that, on architectures with kinda strong memory models, we might be heading to a point were if you have more cores running multi-threaded programs, if one of the programs you are running happens to have code that makes a lot of calls addressed to shared memory ( between threads ) then you might result in :
    + ( not only ) lower and lower performance per thread relative to the number of threads
    + ! making your entire systems performance suffer, even separate programs / other VM’s !

    P.S : The only way I can think of preventing this from coming to fruition would be adding even more overhead to voluntarily upper bound the amount of chaos a single process can cause.

    reference :



    This article has, sadly, confirmed that I don’t even need the Ryzen 1800X I’ve got, much less the 3950X I’m going to buy – if I don’t hold out for a 24-core.

    Fortunately, Tom over at “Moore’s Law Is Dead” mentioned that home desktop enthusiasts desperate to rationalize a big Zen 2 chip can credit disabling the SMT. That will remove any potential drag from poorly-threaded games and still leave gobs of real cores for the fun stuff.

    He didn’t put it quite that way but I feel sure he was encouraging me specifically.

    Now I only need to see which chip will most reliably hit the magic some-cores 5GHz, the other component of my 7nm e-lust.



    Fortunately, Tom over at “Moore’s Law Is Dead” mentioned that home desktop enthusiasts desperate to rationalize a big Zen 2 chip can credit disabling the SMT. That will remove any potential drag from poorly-threaded games and still leave gobs of real cores for the fun stuff.

    Sorry, but I think you might have missed by point, even in a system without SMT the earlier problem still arises so :

    remove any potential drag from poorly-threaded games

    it not really going to happen … but since both AMD and Intel chips are currently suffering from the drawbacks of this memory architecture choices it’s harder to realize how bad this gets.
    But then why did Tom say that disabling SMT lead to performance gain, how does that change anything?
    By disabling Simultaneous MultiThreading you are just moving your threads out of a single core and to other potentially available cores. Now if you do have a “big Zen 2 chip” the probability of a core being potentially available is higher, so that badly coded program’s thread might still be running. The advantages on overall performance that I can currently think of that disabling SMT could bring would be :
    + lower pipeline utilization -> lower power draw -> less heat -> less overheating -> less thermal throttling / performance caped by HW/OS
    + only one process executing per core -> while it is running only it’s own data is being fetched/loaded to that core -> more relevant L0/L1/L2 data in cache -> less cache misses -> less calls to a far away slower memory | less siting around idly waiting for data that if we had SMT on might have been displaced by another concurrently running thread
    + less aliasing on the branch predictor’s target buffer and history -> more accurate predictions -> reduced risk of needing to flush instructions that were executed based on wrong speculations
    + my machine is still a network, the least performing node in the network is still going to bottleneck -> eg: my performance so if my execution stage is one of a very wide superscallar but say my instruction fetch unit is less powerful -> depending on how much less powerfull it is if I try to run +1 thread it might be likely to be a bottleneck

    Now, I am not saying that SMT hurts performance, quite the contrary according to it’s supporters, on our standard benchmarks this does improve performance by:
    + allowing better core utilization -> less stalling while waiting for memory -> more processes are executing at the same moment -> a thread ready to execute has a higher probability of doing so -> less latency before execution
    + we have superscallars + reordering -> both cores needs can be meet by reordering there data needs for different execution pathways -> no contention

    In the end SMT might help in some application and hurt performance in others, it will depend on it’s implementation, the resources needs, optimization and behavior of the software. But disabling it is NOT going to solve the multi-threading wall architectures with strong memory models are going to be facing in the near future.

    P.S : Advantages/Disadvantages were cited based on a average case, for example, in the case of branch prediction more history information and data relive to the currently executing code in the target buffer doesn’t automatically translate to better accuracy in all cases. But, the consensus is that on average, compared to a 50/50 blind prediction, it does.
    P.S.S : This is a personal perception based on my current understanding, if you find ANYTHING you think is wrong then please tell me right away, I’m only human.

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic.