At work recently, I found myself trying to explain the Work-Efficiency vs Step-Efficiency tradeoff to a coworker, but when I searched for online resources to help I couldn’t find any that I liked, so I decided to take a shot at writing my own. I found this idea presented in a video lecture series about programming for GPGPUs on Youtube a while ago. However, it’s just as applicable to any form of parallel processing, from SIMD instructions running on a single CPU core up to massive clusters of thousands of computers.
Good explanation of the difference between work efficiency and step efficiency when talking about parallel algorithms.
Overall I mostly agree. I am not sure that I would call things “efficiency” but that is just terminology. Seems like your talking really execution time be it of the critical path or as some sort of sum of parallel threads or maybe energy consumption. Also as far as I can see, the rule of what to optimize work vs. step really comes down to fitting the sub-tasks into an integral multiple of the number of cores, and probably as I think you said the smallest integral multiple possible.
There are exceptions to this to. Some processors have thermal limits so thermal load, clock speed, and usable cores can be important. There are also memory bandwidth limitations. Then there is code and code generation optimization and vectorization which I’m not sure fits directly into your model very well. Plus there is the effect of vectorization on memory bandwidth limitations. I know on my workstation I can get the same step time often in two ways, no vectorization and using all real cores OR vectorization using only half of the real cores. I would need higher memory bandwidth to actually use both. That is 4 port versus the more common 2 port memory.
Not sure what I am getting at. Maybe that more or less I agree, but at least for me thinking specifically about what efficiency means and how a task might be mapped onto hardware seems more clear. Or maybe I just miss-understand the whole thing.
I think you’re maybe looking at it too “close to the hardware”, when this is more about theoretical concepts in time complexity analysis. Work efficiency can be thought of as the total number of primitive operations that your processors execute, and step efficiency tells you how much of that work you have to do sequentially. Whether you use SIMD or whatever doesn’t really matter; it’s already figured into the work and step efficiency.
This means that eg. the work efficiency of a non-parallel algorithm is always going to be bigger or equal to the work efficiency: parallelisation always “costs” you synchronization etc, so the most work efficient algorithm is always going to be single threaded. Not great for your wall clock use though, hence why you need to also care about step efficiency
Yes, I was thinking about that some more. Synchronization can be large in terms of time if you have work units that do not have known time or count. Actually this is an exception too. That is you may want to have number of work units large compared to number of cores for example. Which I think is actually opposite from what you said except of course the more work units probably the more total overhead.
except of course the more work units probably the more total overhead.
Yeah this is exactly it: when you start splitting work between multiple processors, there’s often more overhead involved in simply doing that splitting and coordinating the results. That’s why an explicitly non-parallel algorithm’s work efficiency is going to be as good or better than a parallel one. It’s impossible to parallelize an algorithm and make it more work efficient than the non-parallel version, even though it’ll (hopefully 😅) use less wall clock time due to having better step efficiency, because it doesn’t have to do the whole thing sequentially.
Yes. In the end it is usually total wall time of the critical path that matters unless there are special considerations like power consumption or heat generation, etc.
Not the same thing, but CPUs are designed to optimize different things too. I know I have often had laptops that were faster for single threaded tasks, but my engineering workstation has typically had way higher throughput for a parallel work load. Say nothing about compute clusters. Lot of these things about single threaded and parallel fairly common.
There also tends to be fairly expensive setup and sync steps related to highly parallel stuff like MPI or GPUs, though sometimes less so in the case of say vectorization or threading.
Yes. Took my first computer class in 1978. Fortran of all things. Hooked immediately. Considered CS, but actually went the Science side of things doing a combination of design, simulation, instrument control, and machine control. Whatever was needed. Lot of changes in hardware, software, languages, etc. on one hand but on the other hand not so much. At some level it has all been the same.
Overall I mostly agree. I am not sure that I would call things “efficiency” but that is just terminology. Seems like your talking really execution time be it of the critical path or as some sort of sum of parallel threads or maybe energy consumption. Also as far as I can see, the rule of what to optimize work vs. step really comes down to fitting the sub-tasks into an integral multiple of the number of cores, and probably as I think you said the smallest integral multiple possible.
There are exceptions to this to. Some processors have thermal limits so thermal load, clock speed, and usable cores can be important. There are also memory bandwidth limitations. Then there is code and code generation optimization and vectorization which I’m not sure fits directly into your model very well. Plus there is the effect of vectorization on memory bandwidth limitations. I know on my workstation I can get the same step time often in two ways, no vectorization and using all real cores OR vectorization using only half of the real cores. I would need higher memory bandwidth to actually use both. That is 4 port versus the more common 2 port memory.
Not sure what I am getting at. Maybe that more or less I agree, but at least for me thinking specifically about what efficiency means and how a task might be mapped onto hardware seems more clear. Or maybe I just miss-understand the whole thing.
I think you’re maybe looking at it too “close to the hardware”, when this is more about theoretical concepts in time complexity analysis. Work efficiency can be thought of as the total number of primitive operations that your processors execute, and step efficiency tells you how much of that work you have to do sequentially. Whether you use SIMD or whatever doesn’t really matter; it’s already figured into the work and step efficiency.
This means that eg. the work efficiency of a non-parallel algorithm is always going to be bigger or equal to the work efficiency: parallelisation always “costs” you synchronization etc, so the most work efficient algorithm is always going to be single threaded. Not great for your wall clock use though, hence why you need to also care about step efficiency
Edit: this wiki page was decent at explaining this stuff in a slightly more rigorous way, but not too rigorous. Note that depth / span is the same as “step efficiency” https://en.wikipedia.org/wiki/Analysis_of_parallel_algorithms
Yes, I was thinking about that some more. Synchronization can be large in terms of time if you have work units that do not have known time or count. Actually this is an exception too. That is you may want to have number of work units large compared to number of cores for example. Which I think is actually opposite from what you said except of course the more work units probably the more total overhead.
Yeah this is exactly it: when you start splitting work between multiple processors, there’s often more overhead involved in simply doing that splitting and coordinating the results. That’s why an explicitly non-parallel algorithm’s work efficiency is going to be as good or better than a parallel one. It’s impossible to parallelize an algorithm and make it more work efficient than the non-parallel version, even though it’ll (hopefully 😅) use less wall clock time due to having better step efficiency, because it doesn’t have to do the whole thing sequentially.
Yes. In the end it is usually total wall time of the critical path that matters unless there are special considerations like power consumption or heat generation, etc.
Not the same thing, but CPUs are designed to optimize different things too. I know I have often had laptops that were faster for single threaded tasks, but my engineering workstation has typically had way higher throughput for a parallel work load. Say nothing about compute clusters. Lot of these things about single threaded and parallel fairly common.
There also tends to be fairly expensive setup and sync steps related to highly parallel stuff like MPI or GPUs, though sometimes less so in the case of say vectorization or threading.
Thanks, interesting.
It is interesting, isn’t it? This is why I love computing as a field; the whole “spectrum” from theory to practice is just super fascinating.
Yes. Took my first computer class in 1978. Fortran of all things. Hooked immediately. Considered CS, but actually went the Science side of things doing a combination of design, simulation, instrument control, and machine control. Whatever was needed. Lot of changes in hardware, software, languages, etc. on one hand but on the other hand not so much. At some level it has all been the same.