OpenMP is one of the easiest ways to make existing code run across CPU cores. In the simplest cases you simply add a single #pragma to C code and it goes N times faster. This is when you're running a function in a loop with no side effects. Some examples I've done:
1) ray tracing. Looping over all the pixels in an image using ray tracing to determine the color of each pixel. The algorithm and data structures are complex but don't change during the rendering. N cores is about N times as fast.
2) in Solvespace we had a small loop which calls a tessellation function on a bunch of NURBS surfaces. The function was appending triangles to a list, so I made a thread-local list for each call and combined them after to avoid writes to shared data structure. Again N times faster with very little effort.
The code is also fine to build single threaded without change if you don't have OpenMP. Your compiler will just ignore the #pragmas.
> OpenMP is one of the easiest ways to make existing code run across CPU cores.
True (or with Intel TBB), however as someone with a lot of experience optimising HPC algorithms for rendering, geometry processing and simulation, there are caveats, and quite often you can get situations where the existing code that is parallelised this way more naively can spend disproportionate amounts of CPU usage on spinlocks in OpenMP or TBB instead of doing useful work. (I've also noticed the same thing happening with Rayon in Rust).
Sometimes I've looked at code other colleagues have "parallelised" this way, and they've said "yes, it's using multiple threads", but when you profile it with perf or vtune, it's clearly not really doing that much *useful* parallel work, and sometimes it's even slower than single-threaded from a wall-clock standpoint, and people just didn't check if it was faster, they just looked at the CPU usage, and didn't notice the spinlocks.
OpenMP is great. I’ve done something similar to your second case (thread local objects that are filled in parallel and later combined). In the case of “OpenMP off” (pragmas ignored), is it possible to avoid the overhead of the thread local object essentially getting copied into the final object (since no OpenMP means only a single thread local object)? I avoided this by implementing a separate code path, but I’m just wondering if there are any tricks I missed that would allow still a single code path
Give one of the threads (thread ID 0, for instance) special privileges. Its list is the one everything else is appended to, then there's only concatenation or copying if you have more than one thread.
Or, pre-allocate the memory and let each thread write to its own subset of the final collection and avoid the combine step entirely. This works regardless of the number of threads you use so long as you know the maximum amount of memory you might need to allocate. If it has no calculable upper bound, you will need to use other techniques.
You can now (already in OpenMP5) use it to write GPU programs. Intels OneAPI uses OpenMP 5.5 to write programs for the Intel PonteVecchio GPUs which are on par to the Nvidia A100.
It is a bit better now, but OpenMP is yet another standard born in UNIX HPC clusters, so I am even surprised that Microsoft bothered at all, it seems to be something to fill a check box.
I was just googling to see if there's any Emscripten/WASM implementation of OpenMP. The emscripten github issue [1] has a link to this "simpleomp" [2][3] where
> In ncnn project, we implement a minimal openmp runtime for webassembly target
> It only works for #pragma omp parallel for num_threads(N)
I've used it a while ago, but got burned by very uneven support across compilers — MSVC required special tweaks, and old GCC would create crashy code without warning.
It was okay for basic embarrassingly parallel for loops. I ended up not using any more advanced features, because apart from even worse compiler support, non-trivial multi-threading in C without any safeguards is just too easy to mess up.
1) ray tracing. Looping over all the pixels in an image using ray tracing to determine the color of each pixel. The algorithm and data structures are complex but don't change during the rendering. N cores is about N times as fast.
2) in Solvespace we had a small loop which calls a tessellation function on a bunch of NURBS surfaces. The function was appending triangles to a list, so I made a thread-local list for each call and combined them after to avoid writes to shared data structure. Again N times faster with very little effort.
The code is also fine to build single threaded without change if you don't have OpenMP. Your compiler will just ignore the #pragmas.
True (or with Intel TBB), however as someone with a lot of experience optimising HPC algorithms for rendering, geometry processing and simulation, there are caveats, and quite often you can get situations where the existing code that is parallelised this way more naively can spend disproportionate amounts of CPU usage on spinlocks in OpenMP or TBB instead of doing useful work. (I've also noticed the same thing happening with Rayon in Rust).
Sometimes I've looked at code other colleagues have "parallelised" this way, and they've said "yes, it's using multiple threads", but when you profile it with perf or vtune, it's clearly not really doing that much *useful* parallel work, and sometimes it's even slower than single-threaded from a wall-clock standpoint, and people just didn't check if it was faster, they just looked at the CPU usage, and didn't notice the spinlocks.
The best I've found so far:
https://cdn.kernel.org/pub/linux/kernel/people/paulmck/perfb...
And some other good reading:
https://www.amazon.com/Systems-Performance-Brendan-Gregg/dp/...
https://fgiesen.wordpress.com/2014/08/18/atomics-and-content...
https://travisdowns.github.io/blog/2020/07/06/concurrency-co...
Or, pre-allocate the memory and let each thread write to its own subset of the final collection and avoid the combine step entirely. This works regardless of the number of threads you use so long as you know the maximum amount of memory you might need to allocate. If it has no calculable upper bound, you will need to use other techniques.
So easiest depends on the target audience.
https://www.intel.com/content/www/us/en/docs/oneapi/optimiza...
gcc also provides support for NVidia and AMD GPUs
https://gcc.gnu.org/wiki/Offloading
Here is an example how you can use openmp for running a kernel on a nvidia A100:
https://people.montefiore.uliege.be/geuzaine/INFO0939/notes/...
> In ncnn project, we implement a minimal openmp runtime for webassembly target
> It only works for #pragma omp parallel for num_threads(N)
[1] https://github.com/emscripten-core/emscripten/issues/13892
[2] https://github.com/Tencent/ncnn/blob/master/src/simpleomp.h
[3] https://github.com/Tencent/ncnn/blob/master/src/simpleomp.cp...
It was okay for basic embarrassingly parallel for loops. I ended up not using any more advanced features, because apart from even worse compiler support, non-trivial multi-threading in C without any safeguards is just too easy to mess up.