August 21st, 2024

SIMD Matters: Graph Coloring

The article discusses SIMD's role in enhancing CPU performance in game development, particularly in Box2D, highlighting challenges, the use of graph coloring, and benchmarks showing significant performance improvements over scalar implementations.

Read original articleLink Icon
SIMD Matters: Graph Coloring

The article discusses the implementation of SIMD (Single Instruction, Multiple Data) in game development, particularly in the Box2D physics engine. While SIMD is often seen as a key to enhancing CPU performance, its practical benefits can be elusive, especially in scenarios where operations are not easily parallelizable. The author highlights challenges in game physics, such as the need for piecemeal vector math and the random access of bodies, which complicate the use of SIMD. To address these issues, the author introduces graph coloring as a method to group contact constraints that can be solved simultaneously without race conditions. This technique allows for efficient processing of multiple contact pairs, significantly improving performance. Benchmarks conducted on different hardware show that SIMD implementations (AVX2, SSE2, and Neon) outperform scalar implementations, with AVX2 yielding the best results. The author concludes that while implementing SIMD requires considerable effort, the performance gains justify the investment, enabling games to run faster and manage more rigid bodies. Additionally, the article touches on the limited effectiveness of compiler vectorization compared to hand-optimized SIMD code.

- SIMD can significantly enhance CPU performance in game physics.

- Graph coloring allows for efficient grouping of contact constraints for simultaneous processing.

- Benchmarks show SIMD implementations outperform scalar implementations by a substantial margin.

- Implementing SIMD requires considerable effort but yields worthwhile performance improvements.

- Compiler vectorization may not match the efficiency of hand-optimized SIMD code.

Link Icon 13 comments
By @mikewarot - 5 months
>This is the magic of graph coloring and enables me to solve multiple contact constraints simultaneously without a race condition.

<TANGENT> This hits me, like a ton of bricks, as one of the most elegant ways to describe why I add 2 phases of clocking ("like colors on a chessboard" is the phrase I've been using) to my BitGrid[1] hobby project.

I wonder what other classes of problems this could solve. This feels oddly parallel, like a mapping of the Langlands program into computer science.[2]

[1] https://esolangs.org/wiki/Bitgrid

[2] https://en.wikipedia.org/wiki/Langlands_program

By @a1o - 5 months
> The conventional wisdom is that it is difficult to achieve real gains from SIMD.

This has been my experience often times I misunderstood how much can be gained by using SIMD, and preparing the data to be "eaten" by SIMD instructions is not trivial. I have many times attempted to use just to profile and see it didn't improve things at all and made the code really hard to understand.

Kudos for Erin here this is really hard work and it's great it paid off well and gave good results here!

By @mgaunard - 5 months
The main problem of SIMD remains the same: it requires working with structures of arrays rather than arrays of structures, which makes it more difficult to combine and layer objects on top of each other.
By @patrick451 - 5 months
> I also implemented all the wide math functions to work with this. It seems that I have arranged all the data perfectly for the compiler to use automatic vectorization. But it seems this doesn’t really happen to a sufficient degree to compete with my hand written SSE2.

I will keep this example in mind the next time somebody trots out the line that you should just trust the compiler.

By @unwind - 5 months
Very cool and informative article, and I love that Box2d is in C nowadays that really makes it clear. Great job!

I saw a small typo:

    // wide float
    typedef b2FloatW __m128;
The `typedef` is backwards, the alias and the underlying type name are in the wrong order and need to be swapped around.
By @openasocket - 5 months
They didn't really explain why graph coloring can be done quickly in this case. I'm not familiar with graph coloring theory, but I know the general graph coloring problems are NP-complete. I'm guessing the reason you can use a greedy algorithm to get a decent coloring quickly is because you know the graph is planar, and because you don't actually care about getting a minimal coloring.
By @intalentive - 5 months
Is the AVX2 implementation using the full 256-bit register width? If so, I am surprised that it is only 14% faster than SSE. If not, I would like to see how the results compare with the rest of the tests.
By @xoranth - 5 months
General questions for gamedevs here. How useful is SIMD given that now we have compute shaders on the GPU? If so, what workloads still require SIMD/why would you choose one over the other?
By @noelwelsh - 5 months
> This is a bit ironic because on x64 all math is SIMD math, it is just inefficient SIMD math.

I don't understand this statement at the end of the article? Can anyone explain? TIA.

By @CreepGin - 5 months
I really appreciate the write-up and showing the most important code snippets.

Intuitively, this feels like a narrower version of using Z-order curve.

By @ccppurcell - 5 months
This is a total longshot but my PhD and research has been in graph algorithms including colouring, and I'm looking for a career change. If anyone has any advice at all (roles or companies that use these skills, tools to learn, etc) I'd be very interested.

Caveats: my knowledge is mostly theoretical (eg proving np-hardness or algorithm existence results) but I'm very good at thinking algorithmically. I have only hobbyist programming skills but I am a fast learner. Thanks!

By @baq - 5 months
> The Apple M2 smokes!

I was about this surprised when I made a jupyter notebook with a few gigs of numbers shuffled around and xgboosted and after I was done prototyping on an M1 Air and ran it on my serious box (a 12700k) it was actually slower, and noticeably.

By @vkazanov - 5 months
The other problem with simd is that in modern cpu-centric languages it often requires a rewrite for every vector width.

And for 80% of the cases by the point there is enough vectorizable data for a programmer to look into simd, a gpu can provide 1000%+ of perf AND a certain level of portability.

So right now simd is a niche tool for super low-level things: certain decompression algos, bits of math here and there, solvers, etc.

And it also takes a lot of space on your cpu die. Like, A LOT.