August 12th, 2024

Spice: Fine-grained parallelism with sub-nanosecond overhead in Zig

Spice is a Zig-based parallelism framework offering sub-nanosecond overhead and contention-free operation. It is still in research, with limitations in testing and documentation, caution advised for production use.

Read original articleLink Icon
CuriositySkepticismInterest
Spice: Fine-grained parallelism with sub-nanosecond overhead in Zig

Spice is a parallelism framework developed in the Zig programming language, designed to enable efficient parallel execution with minimal overhead. Its primary goal is to allow developers to incorporate parallelism into their functions without incurring significant performance penalties, achieving sub-nanosecond overhead. Key features include contention-free operation, which prevents threads from competing for the same tasks, thus maintaining performance even with a high number of threads. Performance benchmarks indicate that Spice excels in scenarios involving fast operations, such as summing nodes in a binary tree, and it shows lower overhead compared to other frameworks like Rayon in Rust. The framework employs a heartbeat scheduling mechanism for work distribution, minimizing overhead by scheduling infrequently. However, Spice is still a research project with limitations, including insufficient testing and documentation, lack of built-in support for arrays or slices, and potential issues arising from improper usage. An example in the documentation demonstrates how to use Spice for summing values in a binary tree, highlighting its task management capabilities. While Spice presents an innovative approach to parallelism in Zig, users should be aware of its limitations before considering it for production applications.

- Spice provides efficient parallelism with sub-nanosecond overhead.

- It avoids common parallelism issues like contention and inefficient work-stealing.

- The framework is still in the research phase and lacks comprehensive testing.

- Users should be cautious of its limitations when considering production use.

- An example usage is provided for summing values in a binary tree.

AI: What people are saying
The comments on the Spice framework reveal a mix of insights and critiques regarding its implementation and documentation.
  • Some users appreciate the research behind the framework, particularly the concept of heartbeat scheduling.
  • Concerns are raised about the claim of "sub-nanosecond overhead," with some labeling it as misleading marketing.
  • Users find the documentation and README helpful, though some areas remain unclear.
  • Links to related research papers and limitations of the project are shared, indicating ongoing interest in its development.
  • There is a distinction made between this framework and other projects, such as SpiceDB.
Link Icon 13 comments
By @shwestrick - 8 months
For those curious, this implementation is based on a recent line of research called "heartbeat scheduling" which amortizes the overheads of creating parallelism, essentially accomplishing a kind of dynamic automatic granularity control.

Related papers:

(2018) Heartbeat Scheduling: Provable Efficiency for Nested Parallelism. https://www.andrew.cmu.edu/user/mrainey/papers/heartbeat.pdf

(2021) Task Parallel Assembly Language for Uncompromising Parallelism. https://users.cs.northwestern.edu/~simonec/files/Research/pa...

(2024) Compiling Loop-Based Nested Parallelism for Irregular Workloads. https://users.cs.northwestern.edu/~simonec/files/Research/pa...

(2024) Automatic Parallelism Management. https://www.cs.cmu.edu/~swestric/24/popl24-par-manage.pdf

By @nirushiv - 8 months
I haven’t read through the code in detail but I can tell you “sub-nanosecond overhead” is misleading and marketing fluff. On first look, the measure seems to be some convoluted “time per thing” where the number of threads is far far smaller than the number of “thing”s
By @akovaski - 8 months
I'm not terribly familiar with this space, but I do like the concurrency model presented here.

I think the README here is very well written, and I have a good idea of what's going on just from reading it, but there are a few areas where I'm left scratching my head. Thankfully the code is fairly easy to read.

By @lcof - 8 months
Interesting research work! Besides the code itself, there is some good reasoning and the documentation is well written

The 2018 paper on heartbeat scheduling is also an interesting read https://www.andrew.cmu.edu/user/mrainey/papers/heartbeat.pdf

By @shoggouth - 8 months
By @geertj - 8 months
Per the description this uses busy waiting in the workers to get to nanosecond level latencies. I wonder if anyone has a perspective on how realistic busy waiting is in large applications with tens of thousands of tasks? Maybe it works if the tasks are async (i.e. not thread based) so that you only have N waiters where N is the size of the executor’s thread pool? In any case energy consumption of such an architecture would be higher.

Related, I’ve been interested a while whether there’s a faster way for a producer of work to have a consumer wake up without resorting to busy waiting, possibly by running the consumer in the producer time slice.

Also related, I’ve wondered if it’s possible to have a user space FUTEX_WAKE operation that would halve the typical penalty of waking up a consumer (to just the consumer).

By @gyrovagueGeist - 8 months
This is neat and links to some great papers. I wish the comparison was with OpenMP tasks though; I’ve heard Rayon has a reputation for being a bit slow
By @raggi - 8 months
cooperative scheduling is the basis for so many patterns with great metrics :)
By @dsp_person - 8 months
By @assafe - 8 months
This is great!
By @pgt - 8 months
Not to be confused with SpiceDB by AuthZed: https://authzed.com/spicedb