August 12th, 2024

Spice: Fine-grained parallelism with sub-nanosecond overhead in Zig

Spice is a Zig-based parallelism framework offering sub-nanosecond overhead and contention-free operation. It is still in research, with limitations in testing and documentation, caution advised for production use.

Read original article

CuriositySkepticismInterest

Spice: Fine-grained parallelism with sub-nanosecond overhead in Zig

Spice is a parallelism framework developed in the Zig programming language, designed to enable efficient parallel execution with minimal overhead. Its primary goal is to allow developers to incorporate parallelism into their functions without incurring significant performance penalties, achieving sub-nanosecond overhead. Key features include contention-free operation, which prevents threads from competing for the same tasks, thus maintaining performance even with a high number of threads. Performance benchmarks indicate that Spice excels in scenarios involving fast operations, such as summing nodes in a binary tree, and it shows lower overhead compared to other frameworks like Rayon in Rust. The framework employs a heartbeat scheduling mechanism for work distribution, minimizing overhead by scheduling infrequently. However, Spice is still a research project with limitations, including insufficient testing and documentation, lack of built-in support for arrays or slices, and potential issues arising from improper usage. An example in the documentation demonstrates how to use Spice for summing values in a binary tree, highlighting its task management capabilities. While Spice presents an innovative approach to parallelism in Zig, users should be aware of its limitations before considering it for production applications.

- Spice provides efficient parallelism with sub-nanosecond overhead.

- It avoids common parallelism issues like contention and inefficient work-stealing.

- The framework is still in the research phase and lacks comprehensive testing.

- Users should be cautious of its limitations when considering production use.

- An example usage is provided for summing values in a binary tree.

Parallel Nix Evaluation

The blog post discusses Nix's speed impact on user experience due to Nixpkgs and NixOS growth. Determinate Systems' parallel evaluator boosts performance 3-4 times, addressing immutability challenges for faster evaluations. Ongoing work targets concurrency bugs.

Using SIMD for Parallel Processing in Rust

SIMD is vital for performance in Rust. Options include auto-vectorization, platform-specific intrinsics, and std::simd module. Balancing performance, portability, and ease of use is key. Leveraging auto-vectorization and intrinsics optimizes Rust projects for high-performance computing, multimedia, systems programming, and cryptography.

Exploring biphasic programming: a new approach in language design

Biphasic programming introduces new language design trends like Zig's "comptime" for compile-time execution, React Server Components for flexible rendering, and Winglang's phase-specific code for cloud applications.

Baby's Second WASM Compiler

A new wasm compiler named zest is under development for 2024, focusing on quality and speed enhancements. It supports various features and employs a tree structure for expressions, emphasizing efficiency and simplicity.

C Macro Reflection in Zig – Zig Has Better C Interop Than C Itself

Zig is a developing programming language aimed at low-level systems programming, offering strong C interoperability, ease of use, and features like C macro reflection, making it a potential C replacement.

AI: What people are saying

The comments on the Spice framework reveal a mix of insights and critiques regarding its implementation and documentation.

Some users appreciate the research behind the framework, particularly the concept of heartbeat scheduling.
Concerns are raised about the claim of "sub-nanosecond overhead," with some labeling it as misleading marketing.
Users find the documentation and README helpful, though some areas remain unclear.
Links to related research papers and limitations of the project are shared, indicating ongoing interest in its development.
There is a distinction made between this framework and other projects, such as SpiceDB.

13 comments

By @shwestrick - 8 months

For those curious, this implementation is based on a recent line of research called "heartbeat scheduling" which amortizes the overheads of creating parallelism, essentially accomplishing a kind of dynamic automatic granularity control.

Related papers:

(2018) Heartbeat Scheduling: Provable Efficiency for Nested Parallelism. https://www.andrew.cmu.edu/user/mrainey/papers/heartbeat.pdf

(2021) Task Parallel Assembly Language for Uncompromising Parallelism. https://users.cs.northwestern.edu/~simonec/files/Research/pa...

(2024) Compiling Loop-Based Nested Parallelism for Irregular Workloads. https://users.cs.northwestern.edu/~simonec/files/Research/pa...

(2024) Automatic Parallelism Management. https://www.cs.cmu.edu/~swestric/24/popl24-par-manage.pdf

By @nirushiv - 8 months

I haven’t read through the code in detail but I can tell you “sub-nanosecond overhead” is misleading and marketing fluff. On first look, the measure seems to be some convoluted “time per thing” where the number of threads is far far smaller than the number of “thing”s

By @akovaski - 8 months

I'm not terribly familiar with this space, but I do like the concurrency model presented here.

I think the README here is very well written, and I have a good idea of what's going on just from reading it, but there are a few areas where I'm left scratching my head. Thankfully the code is fairly easy to read.

By @lcof - 8 months

Interesting research work! Besides the code itself, there is some good reasoning and the documentation is well written

The 2018 paper on heartbeat scheduling is also an interesting read https://www.andrew.cmu.edu/user/mrainey/papers/heartbeat.pdf

By @shoggouth - 8 months

List of limitations of the project: https://github.com/judofyr/spice?tab=readme-ov-file#limitati...

By @geertj - 8 months

Per the description this uses busy waiting in the workers to get to nanosecond level latencies. I wonder if anyone has a perspective on how realistic busy waiting is in large applications with tens of thousands of tasks? Maybe it works if the tasks are async (i.e. not thread based) so that you only have N waiters where N is the size of the executor’s thread pool? In any case energy consumption of such an architecture would be higher.

Related, I’ve been interested a while whether there’s a faster way for a producer of work to have a consumer wake up without resorting to busy waiting, possibly by running the consumer in the producer time slice.

Also related, I’ve wondered if it’s possible to have a user space FUTEX_WAKE operation that would halve the typical penalty of waking up a consumer (to just the consumer).

By @gyrovagueGeist - 8 months

This is neat and links to some great papers. I wish the comparison was with OpenMP tasks though; I’ve heard Rayon has a reputation for being a bit slow

By @raggi - 8 months

cooperative scheduling is the basis for so many patterns with great metrics :)

By @dsp_person - 8 months

see also readme under bench https://github.com/judofyr/spice/blob/main/bench/README.md

By @assafe - 8 months

This is great!

By @pgt - 8 months

Not to be confused with SpiceDB by AuthZed: https://authzed.com/spicedb