July 5th, 2024

Scan HTML faster with SIMD instructions: .NET/C# Edition

WebKit and Chromium enhance HTML content scanning with fast SIMD routines, boosting performance significantly. .NET8 now supports speedy SIMD instructions for C#, achieving impressive speeds comparable to C/C++ implementations.

Read original articleLink Icon
Scan HTML faster with SIMD instructions: .NET/C# Edition

Recently, WebKit and Chromium implemented fast SIMD routines to scan HTML content efficiently. By utilizing vectorized classification, these engines can quickly identify specific characters like <, &, \r, and \0 in blocks of text using SIMD instructions available on modern processors. While C/C++ achieves speeds of 7 GB/s on an Apple MacBook, the .NET8 runtime now supports fast SIMD instructions for C# as well. A comparison between a conventional C# function and a SIMD-optimized version showed a significant performance boost, with the SIMD version being over 4 times faster. The SIMD function for ARM NEON processors reached 6.2 GB/s, while for Intel Ice Lake systems with AVX2, it achieved 7.5 GB/s, matching the performance of C/C++ implementations. To maximize performance in C#, it is crucial to write code that encourages compiler inlining for the scanning function, ensuring optimal speed. Overall, leveraging SIMD instructions in .NET/C# can lead to substantial performance gains, making it a worthwhile optimization effort for developers.

Related

Own Constant Folder in C/C++

Own Constant Folder in C/C++

Neil Henning discusses precision issues in clang when using the sqrtps intrinsic with -ffast-math, suggesting inline assembly for instruction selection. He introduces a workaround using __builtin_constant_p for constant folding optimization, enhancing code efficiency.

Why Google Sheets ported its calculation worker from JavaScript to WasmGC

Why Google Sheets ported its calculation worker from JavaScript to WasmGC

Google Sheets transitioned its calculation worker to WasmGC from JavaScript for improved performance. Collaboration between Sheets and Chrome teams led to optimizations, overcoming challenges for near-native speed on the web.

Using SIMD for Parallel Processing in Rust

Using SIMD for Parallel Processing in Rust

SIMD is vital for performance in Rust. Options include auto-vectorization, platform-specific intrinsics, and std::simd module. Balancing performance, portability, and ease of use is key. Leveraging auto-vectorization and intrinsics optimizes Rust projects for high-performance computing, multimedia, systems programming, and cryptography.

Beating NumPy's matrix multiplication in 150 lines of C code

Beating NumPy's matrix multiplication in 150 lines of C code

Aman Salykov's blog delves into high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code with OpenMP, targeting Intel Core and AMD Zen CPUs. Discusses BLAS, CPU performance limits, and hints at GPU optimization.

Do not taunt happy fun branch predictor

Do not taunt happy fun branch predictor

The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.

Link Icon 1 comments
By @smnc - 5 months
> As an optimization, it is helpful to use a local variable for the reference to the first pointer. Doing so improves the perfomance substantially: C# is not happy when we repeatedly modify a reference. Thus, at the start of the function, you may set byte* mystart = start, use mystart throughout, and then, just before a return, you set start = mystart.

Did not expect this.