July 20th, 2024

Memory and ILP handling in 2D convolutions

A 2D convolution operation extracts image features using filters, converting signals to tensors. Cross-correlation is used for symmetric signals. Memory optimization and SIMD instructions enhance efficiency in processing MNIST images.

Read original articleLink Icon
Memory and ILP handling in 2D convolutions

A convolution operation in 2D processes images by applying a filter to extract features. The process involves sampling and quantizing continuous signals into discrete values represented as tensors. The filter is applied to the image using a function of sums limited by the filter size. The output tensor represents the result of the convolution operation. Cross-correlation, a similar operation to convolution, is used when the shifted signal is symmetric. The implementation involves optimizing memory access and utilizing SIMD vector instructions for efficient computation. The code runs on specific hardware and processes a batch of images from the MNIST dataset. The filter is initialized with specific distributions for effective feature extraction. The algorithm optimizes memory access and computation by vectorizing operations and utilizing stack variables for efficient processing.

Related

Show HN: UNet diffusion model in pure CUDA

Show HN: UNet diffusion model in pure CUDA

The GitHub content details optimizing a UNet diffusion model in C++/CUDA to match PyTorch's performance. It covers custom convolution kernels, forward pass improvements, backward pass challenges, and future optimization plans.

Convolutions, Fast Fourier Transform and Polynomials

Convolutions, Fast Fourier Transform and Polynomials

Alvaro Revuelta explains efficient polynomial multiplication using Convolution, Fast Fourier Transform (FFT), and Python. FFT reduces complexity from $O(n^2)$ to $O(nlogn), showcasing significant efficiency gains for high-degree polynomials.

Beating NumPy's matrix multiplication in 150 lines of C code

Beating NumPy's matrix multiplication in 150 lines of C code

Aman Salykov's blog delves into high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code with OpenMP, targeting Intel Core and AMD Zen CPUs. Discusses BLAS, CPU performance limits, and hints at GPU optimization.

The Fourier Transform: What's Wrong with It?

The Fourier Transform: What's Wrong with It?

The Fourier Transform is a versatile tool for signal analysis, converting time functions to frequency functions. Practical applications face challenges like accuracy issues and data windowing impact. Understanding limitations is crucial for meaningful results in engineering.

Show HN: I created a Neural Network from scratch, in scratch

Show HN: I created a Neural Network from scratch, in scratch

The article discusses implementing a 1-layer Feed Forward Network in Scratch for image classification. Challenges with handling multi-dimensional data were faced, but promising results were achieved with limited MNIST dataset training.

Link Icon 5 comments
By @mratsim - 7 months
Years ago I started a collection of convolution optimization resources: https://github.com/mratsim/laser/wiki/Convolution-optimisati...

Also checked and apparently Nvidia Cutlass now supports generic convolutions: https://github.com/NVIDIA/cutlass

By @epistasis - 7 months
Interesting article, thanks, IMHO mostly for the low level performance analysis.

When it comes to actual computation of convolutions, the fast Fourier transform should at least be mentioned, even if in passing. Early in grad school I peaked at the source for R's density() function, and was blown away that it was using FFT, and that I had not picked up that trick in my math classes (or maybe I had just forgotten it...)

For a 2d example:

https://stackoverflow.com/questions/50453981/implement-2d-co...

And a recent HN thread that was very good:

https://news.ycombinator.com/item?id=40840396

By @imtringued - 7 months
As cool as this is, I can't help but think how pointless the goal itself is.

XDNA 2 will have 12 TFLOPs, roughly matching the 96 core Threadripper Pro 7995WX at a much lower price point.

By @toxik - 7 months
ILP is instruction-level parallelism, if you had a hard time remembering like me.