Building a data compression utility in Haskell using Huffman codes
Building a data compression utility in Haskell with Huffman codes. Explains Huffman codes, prefix-free codes, creating codes with binary trees, encoding/decoding text, laziness, encoding binary data, and serializing compressed output.
Read original articleThis post discusses building a data compression utility in Haskell using Huffman codes. It explains the concept of Huffman codes, which map characters to unique bit sequences for compression. The post details how to create prefix-free codes to avoid ambiguity during decoding. It then outlines the process of creating prefix-free codes using a complete binary tree, emphasizing shorter codes for more frequent characters. The post provides code snippets for encoding and decoding text using Huffman codes in Haskell, highlighting the importance of laziness for memory efficiency. It also touches on encoding binary data using the same approach and serializing the compressed output with the frequency map for decoding. The post concludes by showcasing how to serialize the compressed data into bytes for storage in a binary file.
Related
Compressing graphs and indexes with recursive graph bisection (2016)
Graph reordering is applied to boost compression in graphs and indexes. A new compression-friendly technique using recursive graph bisection shows improved rates for large-scale graphs, enhancing storage and processing efficiency.
A working entropy encoder with output below the Shannon limit
The GitHub repository features Valli encoding, an entropy encoder aiming for compression below the Shannon limit. It includes a C++ implementation, algorithm explanations, code instructions, and project support. Visit the repository for more.
CPS in Hoot
The Hoot Scheme-to-Wasm compiler uses CPS transformation to handle push calls in WebAssembly. Strategies like generic slicing and CPS transformation are explored, enabling features like Fibers and promise integration. Performance impact is under evaluation.
Beyond monospace: the search for the perfect coding font
Designing coding fonts involves more than monospacing. Key considerations include hyphens resembling minus signs, aligning symbols, distinguishing zero from O, and ensuring clarity for developers and type designers. Testing with proofing strings is recommended.
Weekend projects: getting silly with C
The C programming language's simplicity and expressiveness, despite quirks, influence other languages. Unconventional code structures showcase creativity and flexibility, promoting unique coding practices. Subscription for related content is encouraged.
I mention this only because, when I learned the tree-based approach at uni, I simply wasn't aware that there was another way to do it, and I'm wondering how many of you that's true for as well.
While the tree approach is intuitive and illuminating, it probably makes more sense to work with in-place arrays, since the situations when you care most about compression are probably the situations when you have a lot of data and want to run fast.
In-Place Calculation of Minimum-Redundancy Codes
Moffat, Katajainen. 1995.
http://hjemmesider.diku.dk/~jyrki/Paper/WADS95.pdf
Technically, this is not quite correct. The class of so-called uniquely decodable codes is unambigous, and a superset of the prefix codes. One simple example of a uniquely decodable code is the reverse of a prefix code. For the example in the article that would be
a 1
b 00
c 10
While the code for a is a prefix of the code of c, one can still unambiguously decode any code sequence by processing it in reverse order. It would be interesting to see a uniquely decodable code that is neither a prefix code nor one in reverse.https://www.coursera.org/learn/scala-functional-programming?...
Also, the microprogram had to deal with branch prediction because it was a non-superscalar pipelined processor model. Guess the wrong branch, and enjoy wasting cycles on a pipeline stall while the correct branch filters forward.
Otherwise a great read, thanks!
Direct link: https://lazamar.github.io/images/data-compressor.svg
The only reason Huffman codes are used is they were invented first and arithmetic codes were patented. That patent has now expired, so we should use the better design.
But on two separate occasions I made important career decisions with opportunity cost to work with highly lethal GHC contributors: those people are just really good.
If Haskell sucks like all languages it’s because Haskell excels at using computers to compute something: Haskell considers data shuffling a strictly secondary concern compared to doing actual computations.
Related
Compressing graphs and indexes with recursive graph bisection (2016)
Graph reordering is applied to boost compression in graphs and indexes. A new compression-friendly technique using recursive graph bisection shows improved rates for large-scale graphs, enhancing storage and processing efficiency.
A working entropy encoder with output below the Shannon limit
The GitHub repository features Valli encoding, an entropy encoder aiming for compression below the Shannon limit. It includes a C++ implementation, algorithm explanations, code instructions, and project support. Visit the repository for more.
CPS in Hoot
The Hoot Scheme-to-Wasm compiler uses CPS transformation to handle push calls in WebAssembly. Strategies like generic slicing and CPS transformation are explored, enabling features like Fibers and promise integration. Performance impact is under evaluation.
Beyond monospace: the search for the perfect coding font
Designing coding fonts involves more than monospacing. Key considerations include hyphens resembling minus signs, aligning symbols, distinguishing zero from O, and ensuring clarity for developers and type designers. Testing with proofing strings is recommended.
Weekend projects: getting silly with C
The C programming language's simplicity and expressiveness, despite quirks, influence other languages. Unconventional code structures showcase creativity and flexibility, promoting unique coding practices. Subscription for related content is encouraged.