July 4th, 2024

Building a data compression utility in Haskell using Huffman codes

Building a data compression utility in Haskell with Huffman codes. Explains Huffman codes, prefix-free codes, creating codes with binary trees, encoding/decoding text, laziness, encoding binary data, and serializing compressed output.

Read original article

This post discusses building a data compression utility in Haskell using Huffman codes. It explains the concept of Huffman codes, which map characters to unique bit sequences for compression. The post details how to create prefix-free codes to avoid ambiguity during decoding. It then outlines the process of creating prefix-free codes using a complete binary tree, emphasizing shorter codes for more frequent characters. The post provides code snippets for encoding and decoding text using Huffman codes in Haskell, highlighting the importance of laziness for memory efficiency. It also touches on encoding binary data using the same approach and serializing the compressed output with the frequency map for decoding. The post concludes by showcasing how to serialize the compressed data into bytes for storage in a binary file.

Compressing graphs and indexes with recursive graph bisection (2016)

Graph reordering is applied to boost compression in graphs and indexes. A new compression-friendly technique using recursive graph bisection shows improved rates for large-scale graphs, enhancing storage and processing efficiency.

A working entropy encoder with output below the Shannon limit

The GitHub repository features Valli encoding, an entropy encoder aiming for compression below the Shannon limit. It includes a C++ implementation, algorithm explanations, code instructions, and project support. Visit the repository for more.

CPS in Hoot

The Hoot Scheme-to-Wasm compiler uses CPS transformation to handle push calls in WebAssembly. Strategies like generic slicing and CPS transformation are explored, enabling features like Fibers and promise integration. Performance impact is under evaluation.

Beyond monospace: the search for the perfect coding font

Designing coding fonts involves more than monospacing. Key considerations include hyphens resembling minus signs, aligning symbols, distinguishing zero from O, and ensuring clarity for developers and type designers. Testing with proofing strings is recommended.

Weekend projects: getting silly with C

The C programming language's simplicity and expressiveness, despite quirks, influence other languages. Unconventional code structures showcase creativity and flexibility, promoting unique coding practices. Subscription for related content is encouraged.

16 comments

By @mrkeen - 10 months

There exists an array-based, in-place algorithm for this, reducing the need to allocate trees and chase pointers.

I mention this only because, when I learned the tree-based approach at uni, I simply wasn't aware that there was another way to do it, and I'm wondering how many of you that's true for as well.

While the tree approach is intuitive and illuminating, it probably makes more sense to work with in-place arrays, since the situations when you care most about compression are probably the situations when you have a lot of data and want to run fast.

  In-Place Calculation of Minimum-Redundancy Codes
  Moffat, Katajainen.  1995.
  http://hjemmesider.diku.dk/~jyrki/Paper/WADS95.pdf

By @tromp - 10 months

> To make it unambiguous we must make sure that no code word is a prefix of another code word.

Technically, this is not quite correct. The class of so-called uniquely decodable codes is unambigous, and a superset of the prefix codes. One simple example of a uniquely decodable code is the reverse of a prefix code. For the example in the article that would be

    a 1
    b 00
    c 10

While the code for a is a prefix of the code of c, one can still unambiguously decode any code sequence by processing it in reverse order. It would be interesting to see a uniquely decodable code that is neither a prefix code nor one in reverse.

By @atlintots - 10 months

This is great! Are there any other similar tutorials going through writing a Haskell program, but with some more advanced features (monad transformers, lenses, etc)

By @goldfishgold - 10 months

Coursera’s functional programming course (in Scala) includes a pretty similar Huffman coding assignment with autograder if anybody wants to take a stab at it themselves.

https://www.coursera.org/learn/scala-functional-programming?...

By @banish-m4 - 10 months

Last time I used Huffman codes, it was to run a MICMAC processor macroprogram (assembly text) in the fewest number of microcycles and to use the fewest microinstructions in the microprogram (microcode). So starting with a histogram of the macroinstructions executed (IIRC, I first wrote an interpreter in C to count how many of each were executed), I crafted a progressive decoding microcode program to implement all of the required ISA macro-operations. IIRC, the macro instruction ISA I created was bit-granular instead of byte-oriented. In the real world, it would've been slow and inconvenient. What's nice about Huffman codes is that you can vary the prefix depth based on the distribution of values, so you don't have to have lopsided codes based on 1 bit prefixes.

Also, the microprogram had to deal with branch prediction because it was a non-superscalar pipelined processor model. Guess the wrong branch, and enjoy wasting cycles on a pipeline stall while the correct branch filters forward.

By @tankfeeder - 10 months

https://rosettacode.org/wiki/Huffman_coding

By @ykonstant - 10 months

Hey, since this is likely to attract Haskell programmers: how fast is Haskell these days for a programmer intent on writing optimized code? I am particularly interested in its performance for numerical crunching like matrix operations and other stuff that benefit from SIMD.

By @alwinaugustin - 10 months

Thanks for sharing. Very nice and insightful.

By @chvrchbvrner - 10 months

I think there is a typo in the table of the "Creating prefix-free codes" section. D should be '0010' (not '0110').

Otherwise a great read, thanks!

By @lo_zamoyski - 10 months

Btw, what is that on the girl's shirt in the image?

Direct link: https://lazamar.github.io/images/data-compressor.svg

By @londons_explore - 10 months

For all readers, arithmetic codes are better in nearly all ways. They can be implemented in less RAM and code, they compress and decompress to a better ratio, and the probabilities of different symbols appearing can be dynamically updated during the stream far more easily.

The only reason Huffman codes are used is they were invented first and arithmetic codes were patented. That patent has now expired, so we should use the better design.

By @lynx23 - 10 months

Very nice read, thanks for sharing!

By @bdahz - 10 months

How is the performance when compared to similar implementations in C/C++ or Rust?

By @benreesman - 10 months

Haskell is a really nice language. In general I don’t identify as an X programmer for any value of X: I tend to write in a half dozen languages daily and they all suck in their own special way.

But on two separate occasions I made important career decisions with opportunity cost to work with highly lethal GHC contributors: those people are just really good.

If Haskell sucks like all languages it’s because Haskell excels at using computers to compute something: Haskell considers data shuffling a strictly secondary concern compared to doing actual computations.

By @2-3-7-43-1807 - 10 months

and yet another episode in the series of "look, what I did with Haskell"

Building a data compression utility in Haskell using Huffman codes

Related

Compressing graphs and indexes with recursive graph bisection (2016)

A working entropy encoder with output below the Shannon limit

CPS in Hoot

Beyond monospace: the search for the perfect coding font

Weekend projects: getting silly with C

Related

Compressing graphs and indexes with recursive graph bisection (2016)

A working entropy encoder with output below the Shannon limit

CPS in Hoot

Beyond monospace: the search for the perfect coding font

Weekend projects: getting silly with C