July 4th, 2024

Building a data compression utility in Haskell using Huffman codes

Building a data compression utility in Haskell with Huffman codes. Explains Huffman codes, prefix-free codes, creating codes with binary trees, encoding/decoding text, laziness, encoding binary data, and serializing compressed output.

Read original articleLink Icon
Building a data compression utility in Haskell using Huffman codes

This post discusses building a data compression utility in Haskell using Huffman codes. It explains the concept of Huffman codes, which map characters to unique bit sequences for compression. The post details how to create prefix-free codes to avoid ambiguity during decoding. It then outlines the process of creating prefix-free codes using a complete binary tree, emphasizing shorter codes for more frequent characters. The post provides code snippets for encoding and decoding text using Huffman codes in Haskell, highlighting the importance of laziness for memory efficiency. It also touches on encoding binary data using the same approach and serializing the compressed output with the frequency map for decoding. The post concludes by showcasing how to serialize the compressed data into bytes for storage in a binary file.

Link Icon 16 comments
By @mrkeen - 7 months
There exists an array-based, in-place algorithm for this, reducing the need to allocate trees and chase pointers.

I mention this only because, when I learned the tree-based approach at uni, I simply wasn't aware that there was another way to do it, and I'm wondering how many of you that's true for as well.

While the tree approach is intuitive and illuminating, it probably makes more sense to work with in-place arrays, since the situations when you care most about compression are probably the situations when you have a lot of data and want to run fast.

  In-Place Calculation of Minimum-Redundancy Codes
  Moffat, Katajainen.  1995.
  http://hjemmesider.diku.dk/~jyrki/Paper/WADS95.pdf
By @tromp - 7 months
> To make it unambiguous we must make sure that no code word is a prefix of another code word.

Technically, this is not quite correct. The class of so-called uniquely decodable codes is unambigous, and a superset of the prefix codes. One simple example of a uniquely decodable code is the reverse of a prefix code. For the example in the article that would be

    a 1
    b 00
    c 10
While the code for a is a prefix of the code of c, one can still unambiguously decode any code sequence by processing it in reverse order. It would be interesting to see a uniquely decodable code that is neither a prefix code nor one in reverse.
By @atlintots - 7 months
This is great! Are there any other similar tutorials going through writing a Haskell program, but with some more advanced features (monad transformers, lenses, etc)
By @goldfishgold - 7 months
Coursera’s functional programming course (in Scala) includes a pretty similar Huffman coding assignment with autograder if anybody wants to take a stab at it themselves.

https://www.coursera.org/learn/scala-functional-programming?...

By @banish-m4 - 7 months
Last time I used Huffman codes, it was to run a MICMAC processor macroprogram (assembly text) in the fewest number of microcycles and to use the fewest microinstructions in the microprogram (microcode). So starting with a histogram of the macroinstructions executed (IIRC, I first wrote an interpreter in C to count how many of each were executed), I crafted a progressive decoding microcode program to implement all of the required ISA macro-operations. IIRC, the macro instruction ISA I created was bit-granular instead of byte-oriented. In the real world, it would've been slow and inconvenient. What's nice about Huffman codes is that you can vary the prefix depth based on the distribution of values, so you don't have to have lopsided codes based on 1 bit prefixes.

Also, the microprogram had to deal with branch prediction because it was a non-superscalar pipelined processor model. Guess the wrong branch, and enjoy wasting cycles on a pipeline stall while the correct branch filters forward.

By @ykonstant - 7 months
Hey, since this is likely to attract Haskell programmers: how fast is Haskell these days for a programmer intent on writing optimized code? I am particularly interested in its performance for numerical crunching like matrix operations and other stuff that benefit from SIMD.
By @alwinaugustin - 7 months
Thanks for sharing. Very nice and insightful.
By @chvrchbvrner - 7 months
I think there is a typo in the table of the "Creating prefix-free codes" section. D should be '0010' (not '0110').

Otherwise a great read, thanks!

By @lo_zamoyski - 7 months
Btw, what is that on the girl's shirt in the image?

Direct link: https://lazamar.github.io/images/data-compressor.svg

By @londons_explore - 7 months
For all readers, arithmetic codes are better in nearly all ways. They can be implemented in less RAM and code, they compress and decompress to a better ratio, and the probabilities of different symbols appearing can be dynamically updated during the stream far more easily.

The only reason Huffman codes are used is they were invented first and arithmetic codes were patented. That patent has now expired, so we should use the better design.

By @lynx23 - 7 months
Very nice read, thanks for sharing!
By @bdahz - 7 months
How is the performance when compared to similar implementations in C/C++ or Rust?
By @benreesman - 7 months
Haskell is a really nice language. In general I don’t identify as an X programmer for any value of X: I tend to write in a half dozen languages daily and they all suck in their own special way.

But on two separate occasions I made important career decisions with opportunity cost to work with highly lethal GHC contributors: those people are just really good.

If Haskell sucks like all languages it’s because Haskell excels at using computers to compute something: Haskell considers data shuffling a strictly secondary concern compared to doing actual computations.

By @2-3-7-43-1807 - 7 months
and yet another episode in the series of "look, what I did with Haskell"