June 24th, 2024

A working entropy encoder with output below the Shannon limit

The GitHub repository features Valli encoding, an entropy encoder aiming for compression below the Shannon limit. It includes a C++ implementation, algorithm explanations, code instructions, and project support. Visit the repository for more.

Read original article

A working entropy encoder with output below the Shannon limit

The GitHub repository linked contains information about Valli encoding, a proof of concept entropy encoder designed to achieve compression output sizes below the Shannon limit for lossless compression. It offers an implementation of the encoder in C++ along with comprehensive explanations of the algorithm, instructions for running the code, and support for the project. For further details or specific inquiries regarding Valli encoding, you can visit the provided GitHub repository.

Unique3D: Image-to-3D Generation from a Single Image

The GitHub repository hosts Unique3D, offering efficient 3D mesh generation from a single image. It includes author details, project specifics, setup guides for Linux and Windows, an interactive demo, ComfyUI, tips, acknowledgements, collaborations, and citations.

Trealla Prolog: Compact and efficient Prolog interpreter

Trealla Prolog is a compact interpreter written in C, compliant with ISO Prolog. It supports unbounded integers, UTF-8 atoms, efficient strings, and runs on Linux, Android, and WebAssembly. It offers C integration, SQLite access, concurrency features, and experimental functionalities. The project is open source under the MIT license.

Video annotator: a framework for efficiently building video classifiers

The Netflix Technology Blog presents the Video Annotator (VA) framework for efficient video classifier creation. VA integrates vision-language models, active learning, and user validation, outperforming baseline methods with an 8.3 point Average Precision improvement.

A portable lightweight C FFI for Lua, based on libffi

A GitHub repository offers a portable lightweight C FFI for Lua, based on libffi. It aims for LuaJIT FFI compatibility, developed in C. Includes features, examples, basic types, build instructions, testing, and acknowledgements.

Show HN: Python lib to run evals across providers: OpenAI, Anthropic, etc.

The GitHub repository provides details on LLM Safety Evals, accessible on evals.gg. It features a bar chart, a Twitter post, setup guidelines, and code execution commands. Contact for further support.

12 comments

By @klauspost - 10 months

Looking at it quickly, it seems like you just moving entropy to the frequency table.

If the frequency table isn't included in the count, I could just make an "infinite" compressor by storing the frequency and cut one byte off the end. From the frequency table I could then deduce what the last byte should be.

> typical entropy encoders (Huffman, ANS, etc) would require 1 bit per symbol,

No. ANS (and range/arithmetic coding) allows for probabilities to be stored with fractional bits.

By @peter-ebert - 10 months

I was initially surprised to find that the size of multinomials is less than the entropy predicted by the Shannon source coding theorem, which is generally accepted as a lower bound. Not sure if this encoder is new, but hopefully others find it interesting. Step by step math here https://github.com/Peter-Ebert/Valli-Encoding/blob/main/the-...

By @enthdegree - 10 months

The author has built the following encoder:

- A genie hands the encoder the type of an iid source sequence (the function N here: [1])

- The encoder produces a representation of the source with fewer bits than what would be possible using a Shannon-efficient encoder that didn't know the type.

- You also need to know the type to decode.

The reduction in bits from Shannon's theorem is explained by the genie revealing this extra information that reduced the input sequence's entropy (also causing it to become non-iid). The collection of letter-typical sequences of specific type is indeed often going to be smaller than (entropy-rate)^(blocklength), and that is what we are seeing here. This is where the author's explanation starts to go astray:

> The Shannon limit uses Stirling's approximation which is an asymptotic approximation for factorials and as such the approximation is too large for small values and becomes more accurate as the factorial size (i.e. message length and symbol count) approaches infinity.

[1]: https://en.wikipedia.org/wiki/Typical_set#Strongly_typical_s...

I would recommend the author read any information theory textbook, starting at the latest, at the part that covers Asymptotic Equipartition Properties. This is the crucial definition that will make the source coding theorem concrete, in my opinion.

By @sebzim4500 - 10 months

I think I'm misunderstanding what this does, since a naive reading suggests it's doing something which is pretty easy to prove impossible.

EDIT: Oh. I guess OP isn't including the symbol counts as part of the message when calculating the message length? I guess I can see why this could be useful, but I think this should be made much more explicit.

By @slwvx - 10 months

One's default when seeing a claim to do better than the Shannon limit is to guess that the author is a scammer, confused, or both.

My impression in a quick read of this github page is that this person has no background in channel coding or information theory. Maybe not quite a scammer, but it appears this should be ignored.

By @AnotherGoodName - 10 months

A better way to do this is plain old arithmetic encoding. It also effectively stores a real number of bits for each symbol which is where the savings here is coming from as you stated.

Ie. you'll get an even better result if you limit plain old arithetic coding to 1/70th probablility of each symbol (70 was stated as the non binary aligning example here) and just encode through using arithmetic encoding. Arithmetic encoding is blisteringly fast too.

https://en.wikipedia.org/wiki/Arithmetic_coding

I'd also say it's not really cheating Shannon limits. It's just that a binary channel likes to align on base two boundaries. Arithmetic coding is the way to store optimally without needing to align on the boundaries. With arithmetic coding at worst you waste 7 bits at the end of the file as you align to 8bits to store.

By @minitech - 10 months

> Given a message with two symbols in a 1:1 ratio, typical entropy encoders (Huffman, ANS, etc) would require 1 bit per symbol, with 0 representing one symbol and 1 representing the other.

If you make use of the same information required here (exact counts) and update the probability model as you go, the result becomes the same for arithmetic coding/ANS (e.g. even with Huffman, after all symbols of one kind are seen, the remaining symbol is encoded in 0 bits).

By @hmry - 10 months

So if I'm reading this correctly, this works by knowing how often each symbol occurs ahead of time? (Since the table of frequencies is not counted in the compressed size but required for decompression)

By @gzalo - 10 months

If the symbol frequency is known beforehand and fixed for all messages, it's not really i.i.d, as each distribution depends on the previous symbols.

By @thesz - 10 months

This can be seen as encoding ranges that span each character. E.g., "asdoasd" needs to encode just 3 for 'o', remove 'o' from "asdoasd" and encode 0 and 2 for 'a', then 0 and 1 for 's' and then only "dd" remains - encode 'd', encode stoppage bit and you are done.

So I implemented that transformed idea in Haskell using rude approximation of some encoding of the integers and got, approximately, 5432177 bits for first 1e6 bytes of enwik9. This translates to 5.43 bits per byte, which is not that bad - order0 encoding gets about 5.56 bits per byte there. My approach can be improved, of course, and improved result can be even smaller.

So, in my not important opinion, it is a good approach. The only drawback I see is that I cannot fathom how to extend that encoder to higher order contexts.

My variant adds counts of runs (count of character appearance), so it does add frequency table implicitly.

A working entropy encoder with output below the Shannon limit

Related

Unique3D: Image-to-3D Generation from a Single Image

Trealla Prolog: Compact and efficient Prolog interpreter

Video annotator: a framework for efficiently building video classifiers

A portable lightweight C FFI for Lua, based on libffi

Show HN: Python lib to run evals across providers: OpenAI, Anthropic, etc.

Related

Unique3D: Image-to-3D Generation from a Single Image

Trealla Prolog: Compact and efficient Prolog interpreter

Video annotator: a framework for efficiently building video classifiers

A portable lightweight C FFI for Lua, based on libffi

Show HN: Python lib to run evals across providers: OpenAI, Anthropic, etc.