A working entropy encoder with output below the Shannon limit
The GitHub repository features Valli encoding, an entropy encoder aiming for compression below the Shannon limit. It includes a C++ implementation, algorithm explanations, code instructions, and project support. Visit the repository for more.
Read original articleThe GitHub repository linked contains information about Valli encoding, a proof of concept entropy encoder designed to achieve compression output sizes below the Shannon limit for lossless compression. It offers an implementation of the encoder in C++ along with comprehensive explanations of the algorithm, instructions for running the code, and support for the project. For further details or specific inquiries regarding Valli encoding, you can visit the provided GitHub repository.
Related
Unique3D: Image-to-3D Generation from a Single Image
The GitHub repository hosts Unique3D, offering efficient 3D mesh generation from a single image. It includes author details, project specifics, setup guides for Linux and Windows, an interactive demo, ComfyUI, tips, acknowledgements, collaborations, and citations.
Trealla Prolog: Compact and efficient Prolog interpreter
Trealla Prolog is a compact interpreter written in C, compliant with ISO Prolog. It supports unbounded integers, UTF-8 atoms, efficient strings, and runs on Linux, Android, and WebAssembly. It offers C integration, SQLite access, concurrency features, and experimental functionalities. The project is open source under the MIT license.
Video annotator: a framework for efficiently building video classifiers
The Netflix Technology Blog presents the Video Annotator (VA) framework for efficient video classifier creation. VA integrates vision-language models, active learning, and user validation, outperforming baseline methods with an 8.3 point Average Precision improvement.
A portable lightweight C FFI for Lua, based on libffi
A GitHub repository offers a portable lightweight C FFI for Lua, based on libffi. It aims for LuaJIT FFI compatibility, developed in C. Includes features, examples, basic types, build instructions, testing, and acknowledgements.
Show HN: Python lib to run evals across providers: OpenAI, Anthropic, etc.
The GitHub repository provides details on LLM Safety Evals, accessible on evals.gg. It features a bar chart, a Twitter post, setup guidelines, and code execution commands. Contact for further support.
If the frequency table isn't included in the count, I could just make an "infinite" compressor by storing the frequency and cut one byte off the end. From the frequency table I could then deduce what the last byte should be.
> typical entropy encoders (Huffman, ANS, etc) would require 1 bit per symbol,
No. ANS (and range/arithmetic coding) allows for probabilities to be stored with fractional bits.
- A genie hands the encoder the type of an iid source sequence (the function N here: [1])
- The encoder produces a representation of the source with fewer bits than what would be possible using a Shannon-efficient encoder that didn't know the type.
- You also need to know the type to decode.
The reduction in bits from Shannon's theorem is explained by the genie revealing this extra information that reduced the input sequence's entropy (also causing it to become non-iid). The collection of letter-typical sequences of specific type is indeed often going to be smaller than (entropy-rate)^(blocklength), and that is what we are seeing here. This is where the author's explanation starts to go astray:
> The Shannon limit uses Stirling's approximation which is an asymptotic approximation for factorials and as such the approximation is too large for small values and becomes more accurate as the factorial size (i.e. message length and symbol count) approaches infinity.
[1]: https://en.wikipedia.org/wiki/Typical_set#Strongly_typical_s...
I would recommend the author read any information theory textbook, starting at the latest, at the part that covers Asymptotic Equipartition Properties. This is the crucial definition that will make the source coding theorem concrete, in my opinion.
EDIT: Oh. I guess OP isn't including the symbol counts as part of the message when calculating the message length? I guess I can see why this could be useful, but I think this should be made much more explicit.
My impression in a quick read of this github page is that this person has no background in channel coding or information theory. Maybe not quite a scammer, but it appears this should be ignored.
Ie. you'll get an even better result if you limit plain old arithetic coding to 1/70th probablility of each symbol (70 was stated as the non binary aligning example here) and just encode through using arithmetic encoding. Arithmetic encoding is blisteringly fast too.
https://en.wikipedia.org/wiki/Arithmetic_coding
I'd also say it's not really cheating Shannon limits. It's just that a binary channel likes to align on base two boundaries. Arithmetic coding is the way to store optimally without needing to align on the boundaries. With arithmetic coding at worst you waste 7 bits at the end of the file as you align to 8bits to store.
If you make use of the same information required here (exact counts) and update the probability model as you go, the result becomes the same for arithmetic coding/ANS (e.g. even with Huffman, after all symbols of one kind are seen, the remaining symbol is encoded in 0 bits).
So I implemented that transformed idea in Haskell using rude approximation of some encoding of the integers and got, approximately, 5432177 bits for first 1e6 bytes of enwik9. This translates to 5.43 bits per byte, which is not that bad - order0 encoding gets about 5.56 bits per byte there. My approach can be improved, of course, and improved result can be even smaller.
So, in my not important opinion, it is a good approach. The only drawback I see is that I cannot fathom how to extend that encoder to higher order contexts.
PS
My variant adds counts of runs (count of character appearance), so it does add frequency table implicitly.
Related
Unique3D: Image-to-3D Generation from a Single Image
The GitHub repository hosts Unique3D, offering efficient 3D mesh generation from a single image. It includes author details, project specifics, setup guides for Linux and Windows, an interactive demo, ComfyUI, tips, acknowledgements, collaborations, and citations.
Trealla Prolog: Compact and efficient Prolog interpreter
Trealla Prolog is a compact interpreter written in C, compliant with ISO Prolog. It supports unbounded integers, UTF-8 atoms, efficient strings, and runs on Linux, Android, and WebAssembly. It offers C integration, SQLite access, concurrency features, and experimental functionalities. The project is open source under the MIT license.
Video annotator: a framework for efficiently building video classifiers
The Netflix Technology Blog presents the Video Annotator (VA) framework for efficient video classifier creation. VA integrates vision-language models, active learning, and user validation, outperforming baseline methods with an 8.3 point Average Precision improvement.
A portable lightweight C FFI for Lua, based on libffi
A GitHub repository offers a portable lightweight C FFI for Lua, based on libffi. It aims for LuaJIT FFI compatibility, developed in C. Includes features, examples, basic types, build instructions, testing, and acknowledgements.
Show HN: Python lib to run evals across providers: OpenAI, Anthropic, etc.
The GitHub repository provides details on LLM Safety Evals, accessible on evals.gg. It features a bar chart, a Twitter post, setup guidelines, and code execution commands. Contact for further support.