History of lossless data compression algorithms (2014)

optimalsolver · on June 29, 2022

Data compression is apparently equivalent to general intelligence:

http://mattmahoney.net/dc/rationale.html

...which gives way to the theory of mind known as compressionism:

http://ceur-ws.org/Vol-1419/paper0045.pdf

joe_the_user · on June 29, 2022

Perfect compression is equivalent to finding the Kolmogorov complexity of a string. That's incomputable and so perfect compression is beyond any algorithm, even a "generally intelligent" algorithm. Generality in compression might increase an algorithms effectiveness but there's no proof this generality is equivalent to generality in dealing with the world humans deal with.

pessimizer · on June 30, 2022

Not "perfect compression" but any compression. Perfect compression would be equivalent to perfectly digested experiences i.e. that anything that could be deduced from a collection of sensations had been deduced. The link above explains it well.

> In 2000, Hutter proved that finding the optimal behavior of a rational agent is equivalent to compressing its observations. Essentially he proved Occam's Razor, the simplest answer is usually the correct answer. The proof applies to any goal-seeking agent in any unknown environment which can be simulated by a computer. For example, an agent could be a person or computer whose goal is to accumulate money or some other reward signal in an unknown world.

> Formally, the agent and environment are a pair of interacting Turing machines that pass messages back and forth at each time step. In addition, the environment outputs a reward signal (a number) to the agent, and the agent's goal is to maximize the accumulated reward. What Hutter proved is that the optimal behavior of an agent is to guess that the environment is controlled by the shortest program that is consistent with all of the interaction observed so far.

> The problem of finding this program known as AIXI. AIXI implies a solution to optimal algorithmic compression: coding a string as the shortest program that outputs it. The problem is worth studying because it is hard. The general problem is not computable, although Hutter proved that if we assume time bounds t and space bounds l on the environment, then this restricted problem, known as AIXItl, can be solved in O(t2l) time.

edit: I mean, it makes intuitive sense. To have perfect compression of a string would be to have extracted all of the order out of it, every possible pattern, and every possible pattern of patterns etc. It's obviously impossible to know when you've gotten to that point.

joe_the_user · on June 30, 2022

Hutter proved that finding the optimal behavior of a rational agent is equivalent to compressing its observations. Essentially he proved Occam's Razor, the simplest answer is usually the correct answer.

But "rational agent" isn't a standard object mathematicians work with. This person no doubt defined an object they called a "rational agent" and then proved things about it. That's still just plausible reasoning relative to the many informal and incompatible concepts of "rational agent" out-there, not to mention models of environment.

chriswarbo · on June 30, 2022

> But "rational agent" isn't a standard object mathematicians work with

Yes it is; it's a standard definition used across many fields, including parts of mathematics (e.g. game theory and decision theory) https://en.wikipedia.org/wiki/Rational_agent

joe_the_user · on July 1, 2022

Rational agent as general concept is common in many places. But rational agent doesn't have a single mathematical definition in the fashion of a group or a ring. You should notice that the definition of rational agent found in game theory doesn't imply any kind of prediction or computation, what the OP was claiming about "rational agents"

chriswarbo · on July 1, 2022

> But rational agent doesn't have a single mathematical definition in the fashion of a group or a ring.

Sure it's not an algebraic object; but it's still a well-defined mathematical notion.

> You should notice that the definition of rational agent found in game theory doesn't imply any kind of prediction or computation

Of course it does! Game theory (and rational agents in general) is about optimisation ( https://en.wikipedia.org/wiki/Optimization_problem#Continuou... https://en.wikipedia.org/wiki/Optimal_decision#Formal_mathem... https://en.wikipedia.org/wiki/Rational_choice_theory#Formal_... ); it's trying to solve a partially-observable Markov decision problem ( https://en.wikipedia.org/wiki/Markov_decision_process#Defini... ). That requires prediction (of other agents, and/or the environment), and computation (to search through action-space, for Bayesian updating, for backwards-induction, etc.)

> what the OP was claiming about "rational agents"

Since prediction and compression are isomorphic, OP's point is that compression algorithms are relevant to these decision processes. This actually references the work of Marcus Hutter, who added an extra axiom to these established definitions: that agents and environments are (Turing) computable.

This lets us apply techniques from Algorithmic Information Theory, like Solomonoff Induction, and prove that the optimal policy is to:

- Compress all of the observations received so far, into the smallest program possible (for a predetermined Universal Turing Machine). Note that this is uncomputable, since it would tell us its Kolmogorov complexity!

- Predict the outcome of our possible actions, by simulating them against that program

- Choose the action whose simulation leads to the highest reward

joe_the_user · on July 1, 2022

Sure it's not an algebraic object; but it's still a well-defined mathematical notion.

Nah, it's just an informal concept. A game theoretic "rational agent" in general isn't assumed to have any limitations on it's computing ability, for example.

[Game theory is] trying to solve a partially-observable Markov decision problem

Not according to standard texts on game theory. In fact, this is just can't be true in any usual interpretation. POMDPs are "single agent" problems whereas game theory nearly always deals with two competing players/agents each assume attempting to optimize their returns - it's the difference between a single optimum and a saddle point, maximum and minimax.

This actually references the work of Marcus Hutter, who added an extra axiom to these established definitions: that agents and environments are (Turing) computable.

This is my point. You can't "add an extra axiom" and be referencing a standard mathematical object at the same time.

joquarky · on June 30, 2022

Is there a way to do a simulated annealing approach that lets me put as much compute as I want into it for some marginal gains?

chriswarbo · on June 30, 2022

Optimisation algorithms like simulated annealing require some representation to optimise. The most general representation is a computer program, in which case genetic programming might be a better fit :)

Dylan16807 · on June 29, 2022

That's a factor to some extent, but if you're sticking with lossless then the better your compression gets the more your remaining data is dominated by noise.

gfody · on June 29, 2022

that's why you need two brains, one to losslessly compress through symbolic expressiveness and another to lossily compress by generating noise like the incompressible noise

anonGone73 · on June 29, 2022

The Head and the Heart

benj111 · on June 30, 2022

I thought men had their brains elsewhere?

tomcam · on June 30, 2022

Username particularly appropriate

natly · on June 29, 2022

Not mentioned here is Fabrice Bellards record breaking (at an enormous compute cost) compressor https://bellard.org/nncp/.

lucb1e · on June 29, 2022

What does "ratio (bpb)" mean? I'd guess bytes-per-byte or something, like how many bytes of original you get for each byte of compression, but it doesn't work out: the original size is 1e9 bytes, compressed (rounded) 3.2e8, so that's a ratio of 3.1 (1e9/3.2e8=3.09989). The program size amounts to a rounding error on that figure. The bpb value given is 2.58, nowhere near 3.1.

Edit: the paper defines it as "bits per input byte". What kinda measure is that, it's like "how well did it compress as compared to a factor 8", why 8?!

jltsiren · on June 30, 2022

The bit is the most fundamental unit of information. A base-e unit might be more elegant from a certain mathematical perspective, but the connections to formal logic and the ease of implementation make the base-2 bit the natural choice. At least when talking about things like information, entropy, and compression.

Bytes, on the other hand, are entirely arbitrary. At some point, the industry converged to using groups of 8 bits as the primary semantically meaningful unit smaller than a word. Probably because people at that time thought that having 256 distinct characters would be more or less the right choice. And because groups of power-of-2 bits are convenient on hardware level.

Entropy is usually expressed as bits per symbol (or bits per character), because that's what you get when you sum -P(c) log P(c) over all symbols c. People who are used to that convention often extend it to representing compression ratios. Using bits per byte is rare, because bytes are rarely semantically meaningful.

isaacimagine · on June 29, 2022

Given 8 bits of input, how many bits does it take to reconstruct the original 8, on average?

8 / 2.58 = 3.1

lucb1e · on June 29, 2022

Comparing bytes to bits is still apples to appleseeds rather than apples to apples, to my mind at least.

dgacmu · on June 30, 2022

It's a common way to represent entropy (the information content). One could measure bits per x for any x, of course, but bits per character (nee byte) is quite common and goes back to Shannon.

ctur · on June 29, 2022

This seems pretty incomplete if it doesn't include anything about brotli, lz4, or zstd, or Finite State Entropy in general. It's basically missing the past decade or more.

There's a lot of good history here, but just be aware it is missing the state of the art and thus shouldn't be used for deciding anything about algorithms or approaches to use today.

_delirium · on June 29, 2022

> It's basically missing the past decade or more.

The page history [1] shows that the article was mostly written in December 2011, with only minor edits since then, so that sounds about right.

[1] https://ethw.org/w/index.php?title=History_of_Lossless_Data_...

lucb1e · on June 29, 2022

Does that mean this is also outdated? I'm not sure if zstd, brotli, etc. are PAQ-like

> on the bleeding edge of archival software are the PAQ* formats

For "a history of" I don't mind if they don't include the latest decade which is still subject to interpretation for how to structure the text and emphasize key moments/inventions and such, but if they make statements pertaining to the present that's a different story.

felixhandte · on June 29, 2022

No widely deployed compressors are PAQ-like. The speed-up anticipated by the OP never materialized.

pixelpoet · on June 30, 2022

The story of Phil Katz (the PK in PKZIP) is a fascinating and very sad one: https://web.archive.org/web/20000829071343/http://www2.jsonl...

motohagiography · on June 29, 2022

Is the theoretical limit to compression basically what Kolmolgorov complexity is, or is there another shannon/information theory theorem that defines the maximum possible lossless compression in terms of symbols over a channel, where the channel represents the algorithm? ("compression theorem" refers to something else more abstract I think.)

Intuitively it seems like there would have to be one for lossless compression, and a proof that a theorem for the limits of lossy compression was either impossible or an open loop/Hard problem.

joe_the_user · on June 29, 2022

Yes, the Kolmolgorov complexity of a string X is the minimum length of a program/string that when fed into an efficient universal computer will yield back the string X. The Kolmolgorov complexity of a string is also undecideable - it's essentially equivalent to the halting problem.

"Best Lossy Compression" is ambiguous - how much loss is acceptable is isn't defined in the phrase.

dhdc · on June 30, 2022

The theorem you are looking for is Shannon's Source Coding Theorem[1]. It basically states that no encoding scheme can losslessly compress data beyond the given data's Shannon entropy.

[1]: https://en.m.wikipedia.org/wiki/Shannon's_source_coding_theo...

motohagiography · on June 30, 2022

Amazing, thank you, it had to be something like that. It makes much more sense as an encoding limitation than as something abstract as the "shortest possible program" via Kolmolgorov complexity. I was wondering whether KC may just be an inconsistently defined thought experiment compared to concrete ideas from information theory, and if it's not, how much in common it would necessarily have with the coding theorem, as how are they not talking about the same thing.

chriswarbo · on June 30, 2022

What you're after is "minimum description length": choose some representation, e.g. gzip, and try to find the shortest encoding of our input in that representation.

The simplest representation is just raw data. There's only one way to represent our input in this format: just write it out verbatim. Hence its minimum description length is identical to the input size.

A more powerful representation is run-length encoding, where runs of identical symbols can be stored as the symbol and the number of repetitions; everything else is stored raw. Note that there are multiple ways to represent an input in this format: e.g. "50 zeros" is the same as "25 zeros 25 zeros", "00000 45 zeros", etc. yet the latter aren't minimum descriptions.

A more powerful representation is to use back-references: this lets us compress runs of multiple repeated symbols (run-length encoding only handles runs with 1 repeated symbol), and re-use commonly-occuring 'phrases'. This is essentially how LZ works. Again, there are multiple ways to represent an input in this format; finding a minimum description is incredibly expensive; most compressors don't bother, and instead have a configurable 'effort' (e.g. 1 to 9).

Adding more power to a representation makes it able to compress more complex patterns, but also makes it harder to find a minimum description. The most powerful representation is a Universal Turing Machine, which can represent any computable pattern; its minimum description length is the Kolmogorov complexity, and finding it is uncomputable in general.

Note that both Shannon information and algorithmic information (i.e. Kolmogorov complexity) can be gamed if we focus on a single message; for example, I can define an encoding like this:

- To encode a snapshot of Wikipedia as of 2022-06-30T14:00:00Z, output a single `1`

- To encode any other data, run it through `zip` and prepend a `0`

- To decode data which begins with a `1`, output a snapshot of Wikipedia as of 2022-06-30T14:00:00Z

- To decode data which begins with a `0`, send everything after that `0` through `unzip`

Hence we need to focus on the distribution of possible messages, and the big-O behaviour after many messages have been sent, rather than individual messages themselves.

Shannon information is essentially a frequentist approach, based on counting and expectations. Algorithmic information is essentially Bayesian, where the choice of Universal Turing Machine defines a prior, the bits of a message (AKA a program) are evidence, and decoding that message (AKA running the program) updates that prior.

motohagiography · on June 30, 2022

So rewarding, thank you! Naievely, doesn't the expectation in the frequentist approach function as a prior in the Bayesian case? It's like there is a logical homomorphism between the two concepts of KC and SI.

dhdc · on June 30, 2022

In a sense, Kolmolgorov complexity and Shannon entropy can basically be considered as equivalent concepts: they both define the minimum amount of information required to fully define a given piece of data.

bumper_crop · on June 29, 2022

Very timely; I found out yesterday that the UPX program (EXE specific compressor from many years ago) was made by the same guys who made LZO. I had this realization that there is a progeny from people who write compression stuff. In a similar vein, The Rob Pike post from yesterday mentioned Ken Thompson had made a custom audio compression for the Plan 9 release. He also made the UTF-8 compressor too. I love seeing how these people keep refining the state of the art.

Jerrrry · on June 29, 2022

The thoughts around information theory, entropy, compression, emergent behavior, and the intersectional limits of all of them are quite reminiscent and memetic to those who are interested.

Complexity is fuckin cool.

a-dub · on June 29, 2022

you may enjoy this text: http://www.inference.org.uk/itila/book.html

Jerrrry · on June 29, 2022

thank you, saved

mtn_nerd · on June 29, 2022

How could they not include the middle-out algorithm?

dang · on June 29, 2022

Discussed at the time:

History of Lossless Data Compression Algorithms - https://news.ycombinator.com/item?id=8088842 - July 2014 (44 comments)

kloch · on June 29, 2022

Does anyone else remember a joke/malware DOS program that circulated on BBS's in the very early 1990's that promised 98% compression of any file?

It just shortened the size in the FAT table (or similar gimmick that would result in data loss as you wrote new data to the disk)...

lizardactivist · on June 29, 2022

This article is relevant and intersects with some of the algorithms in the topic link:

https://oku.edu.mie-u.ac.jp/~okumura/compression/history.htm...

peter_d_sherman · on June 30, 2022

In reading this article, my intuition says to me that the Burrows-Wheeler Transform (https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transf...) and the Fourier Transform (https://en.wikipedia.org/wiki/Fourier_transform) -- are related -- in some as-of-yet unknown to me, mathematical way... like maybe one is a specific, specialized sub-case of the other, or something like that...

I don't have the mathematical ability to prove or disprove this conjecture -- but if some Mathematician in the future wanted to go down the path of either proving or disproving this, then I'd start with looking at the idea that both Transforms have reverse Transforms that preserve information, and that the Fourier Transform is basically its own reverse Transform... in other words, the Fourier Transform is sort of like an "identity" in the world of transforms...

It seems that there would, or should be, in the world of Math, transform functions of "different orders", i.e., if Fourier is its own opposite, and if data compression comprises two transformation functions, one to compress data, another to decompress it -- then couldn't there also exist "Trinary Transforms" -- where you run transformation function A, get some data out of it, then run transformation function B on that resulting data, and get some different data out of it, then run transformation function C on that data -- and after running transformation function C -- you get your original data back.

If those exist, then maybe N-ary Transformation Functions do too, where you run a set of data through N functions -- and you get your original data back after the Nth time... "transformation function families" -- for lack of better terminology!

I don't know -- I don't have the Mathematical expertise to prove or disprove it!

Disclaimer: All of the above is based on rampant speculation and conjecture -- and could be (and probably is!) absolutely and completely wrong on one or more levels -- so take with the proverbial "grain of salt"!

asimpletune · on June 30, 2022

Part of me wonders if there is some kind of connection between the theoretical minimum compression and radix economy. Like when you calculate the minimal amount of entropy required to encode some data, you end up in fractions of bits. Similarly, the theoretically most efficient radix is base e.

harveywi · on June 29, 2022

Grammar-based compression algorithms are completely missing from this "history".

lucb1e · on June 29, 2022

Haven't heard of that before. Typing it into Wikipedia, I got https://en.wikipedia.org/wiki/Grammar-based_code which explains it in so much as "To compress a data sequence x=x[0..n], a grammar-based code transforms x into a context-free grammar G." (Very helpful, love it when they use pseudocode with single-letter recursive variable definitions which needs accompanying text for it to make sense... and the text isn't there.) There is an example given on the right in image form, which looks like recursive dictionary-based string substitution:

    "We hold these truths"
    compressed: we_holdQ5ths
    Q=Rse we_holdRse_5ths
    R=Se  we_holdSese_5ths
    S=_th we_hold_these_5ths
    5=tru we_hold_these_truths

This method isn't described anywhere nor was the original included, but this seems to be the algorithm used.

duskwuff · on June 29, 2022

I'm not sure grammar-based compressors are a meaningful part of the history of lossless data compression. They're a topic in coding theory research, but, as far as I'm aware, they haven't had a significant impact in practical algorithms used in the field.