Compression Lying Size

1 minute read

Why Compression Is Just Lying About Size

Compression algorithms are basically just finding repetitive patterns and replacing them with shorter lies. It’s like summarizing War and Peace as “Russians are sad for a long time” - technically accurate but missing some nuance. At ExxonMobil, we compressed our log files so aggressively that the compression metadata was larger than the actual logs. We were basically storing a very detailed description of nothing.

The best part about compression is when people try to compress already compressed data, like trying to make a ZIP file smaller by putting it in another ZIP file. It’s like trying to save space by putting your suitcase inside another suitcase. The file actually gets bigger because now you’re storing “this is compressed” metadata twice. I’ve seen systems that automatically compressed everything, including JPEGs and MP3s, turning a storage optimization into a storage pessimization.

My favorite compression bug was when we were compressing JSON by removing whitespace, but our parser required pretty-printed JSON. So we’d compress it for storage, decompress it for parsing, then pretty-print it for logging. The CPU cycles we burned moving spaces around could have powered a small country. Eventually, someone realized we were basically running an elaborate no-op and just stored the JSON as-is. Sometimes the best optimization is not optimizing.

Updated: