Unzipping the Mystery: How ZIP Files Work

2.9k

Share on Facebook Share Tweet on Twitter Share Share

If you've ever had to email, upload or download several large files or programs, you've most likely encountered ZIP files. Also known as compressed or archived files, ZIP files condense multiple files into a single location with the extension .zip or .ZIP, reducing the overall size and making them easier to transmit.

Phillip Katz invented the ZIP file in 1986, and it was first implemented with the PKZip program for Katz's company, PKWare, Inc. Eventually, Katz's compression method became common usage within popular operating systems. Microsoft Windows and Apple's Mac OS include built-in utilities to compress and unzip files, and programs like WinRAR, WinZip and StuffIt can expand them.

SEE ALSO: The History of GIFs

But how does it all work? What kind of technological magic is at play that makes your files smaller while maintaining all of the information for later?

That "magic" is actually a pretty straightforward algorithm that takes the redundant aspects of a file and breaks it into smaller parts.

For an easy-to-understand example, let's take the sentence, "Mashable can help make readers smarter; readers can help make Mashable smarter," and pretend it's a file.

Every word in the example sentence appears twice. If each character and space in this sentence made up one unit of memory, the whole thing would have a file size of 78 units. If we created a numbered code — or "dictionary" — for this sentence, it could go something like this:

1. Mashable
2. can
3. help
4. make
5. readers
6. smarter

1 2 3 4 5 6; 5 2 3 4 1 6

This new sentence has only 24 units. Therefore, the compressed file would have only 24 units of memory in addition to another file that lists our numbered code, so that the compression program knows how to apply each unit of information. This is called "lossless compression"; all of the original information is retained.

The way in which an actual compression program works is a little bit more complicated than the previous example — it would recognize patterns. An example is the letter "e" and a space after "Mashable" and "make." But since there aren't many instances in which this particular pattern occurs, the program would most likely overwrite it with a more apparent pattern. The actual program is able to find a much more efficient dictionary and compressed file than we could.

According to educational and instructional website HowStuffWorks, it's common for languages to have redundant patterns, which is why text files are easily compressed. But the file reduction ratio depends on several factors, including the file's type and size and how the program chooses to compress it.

In contrast, images and MP3 files contain more unique information without many patterns. That's where "lossy compression" comes in — compression programs get rid of what they deem unnecessary information. If you had a scanned image, for example, with a blue sky, a compression program could pick one color of blue used for every pixel. If the compression scheme works well, the change wouldn't be very noticeable, but the file size would be significantly smaller.

The issue with lossy compression, though, is that you can't get the original file from the compressed file, making it less ideal than lossless compression when you need to retain all of the original information, such as when you're downloading databases and certain applications.

Mashable composite image courtesy of iStockphoto, tose, Auris.

Topics: Apps and Software, data, data storage, Did You Know, Tech, zip files

Computer Science Articles

Unzipping the Mystery: How ZIP Files Work

Unzipping the Mystery: How ZIP Files Work