Information Theory and Pornography
In 1948, a very important paper was published by Claude Shannon. The paper was entitled "A Mathematical Theory of Communication". In it, Shannon basically invented information theory. While very technical and boring sounding, information theory is crazy important in today's world because it tells us exactly how much porn we can download.That's right - Shannon drops a knowledge bomb on us so large that the aftershocks are still realized today. Shannon relates entropy, negative logarithms and way nerdier concepts into a model that is the basis of nearly every bit of electronic communication today. From error detection to error correction to encryption to compression, all electronic communication is rooted in Shannon's work.
I mentioned porn because it is relevant (inasmuch as it is a compressible form of information), but also it helps to draw the reader in. Also, it increases my SEO score.
Entropy - Why Does It Matter?
Great question! Let me feed you, baby bird... If we can calculate the entropy of a message, we can determine exactly how "compressed" the message can be. This is important because all Internet traffic is metered at some point. Someone is paying for each and every bit that ventures onto the Internet. Bandwidth costs are a real, viable metric. Maybe not for the home consumer where you pay $50/month for a cable modem connection - but for large data centers that handle significant portions of Internet traffic.According to this out-dated Wikipedia article, the Internet saw nearly 1 petabyte of traffic every day in 2012. If we extrapolate just past a liner slope, we are approaching 2 petabytes of traffic per day in 2016. To put things in perspective, a petabyte is 1 with fifteen 0's after it. And petabytes are counting bytes, not bits. A byte is 8 bits. So multiply by 8 for a more accurate number. The Internet sees
1,000,000,000,000,000 bytes per day
That is a lot. Huge, in fact. Any basic calculator more than a few years old will not allow numbers that big. Let me put this in perspective for you. If you had one thousand dollar bills, you would have a stack of dollar bills that is 4.3 inches high. If you had 1,000,000,000,000,000 dollar bills, you would have a stack of dollar bills that would go from here to the moon and back almost 150 times over (70 million miles). If you look at the traffic generated from your cable modem in one day (with an aggressive estimate of 8GB), you are looking at a stack of dollar bills shorter than the tallest building on earth. Which isn't even a rounding error of a rounding error of a petabyte.
All this to say, if everyone can save even 1% of bytes when sending traffic across the Internet, we will save more bandwidth in one day than the entire Internet saw from January 1st of 1990 through December 31st of 2005. So yes, compression is important.
Don't Forget The Balls
Let's relate entropy to balls. Specifically, shipping balls in a box. Consider a scenario where I have to ship hundreds of balls across the country.Free Balls |
All the boxes I have are fairly standard shipping boxes, and it will take a number of boxes to fit all the balls. Each box costs $12 to send via USPS, so it is in my best interest to reduce the number of boxes that I want to send.
Packaged Balls |
Alternatively, I could place each ball in the box very carefully, lining up all the balls and optimizing the space in each box. It is likely that I would use fewer boxes - this is a long solution, but it has a lower financial cost.
Using Shannon's work, we can calculate the entropy of the balls in the box. This number will tell us the maximum number of balls we can place in the box. With that in mind, I can figure out how best to spend my time and money. Using the ball entropy, I can determine my best course of action. I might choose to spend more money and package the balls quickly into a lot of boxes; conversely, I may choose to spend a lot of time to package the balls in as few boxes as possible. What is most likely is that I will choose a happy medium of cost vs. time. But it is important to recognize that Shannon's work tells me what I am capable of.
A Bad Example To Follow That Bad Analogy
This is a terrible example of entropy and how it can be used for compression. Forgive me.Let's imagine that we want to send the lyrics for the Bryan Adam's classic hit, "Have You Ever Really Loved A Woman?". Our goal is to send the smallest amount of information. For this example, assume that any key we create is available to the recipient of the message.
If you want to play along at home, the lyrics I am using can be found here. There are 328 words in the song, counting duplicates (95 unique words). Incidentally, the ratio of unique words to all words is an indicator of how well we might be able to compress the message.
After some processing, it turns out that the most common word is "you", which shows up 26 times (nearly 8% of the words in the song). Rounding out second place is "her" with 22 occurrences. The word "really" shows up 20 times. Here is the break down of the top 10 words:
Percent Occurrences Word
8% 26 you
7% 22 her
6% 20 really
5% 17 woman
5% 16 a
4% 14 tenderness
4% 12 loved
3% 10 ever
3% 8 you
2% 6 yeah
The most common distinct 10 words in this song make up 46% of the total words in the lyrics. We can assign a number for each word in the list.
Percent Occurrences Word Number
8% 26 you 0
7% 22 her 1
6% 20 really 2
5% 17 woman 3
5% 16 a 4
4% 14 tenderness 5
4% 12 loved 6
3% 10 ever 7
3% 8 you 8
2% 6 yeah 9
We can pair the number with the word and refer to that as our "key". Our key, then, looks like this:
0 you
1 her
3 really
... and so on
If we were to use these numbers to represent the lyric "really really ever loved a woman", we would write "2 2 7 7 4 3". The original phrase is 32 characters (think of a "character" as a keystroke), whereas the "compressed" phrase is 11 characters long - nearly 1/3rd of the original message!
We can assign a number (or symbol) for each word or phrase in the song. When the recipient gets the compressed message, they can use our key to figure out what the original message was. For example, when the recipient sees the phrase "2 2 7 7 4 3" and applies the key, the recipient can calculate the original phrase ("really really ever loved a woman").
Note that all is not perfect! When we assign a number to the word "a" (which appears 2 times only), we will be assigning a 2 digit number (because all the 1 digit numbers have been used). This means that at least for the word "a", the compressed representation is longer than the uncompressed representation!
The Takeaway
Please keep in mind that this exercise is simply to help show the basic concepts of entropy and how it might be useful with compression.Nerd Spoiler Alert
If you are really interested in some of the complexity, think about breaking the lyrics into "patterns" that repeat, not "words". For example, "really really really" is a pattern that shows up. It would be extremely beneficial to represent that pattern as "1" - we have compressed that phrase to 1/17th of its original size. But it might be the case that the best pattern is really every 6 characters, regardless of boundary - it all depends on a lot of thnigs!But we also need to store the data which represents the decompression key - all in less space than the original message. And we can use clever tricks such as changing keys based on stream input, for starters.
Keep in mind that the strategy used to compress is also based on compute power. And the type of data (and the related entropy). There is no single right answer.
No comments:
Post a Comment