Thursday, February 4, 2016

Shannon's Balls

Information Theory and Pornography

In 1948, a very important paper was published by Claude Shannon. The paper was entitled "A Mathematical Theory of Communication". In it, Shannon basically invented information theory. While very technical and boring sounding, information theory is crazy important in today's world because it tells us exactly how much porn we can download.

That's right - Shannon drops a knowledge bomb on us so large that the aftershocks are still realized today. Shannon relates entropy, negative logarithms and way nerdier concepts into a model that is the basis of nearly every bit of electronic communication today. From error detection to error correction to encryption to compression, all electronic communication is rooted in Shannon's work.

I mentioned porn because it is relevant (inasmuch as it is a compressible form of information), but also it helps to draw the reader in. Also, it increases my SEO score.


Entropy - Why Does It Matter?

Great question! Let me feed you, baby bird... If we can calculate the entropy of a message, we can determine exactly how "compressed" the message can be. This is important because all Internet traffic is metered at some point. Someone is paying for each and every bit that ventures onto the Internet. Bandwidth costs are a real, viable metric. Maybe not for the home consumer where you pay $50/month for a cable modem connection - but for large data centers that handle significant portions of Internet traffic.

According to this out-dated Wikipedia article, the Internet saw nearly 1 petabyte of traffic every day in 2012. If we extrapolate just past a liner slope, we are approaching 2 petabytes of traffic per day in 2016. To put things in perspective, a petabyte is 1 with fifteen 0's after it. And petabytes are counting bytes, not bits. A byte is 8 bits. So multiply by 8 for a more accurate number. The Internet sees

1,000,000,000,000,000 bytes per day


That is a lot. Huge, in fact. Any basic calculator more than a few years old will not allow numbers that big. Let me put this in perspective for you. If you had one thousand dollar bills, you would have a stack of dollar bills that is 4.3 inches high. If you had 1,000,000,000,000,000 dollar bills, you would have a stack of dollar bills that would go from here to the moon and back almost 150 times over (70 million miles). If you look at the traffic generated from your cable modem in one day (with an aggressive estimate of 8GB), you are looking at a stack of dollar bills shorter than the tallest building on earth. Which isn't even a rounding error of a rounding error of a petabyte.

All this to say, if everyone can save even 1% of bytes when sending traffic across the Internet, we will save more bandwidth in one day than the entire Internet saw from January 1st of 1990 through December 31st of 2005. So yes, compression is important.

Don't Forget The Balls

Let's relate entropy to balls. Specifically, shipping balls in a box. Consider a scenario where I have to ship hundreds of balls across the country.

Free Balls

All the boxes I have are fairly standard shipping boxes, and it will take a number of boxes to fit all the balls. Each box costs $12 to send via USPS, so it is in my best interest to reduce the number of boxes that I want to send.

Packaged Balls
A simple and fast solution might be to haphazardly throw balls into a box until no more fit, seal the box and start over with the next empty box. This will work and it will be quick for me. This is quick solution, but it comes at a financial cost (more boxes).

Alternatively, I could place each ball in the box very carefully, lining up all the balls and optimizing the space in each box. It is likely that I would use fewer boxes - this is a long solution, but it has a lower financial cost.

Using Shannon's work, we can calculate the entropy of the balls in the box. This number will tell us the maximum number of balls we can place in the box. With that in mind, I can figure out how best to spend my time and money. Using the ball entropy, I can determine my best course of action. I might choose to spend more money and package the balls quickly into a lot of boxes; conversely, I may choose to spend a lot of time to package the balls in as few boxes as possible. What is most likely is that I will choose a happy medium of cost vs. time. But it is important to recognize that Shannon's work tells me what I am capable of.

A Bad Example To Follow That Bad Analogy

This is a terrible example of entropy and how it can be used for compression. Forgive me.

Let's imagine that we want to send the lyrics for the Bryan Adam's classic hit, "Have You Ever Really Loved A Woman?". Our goal is to send the smallest amount of information. For this example, assume that any key we create is available to the recipient of the message.

If you want to play along at home, the lyrics I am using can be found here. There are 328 words in the song, counting duplicates (95 unique words). Incidentally, the ratio of unique words to all words is an indicator of how well we might be able to compress the message.

After some processing, it turns out that the most common word is "you", which shows up 26 times (nearly 8% of the words in the song). Rounding out second place is "her" with 22 occurrences. The word "really" shows up 20 times. Here is the break down of the top 10 words:

Percent  Occurrences  Word
   8%        26        you
   7%        22        her
   6%        20        really
   5%        17        woman
   5%        16        a
   4%        14        tenderness
   4%        12        loved
   3%        10        ever
   3%        8         you
   2%        6         yeah

The most common distinct 10 words in this song make up 46% of the total words in the lyrics. We can assign a number for each word in the list.

Percent  Occurrences  Word          Number
   8%        26        you            0
   7%        22        her            1
   6%        20        really         2
   5%        17        woman          3
   5%        16        a              4
   4%        14        tenderness     5      
   4%        12        loved          6
   3%        10        ever           7
   3%        8         you            8
   2%        6         yeah           9

We can pair the number with the word and refer to that as our "key". Our key, then, looks like this:
0   you
1   her
3   really
... and so on

If we were to use these numbers to represent the lyric "really really ever loved a woman", we would write "2 2 7 7 4 3". The original phrase is 32 characters (think of a "character" as a keystroke), whereas the "compressed" phrase is 11 characters long - nearly 1/3rd of the original message!

We can assign a number (or symbol) for each word or phrase in the song. When the recipient gets the compressed message, they can use our key to figure out what the original message was. For example, when the recipient sees the phrase "2 2 7 7 4 3" and applies the key, the recipient can calculate the original phrase ("really really ever loved a woman").

Note that all is not perfect! When we assign a number to the word "a" (which appears 2 times only), we will be assigning a 2 digit number (because all the 1 digit numbers have been used). This means that at least for the word "a", the compressed representation is longer than the uncompressed representation!

The Takeaway

Please keep in mind that this exercise is simply to help show the basic concepts of entropy and how it might be useful with compression.

This is a simplification of sorts, and there are a lot of considerations with compression and entropy. For example, the "key" needs to be sent along with the message; consequently, in order to be useful, the "compressed" message plus the key needs to be smaller than the original message. As well, I cheated and used spaces in the compressed message. In real life, word delineation would happen a different way - in fact, a lot of complicated things happen. But the concept is sound and hopefully you have a better understanding of compression.



Nerd Spoiler Alert

If you are really interested in some of the complexity, think about breaking the lyrics into "patterns" that repeat, not "words". For example, "really really really" is a pattern that shows up. It would be extremely beneficial to represent that pattern as "1" - we have compressed that phrase to 1/17th of its original size. But it might be the case that the best pattern is really every 6 characters, regardless of boundary - it all depends on a lot of thnigs!

But we also need to store the data which represents the decompression key - all in less space than the original message. And we can use clever tricks such as changing keys based on stream input, for starters.

Keep in mind that the strategy used to compress is also based on compute power. And the type of data (and the related entropy). There is no single right answer.


No comments:

Post a Comment