Fringe Nerdology: February 2016

Sunday, February 28, 2016

Quantification of Successful Concept Predictors

The 5 Core Qualities

As we discuss concepts (for the sake of this conversation, consider a "concept" as an idea, software project, or some tangible product), it becomes important to define the potential of success for a concept. I argue that there are a 5 core qualities which can be used as predictors for the success of a concept. The classification of these core qualities is not perfect and there is overlap between some of the qualities. However, it is a good starting point and will allow concepts to be compared along side each other. This is an important theme moving forward. Recall, a good deal of the content of this blog is theoretical in nature; consequently there are often times where no actual results exist in which to gauge success. It is therefore imperative that we have a means to estimate how successful an idea may become. In particular, this estimator can be used to drive which projects should be afforded more research and which projects should be tabled.

I'd like to introduce the core qualities here, but I hope to expand on them and construct more precise definitions at some point.

Value

Does the concept have inherent value or can the concept bring some value? Value is perhaps the most nebulous of the core qualities. For example, it is difficult to quantify the value obtained from a piece of software and even more challenging to compare that value against value derived from a tangible good. Nevertheless, we must strive to quantify the value so that we can predict success. Most people will probably see "value" as the most important core quality, though I suspect that it is the hardest core quality to assess.

Accessibility

Is the concept accessible? Consider cost and required resources in order to execute the concept.

This core quality is assessed with respect to the average person, not the intended concept consumer; this approach leads to less biased comparisons across concepts. Consider a domain-specific assessment for like-minded concepts.

For example, if a concept describes a manufacturing process for piston rings which will not wear over time, what are the execution requirements? Almost certainly there is a need for precision manufacturing equipment and specialized raw materials. Accessibility for the average person, then, would be low. Conversely, a concept describing a new way to tie a shoe would have a very high accessibility score because nearly everyone in the world has the equipment and ability to execute the concept.

Usability

How easy is it to execute the concept? Here, the definition of "usability" changes from concept to concept. For software, usability might describe the user interface. For a car, usability may describe steering, acceleration and braking. Some people may think of usability as a means to describe simpleness or intuitiveness.

Classic example: if the software interface is not intuitive and the instruction manual is written in Klingon, then the concept scores low with respect to usability because most people cannot easily use the software.

As a note, when I am involved with designing software, I often apply the "mother rule" to common or complicated functionality. I ask whether or not my mother could use the functionality without needing someone to explain it to her. This turns out to be a good litmus test for other developers on my team - perhaps because other developers can relate to the situation. To wit, most developers have fixed their mothers computer, fielded calls about email attachments or configured a new computer.

Awareness

How are others made aware of the concept? It does not matter how awesome my concept is or how easy it is to implement or execute - if no one knows about my concept, it may as well not exist.

Bringing awareness to a concept can sometimes be challenging. It is not my intent to discuss how to generate awareness; rather, my intent is to simply state that awareness of a concept is fundamental in predicting success. More to the point, having the ability to generate awareness is needed.

Reliability

Is the concept reliable? Another nebulous determinant, but still important. For a car manufacturer, the car needs to run well with little maintenance. For iPhone applications, the app needs to not crash frequently (or at all). How we measure reliability will differ between concepts, but we should be able to capture some value of reliability.

Now What?

We need to identify a way to capture all the core qualities at once. How can we easily convey the potential of success for a concept? I think that a good visualization should be able to convey its message with no words (legends, keys, tips, etc.). Radar charts are our friend here. These charts allow us to render multivariate data in an easy-to-understand manner. At a glance, we can see how a concept's core qualities rank with respect to each other.

If you are not familiar with a radar chart, I think it is easier to see an imperfect concept before an ideal (or perfect). Take a look at the chart below - hopefully you are able to infer that the concept lacks in reliability, is average with respect to value and awareness and scores well with usability and accessibility.

Poor Reliability

Below, I illustrate the "perfect concept". This radar chart shows what a theoretically perfect concept would look like. I find this quite boring when compared to an imperfect concept.

Perfect Concept

Comparing Concepts

Comparing concepts against each other is an art. We can look at a number of concepts simultaneously:

Comparisons

While somewhat helpful, this is not ideal. It is difficult to rank these concepts from best to worst (most-likely to succeed and least-likely to succeed). My brain cannot easily quantify and rank these concepts (with the notable exception of the ideal concept).

I have used various strategies with limited success when comparing concepts. From assigning "point values" and success coefficients to coloring the points to calculating volume, I have not seen anything that works better than intuition derived from some form of visual representation. And I think there is inherent value in gut feelings.

Multiple Concepts

Sometimes I am evaluating multiple concepts where the usability does not matter much - this happens often with software projects I write for myself (I simply do not care if anyone else can use the software other than me). I guess what I am trying to say (poorly) is that I need to evaluate groups of concepts with unique constraints on the set (ie: "usability does not matter" or "awareness is inconsequential"). Regardless of the constraints, if I know what I am looking for, it becomes easier to process the radar charts.

Takeaway

The main message to take away is that we need a way to quantify the potential for success of a concept. This is important because I will be talking about a wide variety of concepts and it is valuable to have a sense of which projects I should be pursuing. By defining and understanding the five core qualities, I can assign unbiased values to the five core qualities and represent the assignments as a radar chart. This permits me to be "roughly right" in the "resource assignment" phase of the concept. I simply will cannot afford to invest in the concepts which predict little chance of success.

Wednesday, February 17, 2016

Checksums and Bugs

Bugs Are a Joke

Those familiar with the history of computers know all too well that bugs are a joke. Literally. Like "Walt Disney on Ice" literally. The first well known defect (the term we use professionally to describe a bug, or "undesired/unexpected behavior") was a moth. That's right - not a typo. You see, back in the days of Shannon (remember him?), computers had a lot of moving parts. And sometimes, bugs would fly into the moving parts. And sometimes the bug might have an unfortunate experience with the moving parts. True story - a moth flew into a relay back in 1946 and Grace Hopper found the moth (essentially finding the first "bug"). Incidentally, Grace Hopper is totally awesome and was a true pioneer in the field of computer science.

Checksums

A checksum is a fancy term we use for making sure computers didn't miss something. Not surprisingly, checksums are used in Information Theory (again, remember Shannon?). Bugs notwithstanding, the computer world is not perfect. Stuff happens, and that stuff can be out of our control. Signals are susceptible to interference. In the computer world, all communication boils down to 1's and 0's. But nature doesn't always play nicely and sometimes all those pretty copper or fiber lines that carry your Internet signals get mixed signals. If too many mixed signals go unnoticed, bad things tend to happen. Happily, pioneers of Information Theory understood this (keep in mind that this dates back to WWII, where radio communication was expensive but extremely important). They came up with the notion of a checksum. There are a ton of really good uses, but this is a trivial example:

For every 7 bits I send, I will add an 8th bit which will make the number of 1's in the 8 bit string even; when you receive my message, you can check that the number of 1's is even.

Example:

I send the 7 bit value:

0110011

There are 4 1's in the string - an even number; consequently, I do NOT need to add another 1, so I add a 0 instead.

01100110

Now, when you receive the value, you will remove the last 0 from the byte (a byte is 8 bits), resulting in the original message:

0110011

You will count the number of 1's, see that there are 4 - which is even!.

But what happens when something goes awry during the transmission? Let's take a look. I send the original message (with the "0" appended):

01100110

But because of transmission errors, you receive:

01000110

You count the number of 1's in the message and see that the value is odd, not even. You have just detected an error (because of checksums)! All that you can deduce is that there was an error in the transmission - you do not know where the error occurred - just that there is an error.

This is a simple example, but hopefully you get the gist of checksums. Be aware that the particular example I illustrated will only detect one error. Multiple errors may go undetected. Fear not - science has provided more complex checksum methods to account for that; as well, some methods can offer error correction. That is to say that not only can errors be detected, but in some cases, the errors can be corrected - wild!

OK, But What About The Real World?

Great question! You see checksums all the time. UPC symbols have them - so any time you buy something that is scanned, you are seeing checksums at play (technically it is a check digit, but the concept is the same).

Checksums are also used in ISBN numbers. If you are old enough to remember what a real book looked like, you might recall that each book has its own ISBN. Boring, but it might help on Final Jeopardy.

Try It At Home

Need a weekend project? Try this on for size: credit card authorization did not used to happen instantly. Merchants would batch up all the transactions over a set amount of time and then process them with the credit card company. And sometimes the processing used unreliable communication links (sound familiar yet?). In order to help identify which credit card numbers were transmitted in error, credit cards has a checksum built in. This is still used today. Take your favorite credit card and do the following:

Multiply the first number by 1
Multiply the second number by 2
Multiply the third number by 1
Multiply the fourth number by 2
...
Repeat for all numbers except for the last number
Sum up all the products
Add the last number to the sum of the products
If the total sum is evenly divisible by 10 (modulo 10 if you are mathy), the number is valid

For example, give the 16 digit credit card number and the multiplier:

1234 1234 1234 123?

x 1212 1212 1212 121

We get the sum:

1438 1438 1438 143 = 56

The check digit needs to be a value such that when added to 56, will be divisible by 10. We need a 4. The complete credit card number is then:

1234 1234 1234 1234

And How Does This Relate To Bugs?

Fair question. This is where the boring part comes into play, but it is important to understand. Remember when I was talking about motion detection? When I was writing the detection algorithm, I noticed that the edges of the image did not look quite right. When I looked into it a bit further, I found that the image was being improperly handled: my algorithm was cropping part of the image (to focus on the section of the image which I was interested in), but the code I used to crop the image was not properly handling decompression/compression. As a consequence, parts of the compressed image that was cropped off still had some data within the image. Think about it like this:

The compressed image has localized data; basically a count of the number of pixels of each color within a quadrant of the image, along with a checksum of the value. But when I remove part of the image, the count was not updated and the checksum does not validate.

By way of example, let me illustrate. Remember this dude?

The Guy

If we want to compress the image, we might break up the image into quadrants like this:

Quadrants

Now, count the number of black squares in each quadrant and add a check digit that value. For each of the top two quadrants, we have 11 full black squares. In the lower quadrants, we have 7 black squares. Checksum modulo 10 and our values for the top quadrants are 11 and 9 (11 is the count, 9 is the check digit) and the lower quadrants are 7 and 3. We can represent the values as touples, and they look like this: (11, 9) and (7, 3).

Lets say that I crop the image, like so:

Cropped for profile pic

But whoops! I did not update the compression data with checksums! The top two quadrants will still validate (11 ,9), but the two lower quadrants are in error. The lower quadrants should be (4, 6) but are still (7, 3). By virtue of not recalculating the compression values, the data we have about the image describes part of the image which is no longer present.

Is This A Problem?

No, not at all. This does not affect how we perceive the image - the image will continue to look as it should because the compression error has no visual manifestation. But software that looks at the checksum will notice that something does not quite add up (pun intended).

Endgame

What is the takeaway? Hopefully you have a slightly better understanding of what a checksum is and how it works. Additionally, you now have an anecdote to tell the next time someone says that they found a bug in some software.

Also, for the purpose of this post, I use the term "checksum" to be interchangeable with the term "check digit". While the two terms are related, they are not exactly the same. Checksums are typically more complex than what I describe, but the concept is the same. But checksums are more widely known, so we'll stick with that term.

Thursday, February 4, 2016

Shannon's Balls

Information Theory and Pornography

In 1948, a very important paper was published by Claude Shannon. The paper was entitled "A Mathematical Theory of Communication". In it, Shannon basically invented information theory. While very technical and boring sounding, information theory is crazy important in today's world because it tells us exactly how much porn we can download.

That's right - Shannon drops a knowledge bomb on us so large that the aftershocks are still realized today. Shannon relates entropy, negative logarithms and way nerdier concepts into a model that is the basis of nearly every bit of electronic communication today. From error detection to error correction to encryption to compression, all electronic communication is rooted in Shannon's work.

I mentioned porn because it is relevant (inasmuch as it is a compressible form of information), but also it helps to draw the reader in. Also, it increases my SEO score.

Entropy - Why Does It Matter?

Great question! Let me feed you, baby bird... If we can calculate the entropy of a message, we can determine exactly how "compressed" the message can be. This is important because all Internet traffic is metered at some point. Someone is paying for each and every bit that ventures onto the Internet. Bandwidth costs are a real, viable metric. Maybe not for the home consumer where you pay $50/month for a cable modem connection - but for large data centers that handle significant portions of Internet traffic.

According to this out-dated Wikipedia article, the Internet saw nearly 1 petabyte of traffic every day in 2012. If we extrapolate just past a liner slope, we are approaching 2 petabytes of traffic per day in 2016. To put things in perspective, a petabyte is 1 with fifteen 0's after it. And petabytes are counting bytes, not bits. A byte is 8 bits. So multiply by 8 for a more accurate number. The Internet sees

1,000,000,000,000,000 bytes per day

That is a lot. Huge, in fact. Any basic calculator more than a few years old will not allow numbers that big. Let me put this in perspective for you. If you had one thousand dollar bills, you would have a stack of dollar bills that is 4.3 inches high. If you had 1,000,000,000,000,000 dollar bills, you would have a stack of dollar bills that would go from here to the moon and back almost 150 times over (70 million miles). If you look at the traffic generated from your cable modem in one day (with an aggressive estimate of 8GB), you are looking at a stack of dollar bills shorter than the tallest building on earth. Which isn't even a rounding error of a rounding error of a petabyte.

All this to say, if everyone can save even 1% of bytes when sending traffic across the Internet, we will save more bandwidth in one day than the entire Internet saw from January 1st of 1990 through December 31st of 2005. So yes, compression is important.

Don't Forget The Balls

Let's relate entropy to balls. Specifically, shipping balls in a box. Consider a scenario where I have to ship hundreds of balls across the country.

Free Balls

All the boxes I have are fairly standard shipping boxes, and it will take a number of boxes to fit all the balls. Each box costs $12 to send via USPS, so it is in my best interest to reduce the number of boxes that I want to send.

Packaged Balls

A simple and fast solution might be to haphazardly throw balls into a box until no more fit, seal the box and start over with the next empty box. This will work and it will be quick for me. This is quick solution, but it comes at a financial cost (more boxes).

Alternatively, I could place each ball in the box very carefully, lining up all the balls and optimizing the space in each box. It is likely that I would use fewer boxes - this is a long solution, but it has a lower financial cost.

Using Shannon's work, we can calculate the entropy of the balls in the box. This number will tell us the maximum number of balls we can place in the box. With that in mind, I can figure out how best to spend my time and money. Using the ball entropy, I can determine my best course of action. I might choose to spend more money and package the balls quickly into a lot of boxes; conversely, I may choose to spend a lot of time to package the balls in as few boxes as possible. What is most likely is that I will choose a happy medium of cost vs. time. But it is important to recognize that Shannon's work tells me what I am capable of.

A Bad Example To Follow That Bad Analogy

This is a terrible example of entropy and how it can be used for compression. Forgive me.

Let's imagine that we want to send the lyrics for the Bryan Adam's classic hit, "Have You Ever Really Loved A Woman?". Our goal is to send the smallest amount of information. For this example, assume that any key we create is available to the recipient of the message.

If you want to play along at home, the lyrics I am using can be found here. There are 328 words in the song, counting duplicates (95 unique words). Incidentally, the ratio of unique words to all words is an indicator of how well we might be able to compress the message.

After some processing, it turns out that the most common word is "you", which shows up 26 times (nearly 8% of the words in the song). Rounding out second place is "her" with 22 occurrences. The word "really" shows up 20 times. Here is the break down of the top 10 words:

Percent Occurrences  Word
8% 26     you
7%    22     her
6%    20     really
5%    17     woman
5%    16     a
4% 14 tenderness
4%    12     loved
3%    10     ever
3%    8      you
2% 6 yeah

The most common distinct 10 words in this song make up 46% of the total words in the lyrics. We can assign a number for each word in the list.

Percent Occurrences  Word Number
8% 26     you 0
7%      22     her   1
6%      20     really   2
5%      17     woman   3
5%      16     a   4
4% 14 tenderness   5
4%      12     loved   6
3%      10     ever   7
3%      8      you   8
2% 6 yeah   9

We can pair the number with the word and refer to that as our "key". Our key, then, looks like this:
0 you
1 her
3 really
... and so on

If we were to use these numbers to represent the lyric "really really ever loved a woman", we would write "2 2 7 7 4 3". The original phrase is 32 characters (think of a "character" as a keystroke), whereas the "compressed" phrase is 11 characters long - nearly 1/3rd of the original message!

We can assign a number (or symbol) for each word or phrase in the song. When the recipient gets the compressed message, they can use our key to figure out what the original message was. For example, when the recipient sees the phrase "2 2 7 7 4 3" and applies the key, the recipient can calculate the original phrase ("really really ever loved a woman").

Note that all is not perfect! When we assign a number to the word "a" (which appears 2 times only), we will be assigning a 2 digit number (because all the 1 digit numbers have been used). This means that at least for the word "a", the compressed representation is longer than the uncompressed representation!

The Takeaway

Please keep in mind that this exercise is simply to help show the basic concepts of entropy and how it might be useful with compression.

This is a simplification of sorts, and there are a lot of considerations with compression and entropy. For example, the "key" needs to be sent along with the message; consequently, in order to be useful, the "compressed" message plus the key needs to be smaller than the original message. As well, I cheated and used spaces in the compressed message. In real life, word delineation would happen a different way - in fact, a lot of complicated things happen. But the concept is sound and hopefully you have a better understanding of compression.

Nerd Spoiler Alert

If you are really interested in some of the complexity, think about breaking the lyrics into "patterns" that repeat, not "words". For example, "really really really" is a pattern that shows up. It would be extremely beneficial to represent that pattern as "1" - we have compressed that phrase to 1/17th of its original size. But it might be the case that the best pattern is really every 6 characters, regardless of boundary - it all depends on a lot of thnigs!

But we also need to store the data which represents the decompression key - all in less space than the original message. And we can use clever tricks such as changing keys based on stream input, for starters.

Keep in mind that the strategy used to compress is also based on compute power. And the type of data (and the related entropy). There is no single right answer.