1. Information Content#

Information content is a measure of the surprise associated with an event from a random variable.

The formula for information content is as follows,

\[ I(x) = \log \frac{1}{P(X = x)} \]

Let us build some intuition around why this is a good measure of surprise.

Claude suggested that a good measure of information content should satisfy the following axioms:

Events that are more unlikely should results in more information.
An event with 100% probability should results in zero information.
If two events are measured, the total amount of information gained should be the sum of \(I\).

Let us look at a simple example, suppose we flip a fair coin. The outcomes of this random variable, \(X\), are either \(H\) or \(T\) each of which have 50% probability of occuring. What is the information content of these events?

The examples present in this notebook are lifted from Information Theory, Inference, and Learning Algorithms by David J.C. MacKay

import math

def entropy(p):
    return - math.log2(p)

print(entropy(0.5))

1.0

The information we gain from this coin flip is 1 bit. We can encode the outcome of the coin flip using 1 bit of information so this makes sense.

Now let us suppose we play a simple game of battleship on an 8 x 8 grid. Suppose that in this game we are trying to guess the position of one battleship that takes up one square in the grid.

The probability we find the battleship is \(\frac{1}{64}\). What is the information we gain from this event?

entropy(1 / 64)

6.0

We get 6 bits of information which is more that the coin flip. This is not surprising as finding the battleship is a rairer event.

Suppose now we hit the battle ship on the 32-nd guess. The entropy for this event is as follows.

total_information = 0
n = 64
for i in range(1, 32):
    total_information += entropy((n - 1) / n)
    n -= 1

total_information += entropy(1 / n)
total_information

6.0

In fact no matter when we find the battleship, the total information will be 6 bits.

\[ \log \left( \frac{64}{63} \right) + \log \left( \frac{63}{62} \right) + \log \left( \frac{62}{61} \right) + \dots + \log \left(\frac{n+1}{n} \right) + \log \left(\frac{n}{1} \right) = \log(64) = 6 \]

Things are not as nice however when the number of squares in the grid is not a power of two. For example if we have 65 squares, the information we get from finding the ship of the first guess is,

entropy(1 / 65)

6.022367813028454

We don’t get the nice round numbers we were seeing before. Therefore, information content should not be thought of as simply the number of bits needed to encode the events of a random variable.