A friend of mine posed this brain teaser to me recently:
What's the length of shortest bit sequence that's never been sent over the Internet?
We can never know for sure because we don't have a comprehensive list of all the data. But what can we say probabilistically? Restating it like so:
At what value for X is there a 50% chance there's a sequence of X-bits in length that hasn't been transmitted yet?
What does your intuition say? Obviously every 8-bit sequence has been sent since there's only 256 values. By downloading this HTML page over TLS you've probably used up every 8-bit value. Has every 100 byte message been sent?
This is how my intuition went: it's probably less than 128 bits because UUIDs are 128 bits, and they're universally unique. It's probably greater than 48 bits because of how common collisions are at that end for hashes and CRCs, and the Internet has generated a lot of traffic.
How would we determine the right value?
I decided to model data as each bit sent is like flipping a coin. This isn't strictly true, of course, but with encryption becoming more prevalent, it's getting to be close.
So how many flips of a coin does it take to expect to get n heads in a row?
I found this neat little paper deriving the following formula, where $n$ is number of heads in a row, and $E$ is the expected number of flips:
$$ E = 2^{n+1} - 2 $$
We're looking for a specific sequence, though, not a specific number of heads in a row. We don't eve know what the sequence is since it hasn't been sent yet. Is that a problem? Not at all! We're looking for some sequence of length $n$, and given that both 0 and 1 are equally likely, the sequence 00110 is equally likely as 11111.
(Of course, different sequences on the Internet are not all equally likely, but we're simplifying to make this calculable.)
We're looking for $n$, however, and not the number of flips. What should the number of flips be set to? We need to estimate the total amount of data ever sent over the Internet. I found a nice table estimating how many petabytes per month are sent for each year.
Adding them up gets you $3.4067 \cdot 10^{22}$ bits, which is in the same rough neighborhood as the number of grains of sand on Earth! Neat.
To solve for $n$:
\begin{equation} \begin{aligned} E &= 2^{n + 1} - 2 \\ 3.4067 \cdot 10^{22} &= 2^{n + 1} - 2 \\ \log_2 (3.4067 \cdot 10^{22} + 2) - 1 &= n \\ n &= 73.85 \end{aligned} \end{equation}
So there's a 50% chance a message of length $73.85$ bits has not been sent yet. This matched my intuition nicely!
Using some forecasting estimates from Cisco, here's how $n$ changes over the next few years:
2017 | $n = 74.28$ |
2018 | $n = 74.67$ |
2019 | $n = 75.05$ |
2020 | $n = 75.41$ |
2021 | $n = 75.75$ |
What do you think? Am I right? Is there a different way to solve this problem? Let me know via email or Twitter.