sean cassidy : What's the length of shortest bit sequence that's never been sent over the Internet?

A friend of mine posed this brain teaser to me recently:

What's the length of shortest bit sequence that's never been sent over the Internet?

We can never know for sure because we don't have a comprehensive list of all the data. But what can we say probabilistically? Restating it like so:

At what value for X is there a 50% chance there's a sequence of X-bits in length that hasn't been transmitted yet?

What does your intuition say? Obviously every 8-bit sequence has been sent since there's only 256 values. By downloading this HTML page over TLS you've probably used up every 8-bit value. Has every 100 byte message been sent?

This is how my intuition went: it's probably less than 128 bits because UUIDs are 128 bits, and they're universally unique. It's probably greater than 48 bits because of how common collisions are at that end for hashes and CRCs, and the Internet has generated a lot of traffic.

How would we determine the right value?

I decided to model data as each bit sent is like flipping a coin. This isn't strictly true, of course, but with encryption becoming more prevalent, it's getting to be close.

So how many flips of a coin does it take to expect to get n heads in a row?

I found this neat little paper deriving the following formula, where $n$ is number of heads in a row, and $E$ is the expected number of flips:

$$ E = 2^{n+1} - 2 $$

We're looking for a specific sequence, though, not a specific number of heads in a row. We don't even know what the sequence is since it hasn't been sent yet. Is that a problem? Not at all! We're looking for some sequence of length $n$, and given that both 0 and 1 are equally likely, the sequence 00110 is equally likely as 11111.

(Of course, different sequences on the Internet are not all equally likely, but we're simplifying to make this calculable.)

We're looking for $n$, however, and not the number of flips. What should the number of flips be set to? We need to estimate the total amount of data ever sent over the Internet. I found a nice table estimating how many petabytes per month are sent for each year.

Adding them up gets you $3.4067 \cdot 10^{22}$ bits, which is in the same rough neighborhood as the number of grains of sand on Earth! Neat.

To solve for $n$:

\begin{equation} \begin{aligned} E &= 2^{n + 1} - 2 \\ 3.4067 \cdot 10^{22} &= 2^{n + 1} - 2 \\ \log_2 (3.4067 \cdot 10^{22} + 2) - 1 &= n \\ n &= 73.85 \end{aligned} \end{equation}

So there's a 50% chance a message of length $73.85$ bits has not been sent yet. This matched my intuition nicely!

Using some forecasting estimates from Cisco, here's how $n$ changes over the next few years:

2017$n = 74.28$
2018$n = 74.67$
2019$n = 75.05$
2020$n = 75.41$
2021$n = 75.75$

What do you think? Am I right? Is there a different way to solve this problem? Let me know via email or Twitter.

Articles

  1. Browser Extension Password Managers Should Not Be Used
  2. How to Implement Crypto Poorly
  3. Genius Blocker
  4. LostPass
  5. Your Own Verifiable Hardware RNG with bladeRF SDR
  6. Code as Risk
  7. Sherlock Holmes Debugging
  8. Plural gTLDs are evil
  9. Your Interface is what Matters
  10. Better Java
  11. When names outlive their usefulness
  12. Diagnosis of the OpenSSL Heartbleed Bug
  13. The Intuition Trap
  14. The Story of the GnuTLS Bug
  15. Wrong Solutions
  16. Host an infodump session
  17. So, you want to crypto
  18. Strings are untyped
  19. Don't Pipe to your Shell
  20. How to Organize Your Brain with Bookmark Tags
  21. You are not a 10x Developer
  22. Windows ruins everything
  23. On Being Nice
  24. Bus Factors and Walk Score
  25. Wiggle the mouse to fix the test
  26. A Difficult Bug
  27. The Origins of the Diluvian Network
  28. Zipf your variable names
  29. On Accepting Interview Question Answers
  30. Rate Limiting per User
  31. Write your own Data Structures