sean cassidy : What's the length of shortest bit sequence that's never been sent over the Internet?

A friend of mine posed this brain teaser to me recently:

What's the length of shortest bit sequence that's never been sent over the Internet?

We can never know for sure because we don't have a comprehensive list of all the data. But what can we say probabilistically? Restating it like so:

At what value for X is there a 50% chance there's a sequence of X-bits in length that hasn't been transmitted yet?

What does your intuition say? Obviously every 8-bit sequence has been sent since there's only 256 values. By downloading this HTML page over TLS you've probably used up every 8-bit value. Has every 100 byte message been sent?

This is how my intuition went: it's probably less than 128 bits because UUIDs are 128 bits, and they're universally unique. It's probably greater than 48 bits because of how common collisions are at that end for hashes and CRCs, and the Internet has generated a lot of traffic.

How would we determine the right value?

I decided to model data as each bit sent is like flipping a coin. This isn't strictly true, of course, but with encryption becoming more prevalent, it's getting to be close.

So how many flips of a coin does it take to expect to get n heads in a row?

I found this neat little paper deriving the following formula, where $n$ is number of heads in a row, and $E$ is the expected number of flips:

$$ E = 2^{n+1} - 2 $$

We're looking for a specific sequence, though, not a specific number of heads in a row. We don't even know what the sequence is since it hasn't been sent yet. Is that a problem? Not at all! We're looking for some sequence of length $n$, and given that both 0 and 1 are equally likely, the sequence 00110 is equally likely as 11111.

(Of course, different sequences on the Internet are not all equally likely, but we're simplifying to make this calculable.)

We're looking for $n$, however, and not the number of flips. What should the number of flips be set to? We need to estimate the total amount of data ever sent over the Internet. I found a nice table estimating how many petabytes per month are sent for each year.

Adding them up gets you $3.4067 \cdot 10^{22}$ bits, which is in the same rough neighborhood as the number of grains of sand on Earth! Neat.

To solve for $n$:

\begin{equation} \begin{aligned} E &= 2^{n + 1} - 2 \\ 3.4067 \cdot 10^{22} &= 2^{n + 1} - 2 \\ \log_2 (3.4067 \cdot 10^{22} + 2) - 1 &= n \\ n &= 73.85 \end{aligned} \end{equation}

So there's a 50% chance a message of length $73.85$ bits has not been sent yet. This matched my intuition nicely!

Using some forecasting estimates from Cisco, here's how $n$ changes over the next few years:

2017$n = 74.28$
2018$n = 74.67$
2019$n = 75.05$
2020$n = 75.41$
2021$n = 75.75$

What do you think? Am I right? Is there a different way to solve this problem? Let me know via email or Twitter.

Articles

  1. Browser Extension Password Managers Should Not Be Used
  2. How to Implement Crypto Poorly
  3. Meditations Redux
  4. Genius Blocker
  5. LostPass
  6. Privilege
  7. Your Own Verifiable Hardware RNG with bladeRF SDR
  8. We, the Weapons
  9. The Practice Startup
  10. Code as Risk
  11. Sherlock Holmes Debugging
  12. Plural gTLDs are evil
  13. Your Interface is what Matters
  14. Write in the Margins
  15. Meditations
  16. Better Java
  17. When names outlive their usefulness
  18. Diagnosis of the OpenSSL Heartbleed Bug
  19. The Intuition Trap
  20. Ambition
  21. The Story of the GnuTLS Bug
  22. Wrong Solutions
  23. Host an infodump session
  24. So, you want to crypto
  25. Hackers and Engineering School
  26. Strings are untyped
  27. Don't Pipe to your Shell
  28. How to Organize Your Brain with Bookmark Tags
  29. You are not a 10x Developer
  30. Windows ruins everything
  31. Don't Give Up and Die
  32. On Being Nice
  33. Bus Factors and Walk Score
  34. Wiggle the mouse to fix the test
  35. A Difficult Bug
  36. The Origins of the Diluvian Network
  37. Zipf your variable names
  38. H.264 and VP8, compared
  39. On Accepting Interview Question Answers
  40. Rate Limiting per User
  41. Write your own Data Structures