An friend of mine used to ask this interview question:

Given a gigantic stream of unsorted data, calculate the median. You cannot store it in memory.

If this question was asking for the mean, life would be easy! Candidates normally struggle for a long time while having increasingly more obvious (obvious to the interviewer, that is) hints thrown at them. Such as, "What do you need to get the median?" and when the candidate says, "The sorted data set." the retort is, "all of it?" Sometimes more requirements are thrown up front, such as variance or skewness.

And so it goes for maybe 15 minutes, until the interviewer says that he doesn't really care about the exact median, and estimate will do just fine. What tools are good for making estimates of very large things? Sampling, of course. If you randomly sample the data set, you can then do all sorts of statistics on the sample, and it should be fairly accurate. That is the "accepted" answer to the question.

One candidate, however, came up with something rather interesting. What if you tracked an estimate of median, and incremented it for every new datum larger than it, and decremented every time it was less? It wouldn't be sensitive to outliers since it uses counts, rather than values, to move the median. Unfortunately, that's where the thought process ended due to the inevitable question, "What about other statistics like the 90th percentile or standard deviation?", but I thought it was a flash of genius.

So, what to do when confronted with a new algorithm? Code it up, of course. I wrote this Python module which effectively tracks the median within one one-hundredth of a percent on large data sets. The key insight, for me, was recognizing that the amount to increment or decrement should be variable. In fact, it should be the smallest difference between the sample median and any data point ever seen. This allows it to scale to larger data sets more effectively, and approaches the mean more quickly than a static small increment would be.

And that's how it sat for a almost a year until I discovered a rather interesting paper. The P-Square Algorithm for Dynamic Calculation of Quantiles and Histograms without Storing Observations by Raj Jain and Imrich Chlamtac, which has a rather sophisticated algorithm for arbitrary quantile searching.

The algorithm is this: given a quantile you wish to track (such as 0.9 for the 90th percentile), track five data points. Two are the minimum and maximum. The central one is the best guess of the quantile in question. The other two are to fit a polynomial in order to give a second degree approximation. The paper discusses how second degree fits work substantially better than linear fits. I wanted to find more papers on this topic, but unfortunately the ACM and IEEE wish to slow down research with paywalls.

This was pretty cool, though. When consulting the Internet I couldn't find many solutions which did online statistics generation. I decided to generalize my small class, bring in the P2 algorithm, and make it an actual usable project that I call LiveStats. I hope that it's useful. One of the remaining gaps I see is lack of online mode generation. I've found a few sources on this, but none that I am completely satisfied with. If you find one, please let me know.

So, while many interview questions are interesting, they don't come with "the one and only" answer. In fact, some useful tools can result from thinking outside of the constraints of the given answer. It's also important to remember to recognize flashes of insight that may not be down the path you thought the problem should go. Explore it, and some fascinating conversation may result.