Streaming Median

Problem: Compute a reasonable approximation to a “streaming median” of a potentially infinite sequence of integers.

Solution: (in Python)

def streamingMedian(seq):
   seq = iter(seq)
   m = 0

   for nextElt in seq:
      if m > nextElt:
         m -= 1
      elif m < nextElt:
         m += 1

      yield m

DiscussionBefore we discuss the details of the Python implementation above, we should note a few things.

First, because the input sequence is potentially infinite, we can’t store any amount of information that is increasing in the length of the sequence. Even though storing something like $ O(\log(n))$ integers would be reasonable for the real world (note that the log of a petabyte is about 60 bytes), we should not let that stop us from shooting for the ideal $ O(1)$ space bound, and exploring what sorts of solutions arise under that constraint. For the record, I don’t know of any algorithms to compute the true streaming median which require $ O(\log(n))$ space, and I would be very interested to see one.

Second, we should note the motivation for this problem. If the process generating the stream of numbers doesn’t change over time, then one can find a reasonably good approximation to the median of the entire sequence using only a sufficiently large, but finite prefix of the sequence. But if the process does change, then all bets are off. We need an algorithm which can compensate for potentially wild changes in the statistical properties of the sequence. It’s unsurprising that such a naive algorithm would do the trick, because it can’t make any assumptions.

In words, the algorithm works as follows: start with some initial guess for the median $ m$. For each element $ x$ in the sequence, add one to $ m$ if $ m$ is less than $ x$; subtract one if $ m$ is greater than $ x$, and do nothing otherwise.

In the Python above, we make use of generators to represent infinite sequences of data. A generator is an object with some iterable state, which yields some value at each step. The simplest possible non-trivial generator generates the natural numbers $ \mathbb{N}$:

def naturalNumbers():
   n = 1
   while True:
      yield n
      n += 1

What Python does with this function is translate it into a generator object “g” which works as follows: when something calls next(g), the function computes as usual until it reaches a yield statement; then it returns the yielded value, saves all of its internal state, and then returns control to the caller. The next time next(g) is called, this process repeats. A generator can be infinite or finite, and it terminates the iteration process when the function “falls off the end,” returning None as a function which has no return statement would.

We should note as well that Python knows how to handle generators with the usual “for … in …” language form. This makes it extremely handy, because programmers don’t have to care whether they’re using a list or an iterator; the syntax to work with them is identical.

Now the “streamingMedian” function we began with accepts as input a generator (or any iterable object, which it converts to a generator with the “iter” function). It then computes the streaming median of that generator as part of another generator, so that one can call it, e.g., with the code:

for medianSoFar in streamingMedian(naturalNumbers()):
   print medianSoFar

10 thoughts on “Streaming Median

    • Just for your information, there are some convergence issues. On one hand, it converges quite slowly. Because of this, this algorithm can’t cope with a quickly-changing stream (that is to say, the changes can be arbitrarily ‘wild’ but there can’t be too many in rapid succession or this method will give bad results). But for such streams, a median is arguably not particularly useful.

      For an alternative method (which doesn’t incorporate all information in the stream, and admittedly requires more computation than this method), see ProgrammingPraxis’s sliding median.

  1. What about powers of two (seq = 1, 2, 4, 8, 16, 32…)? The median of every prefix is also a power of two, however m (the median approximation) is only be incremented by 1 and will never “catch up”.

    • This is a common issue with streaming algorithms. In this particular example, the total set of numbers involved has an unbounded median. This algorithm does allow the median to change and it will change its output accordingly, but it is unreasonable to expect correctness in this case. The crazier the sequence is, the less sensible this algorithm’s output will be.

      Of course, if we restrict our algorithm to use sub-linear space (this algorithm uses logarithmic space) then we shouldn’t expect much better. In this example we shoot primarily for reasonable approximations in non-pathological cases on data which is potentially large-dimensional. For instance, this algorithm works well for video data.

  2. There’s a median-approximation algorithm that takes log(n) space and can be used in a streaming way: The first layer outputs the median of every three inputs it absorbs. The second layer gives the median of every three numbers from the first layer, and so forth. This would be fun to write recursively. Don’t remember the ref, sorry.

  3. This looks like the paper:

    The guarantee is a probability distribution. Speaking in Python, say the input is random.shuffle(range(N)) where N is odd. Then the true median is always (N-1)/2. They give the function that computes the probability that the algorithm will output i (0 <= i < N). Say error(i, N) = abs(i – (N-1)/2), then they give tables with the average error and the standard deviation of the error. (I don't understand why people would think of average abs error as the default rather than RMS error).

Leave a Reply