Finding the majority element of a stream

Problem: Given a massive data stream of $ n$ values in $ \{ 1, 2, \dots, m \}$ and the guarantee that one value occurs more than $ n/2$ times in the stream, determine exactly which value does so.

Solution: (in Python)

def majority(stream):
   held = next(stream)
   counter = 1

   for item in stream:
      if item == held:
         counter += 1
      elif counter == 0:
         held = item
         counter = 1
      else:
         counter -= 1

   return held

Discussion: Let’s prove correctness. Say that $ s$ is the unknown value that occurs more than $ n/2$ times. The idea of the algorithm is that if you could pair up elements of your stream so that distinct values are paired up, and then you “kill” these pairs, then $ s$ will always survive. The way this algorithm pairs up the values is by holding onto the most recent value that has no pair (implicitly, by keeping a count how many copies of that value you saw). Then when you come across a new element, you decrement the counter and implicitly account for one new pair.

Let’s analyze the complexity of the algorithm. Clearly the algorithm only uses a single pass through the data. Next, if the stream has size $ n$, then this algorithm uses $ O(\log(n) + \log(m))$ space. Indeed, if the stream entirely consists of a single value (say, a stream of all 1’s) then the counter will be $ n$ at the end, which takes $ \log(n)$ bits to store. On the other hand, if there are $ m$ possible values then storing the largest requires $ \log(m)$ bits.

Finally, the guarantee that one value occurs more than $ n/2$ times is necessary. If it is not the case the algorithm could output anything (including the most infrequent element!). And moreover, if we don’t have this guarantee then every algorithm that solves the problem must use at least $ \Omega(n)$ space in the worst case. In particular, say that $ m=n$, and the first $ n/2$ items are all distinct and the last $ n/2$ items are all the same one, the majority value $ s$. If you do not know $ s$ in advance, then you must keep at least one bit of information to know which symbols occurred in the first half of the stream because any of them could be $ s$. So the guarantee allows us to bypass that barrier.

This algorithm can be generalized to detect $ k$ items with frequency above some threshold $ n/(k+1)$ using space $ O(k \log n)$. The idea is to keep $ k$ counters instead of one, adding new elements when any counter is zero. When you see an element not being tracked by your $ k$ counters (which are all positive), you decrement all the counters by 1. This is like a $ k$-to-one matching rather than a pairing.

Reservoir Sampling

Problem: Given a data stream of unknown size $ n$, pick an entry uniformly at random. That is, each entry has a $ 1/n$ chance of being chosen.

Solution: (in Python)

import random

def reservoirSample(stream):
   for k,x in enumerate(stream, start=1):
      if random.random() < 1.0 / k:
         chosen = x

   return chosen

Discussion: This is one of many techniques used to solve a problem called reservoir sampling. We often encounter data sets that we’d like to sample elements from at random. But with the advent of big data, the lists involved are so large that we either can’t fit it all at once into memory or we don’t even know how big it is because the data is in the form of a stream (e.g., the number of atomic price movements in the stock market in a year). Reservoir sampling is the problem of sampling from such streams, and the technique above is one way to achieve it.

In words, the above algorithm holds one element from the stream at a time, and when it inspects the $ k$-th element (indexing $ k$ from 1), it flips a coin of bias $ 1/k$ to decide whether to keep its currently held element or to drop it in favor of the new one.

We can prove quite easily by induction that this works. Indeed, let $ n$ be the (unknown) size of the list, and suppose $ n=1$. In this case there is only one element to choose from, and so the probability of picking it is 1. The case of $ n=2$ is similar, and more illustrative. Now suppose the algorithm works for $ n$ and suppose we increase the size of the list by 1 adding some new element $ y$ to the end of the list. For any given $ x$ among the first $ n$ elements, the probability we’re holding $ x$ when we  inspect $ y$ is $ 1/n$ by induction. Now we flip a coin which lands heads with probability $ 1/(n+1)$, and if it lands heads we take $ y$ and otherwise we keep $ x$. The probability we get $ y$ is exactly $ 1/(n+1)$, as desired, and the probability we get $ x$ is $ \frac{1}{n}\frac{n}{n+1} = \frac{1}{n+1}$. Since $ x$ was arbitrary, this means that after the last step of the algorithm each entry is held with probability $ 1/(n+1)$.

$ \square$

It’s easy to see how one could increase the number of coins being flipped to provide a sampling algorithm to pick any finite number of elements (with replacement, although a variant without replacement is not so hard to construct using this method). Other variants, exist, such as distributed and weighted sampling.

Python’s generators make this algorithm for reservoir sampling particularly nice. One can define a generator which abstractly represents a data stream (perhaps querying the entries from files distributed across many different disks), and this logic is hidden from the reservoir sampling algorithm. Indeed, this algorithm works for any iterable, although if we knew the size of the list we could sample much faster (by uniformly generating a random number and indexing the list appropriately). The start parameter given to the enumerate function makes the $ k$ variable start at 1.

Streaming Median

Problem: Compute a reasonable approximation to a “streaming median” of a potentially infinite sequence of integers.

Solution: (in Python)

def streamingMedian(seq):
   seq = iter(seq)
   m = 0

   for nextElt in seq:
      if m &gt; nextElt:
         m -= 1
      elif m &lt; nextElt:
         m += 1

      yield m

DiscussionBefore we discuss the details of the Python implementation above, we should note a few things.

First, because the input sequence is potentially infinite, we can’t store any amount of information that is increasing in the length of the sequence. Even though storing something like $ O(\log(n))$ integers would be reasonable for the real world (note that the log of a petabyte is about 60 bytes), we should not let that stop us from shooting for the ideal $ O(1)$ space bound, and exploring what sorts of solutions arise under that constraint. For the record, I don’t know of any algorithms to compute the true streaming median which require $ O(\log(n))$ space, and I would be very interested to see one.

Second, we should note the motivation for this problem. If the process generating the stream of numbers doesn’t change over time, then one can find a reasonably good approximation to the median of the entire sequence using only a sufficiently large, but finite prefix of the sequence. But if the process does change, then all bets are off. We need an algorithm which can compensate for potentially wild changes in the statistical properties of the sequence. It’s unsurprising that such a naive algorithm would do the trick, because it can’t make any assumptions.

In words, the algorithm works as follows: start with some initial guess for the median $ m$. For each element $ x$ in the sequence, add one to $ m$ if $ m$ is less than $ x$; subtract one if $ m$ is greater than $ x$, and do nothing otherwise.

In the Python above, we make use of generators to represent infinite sequences of data. A generator is an object with some iterable state, which yields some value at each step. The simplest possible non-trivial generator generates the natural numbers $ \mathbb{N}$:

def naturalNumbers():
   n = 1
   while True:
      yield n
      n += 1

What Python does with this function is translate it into a generator object “g” which works as follows: when something calls next(g), the function computes as usual until it reaches a yield statement; then it returns the yielded value, saves all of its internal state, and then returns control to the caller. The next time next(g) is called, this process repeats. A generator can be infinite or finite, and it terminates the iteration process when the function “falls off the end,” returning None as a function which has no return statement would.

We should note as well that Python knows how to handle generators with the usual “for … in …” language form. This makes it extremely handy, because programmers don’t have to care whether they’re using a list or an iterator; the syntax to work with them is identical.

Now the “streamingMedian” function we began with accepts as input a generator (or any iterable object, which it converts to a generator with the “iter” function). It then computes the streaming median of that generator as part of another generator, so that one can call it, e.g., with the code:

for medianSoFar in streamingMedian(naturalNumbers()):
   print medianSoFar