Finding the majority element of a stream

Problem: Given a massive data stream of n values in \{ 1, 2, \dots, m \} and the guarantee that one value occurs more than n/2 times in the stream, determine exactly which value does so.

Solution: (in Python)

def majority(stream):
   held = next(stream)
   counter = 1

   for item in stream:
      if item == held:
         counter += 1
      elif counter == 0:
         held = item
         counter = 1
      else:
         counter -= 1

   return held

Discussion: Let’s prove correctness. Say that s is the unknown value that occurs more than n/2 times. The idea of the algorithm is that if you could pair up elements of your stream so that distinct values are paired up, and then you “kill” these pairs, then s will always survive. The way this algorithm pairs up the values is by holding onto the most recent value that has no pair (implicitly, by keeping a count how many copies of that value you saw). Then when you come across a new element, you decrement the counter and implicitly account for one new pair.

Let’s analyze the complexity of the algorithm. Clearly the algorithm only uses a single pass through the data. Next, if the stream has size n, then this algorithm uses O(\log(n) + \log(m)) space. Indeed, if the stream entirely consists of a single value (say, a stream of all 1’s) then the counter will be n at the end, which takes \log(n) bits to store. On the other hand, if there are m possible values then storing the largest requires \log(m) bits.

Finally, the guarantee that one value occurs more than n/2 times is necessary. If it is not the case the algorithm could output anything (including the most infrequent element!). And moreover, if we don’t have this guarantee then every algorithm that solves the problem must use at least \Omega(n) space in the worst case. In particular, say that m=n, and the first n/2 items are all distinct and the last n/2 items are all the same one, the majority value s. If you do not know s in advance, then you must keep at least one bit of information to know which symbols occurred in the first half of the stream because any of them could be s. So the guarantee allows us to bypass that barrier.

This algorithm can be generalized to detect k items with frequency above some threshold n/(k+1) using space O(k \log n). The idea is to keep k counters instead of one, adding new elements when any counter is zero. When you see an element not being tracked by your k counters (which are all positive), you decrement all the counters by 1. This is like a k-to-one matching rather than a pairing.

Advertisements

2 thoughts on “Finding the majority element of a stream

  1. Frequent items has a very nice generalization to the case where the observations are real vectors, and you want to find the most frequent “directions”, which is just another way to say you want the top singular vectors of a streaming matrix.
    “Simple and deterministic matrix sketching”, Edo Liberty 2013

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s