Given a stream of numbers, how do we find the median of the last seen N numbers in an efficient way? Let M be the number of numbers we’ve seen up to this point.

Exact approaches

Naive sorting in o(nlogn)

One simple way to solve this problem is to sort the last seen N numbers and then return the middle number as the median. This takes o(nlogn) time and o(n) space.

Dual heap in o(1) time and o(m) space

A more efficient solution is to simply keep track of two heaps: a min heap that represents one half of the seen numbers, and a max heap the other half. If an item is smaller than the max heap, then add it to the max heap. If an item is greater than the min heap, then add it to the min heap. Then balance the heaps. If one heap has two more elements than the other, remove its root element and add it to the other. These heap operations take o(1) time in the average case, but can take o(logn) in the worst case.

The max heap represents all numbers less than the median. The min heap represents all numbers greater than the median.

Then we can calculate the median as simply the root of the heap with more elements, or the mean of the two roots if the heaps are of equal size.

The problem with this solution is that it’s space intensive since we have to keep track of all elements seen. So it takes o(1) average time but o(m) space.

Quickselect in o(n) time and o(n) space

Quickselect is an average o(n) time algorithm to find the kth smallest element in an unsorted array. The idea is similar to quicksort – we choose a pivot and partition the array into subarrays greater and less than the pivot. Then we recurse into the only possible side that contains our kth smallest element.

In the worst case, this algorithm performs in o(n^2) time when we have bad pivots. For example, when we have an input array sorted from least to greatest and are searching for the greatest element, and we pick the first element as pivot. We’d go from [1,2,3..n] to [2,3..n] to [3..n], … [n]. Sum from 1 to n is n^2/2 which is o(n^2).

For a function select(left, right, k), the recursive call is

def select(lst, left, right, k):
    if left == right:
        return lst[left]

    pivot = partition(lst, left, right)

    if k == pivot:
        return lst[k]

    if k < pivot:
        select(lst, left, pivot - 1, k)
    else:
        select(lst, pivot + 1, right, k)

Finding the median

We can adapt this algorithm to simply find the n/2th element in the array. This gives us an o(n) time and o(n) space algorithm.

Notes

  • Introselect is a hybrid algorithm that guarantee o(n) average and o(nlogn) worst case by using a more costly way to pick pivots when quickselect fails to make sufficient progress. For example, if k calls have been made without halving the list size (where k is some small integer), then switch to the alternative pivot algorithm. We could also use the linear median of medians algorithm (Blum-Floyd-Pratt-Rivest-Tarjan) to achieve o(n) worst case – however in most cases this isn’t desirable since the median of medians approach takes too long to compute the medians.

Counting sort in o(n) time and o(k) space

Counting sort is a linear time o(n+k) sorting algorithm for hashable elements in some range. The basic idea is that we create a frequency counter of the elements in the array, and then iterate over the keys in the frequency counter in sorted order to constructed the sorted array. If the input range is unknown or extremely large, then counting sort may not be a good idea, as we have to construct an O(k) array.

Simple example

Suppose you have an unsorted array A of numbers: [3,5,2,5,1].

We create an array COUNTS of length 5 of all zeros: [0,0,0,0,0]. We’d like for the first element of COUNTS to represent the number of 1s in A. We’d also like for the second element of COUNTS to represent the number of 2s in A, and so on.

To achieve this, we loop through A, and for each element in A, we increment its frequency in COUNTS:

  • The first element in A is 3. So we increment the 3rd element of COUNTS and get [0,0,1,0,0].
  • The second element is 5. So we increment the 5th element and get [0,0,1,0,1].
  • The third element is 2. So we increment the 2nd element and get [0,1,1,0,1].
  • The fourth element is 5. So we increment the 5th element again and get [0,1,1,0,2].
  • The fifth element is 1. So we increment the 1st element and get [1,1,1,0,2].

Now we simply iterate through the COUNTS array and append to an output array the number represented by each index in COUNTS. First we append a single 1. Then we append a single 2. Then we append a single 3. And a single 4.

Finding the median

By iterating halfway through the COUNTS array generated by counting sort, we can find the median in o(n) time while taking o(k) space.

Notes

  • Counting sort is not a comparison-based algorithm, so the nlogn lower bound for comparison based sorting algorithms does not apply.

Skip Lists

TBD

Probabilistic Approach

TBD

References

Linear Time Median Finding by Russell Cohen