July 25th, 2024

My Favorite Algorithm: Linear Time Median Finding

The article explains the median-of-medians algorithm for finding the median in linear time, contrasting it with traditional sorting and quickselect methods, and discusses practical applications and performance considerations.

Read original articleLink Icon
CuriosityAppreciationConfusion
My Favorite Algorithm: Linear Time Median Finding

The article discusses the linear time median finding algorithm, specifically the median-of-medians approach, which allows for the determination of the median in a list in deterministic linear time. The author begins by explaining the traditional method of finding the median through sorting, which operates in O(n log n) time. The quickselect algorithm, developed by Tony Hoare, is introduced as a more efficient average O(n) method, utilizing a pivot to partition the list into smaller groups. The process involves recursively selecting elements based on their relation to the pivot until the median is found.

However, quickselect can degrade to O(n^2) in the worst case if poor pivots are consistently chosen. To address this, the median-of-medians algorithm is proposed, which guarantees linear time performance by selecting a good pivot. This method involves dividing the list into groups of five, finding the median of each group, and then recursively determining the median of these medians to use as a pivot.

The article concludes by noting that while the median-of-medians approach is theoretically sound, in practice, random pivot selection often suffices for efficient performance. The author also mentions the C++ standard library's introselect algorithm, which combines quickselect and heapselect for improved performance. Overall, the article provides a comprehensive overview of median finding algorithms, emphasizing the balance between theoretical efficiency and practical application.

AI: What people are saying
The comments on the article about the median-of-medians algorithm reveal various insights and discussions among readers.
  • Several commenters share their experiences with different median algorithms, noting the complexity and performance of the median-of-medians compared to other methods.
  • Practical implementations and variations of the algorithm are discussed, including references to code repositories and alternative approaches like biased pivots and streaming algorithms.
  • Some users express confusion about specific algorithmic details, such as the choice of group sizes and the handling of non-full chunks.
  • There is a philosophical discussion about the significance of proving linear time selection algorithms and their implications in computer science.
  • Critiques of code quality and implementation choices in the article are raised, particularly regarding Python's handling of integer division.
Link Icon 33 comments
By @danlark - 6 months
Around 4 years ago I compared lots of different median algorithms and the article turned out to be much longer than I anticipated :)

https://danlark.org/2020/11/11/miniselect-practical-and-gene...

By @rented_mule - 6 months
10-15 years ago, I found myself needing to regularly find the median of many billions of values, each parsed out of a multi-kilobyte log entry. MapReduce was what we were using for processing large amounts of data at the time. With MapReduce over that much data, you don't just want linear time, but ideally single pass, distributed across machines. Subsequent passes over much smaller amounts of data are fine.

It was a struggle until I figured out that knowledge of the precision and range of our data helped. These were timings, expressed in integer milliseconds. So they were non-negative, and I knew the 90th percentile was well under a second.

As the article mentions, finding a median typically involves something akin to sorting. With the above knowledge, bucket sort becomes available, with a slight tweak in my case. Even if the samples were floating point, the same approach could be used as long as an integer (or even fixed point) approximation that is very close to the true median is good enough, again assuming a known, relatively small range.

The idea is to build a dictionary where the keys are the timings in integer milliseconds and the values are a count of the keys' appearance in the data, i.e., a histogram of timings. The maximum timing isn't known, so to ensure the size of the dictionary doesn't get out of control, use the knowledge that the 90th percentile is well under a second and count everything over, say, 999ms in the 999ms bin. Then the dictionary will be limited to 2000 integers (keys in the range 0-999 and corresponding values) - this is the part that is different from an ordinary bucket sort. All of that is trivial to do in a single pass, even when distributed with MapReduce. Then it's easy to get the median from that dictionary / histogram.

By @xinok - 6 months
> P.S: In 2017 a new paper came out that actually makes the median-of-medians approach competitive with other selection algorithms. Thanks to the paper’s author, Andrei Alexandrescu for bringing it to my attention!

He also gave a talk about his algorithm in 2016. He's an entertaining presenter, I highly recommended!

There's Treasure Everywhere - Andrei Alexandrescu

https://www.youtube.com/watch?v=fd1_Miy1Clg

By @mabbo - 6 months
I learned about the median-of-medians quickselect algorithm when I was an undergrad and was really impressed by it. I implemented it, and it was terribly slow. It's runtime grew linearly, but that only really mattered if you had at least a few billion items in your list.

I was chatting about this with a grad student friend who casually said something like "Sure, it's slow, but what really matters is that it proves that it's possible to do selection of an unsorted list in O(n) time. At one point, we didn't know whether that was even possible. Now that we do, we know there might an even faster linear algorithm." Really got into the philosophy of what Computer Science is about in the first place.

The lesson was so simple yet so profound that I nearly applied to grad school because of it. I have no idea if they even recall the conversation, but it was a pivotal moment of my education.

By @kwantam - 6 months
One of the fun things about the median-of-medians algorithm is its completely star-studded author list.

Manuel Blum - Turing award winner in 1995

Robert Floyd - Turing award winner in 1978

Ron Rivest - Turing award winner in 2002

Bob Tarjan - Turing award winner in 1986 (oh and also the inaugural Nevanlinna prizewinner in 1982)

Vaughan Pratt - oh no, the only non-Turing award winner in the list. Oh right but he's emeritus faculty at Stanford, directed the SUN project before it became Sun Microsystems, was instrumental in Sun's early days (director of research and designer of the Sun logo!), and is responsible for all kinds of other awesome stuff (near and dear to me: Pratt certificates of primality).

Four independent Turing awards! SPARCstations! This paper has it all.

By @someplaceguy - 6 months

    return l[len(l) / 2]
I'm not a Python expert, but doesn't the `/` operator return a float in Python? Why would you use a float as an array index instead of doing integer division (with `//`)?

I know this probably won't matter until you have extremely large arrays, but this is still quite a code smell.

Perhaps this could be forgiven if you're a Python novice and hadn't realized that the two different operators exist, but this is not the case here, as the article contains this even more baffling code which uses integer division in one branch but float division in the other:

    def quickselect_median(l, pivot_fn=random.choice):
        if len(l) % 2 == 1:
            return quickselect(l, len(l) // 2, pivot_fn)
        else:
            return 0.5 * (quickselect(l, len(l) / 2 - 1, pivot_fn) +
                           quickselect(l, len(l) / 2, pivot_fn))
That we're 50 comments in and nobody seems to have noticed this only serves to reinforce my existing prejudice against the average Python code quality.
By @TacticalCoder - 6 months
I really enjoyed TFA but this:

> Technically, you could get extremely unlucky: at each step, you could pick the largest element as your pivot. Each step would only remove one element from the list and you’d actually have O(n2) performance instead of O(n)

If adversarial input is a concern, doing a O(n) shuffle of the data first guarantees this cannot happen. If the data is really too big to shuffle, then only shuffle once a bucket is small enough to be shuffled.

If you do shuffle, probabilities are here to guarantee that that worst case cannot happen. If anyone says that "technically" it can happen, I'll answer that then "technically" an attacker could also guess correctly every bit of your 256 bits private key.

Our world is build on probabilities: all our private keys are protected by the mathematical improbability that someone shall guess them correctly.

From what I read, a shuffle followed by quickselect is O(n) for all practical purposes.

By @furstenheim - 6 months
Floyd Ryvest also does the job . A bit more efficient IIRC.

However I never managed to understand how it works.

https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Rivest_algorit...

By @throwaway294531 - 6 months
If you're selecting the n:th element, where n is very small (or large), using median-of-medians may not be the best choice.

Instead, you can use a biased pivot as in [1] or something I call "j:th of k:th". Floyd-Rivest can also speed things up. I have a hobby project that gets 1.2-2.0x throughput when compared to a well implemented quickselect, see: https://github.com/koskinev/turboselect

If anyone has pointers to fast generic & in-place selection algorithms, I'm interested.

[1] https://doi.org/10.4230/LIPIcs.SEA.2017.24

By @mgaunard - 6 months
You could also use one of the streaming algorithms which allow you to compute approximations for arbitrary quantiles without ever needing to store the whole data in memory.
By @anonymoushn - 6 months
One commonly sees the implication that radix sort cannot be used for data types other than integers, or for composite data types, or for large data types. For example, TFA says you could use radix sort if your input is 32-bit integers. But you can use it on anything. You can use radix sort to sort strings in O(n) time.
By @ncruces - 6 months
An implementation in Go, that's (hopefully) simple enough to be understandable, yet minimally practical:

https://github.com/ncruces/sort/blob/main/quick/quick.go

By @Xcelerate - 6 months
I received a variant of this problem as an interview question a few months ago. Except the linear time approach would not have worked here, since the list contains trillions of numbers, you only have sequential read access, and the list cannot be loaded into memory. 30 minutes — go.

First I asked if anything could be assumed about the statistics on the distribution of the numbers. Nope, could be anything, except the numbers are 32-bit ints. After fiddling around for a bit I finally decided on a scheme that creates a bounding interval for the unknown median value (one variable contains the upper bound and one contains the lower bound based on 2^32 possible values) and then adjusts this interval on each successive pass through the data. The last step is to average the upper and lower bound in case there are an odd number of integers. Worst case, this approach requires O(log n) passes through the data, so even for trillions of numbers it’s fairly quick.

I wrapped up the solution right at the time limit, and my code ran fine on the test cases. Was decently proud of myself for getting a solution in the allotted time.

Well, the interview feedback arrived, and it turns out my solution was rejected for being suboptimal. Apparently there is a more efficient approach that utilizes priority heaps. After looking up and reading about the priority heap approach, all I can say is that I didn’t realize the interview task was to re-implement someone’s PhD thesis in 30 minutes...

I had never used leetcode before because I never had difficulty with prior coding interviews (my last job search was many years before the 2022 layoffs), but after this interview, I immediately signed up for a subscription. And of course the “median file integer” question I received is one of the most asked questions on the list of “hard” problems.

By @jagged-chisel - 6 months
It's quicksort with a modification to select the median during the process. I feel like this is a good way to approach lots of "find $THING in list" questions.
By @someplaceguy - 6 months
I found this part of the code quite funny:

    # If there are < 5 items, just return the median
    if len(l) < 5:
        # In this case, we fall back on the first median function we wrote.
        # Since we only run this on a list of 5 or fewer items, it doesn't
        # depend on the length of the input and can be considered constant
        # time.
        return nlogn_median(l)
Hell, why not just use 2^140 instead of 5 as the cut-off point, then? This way you'd have constant time median finding for all arrays that can be represented in any real-world computer! :) [1]

[1] According to https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/

By @ignoramous - 6 months
If an approximation is enough, the p2 quantile estimator (O(1) memory) is pretty neat: https://news.ycombinator.com/item?id=25201093
By @saagarjha - 6 months
This is hinted at in the post but if you're using C++ you will typically have access to quickselect via std::nth_element. I've replaced many a sort with that in code review :) (Well, not many. But at least a handful.)
By @chpatrick - 6 months
Another nice one is O(1) weighted sampling (after O(n) preprocessing).

https://en.wikipedia.org/wiki/Alias_method

By @melonmouse - 6 months
The linked proof for that median of medians is O(n) feels counterintuitive to me. Here's a (simpler?) alternative.

  T(0) = 0
  T(1) = 1
  T(n) = n + T(n/5) + T(7/10*n)
We want to prove that:

  T(n) ≤ C*n
It is intuitive that T(a+b) ≥ T(a) + T(b), or in other words, T is superadditive. That can be shown by induction:

Induction base: it holds for all a+b < 1, the only case being a=0, b=0:

  T(0+0) = 0 + T(0) + T(0) ≥ T(0) + T(0)
Induction step: suppose it holds for all a+b < k. Let a+b = k.

  T(a+b) = T(k)
         = k + T(k/5) + T(7/10*k)
         ≥ k + T(a/5) + T(b/5) + T(7/10*a) + T(7/10*b)
         = [a + T(a/5) + T(7/10*a)] + [b + T(b/5) + T(7/10*b)]
         = T(a) + T(b)
Because T is superadditive:

  T(n) = n + T(n/5) + T(7/10*n)
       ≤ n + T(n/5 + 7/10*n)
       = n + T(9/10*n)
Now we can apply the master theorem. Or to write out the proof (using a geometric series):

  T(n) ≤ n + T(9/10*n)
       ≤ n * ∑ᵢ₌₀ᶦⁿᶠᶦⁿᶦᵗʸ (9/10)^i
       = n * 1/(1-9/10)
       = 10*n
So, we have shown the algorithm is O(n) with C=10 (or less).
By @hammeiam - 6 months
The "Split the array into subarrays of length 5, now sorting all of the arrays is O(n) instead of O(n log n)" feels like cheating to me
By @Someone - 6 months
FTA:

“Proof of Average O(n)

On average, the pivot will split the list into 2 approximately equal-sized pieces. Therefore, each subsequent recursion operates on 1⁄2 the data of the previous step.”

That “therefore” doesn’t follow, so this is more an intuition than a proof. The problem with it is that the medium is more likely to end up in the larger of the two pieces, so you more likely have to recurse on the larger part than on the smaller part.

What saves you is that O(n) doesn’t say anything about constants.

Also, I would think you can improve things a bit for real world data by, on subsequent iterations, using the average of the set as pivot (You can compute that for both pieces on the fly while doing the splitting. The average may not be in the set of items, but that doesn’t matter for this algorithm). Is that true?

By @sfpotter - 6 months
A nice way to approximate the median: https://www.stat.berkeley.edu/~ryantibs/papers/median.pdf
By @RcouF1uZ4gsC - 6 months
> The C++ standard library uses an algorithm called introselect which utilizes a combination of heapselect and quickselect and has an O(nlogn) bound.

Introselect is a combination of Quickselect and Median of Medians and is O(n) worst case.

By @Tarean - 6 months
Love this algorithm. It feels like magic, and then it feels obvious and basically like binary search.

Similar to the algorithm to parallelize the merge step of merge sort. Split the two sorted sequences into four sequences so that `merge(left[0:leftSplit], right[0:rightSplit])+merge(left[leftSplit:], right[rightSplit:])` is sorted. leftSplit+rightSplit should be halve the total length, and the elements in the left partition must be <= the elements in the right partition.

Seems impossible, and then you think about it and it's just binary search.

By @teo_zero - 6 months
> On average, the pivot will split the list into 2 approximately equal-sized pieces.

Where does this come from?

Even assuming a perfect random function, this would be true only for distributions that show some symmetry. But if the input is all 10s and one 5, each step will generate quite different-sized pieces!

By @runiq - 6 months
Why is it okay to drop not-full chunks? The article doesn't explain that and I'm stupid.

Edit: I just realized that the function where non-full chunks are dropped is just the one for finding the pivot, not the one for finding the median. I understand now.

By @ValleZ - 6 months
I was asked to invent this algorithm on a whiteboard in 30 minutes. Loved it.
By @beyondCritics - 6 months
<It’s not straightforward to prove why this is O(n).

Replace T(n/5) with T(floor(n/5)) and T(7n/10) with T(floor(7n/10)) and show by induction that T(n) <= 10n for all n.

By @kccqzy - 6 months
> Quickselect gets us linear performance, but only in the average case. What if we aren’t happy to be average, but instead want to guarantee that our algorithm is linear time, no matter what?

I don't agree with the need for this guarantee. Note that the article already says the selection of the pivot is by random. You can simply choose a very good random function to avoid an attacker crafting an input that needs quadratic time. You'll never be unlucky enough for this to be a problem. This is basically the same kind of mindset that leads people into thinking, what if I use SHA256 to hash these two different strings to get the same hash?

By @zelphirkalt - 6 months
You can simply pass once over the data, and while you do that, count occurrences of the elements, memorizing the last maximum. Whenever an element is counted, you check, if that count is now higher than the previous maximum. If it is, you memorize the element and its count as the maximum, of course. Very simple approach and linear in time, with minimal book keeping on the way (only the median element and the count (previous max)).

I don't find it surprising or special at all, that finding the median works in linear time, since even this ad-hoc thought of way is in linear time.

EDIT: Ah right, I mixed up mode and median. My bad.

By @vismit2000 - 6 months
This is covered in section 9.3 in CLRS book - Medians and Order Statistics
By @SkiFire13 - 6 months
I wonder what's the reason of picking groups of 5 elements instead of 2 or 8.
By @nilslindemann - 6 months
"ns" instead of "l" and "n" instead of "el" would have been my choice (seen in Haskell code).