Problem

Input:array $A[1..n]$, $k∈N$.

Output:$k$th smallest number of $A$.

Consider $k=1$, which is just finding the minimum. This can obviously be done in $O(n)$, or $O(ngn)$ if we try to sort first. Finding $k=2$ is simple too, we first find the smallest element, remove it from $A$, and then rerun to find the newest smallest.

*This doesn’t scale though*, especially for larger values of $k$. Since $k∈O(n)$, we have to iterate through $A$ $k$ times, so runtime is $O(n⋅k)$. Imagine we want to find the median, $m=2n $. Then, this algorithm runs in $Θ(n_{2})$, and we could have instead just sorted the array to find the $k=i+1$th element.

That would take $O(ngn)$, but we can actually find the solution in *linear time*.

## Procedure

The algorithm works as follows: at each step, we (somehow) pick an element as the pivot, which we use to partition $A$ with, just like quicksort. Then we check which position the pivot ended up in.

- If $k$ is less than the position, we know it has to be in the array to the left of the pivot’s position.
- If $k$ is greater than the position, then the element is in the array on the right.
- If $k$ equals the pivot position, then we’ve found the element. The pivot
*is*the answer.

Naturally, this leads to a divide and conquer algorithm.

## Picking the best pivot to partition around

Pick an element as pivot, scan through array and partition $A$ such that everything to the left $L$ of pivot is smaller than element, and to the right $R$ everything is greater than pivot.

This determines how many elements we recurse into, and therefore the overall runtime of the algorithm. The less elements we recurse into at each subproblem, the faster we get.

To divide and partition the array, we scan across the whole array, so that takes $O(n)$. Then, the three cases determine the recursive runtime:

- We’ve found the $k$th element, no other steps are needed. This is $O(1)$.
- Recursing on $L$ means next step takes $T(∣L∣)$.
- Recursing on $R$ means next step takes $T(∣R∣)$.

How big $L$ and $R$ are depend on what we choose as the pivot.

### Worst case pivot

If the pivot was the min or max element in $A$, we’re essentially performing $O(n)$ work for a single unit of progress (one element has been “removed” from $A$). So avoid that.

This is $T(n)=T(n−1)+O(n)$, which is the same as implementing insertion sort using D&C. Recurrence simplifies to $T(n)=∑_{h=0}n−h$. Here, we are summing consecutive numbers, and gives us an overall runtime of $O(n_{2})$.

### Picking the median as the pivot

Consider if the pivot was the $2n $th element of $A$, aka the median. Then the recurrence relation is given by $T(n)=T(2n )+O(n)$. Expanding gives us

$T(n)=h=1∑lgn 2c_{1}n =c_{1}n⋅h=1∑lgn 2_{h}1 $Solving the geometric series tells us that the value is less than $2$. Therefore $T(n)≤2c_{1}n$, and therefore $T(n)=O(n)$. We just made the selection algorithm linear!

Issue is, how do we find the median of the array? Isn’t that what we were trying to find in the first place, *using* this algorithm?

It turns out that finding the median will take way too long. We can, however, find a *good enough* pivot such that the time it takes to find it still fits in our budget, and also allows us to still eliminate enough elements during each recursion for the overall algorithm to remain linear.

### Picking a “good enough” pivot

Consider the pivot that splits the array such that $∣L∣,∣R∣≥103n $ (both arrays contain at least $103n $ elements). This means every recursion step, we remove at least $103n $ elements.

Solving the recurrence for this gives us a geometric series that still evaluates to a constant less than infinity, and so runtime is still $O(n)$.

How do we find such a good pivot during each divide step in such a small runtime, such that the $O(n)$ term is still the upper limit? The answer is by using the **median of medians** as the pivot.

## Algorithm

- Group the array into collections of 5. There may be a remaining group with less than 5 elements.
- Find the median within each collection, including that last group.
- Recursively call the algorithm on the array of medians, where $k=2n $. We want to find the median of these medians.
- Partition $A$ using the median of the medians. Keep track of the final position of our pivot.
- Depending on the three cases, we either return $A[pivot]$ as our answer, recurse into $L$, or recurse into $R$ (maybe updating $k$ along the way, because we’re now considering a smaller array).

## Runtime

Grouping the array takes $O(n)$ time, and finding the median of each collection takes $O(1)$ time (because it’s at most 5 elements, we can find it in constant time).

We recursively call the algorithm on this list of elements. There are $O(5n )$ of such collections.

Partitioning takes $O(n)$ time (if we just swap elements in place). Since we remove *at least* $103n $ of the elements when we recurse, the subproblem has *at most* $107n $ of the original array. The recurrence relation is given by

There is a really long, formal proof that shows this reduces to a term that is $O(n)$, but intuitively, when we consider the two recursive steps, we have

$107n +5n =107n+2n =109n <n$Recalling the intuition behind the master theorem, when the recursive step takes less time than the merging step, then the latter will dominate. So the final runtime is $O(n)$.

## Using selection to find the median efficiently

Selection is commonly used within other algorithms in order to find the median of a list. Its efficiency means that it rarely dominates the overall runtime and basically can be used “for free.”

One example of this is in deterministic quicksort, where we consistently find the median of the array to use as the pivot, which dramatically brings down its overall runtime.