General proof for the lower bound of comparison based sorting algorithms using decision trees

May 18, 2023

Intro

It has been more than a year since my last article on this blog. So, I wanted to come back with something unique: going a bit away from a technical developer topic to a more mathematical topic. But don’t get scared by the mathematical notations throughout the page, as I made it as developer-friendly as possible.

In this article, I provide one of the proofs on the lower bound of comparison based algorithms to be $\Omega(n\log(n))$ , i.e the best time performance that can be achieved using any of the algorithms that use comparison operators. Moreover, I outline some of the well known comparison based sorting algorithms and conduct their analysis.

Since the dawn of computing and computers, the problem of inventing the most optimal sorting algorithms has been a topic of central discussion among computer scientists and mathematicians. Hundreds of sorting algorithms have been proposed by the scientists up to now. The largest part of these research comprises comparison based sorting algorithms.

Definition 1. Comparison based sorting algorithms being one of the types of sorting algorithms, order the elements of an array or list by using a single predicate, either $\le$ or $\ge$ for any pair of two elements. The following two conditions must be satisfied:

if $a \le b$ and $b \le c$ , then $a \le c$
$\forall a,b$ : $a \le b$ or $b \le a$

Now let’s go over some of the most popular comparison based sorting algorithms before proceeding to a general proof for the lower bound.

Example algorithms from the class

Quick Sort

This is a very famous algorithm that works in divide-and-conquer approach, as it was proposed in this research article. It requests no extra space on memory and the number of comparisons used in average is $2(M-N)\ln(N-M)$ , where $M$ and $N$ are lower and upper indices of the array $A$ to be sorted, respectively. In order to operate, it recursively calls a helper method called partition(A, M, N) to find the pivot index $I$ (it is an index used for dividing the array into two parts). Subsequently, it recursively calls the quicksort(A, M, I) and quicksort(A, I, N). In the best case (i.e the lower bound), the partition index $I$ always comes out to be in the center of the array $A$ , the algorithm would recur twice for $\frac{n}{2}$ and the partition() function would take $O(n)$ work. Hence, the recurrence relation for the time complexity would be:

\tag{1} T(n) = 2T(\frac{n}{2}) + O(n)

By the Master Theorem, it can be proven that the equation $(1)$ , i.e the Quick Sort algorithm’s time complexity is at least $\Omega(n\log n)$ .

Heapsort

Heapsort is another comparison based sorting algorithm that makes use of the heap data structure. Summarizing these steps, first of all, the algorithm builds a max heap from the given array $A$ using an algorithm titled heapify(), whose time complexity is $O(\log n)$ . A max heap is a complete binary tree in which the value in each internal node is greater than or equal to the values in the children of that node.

An example of a max heap is described in the following Figure: Figure 1. Max heap

Hence, the maximum element of the input array $A$ , is stored at the root of a max heap. For the next step, the algorithm swaps the root element, i.e the $A[0]$ with the last element, which is the $A[i]$ (so, we take out the root element, and the size of the heap is decremented). It recursively call the heapify() again on the root until there is no element left. If we let the size of the input $A$ as $n$ , then the heapify is called $n$ times. Finally, we can conclude that, the total time complexity is at least $\Omega(n\log n)$ .

Timsort

Timsort is one of the recently invented hybrid algorithms which is also a comparison based, was developed by Tim Peters. It is generally preferred to use for sorting real-world data for modern software data operations. The algorithm is thoroughly described in this article. An array $A$ is divided into blocks called the runs. Run of an array $A$ is a non-increasing or non-decreasing sub-array of $A$ .

Normally, the runs of an array are small-sized, and we perform insertion sort to each of them sequentially, since it is well suited for small lists. Finally, it uses the same merge procedure as used in the Merge Sort Algorithm. The theorem in the article proves that the total time complexity of Timsort is also $\Omega(n \log n)$ .

Prerequisites for the proof

In the previous section, we have given a formal definition for comparison based sorting algorithms. Afterwards, we analyzed implementations of three algorithms of the same class, namely Quick sort, Heapsort and Timsort. We have deduced that each of the above mentioned algorithms require at least $\Omega(n\log(n))$ amount of work.

This section will be devoted for a formal and general mathematical proof, that any sorting algorithm that uses comparison needs at least $\Omega(n\log(n))$ comparisons.

Since we are concerned about proving the lower bound using decision trees, it is wise to present a formal mathematical definition of a tree beforehand:

Definition 2. A tree is an undirected graph, where any two vertices are connected with a single path. Tree Figure 2. A tree ( $a$ — the root of the tree; $b$ and $c$ — children of $a$ ; $b$ , $c$ , $e$ and $g$ are internal nodes; $d$ , $h$ , $f$ , $i$ and $j$ are leaves of the tree)

The next terminology we will make use of during the proof is so-called an m-way tree:

Definition 3. An m-way tree is a tree, where each internal node has at most $m$ children. Figure 3. m-way tree. 2-way, 3-way, and 4-way trees.

By the combinatorics theory, we know that for $n$ distinct elements, we will have $n!$ permutations. The desired sorted order is 1 of those $n!$ arrangements. We can visualize these permutations using a decision tree (see Figure 4). Hence, from the figure, it is clear that for $n=3$ , we are getting $n!=6$ permutations (which is also equivalent to the number of leaves in the tree). The most comparisons occur where the tree has the longest path, i.e the height of the tree, $h=3$ . We note that, the efficiency of a comparison sort depends on the number of such comparisons.

Figure 4. A decision tree showing order permutations for 3 distinct elements $a$ , $b$ and $c$ .

Theorem 1. An m-way tree with height $h$ , has at most $l=m^h$ leaves.

Proof. We use the mathematical induction. Initially, let $m=1$ . This tree will have only a root and no more than $m$ children, where all of them are leaves. Therefore, for $m=1$ , a tree will have at most $m^1=m$ leaves.

Now, we assume that the theorem is true for all m-way trees of height less than $h$ . Let $T$ be a m-way tree with height $h$ . Leaves of the tree $T$ are the union of the leaves of subtrees of $T$ .

The height of each of these subtrees is at most $h-1$ . Hence, due to the assumption above, subtrees of $T$ , possess at most $m^{h-1}$ leaves. Since we have at most $m$ such subtrees, and each of them having $m^{h-1}$ leaves, the given tree has at most $l=m\cdot m^{h-1} = m^h$ leaves.

As Theorem 1 gives that the number of leaves is at most $m^h$ , i.e, $l \le m^h$ . Taking logarithm of base $m$ from both sides yields $\log_ml \le \log_mm^{h} \Longleftrightarrow h \ge \log_ml$ . Since $h \in \mathbb{Z}$ , we get $h \ge \lceil{log_ml}\rceil$ , where $\lceil{x}\rceil$ is — the ceiling function.

Result 1. The minimum value of the height $h$ of a tree is $\lceil{log_ml}\rceil$ , and this is equivalent to the minimum number of comparisons.

Result 2. An algorithm based on binary comparisons requires at least $\lceil{log_2n!}\rceil$ comparisons to be performed.

Final proof

Finally, using the results obtained above, we prove that the maximum efficiency of any comparison based sorting algorithm is $\Omega(n \log (n))$ . Let us recall the definition of Big Omega (see here for differences between Big Omega, Big O and Big Theta):

Definition 3. Let $f,g:\mathbb{Z} \rightarrow \mathbb{R}$ or $f,g:\mathbb{R} \rightarrow \mathbb{R}$ . If there exist constant positive numbers $C$ and $k$ ( $\forall x > k$ ) such that the following inequality holds:

|f(x)| \ge C|g(x)|

then, it is said that $\Omega(f(x))$ is the lower bound of $f(x)$ .

We need the following Lemma:

Lemma 1.: $\forall n>4$ , the inequality $log \frac{n}{2} > \frac{1}{2}$ holds.

Proof: Assuming the inequality to be true, we will try to yield a simple equivalent inequality. By the properties of logarithm, and since the logarithm is an increasing function, we get $\log \frac{n}{2} > \log \sqrt{n} \Longleftrightarrow \frac{n}{2} > \sqrt{n} \Longleftrightarrow \frac{\sqrt{n}}{2} > 1 \Longleftrightarrow \sqrt{n} > 2 \Longleftrightarrow n > 4$ . Which concludes the proof of this lemma.

Theorem 2. (The main theorem of this article) The number of comparisons which is equivalent to $\lceil{\log n!}\rceil$ for any comparison based sorting algorithm, belongs to a class of functions in $\Omega (n \log n)$ ( $n$ is the number of elements in the input list).

Proof. Let us assume $k=4$ and $n>k$ . Since $n!=n\cdot(n-1)\cdots2\cdot1$ :

|\log n!|=\log n! = \log (n\cdot (n-1)\cdots2\cdot1)=\log n + \log (n-1) + \dots \log 2 + \log 1

We replace $n$ terms in the last inequality by only $\cfrac{n}{2}$ terms, hence, the sum becomes smaller:

\log n + \log (n-1) + \dots \log 2 + \log 1 \ge \\ \log n + \log (n-1) + \dots \log \frac{n}{2} \ge \\ \log \frac{n}{2} + \log \frac{n}{2} + \dots + \log \frac{n}{2} = \frac{n}{2} \cdot \log \frac{n}{2}

Now, by the Lemma 1 and the last inequality:

\frac{n}{2}\cdot\log \frac{n}{2} > \frac{n}{4}\log n = \frac{1}{4}|n\log n|

Finally, for $n>4$ , we have proven $|\log n!| \ge |n\log n|$ .

Hence by Definition 3, $\log n!$ belongs to a class of functions $\Omega(n\log n)$ (where $C=\frac{1}{4}$ and $k=4$ ).

Wrapping up. Why should you care?

As software engineers, we use comparison based sorting algorithms directly or indirectly (via third party libraries) in almost any application that we develop. Even though every mathematical detail in this article may not be relevant to understand, the article provides a strict mathematical theory that whatever algorithm we use (whether it is an existing one or we come up with something new) — the fastest time complexity we can achieve is $\Omega(n \log (n))$ , where $n$ is the length of an arbitrary list.