General proof for the lower bound of comparison based sorting algorithms using decision trees

May 18, 2023

Intro

It has been more than a year since my last article on this blog. So, I wanted to come back with something unique: going a bit away from a technical developer topic to a more mathematical topic. But don’t get scared by the mathematical notations throughout the page, as I made it as developer-friendly as possible.

In this article, I provide one of the proofs on the lower bound of comparison based algorithms to be Ω(nlog(n))\Omega(n\log(n)), i.e the best time performance that can be achieved using any of the algorithms that use comparison operators. Moreover, I outline some of the well known comparison based sorting algorithms and conduct their analysis.

Since the dawn of computing and computers, the problem of inventing the most optimal sorting algorithms has been a topic of central discussion among computer scientists and mathematicians. Hundreds of sorting algorithms have been proposed by the scientists up to now. The largest part of these research comprises comparison based sorting algorithms.

Definition 1. Comparison based sorting algorithms being one of the types of sorting algorithms, order the elements of an array or list by using a single predicate, either \le or \ge for any pair of two elements. The following two conditions must be satisfied:

  1. if aba \le b and bcb \le c, then aca \le c
  2. a,b\forall a,b: aba \le b or bab \le a

Now let’s go over some of the most popular comparison based sorting algorithms before proceeding to a general proof for the lower bound.

Example algorithms from the class

Quick Sort

This is a very famous algorithm that works in divide-and-conquer approach, as it was proposed in this research article. It requests no extra space on memory and the number of comparisons used in average is 2(MN)ln(NM)2(M-N)\ln(N-M), where MM and NN are lower and upper indices of the array AA to be sorted, respectively. In order to operate, it recursively calls a helper method called partition(A, M, N) to find the pivot index II (it is an index used for dividing the array into two parts). Subsequently, it recursively calls the quicksort(A, M, I) and quicksort(A, I, N). In the best case (i.e the lower bound), the partition index II always comes out to be in the center of the array AA, the algorithm would recur twice for n2\frac{n}{2} and the partition() function would take O(n)O(n) work. Hence, the recurrence relation for the time complexity would be:

T(n)=2T(n2)+O(n)(1) \tag{1} T(n) = 2T(\frac{n}{2}) + O(n)

By the Master Theorem, it can be proven that the equation (1)(1), i.e the Quick Sort algorithm’s time complexity is at least Ω(nlogn)\Omega(n\log n).

Heapsort

Heapsort is another comparison based sorting algorithm that makes use of the heap data structure. Summarizing these steps, first of all, the algorithm builds a max heap from the given array AA using an algorithm titled heapify(), whose time complexity is O(logn)O(\log n). A max heap is a complete binary tree in which the value in each internal node is greater than or equal to the values in the children of that node.

An example of a max heap is described in the following Figure: Max heap Figure 1. Max heap

Hence, the maximum element of the input array AA, is stored at the root of a max heap. For the next step, the algorithm swaps the root element, i.e the A[0]A[0] with the last element, which is the A[i]A[i] (so, we take out the root element, and the size of the heap is decremented). It recursively call the heapify() again on the root until there is no element left. If we let the size of the input AA as nn, then the heapify is called nn times. Finally, we can conclude that, the total time complexity is at least Ω(nlogn)\Omega(n\log n).

Timsort

Timsort is one of the recently invented hybrid algorithms which is also a comparison based, was developed by Tim Peters. It is generally preferred to use for sorting real-world data for modern software data operations. The algorithm is thoroughly described in this article. An array AA is divided into blocks called the runs. Run of an array AA is a non-increasing or non-decreasing sub-array of AA.

Normally, the runs of an array are small-sized, and we perform insertion sort to each of them sequentially, since it is well suited for small lists. Finally, it uses the same merge procedure as used in the Merge Sort Algorithm. The theorem in the article proves that the total time complexity of Timsort is also Ω(nlogn)\Omega(n \log n).

Prerequisites for the proof

In the previous section, we have given a formal definition for comparison based sorting algorithms. Afterwards, we analyzed implementations of three algorithms of the same class, namely Quick sort, Heapsort and Timsort. We have deduced that each of the above mentioned algorithms require at least Ω(nlog(n))\Omega(n\log(n)) amount of work.

This section will be devoted for a formal and general mathematical proof, that any sorting algorithm that uses comparison needs at least Ω(nlog(n))\Omega(n\log(n)) comparisons.

Since we are concerned about proving the lower bound using decision trees, it is wise to present a formal mathematical definition of a tree beforehand:

Definition 2. A tree is an undirected graph, where any two vertices are connected with a single path. Tree Figure 2. A tree (aa — the root of the tree; bb and cc — children of aa; bb, cc, ee and gg are internal nodes; dd, hh, ff, ii and jj are leaves of the tree)

The next terminology we will make use of during the proof is so-called an m-way tree:

Definition 3. An m-way tree is a tree, where each internal node has at most mm children. m-way tree Figure 3. m-way tree. 2-way, 3-way, and 4-way trees.

By the combinatorics theory, we know that for nn distinct elements, we will have n!n! permutations. The desired sorted order is 1 of those n!n! arrangements. We can visualize these permutations using a decision tree (see Figure 4). Hence, from the figure, it is clear that for n=3n=3, we are getting n!=6n!=6 permutations (which is also equivalent to the number of leaves in the tree). The most comparisons occur where the tree has the longest path, i.e the height of the tree, h=3h=3. We note that, the efficiency of a comparison sort depends on the number of such comparisons.

Decision tree Figure 4. A decision tree showing order permutations for 3 distinct elements aa, bb and cc.

Theorem 1. An m-way tree with height hh, has at most l=mhl=m^h leaves.

Proof. We use the mathematical induction. Initially, let m=1m=1. This tree will have only a root and no more than mm children, where all of them are leaves. Therefore, for m=1m=1, a tree will have at most m1=mm^1=m leaves.

Now, we assume that the theorem is true for all m-way trees of height less than hh. Let TT be a m-way tree with height hh. Leaves of the tree TT are the union of the leaves of subtrees of TT.

The height of each of these subtrees is at most h1h-1. Hence, due to the assumption above, subtrees of TT, possess at most mh1m^{h-1} leaves. Since we have at most mm such subtrees, and each of them having mh1m^{h-1} leaves, the given tree has at most l=mmh1=mhl=m\cdot m^{h-1} = m^h leaves.

As Theorem 1 gives that the number of leaves is at most mhm^h, i.e, lmhl \le m^h . Taking logarithm of base mm from both sides yields logmllogmmhhlogml\log_ml \le \log_mm^{h} \Longleftrightarrow h \ge \log_ml. Since hZh \in \mathbb{Z}, we get hlogmlh \ge \lceil{log_ml}\rceil, where x\lceil{x}\rceil is — the ceiling function.

Result 1. The minimum value of the height hh of a tree is logml\lceil{log_ml}\rceil, and this is equivalent to the minimum number of comparisons.

Result 2. An algorithm based on binary comparisons requires at least log2n!\lceil{log_2n!}\rceil comparisons to be performed.

Final proof

Finally, using the results obtained above, we prove that the maximum efficiency of any comparison based sorting algorithm is Ω(nlog(n))\Omega(n \log (n)). Let us recall the definition of Big Omega (see here for differences between Big Omega, Big O and Big Theta):

Definition 3. Let f,g:ZRf,g:\mathbb{Z} \rightarrow \mathbb{R} or f,g:RRf,g:\mathbb{R} \rightarrow \mathbb{R}. If there exist constant positive numbers CC and kk (x>k\forall x > k) such that the following inequality holds:

f(x)Cg(x)|f(x)| \ge C|g(x)|

then, it is said that Ω(f(x))\Omega(f(x)) is the lower bound of f(x)f(x).

We need the following Lemma:

Lemma 1.: n>4\forall n>4, the inequality logn2>12log \frac{n}{2} > \frac{1}{2} holds.

Proof: Assuming the inequality to be true, we will try to yield a simple equivalent inequality. By the properties of logarithm, and since the logarithm is an increasing function, we get logn2>lognn2>nn2>1n>2n>4\log \frac{n}{2} > \log \sqrt{n} \Longleftrightarrow \frac{n}{2} > \sqrt{n} \Longleftrightarrow \frac{\sqrt{n}}{2} > 1 \Longleftrightarrow \sqrt{n} > 2 \Longleftrightarrow n > 4 . Which concludes the proof of this lemma.

Theorem 2. (The main theorem of this article) The number of comparisons which is equivalent to logn!\lceil{\log n!}\rceil for any comparison based sorting algorithm, belongs to a class of functions in Ω(nlogn)\Omega (n \log n) (nn is the number of elements in the input list).

Proof. Let us assume k=4k=4 and n>kn>k. Since n!=n(n1)21n!=n\cdot(n-1)\cdots2\cdot1:

logn!=logn!=log(n(n1)21)=logn+log(n1)+log2+log1|\log n!|=\log n! = \log (n\cdot (n-1)\cdots2\cdot1)=\log n + \log (n-1) + \dots \log 2 + \log 1

We replace nn terms in the last inequality by only n2\cfrac{n}{2} terms, hence, the sum becomes smaller:

logn+log(n1)+log2+log1logn+log(n1)+logn2logn2+logn2++logn2=n2logn2\log n + \log (n-1) + \dots \log 2 + \log 1 \ge \\ \log n + \log (n-1) + \dots \log \frac{n}{2} \ge \\ \log \frac{n}{2} + \log \frac{n}{2} + \dots + \log \frac{n}{2} = \frac{n}{2} \cdot \log \frac{n}{2}

Now, by the Lemma 1 and the last inequality:

n2logn2>n4logn=14nlogn\frac{n}{2}\cdot\log \frac{n}{2} > \frac{n}{4}\log n = \frac{1}{4}|n\log n|

Finally, for n>4n>4, we have proven logn!nlogn|\log n!| \ge |n\log n|.

Hence by Definition 3, logn!\log n! belongs to a class of functions Ω(nlogn)\Omega(n\log n) (where C=14C=\frac{1}{4} and k=4k=4).

Wrapping up. Why should you care?

As software engineers, we use comparison based sorting algorithms directly or indirectly (via third party libraries) in almost any application that we develop. Even though every mathematical detail in this article may not be relevant to understand, the article provides a strict mathematical theory that whatever algorithm we use (whether it is an existing one or we come up with something new) — the fastest time complexity we can achieve is Ω(nlog(n))\Omega(n \log (n)), where nn is the length of an arbitrary list.


abdullaev.dev
Profile picture

Personal blog by Azamat Abdullaev. I write my <discoveries />.