**A Portfolio of efficient algorithms for different sorting disciplines is the basis for sustainable sorting.**

The most generic (and expensive) sorting algorithms have the following features

- generic sorting sorts by comparison
- generic sorting is stable
- generic sorting can sort elements of varying size
- generic sorting has worst case complexity \(\mathcal{O}(N\log{N})\)
- generic sorting can be implemented with concurrent divide and concurrent conquer

**The greeNsort^{®} Portfolio provides algorithms for different sorting disciplines, ranked here from most generic to most specialized.**

A popular prior-art method is an indirect *Quicksort*, where pointer to size-varying elements are swapped.^{1} This is relatively memory-parsimonious, particularly if elements are big, however, this comes with high random-access CPU costs. The *greeNsort ^{®}* research compares this to a direct

**By adapting the main greeNsort^{®} algorithm Frogsort for the size-varying case an algorithm is obtained, which only needs 50% buffer and still fulfills all above criteria for a generic sorting algorithm.**

Most prior-art algorithms deal with equally sized elements only.

The standard methods for stable sorting which allow for concurrent execution use 100% buffer, for example *Mergesort* as used for stable sorting in C++^{2}. *Mergesort* can easily tuned for adaptivity: smoothly skip the cost of comparing the more, the more data is presorted. **The greeNsort^{®} algorithm Omitsort improves upon this by also smoothly skip the cost of movements, which smoothly makes the algorithm \(\mathcal{O}(N)\).**

*Timsort* as used in Python and Java is a highly tuned adaptive *natural Mergesort* which towards the best case of presorted data smoothly costs only \(\mathcal{O}(N)\) and which only needs 50% buffer. However, beyond implementation complexity, the disadvantage of Timsort is much higher cost for really unsorted data and strong serial logic which hampers concurrency. **The greeNsort^{®} algorithms Frogsort and Geckosort in variants 0 and 1 achieve the generality of Mergesort but with only 50% buffer - without such disadvantages. By contrast, they are automatically adaptive (without tuning), albeit \(\mathcal{O}(N\log{N})\).**

The *greeNsort ^{®}* research is not aware of prior-art algorithms that can match the speed of standard

Academic research was for decades obsessed with finding Sedwick’s *Holy Grail of Sorting*: finding a true stable inplace-Mergesort that can compete with standard *Mergesort*. Only recently, Andrey Astrelin was able to implement *Grailsort*, an algorithm of Huang&Langston, such that it was not more than 1.5 times slower than standard *Mergesort*. Astrelin shows that this is still academic: he provides variants (*Sqrtsort*) that use negligible \(\mathcal{O}(\sqrt{N})\) buffer and are only 1.3 times slower than standard *Mergesort*. **The greeNsort^{®} algorithms Walksort and Jumpsort also work with so little \(\mathcal{O}(\sqrt{N})\) buffer and achieve approximately the speed of standard Mergesort**.

I is well known, that sacrificing stability enables effective inplace-partitioning of equally-sized elements. Further sacrificing worst-case \(\mathcal{O}(N\log{N})\) for only average-case \(\mathcal{O}(N\log{N})\) (by using random pivots instead of deterministic median-pivots) enables inplace-partitioning with *Quicksort* at about the speed of *Mergesort*, in other words: when the sacrifices are acceptable, *Quicksort* tends to be more sustainable than *Mergesort* according to the *greeNsort ^{®}* footprint definition.

Recently *Super Scalar Sample Sort* (SSSS or short S4) has been developed with the goal of replacing standard sorting libraries. While it is faster than current standard sorting libraries, it is not stable^{4}, it is more complex, it requires more buffer^{5} and its serial^{6} implementation ( IS4o) is slower than the best algorithms in the following section.

**The greeNsort^{®} algorithms Zacksort and Zucksort provide simple and elegant solutions to the question open since Hoare’s 1961 inception of Quicksort: how to obtain early-termination on ties without the cost of extra operations slowing down the algorithm on non-tied data.**

**A bit of history**: The original *Quicksort* as published by Hoare (1961) seemed simple and almost perfect, but vulnerable for quadratic runtime under certain data input. Sedgewick (1977) compared different alternatives and recommended a (modified) *Quicksort2* of Singleton (1969) which had average-case \(\mathcal{O}(N\log{N})\) for all data inputs, but no longer early-terminated to \(\mathcal{O}(N\log{d})\) in the tied case. Threeway Quicksort by Wegner (1985) improved and popularized by Bentley&McIlroy (1993) re-introduced early terminating on ties by isolating a third partition with pivot-ties, but due to extra-operations was slower for untied data (*Quicksort3*). Yaroslavskyi’s (2009) faster dual-pivot Quicksort (*Dupisort*) used such extra-operations to create a real third partition^{7}, but the complex^{8} algorithm is strongly serial and probably impossible to implement branchless. Edelkamp&Weiß (2016) published the much faster simple branchless Block-Quicksort (*Quicksort2B*)^{9}, but only with a rudimentary and expensive early-termination mechanism. Orson Peters (2015) combined a branchless algorithm with a proper tuning for ties and a proper tuning for presorted data, the tuning overhead of extra operations in his *Pattern-Defeating Quicksort* is very little, so it is close to optimal and I recommend his excellent C++ implementation on github.

Sorting algorithms exploiting specific situations play an important role in reducing the footprint of sorting. *greeNsort ^{®}* does focus generic algorithms, however, it is important to know the following classes of specific algorithms, which do sort not by comparisons.

Analytic sorting algorithms sort elements by incrementally analyzing the keys for recursive partitioning, in other words, they require the specific situation where the sorting can be determined by analyzing the keys alone, without mutually comparing them. Malte Skarupke (2015) has stressed the importance of offering a convenient API for as many binary data types as possible, his C++ code can be found here.

Incremental analysis of string keys is known to everyone who has seen a phone book. Strings are recursively partitioned character-by-character, sorting ends as soon as the number of remaining elements is one (or small enough to delegate the remaining work to another algorithm). This method can handle strings of varying lengths, but unlike generic sorting it cannot handle locale specific collation orders. The straightforward way of doing this is partitioning into \(d_i\) buckets where \(d_i\) is the number of distinct characters at the i-th character of the string. This works well in small alphabets, if the number of distinct characters can become prohibitively high, an alternative related to *threeway Quicksort* with a fan-out between 2 and 3 is Multi-key Quicksort.

Incremental byte-by-byte analysis of binary types such as integers works the same way and is called *MSB-Radixsort*. While *MSB-Radixsort* supports early termination, *LSD-Radixsort* dos not: it always processes all bytes. With a fixed number of *k* bytes in a specific binary type the cost of *LSD-radix sort* is \(\mathcal{O}(N \cdot k)\) which makes some people believe that Radixsort has cost linear with *N*. They claim that the \(\mathcal{O}(N\log{N})\) rule applies only to generic comparison sorts, but that is a misinterpretation which ignores that a fixed number of bytes limits the size *D* of the domain - for example how many different integers the binary type can represent. For more information see Wikipedia.

Synthetic sorting algorithms sort only keys that stem from a limited domain. This requires, that elements do contain nothing but keys, that there is a limited number of distinct keys, and that those keys can be projected onto consecutive positions. Typically a linear projection is used, i.e. synthetic sorting handles interval-scale data, but unlike generic sorting algorithms not ordinal-scale data (which requires mutual comparisons).

*Counting sort* starts with a vector of counters initialized to zero, then it takes one pass over the data to project each key to a position and increase a counter at this position. Once all elements have been counted, one pass over the counters *synthesizes* the keys in the order of the counter positions. This obviously has only linear \(\mathcal{O}(N)\) cost. For more information see Wikipedia.

If it is known that each key can only occur once, instead of a vector of counters a vector of bits is sufficient. After initializing all bits to 0, *Bit sort* takes one pass over the data to project each key to a position and set that bit to 1. Once all keys have been set, one pass over the bits *synthesizes* the unique keys in the order of the bit positions. This obviously has only linear \(\mathcal{O}(N)\) cost and is very memory parsimonious. Sorting by marking bits has been described in the chapter *Cracking the Oyster* of the famous book *Programming Pearls* by Jon Bentley (1999).

The following examples demonstrate the power and limitations of specialized algorithms. Say we want to sort 1 million integers. We use R and our package bit.

```
# package 'bit' version 4.0.0 is not yet on CRAN
# hence installation from github
# devtools::install_github("truecluster/bit")
require(bit)
require(microbenchmark)
```

If the integers are sampled from a small domain (2 million values) with few or no duplicates, specialized radixsort is faster than quicksort and even more specialized bitsort is even faster, particularly if we tell bitsort that there are no duplicates in this example:

```
if (!exists("sx")) {
x <- sample(2e6, 1e6)
sx <- rbind(
quicksort = summary(microbenchmark(sort(x, method = "quick"), times = 10), unit = "ms")
, radixsort = summary(microbenchmark(sort(x, method = "radix"), times = 10), unit = "ms")
, multip_bitsort = summary(microbenchmark(bit_sort(x, has.dup = TRUE ), times = 10), unit = "ms")
, unique_bitsort = summary(microbenchmark(bit_sort(x, has.dup = FALSE), times = 10), unit = "ms")
)
}
round(sx[,-1], 1)
```

```
## min lq mean median uq max neval
## quicksort 74.7 75.3 76.8 75.9 78.1 82.3 10
## radixsort 52.2 55.2 56.4 56.1 57.7 60.5 10
## multip_bitsort 31.5 35.9 39.7 39.2 41.6 50.8 10
## unique_bitsort 16.4 17.0 17.8 17.7 18.2 20.5 10
```

However, if we sample from a huge domain (2 billion values), the more specialized bitsort algorithms take significantly longer:

```
if (!exists("sy")) {
y <- sample(1e9, 1e6)
sy <- rbind(
quicksort = summary(microbenchmark(sort(y, method = "quick"), times = 10), unit = "ms")
, radixsort = summary(microbenchmark(sort(y, method = "radix"), times = 10), unit = "ms")
, multip_bitsort = summary(microbenchmark(bit_sort(y, has.dup = TRUE ), times = 10), unit = "ms")
, unique_bitsort = summary(microbenchmark(bit_sort(y, has.dup = FALSE), times = 10), unit = "ms")
)
}
round(sy[,-1], 1)
```

```
## min lq mean median uq max neval
## quicksort 73.2 73.4 74.8 73.8 76.1 79.4 10
## radixsort 41.1 43.3 48.2 47.9 53.4 55.8 10
## multip_bitsort 145.9 146.8 148.9 147.2 150.7 158.9 10
## unique_bitsort 1487.4 1523.1 1561.6 1574.2 1593.9 1633.0 10
```

although

*Quicksort*has only average case complexity \(\mathcal{O}(N\log{N})\) and although concurrent conquer is difficult↩︎A standard trick for reducing the buffer need of

*such algorithms based on merging*Mergesort* is allocating 50% buffer, use this buffer first for sorting the left half, then sorting the right half and finally merging both halves, a limitation of this approach is that both halves cannot be sorted concurrently.↩︎This implies that if in a given system sufficient memory is available that is currently not needed for other tasks, then investing negligible buffer into

*[Walk|Jump]sort*or little buffer into*[Frog|Gecko]sort[2|3]*can eliminate the remaining risk of quadratic runtime.↩︎The original publications do not specify stability, but the implementors clarify:

*“So far there is no stable implementation planned. A stable implementation might be difficult due to the algorithm’s design.”*↩︎The inplace version at least \(\mathcal{O}(\sqrt{N})\))↩︎

Sorting doubles, the parallel implementation IPS4o is faster, we have not compared to a parallel implementation of

*Pattern-Defeating Quicksort*↩︎This improves the \(\mathcal{O}(N\log_2{d})\) algorithm to \(\mathcal{O}(N\log_3{d})\) regarding moves↩︎

*greeNsort*research discovered unneeded moves in certain situations↩︎^{®}Fun fact: the optimization published by Edelkamp&Weiß (2016) was already suggested in a little known paper of Hoare (1962) in which he gives context and explanations for his 1961 algorithm.↩︎

Copyright © 2020 Dr. Jens Oehlschlägel - All rights reserved - Terms - Privacy - Impressum