A Portfolio of efficient algorithms for different sorting disciplines is the basis for sustainable sorting.

# Generic sorting

The most generic (and expensive) sorting algorithms have the following features

• generic sorting sorts by comparison
• generic sorting is stable
• generic sorting can sort elements of varying size
• generic sorting has worst case complexity $$\mathcal{O}(N\log{N})$$
• generic sorting can be implemented with concurrent divide and concurrent conquer

The greeNsort® Portfolio provides algorithms for different sorting disciplines, ranked here from most generic to most specialized.

## stable size-varying

A popular prior-art method is an indirect Quicksort, where pointer to size-varying elements are swapped.1 This is relatively memory-parsimonious, particularly if elements are big, however, this comes with high random-access CPU costs. The greeNsort® research compares this to a direct Mergesort (100% buffer) which can be more sustainable.

### 50% buffer

By adapting the main greeNsort® algorithm Frogsort for the size-varying case an algorithm is obtained, which only needs 50% buffer and still fulfills all above criteria for a generic sorting algorithm.

## stable equally-sized

Most prior-art algorithms deal with equally sized elements only.

### 100% buffer

The standard methods for stable sorting which allow for concurrent execution use 100% buffer, for example Mergesort as used for stable sorting in C++2. Mergesort can easily tuned for adaptivity: smoothly skip the cost of comparing the more, the more data is presorted. The greeNsort® algorithm Omitsort improves upon this by also smoothly skip the cost of movements, which smoothly makes the algorithm $$\mathcal{O}(N)$$.

### 50% buffer

Timsort as used in Python and Java is a highly tuned adaptive natural Mergesort which towards the best case of presorted data smoothly costs only $$\mathcal{O}(N)$$ and which only needs 50% buffer. However, beyond implementation complexity, the disadvantage of Timsort is much higher cost for really unsorted data and strong serial logic which hampers concurrency. The greeNsort® algorithms Frogsort and Geckosort in variants 0 and 1 achieve the generality of Mergesort but with only 50% buffer - without such disadvantages. By contrast, they are automatically adaptive (without tuning), albeit $$\mathcal{O}(N\log{N})$$.

### low buffer

The greeNsort® research is not aware of prior-art algorithms that can match the speed of standard Mergesort with less than 50% buffer. The greeNsort® algorithms Frogsort and Geckosort in variants 2 and 3 can sort faster than standard Mergesort with about 13% buffer only, further reduction of buffer is possible.

### sqrt buffer

Academic research was for decades obsessed with finding Sedwick’s Holy Grail of Sorting: finding a true stable inplace-Mergesort that can compete with standard Mergesort. Only recently, Andrey Astrelin was able to implement Grailsort, an algorithm of Huang&Langston, such that it was not more than 1.5 times slower than standard Mergesort. Astrelin shows that this is still academic: he provides variants (Sqrtsort) that use negligible $$\mathcal{O}(\sqrt{N})$$ buffer and are only 1.3 times slower than standard Mergesort. The greeNsort® algorithms Walksort and Jumpsort also work with so little $$\mathcal{O}(\sqrt{N})$$ buffer and achieve approximately the speed of standard Mergesort.

## unstable equally-sized

I is well known, that sacrificing stability enables effective inplace-partitioning of equally-sized elements. Further sacrificing worst-case $$\mathcal{O}(N\log{N})$$ for only average-case $$\mathcal{O}(N\log{N})$$ (by using random pivots instead of deterministic median-pivots) enables inplace-partitioning with Quicksort at about the speed of Mergesort, in other words: when the sacrifices are acceptable, Quicksort tends to be more sustainable than Mergesort according to the greeNsort® footprint definition.3

### unstable sqrt buffer

Recently Super Scalar Sample Sort (SSSS or short S4) has been developed with the goal of replacing standard sorting libraries. While it is faster than current standard sorting libraries, it is not stable4, it is more complex, it requires more buffer5 and its serial6 implementation ( IS4o) is slower than the best algorithms in the following section.

### unstable no buffer

The greeNsort® algorithms Zacksort and Zucksort provide simple and elegant solutions to the question open since Hoare’s 1961 inception of Quicksort: how to obtain early-termination on ties without the cost of extra operations slowing down the algorithm on non-tied data.

A bit of history: The original Quicksort as published by Hoare (1961) seemed simple and almost perfect, but vulnerable for quadratic runtime under certain data input. Sedgewick (1977) compared different alternatives and recommended a (modified) Quicksort2 of Singleton (1969) which had average-case $$\mathcal{O}(N\log{N})$$ for all data inputs, but no longer early-terminated to $$\mathcal{O}(N\log{d})$$ in the tied case. Threeway Quicksort by Wegner (1985) improved and popularized by Bentley&McIlroy (1993) re-introduced early terminating on ties by isolating a third partition with pivot-ties, but due to extra-operations was slower for untied data (Quicksort3). Yaroslavskyi’s (2009) faster dual-pivot Quicksort (Dupisort) used such extra-operations to create a real third partition7, but the complex8 algorithm is strongly serial and probably impossible to implement branchless. Edelkamp&Weiß (2016) published the much faster simple branchless Block-Quicksort (Quicksort2B)9, but only with a rudimentary and expensive early-termination mechanism. Orson Peters (2015) combined a branchless algorithm with a proper tuning for ties and a proper tuning for presorted data, the tuning overhead of extra operations in his Pattern-Defeating Quicksort is very little, so it is close to optimal and I recommend his excellent C++ implementation on github.

# Specific sorting

Sorting algorithms exploiting specific situations play an important role in reducing the footprint of sorting. greeNsort® does focus generic algorithms, however, it is important to know the following classes of specific algorithms, which do sort not by comparisons.

## analytic algorithms

Analytic sorting algorithms sort elements by incrementally analyzing the keys for recursive partitioning, in other words, they require the specific situation where the sorting can be determined by analyzing the keys alone, without mutually comparing them. Malte Skarupke (2015) has stressed the importance of offering a convenient API for as many binary data types as possible, his C++ code can be found here.

### char-by-char

Incremental analysis of string keys is known to everyone who has seen a phone book. Strings are recursively partitioned character-by-character, sorting ends as soon as the number of remaining elements is one (or small enough to delegate the remaining work to another algorithm). This method can handle strings of varying lengths, but unlike generic sorting it cannot handle locale specific collation orders. The straightforward way of doing this is partitioning into $$d_i$$ buckets where $$d_i$$ is the number of distinct characters at the i-th character of the string. This works well in small alphabets, if the number of distinct characters can become prohibitively high, an alternative related to threeway Quicksort with a fan-out between 2 and 3 is Multi-key Quicksort.

### byte-by-byte

Incremental byte-by-byte analysis of binary types such as integers works the same way and is called MSB-Radixsort. While MSB-Radixsort supports early termination, LSD-Radixsort dos not: it always processes all bytes. With a fixed number of k bytes in a specific binary type the cost of LSD-radix sort is $$\mathcal{O}(N \cdot k)$$ which makes some people believe that Radixsort has cost linear with N. They claim that the $$\mathcal{O}(N\log{N})$$ rule applies only to generic comparison sorts, but that is a misinterpretation which ignores that a fixed number of bytes limits the size D of the domain - for example how many different integers the binary type can represent. For more information see Wikipedia.

## synthetic algorithms

Synthetic sorting algorithms sort only keys that stem from a limited domain. This requires, that elements do contain nothing but keys, that there is a limited number of distinct keys, and that those keys can be projected onto consecutive positions. Typically a linear projection is used, i.e. synthetic sorting handles interval-scale data, but unlike generic sorting algorithms not ordinal-scale data (which requires mutual comparisons).

### counting

Counting sort starts with a vector of counters initialized to zero, then it takes one pass over the data to project each key to a position and increase a counter at this position. Once all elements have been counted, one pass over the counters synthesizes the keys in the order of the counter positions. This obviously has only linear $$\mathcal{O}(N)$$ cost. For more information see Wikipedia.

### marking

If it is known that each key can only occur once, instead of a vector of counters a vector of bits is sufficient. After initializing all bits to 0, Bit sort takes one pass over the data to project each key to a position and set that bit to 1. Once all keys have been set, one pass over the bits synthesizes the unique keys in the order of the bit positions. This obviously has only linear $$\mathcal{O}(N)$$ cost and is very memory parsimonious. Sorting by marking bits has been described in the chapter Cracking the Oyster of the famous book Programming Pearls by Jon Bentley (1999).

## an example with specialized algorithms

The following examples demonstrate the power and limitations of specialized algorithms. Say we want to sort 1 million integers. We use R and our package bit.

# package 'bit' version 4.0.0 is not yet on CRAN
# hence installation from github
# devtools::install_github("truecluster/bit")
require(bit)
require(microbenchmark)

If the integers are sampled from a small domain (2 million values) with few or no duplicates, specialized radixsort is faster than quicksort and even more specialized bitsort is even faster, particularly if we tell bitsort that there are no duplicates in this example:

if (!exists("sx")) {
x <- sample(2e6, 1e6)
sx <- rbind(
quicksort = summary(microbenchmark(sort(x, method = "quick"), times = 10), unit = "ms")
, radixsort = summary(microbenchmark(sort(x, method = "radix"), times = 10), unit = "ms")
, multip_bitsort = summary(microbenchmark(bit_sort(x, has.dup = TRUE ), times = 10), unit = "ms")
, unique_bitsort = summary(microbenchmark(bit_sort(x, has.dup = FALSE), times = 10), unit = "ms")
)
}
round(sx[,-1], 1)
##                 min   lq mean median   uq  max neval
## quicksort      74.7 75.3 76.8   75.9 78.1 82.3    10
## radixsort      52.2 55.2 56.4   56.1 57.7 60.5    10
## multip_bitsort 31.5 35.9 39.7   39.2 41.6 50.8    10
## unique_bitsort 16.4 17.0 17.8   17.7 18.2 20.5    10

However, if we sample from a huge domain (2 billion values), the more specialized bitsort algorithms take significantly longer:

if (!exists("sy")) {
y <- sample(1e9, 1e6)
sy <- rbind(
quicksort = summary(microbenchmark(sort(y, method = "quick"), times = 10), unit = "ms")
, radixsort = summary(microbenchmark(sort(y, method = "radix"), times = 10), unit = "ms")
, multip_bitsort = summary(microbenchmark(bit_sort(y, has.dup = TRUE ), times = 10), unit = "ms")
, unique_bitsort = summary(microbenchmark(bit_sort(y, has.dup = FALSE), times = 10), unit = "ms")
)
}
round(sy[,-1], 1)
##                   min     lq   mean median     uq    max neval
## quicksort        73.2   73.4   74.8   73.8   76.1   79.4    10
## radixsort        41.1   43.3   48.2   47.9   53.4   55.8    10
## multip_bitsort  145.9  146.8  148.9  147.2  150.7  158.9    10
## unique_bitsort 1487.4 1523.1 1561.6 1574.2 1593.9 1633.0    10

1. although Quicksort has only average case complexity $$\mathcal{O}(N\log{N})$$ and although concurrent conquer is difficult↩︎

2. A standard trick for reducing the buffer need of such algorithms based on merging Mergesort* is allocating 50% buffer, use this buffer first for sorting the left half, then sorting the right half and finally merging both halves, a limitation of this approach is that both halves cannot be sorted concurrently.↩︎

3. This implies that if in a given system sufficient memory is available that is currently not needed for other tasks, then investing negligible buffer into [Walk|Jump]sort or little buffer into [Frog|Gecko]sort[2|3] can eliminate the remaining risk of quadratic runtime.↩︎

4. The original publications do not specify stability, but the implementors clarify: “So far there is no stable implementation planned. A stable implementation might be difficult due to the algorithm’s design.”↩︎

5. The inplace version at least $$\mathcal{O}(\sqrt{N})$$)↩︎

6. Sorting doubles, the parallel implementation IPS4o is faster, we have not compared to a parallel implementation of Pattern-Defeating Quicksort↩︎

7. This improves the $$\mathcal{O}(N\log_2{d})$$ algorithm to $$\mathcal{O}(N\log_3{d})$$ regarding moves↩︎

8. greeNsort® research discovered unneeded moves in certain situations↩︎

9. Fun fact: the optimization published by Edelkamp&Weiß (2016) was already suggested in a little known paper of Hoare (1962) in which he gives context and explanations for his 1961 algorithm.↩︎