ThegreeNsortin-memory sorting algorithms improve merging, partitioning and introduce a novel class of in-memory sorting algorithms^{®}

Sorting problems and certain statistical experiments (…) require a memory for the material which is being treated — John von Neumann

The *greeNsort ^{®}* research shows how to construct sorting algorithms based on merging with less memory. Further the

**The greeNsort^{®} algorithms achieve better sustainability than the combined teaching on Mergesort by Donald Knuth, Robert Sedgewick, Jyrki Katajainen, Huang&Langston and Tim Peters. The greeNsort^{®} research completes an open topic since the invention of Mergesort by John von Neumann in 1945.**

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult. — Sir Charles Antony Richard Hoare

The *greeNsort ^{®}* research shows that an algorithm based on partitioning (\(\mathcal{O}(N\log{N})\)) can smoothly be tuned to linear cost for the best-case of purely tied keys (\(\mathcal{O}(N\log{d})\), where

**The greeNsort^{®} algorithms achieve \(\mathcal{O}(N\log{d})\) early termination without extra overhead, unlike the Quicksort algorithms taught by R. C. Singleton (1969), Robert Sedgewick (1977), Wegner (1985) and Bentley&McIlroy (1993), Vladimir Yaroslavskiy (2009) and Edelkamp&Weiß (2016)^{1}. The greeNsort^{®} research hence completes an open topic since the invention of Quicksort by C. A. R. Hoare in 1961.**

The psychological profiling [of a programmer] is mostly the ability to shift levels of abstraction, from low level to high level. To see something in the small and to see something in the large. — Donald Knuth

The *greeNsort ^{®}* research has developed a new class of algorithms spanning merging and partitioning

- some embodiments are simple and elegant
- some embodiments are robust and adaptive
- some embodiments are more sustainable than the prior-art: have better trade-offs between speed (energy) and memory requirements (hardware)
- some embodiments are more adaptive than the prior-art: have better trade-offs between worst-case and best-case performance

Beware of bugs in the above code; I have only proved it correct, not tried it. — Donald Knuth

Without the genius of Donald Knuth we’d better verify our conclusions, even correct proofs can be based on wrong assumptions. Our testbed includes 65 *greeNsort ^{®}* algorithms and 50 prior-art algorithms of comparable implementation quality

- all implementations are single threaded (however,
*greeNsort*algorithms are designed for parallel execution)^{®} - most implementations are binary algorithms (with a few 3-ary and 4-ary exceptions)
- the algorithms have been tested on numerous data patterns
- the algorithms have been tested for correctness (and stable ones for correct stability)

- the algorithms have been tested for performance (they return their runtime, memory and footprint)
- the algorithms have been tested for correct memory access through the Valgrind checker

- the algorithms have been validated by Intel RAPL energy measurements
- the algorithms have been validated against numerous prior-art algorithms

Above all else show the data — Edward Tufte

The best *greeNsort ^{®}* algorithms have been compared against the best prior-art algorithms for a couple of important sorting disciplines, for details see the

The prior-art shows a typical **space-time trade-of** from *Grailsort* (Grail: 0% buffer) being slowest, over *Sqrtsort* (Sqrt: almost 0% buffer) and *Timsort*^{3} (Tim: 50% buffer) to *Mergesort* (Knuth: 100% buffer) being fastest. The *greeNsort ^{®}* algorithms are consistently faster than the prior-art given equal amount of buffer. Note that

**How much the greeNsort^{®} algorithms are better can bee seen when looking at the fair Footprint measure instead of Runtime** (the latter favors algorithms using much memory whereas the former gives a fair comparison): Compared to the fastest

We call this the race to idle. The faster we can complete the workload and get the computer back to idle, the more energy we can save. — Intel

^{5}

Multi-threading not only delivers better performance but also better energy-efficiency. If we can achieve this with less hardware and embedded carbon: all the better. However, in some systems such as latency-critical servers we do not want a single session do grab all available hardware threads, there we want one or just a few threads. Hence both is important: efficient single-threaded and multi-threaded execution. *greeNsort ^{®}* algorithms match the prior art in multi-threaded efficiency, beat the prior-art in single-threaded efficiency and require less memory hence less embedded energy in the hardware.

In the following we compare multi-threaded implementations of *Mergesort* (Knuth’s merge), *Frogsort0*, *Frogsort1*, *Frogsort2* and *Frogsort3*. *Mergesort* needs 100% buffer, *Frogsort0* needs 50% buffer, *Frogsort2* is used with 14% buffer and *Frogsort3* is used with 12.5% buffer, the lowest hardware-need hence allowing the most sustainable use of aged computers. For each combination of algorithm and degree of threading we ran 50 experiments with randomly permuted input on a I7-8565U (4-cores, 8 hyperthreads) and excluded the two most extreme outliers.

The first chart shows that regarding runtime our implementation of *Mergesort* (black) scales very-well with the number of threads used, the same is true for ,*Frogsort0* (red), *Frogsort1* (blue) and *Frogsort3* (cyan). With serial versions *Frogsort2* (green) is much faster, this advantage dilutes with increasing number of threads. We can also see what Intel said: multi-threading saves energy (on sufficiently large data sets), here down to about 50%.

The next chart compares *Frogsort2* and *Quicksort*. We see that *Quicksort* scaling has high variance and rarely achieves an appropriate speed-up (only if random pivots on both branches are equally good and fast and hence avoid waiting). By contrast, *Frogsort2* - like all split&merge algorithms - achieves reliably good work-balance and scaling.

Now we compare just the most important algorithms just for serial and maximally multithreaded runs. Regarding runtime and energy *Frogsort2* is the clear winner in serial sorting, for parallel sorting *Frogsort2* is almost as good as *Mergesort* and clearly better than *Quicksort*.

Regarding footprint and efootprint *Frogsort2* is the clear winner in serial and parallel sorting over *Mergesort* and *Quicksort*, hence achieves the most sustainable combination of runtime energy and hardware cost.

Simple is better than complex … If the implementation is hard to explain, it’s a bad idea. — Tim Peters

Some languages and systems use *Timsort* behind their core sorting API (*Python*, *Java*, *Android* and *Rust*). Tim Peters - the author of *Timsort* is a clever and nice guy, and his algorithm is a masterpiece of algorithm engineering. After we have said these pleasantries, it is time to say clearly what *Timsort* is: rather an algorithm good at avoiding work on pre-sorted data than good at sorting non-sorted data. We believe that the main use-case of a sorting API is converting non-sorted data into sorted data, and not minimizing the work in case the data is already sorted. If data is already sorted, there are better ways - metadata - to avoid sorting it again. The usual justification for *Timsort* as the core algorithm behind a sorting API is the expectation that *“More often than not, data will have some preexisting internal structure”*. In the following we will deconstruct this argument, explain the problems of *Timsort* and show that it rarely is better than other algorithms.

A good technical writer, trying not to be obvious about it, but says everything twice: formally and informally. Or maybe three times. — Donald Knuth

*Timsort*is based on*Natural-Mergesort*, which is biased towards good performance on presorted data and rather*not so good performance on non-sorted data*- the merge strategy of
*Timsort*avoids movements on presorted data, but*duplicates movements when it truly comes to merging* *Timsort*heavily invests overhead and comparisons in the hope to save forthcoming comparisons and movements,*for truly non-sorted data these investments are misinvested costs*^{6}- while
*Timsort*shows impressive speedups in presorted data (up to factor 20 faster than*Mergesort*), is is usually overlooked that*Mergesort*even without adaptivity-tuning is very adaptive to presorting (say factor 6), and hence*Timsort**realizes only small absolute savings on presorted data but big absolute losses when it comes to sorting tasks)* *Timsort*shines if data is locally presorted and globally doesn’t need merging,*on real real-world data Timsort is slow**Timsort*is so complicated,*that an algorithmic bug escaped attention for 13 years*^{7}and the attempt to fix it in Java resulted in yet another bug discovered 3 years later^{8}- implementing
*Timsort*is*rather tricky and it is inherently serial*

** Timsort is king in presorted data, but when sorting algorithms are needed for sorting APIs in a multicore-global-warming-world, then the greeNsort^{®} algorithms are faster, more sustainable and can be parallelized. Furthermore the tuned greeNsort^{®} algorithms eat deep into Timsorts domain and sometimes beat Timsort in adaptivity**, as the following sections show.

If the standard is lousy, then develop another standard — Edward Tufte

While assuming some “preexisting internal structure” is reasonable, for a certain percentage of API-calls, *Timsort* wants extreme structure: *Timsort* can beat *Mergesort* and other algorithms if data is locally perfectly presorted and has strong global structure, such that few runs actually need merging and mostly just can be copied together. This is not what real-world data looks like, and if so, it does not need sorting. We choose to evaluate algorithms with a mix of four data generators that provide a realistic degree of structure:

**asclocal**: \(\sqrt{N}\) locally sorted but overlapping sequences of length \(\sqrt{N}\) need global merging (the first half of the work is already done, the second not)

**ascglobal**: \(\sqrt{N}\) globally ascending blocks each with \(\sqrt{N}\)randomly permuted keys need local sorting (the first half of the work is sorting each block, the second part - global merging - can be skipped)**tieboot**: a bootstrap sample of \(1..N\) keys, this implies about \(2/3\) unique keys and \(1/3\) ties**tiesqrt**: \(N\) keys are sampled from only \(\sqrt{N}\) distinct keys, hence we have about \(\sqrt{N}\) tied values of each

The first two of these generators give fair opportunity to *Timsort* for exploiting preexisting internal structure. The second two generators provide week and strong opportunity to partitioning sorts such as *Quicksort* for exploiting ties. We expected *Timsort* to benefit from this structured portfolio, but *Timsort* not even convinces on the two semi-presorted use-cases. To our surprise *Timsort* was not faster than *Knuthsort* on the half-presorted data structures:

This is because all compared algorithms are faster on the easier *tiesqrt*, *asclocal* and *ascglobal* inputs than on the more difficult *tieboot* inputs. All competitors are faster than *Timsort*, even on the two half-presorted data structures *asclocal* and *ascglobal*. The *greeNsort ^{®}* algorithms are also more sustainable, the ratios relative to

Some circumstantial evidence is very strong, as when you find a trout in the milk — Henry David Thoreau

The following experiments shows how easily *Timsorts* performance breaks down if the input data is not perfectly presorted. We take perfectly ascending data, and disturb it with increasing amount of uniform random data defined by a parameter \(e\). We perform 20 experiments for \(e \in \{ 0, 1,\ldots, 19\}\) each replicated \(10 \cdot (20 - e)\) times, i.e. the easier experiments with shorter runtimes are replicated more often. The input data is generated by the R-function `seq(n-2^e, 0, length=n) + runif(n, 0, 2*(2^e-1))`

and visualized in the following chart:

Note that only in the few red experiments \(e \in {0,1,2,3}\) with very little random noise *Timsort* is faster than classical *Mergesort*. *greeNsort ^{®}* algorithms also beat

*Timsorts* performance breaks down even more if the input data is reverse-sorted. We repeat the above experiments with descending data disturbed by an increasing amount of noise. Let the charts speak for themselves:

I

stillhave to lecture him [Guido van Rossum] about his true vision of what Python should be — Tim Peters

Now what is the true vision of what *Timsort* should be? Given that is cannot compete with many other sorting algorithms. Let’s rehearse where *Timsort* beats them all: if the data is locally presorted and globally needs just re-arrangement (copying, not merging) of identified quite non-overlapping runs. The implementation note of Javas stable sorting API says about *Timsort*: *“It is well-suited to merging two or more sorted arrays: simply concatenate the arrays and sort the resulting array”*. This is a call to waste energy, why should we ask *Timsort* to identify existing runs if we know them beforehand? Lets take the simple example of two sorted arrays and compare the costs:

Proper way | Timsort way | |
---|---|---|

non-overlapping sequences | N allocations and N copies for concatenation | N allocations and N copies for concatenation, N comparisons for identifying runs, if not copied in the right sequence: another 0.5 N allocations and 1.5 copies |

overlapping sequences | N allocations, N comparisons and N copies for merging | N allocations and N copies for concatenation, N comparisons for identifying runs, another 0.5 allocations and N comparisons and 1.5 N copies for merging |

So even in *Timsorts* domain of dealing with presorted non-overlapping sequences the Java suggestion is nonsensical. It would make more sense to split the serial *Timsort* into two easily parallelizable APIs

- a first API that identifies multiple runs of possibly varying lengths
- a second API that takes identified runs of possibly varying lengths, generates a merge-plan and executes it

However, in certain input patterns *Timsorts* restriction to only merge neighbor runs will turn out to be suboptimal and other strategies will be superior. But if the first API is not used in the Java example and the second API is heavily modified to merge non-neighbor runs, what is left for *Timsort*? Well, the *“proper way of merging”* fails, if the input sequences unexpectedly turn out not to be completely sorted! The *“Timsort way”* is robust against input problems. But again: then we should rather verify each input sequence with *Timsort*, and then call the second API which executes an optimal plan.

**Finally we have peeled Timsort to its core: it is a robust \(\mathcal{O}(N)\) sorting verifier, most likely the best one possible, because it gracefully degrades to \(\mathcal{O}(N\log{N})\) sorting: Tim peters did a great Job on this. But when sorting algorithms are needed for sorting APIs in a multicore-global-warming-world, then the greeNsort^{®} algorithms are faster, more sustainable and can be parallelized.**

Results! Why, man, I have gotten a lot of results. I know several thousand things that won’t work. — Thomas Edison

All those algorithms have been fairly compared by us with similar implementation quality. Orson Peters (2015)

*Pattern Defeating Quicksort*- see*Pdqsort*- is not on the above list, because his extremely professional and tuned C++ implementation cannot be compared. Theoretically the*greeNsort*algorithms are slightly more efficient however, this has yet to be empirically checked↩︎^{®}Exceptions are Tim Peters

*Timsort*the implementation of Timothy Van Slyke is used, and Huang&Langston’s*Grailsort*and related algorithms implemented by Andrey Astrelin↩︎Note that

*Timsort*is lower than proper standard textbook*Mergesort*on all four input data patterns, including globally presorted data and locally presorted data↩︎Note that

*Sqrtsort*can only sort equally sized elements, whereas*Frogsort2*can handle elements of different size↩︎The implementers of

*Timsort*in*Rust*have realized that and implemented*Timsort*without ‘galopping mode’.↩︎Gow et. al. (2015) “OpenJDK’s Java.utils.Collection.sort() Is Broken: The Good, the Bad and the Worst Case”↩︎

Auger et. al. (2018) “On the Worst-Case Complexity of TimSort”↩︎

Copyright © 2020 Dr. Jens Oehlschlägel - All rights reserved - Terms - Privacy - Impressum