Call to action: apply greeNsort knowledge to your artifacts

Many people in different roles can contribute to a reduction of CO2-footprint in sorting: CS researchers and teachers, library-maintainers, application-programmers.

We comments on other languages, corrections, updates and - most importantly - improvements at feedback(at)greensort.org .

Prior art

The greeNsort^® portfolio of algorithms provides more sustainable alternatives to implementing your own sorting code with popular algorithms such as Quicksort2, Quicksort3, Mergesort, Timsort, Peeksort, Powersort.

Available algorithms

The following greeNsort^® algorithms are available to create better implementations:

stable 100% buffer Omitsort or Octosort are efficient adaptive (\(\mathcal{O}(N)\)) replacements for Mergesorts (and faster than Timsort in the \(\mathcal{O}(N\log{N})\) cases)
stable 100% buffer Kiwisort or Swansort are replacements for Mergesorts when \(\mathcal{O}(N\log{D})\) adaptivity to ties is paramount
stable ≤50% buffer Frogsort Geckosort and Squidsort are efficient memory-parsimonious replacements for Mergesorts (incl. Timsort)
stable √N buffer Walksort or Jumpsort are efficient replacements for in-place Mergesort, Grailsort and similar algorithms
unstable 0% buffer ZacksortB or ZucksortB are efficient replacements for Quicksort with cheap \(\mathcal{O}(N\log{D})\) adaptivity to ties

For more information on the algorithms see the Portfolio.

Choosing an algorithm

Insensitive choice of algorithms and data structures lead to significant energy wasting — Fethullah Goekkus¹

When choosing sorting algorithms consider the following:

parallel

if possible use parallel algorithms: preferably IPS4o or good parallel radix algorithms, a parallel Mergesort are most easy to implement but Frogsorts (also easy) provide a better compromise between saving memory and parallel execution speed
On single user workstations - e.g. for Data science: prefer algorithms that can parallelize with little buffer such as the IPS4o, PFrogsort2 or PWalksort, this saves hardware (or cloud resources)
For predictable throughput on servers serving many clients: use efficient serial algorithms, if possible radix-sorts such as Skasort or generic algorithms such as unstable Ducksort or stable Frogsort2.

stable

if sorting limited domains allow it: use a bitsort, a counting sort or a MSB radix sort like Skasort
if Speed and Energy minimization is paramount: Frogsort2 or once available a stable version of PS4o or IPS4o
if Adaptivity to presorting is paramount: Octosort (max-adaptivity) or Squidsort (min-memory)
if Saving memory is paramount: Walksort or Jumpsort (distance-reducing at expense of extra-movements)
if Adaptivity to ties is paramount: look at Kiwisort, Swansort or in rare cases Storksort

unstable

whenever possible use IPS4o (or in rare cases a superior radix-sort), or Pdqsort
if you need to implement yourself use Ducksort
if partial sorting is needed: Duckpart

Improving libraries

Compatibility means deliberately repeating other people’s mistakes — David Wheeler

All the following programming languages had their reasons to shape their sorting APIs and implementations as they did, however, that was before the climate crisis. Now programming languages as a whole are measured by their energy efficiency, see Energy Efficiency across Programming Languages, and - no surprise - it turns out that some languages seem more efficient than others, particularly compiled languages, but also Java. Sadly, some of the languages gaining popularity for expensive big-data applications are interpreted (R and Python). If - for some reason - people need to stick with those instead of switching to more efficient languages (e.g. Julia), then it becomes even more important to maximize the efficiency of the compiled sorting libraries used.

Three types of actions can improve sorting efficiency in standard libraries:

improving implementations (within unchanged interfaces)
enhancing interfaces (new parameters, new interfaces)
replacing interfaces (changing existing interfaces for the better)
gives quick wins, 2. should be doable, and perhaps even 3.

Before we comment on particular languages, we shortly

recall the most important greeNsort^® algorithms
list the quick win actions for each language
describe the ideal interfaces
comment on string sorting
comment on multithreading

Quick win actions

The following quick-wins can be achieved by improving implementations:

	Opportunity	Action
C	faster sorting still adaptive to ties	replace Quicksort3 with ZacksortB
C++	faster sorting with the same amount of memory or simply save memory	replace Quicksort2 with ZacksortB or Pdqsort, at 0%-buffer replace inplace-mergesort with Grailsort, at negligible sqrt(N) buffer add Walksort, at parametrizable 14% buffer add Frogsort2, replace 50%-buffer Mergesort with Frogsort1, replace 100%-buffer Mergesort with the more adaptive Omitsort
Clojure	see Java	see Java
Groovy	see Java	see Java
Java	faster sorting still adaptive to presorting, faster sorting still adpative to ties	replace Timsort with Squidsort1, replace Dupisort with ZacksortB
JRuby	see Java	see Java
Kotlin	see Java	see Java
Python	faster sorting still adaptive to presorting	replace Timsort with Squidsort1
Rust	faster sorting still adaptive to presorting	replace Timsort with Squidsort1
Scala	faster sorting still adaptive to ties, half buffer and half copying, see Java	replace Quicksort3 with ZacksortB, replace Copysort with Frogsort1, see Java

Ideal interfaces

After quick-wins, more can be saved by creating optimal interfaces.

an ideal interface should rather take declarative specification arguments (e.g. need-stable Yes/NO) than imperative arguments (use-quicksort)
an ideal interface should choose the most efficient algorithm to fulfill the specification
an ideal interface should choose unstable inplace algorithms when sorting keys-only where stability doesn’t matter (Zacksort)
an ideal interface should allow to specify a key-function or a comparison function (or none of them)
an ideal interface should have energy-efficient defaults, such as using radix-sorts for atomic types and strings (Skasort)
an ideal interface should allow specification of user-preferences regarding memory-speed-trade-offs (to choose between Walksort, Frogsort2 and Frogsort1)
an ideal interface should implement algorithms for equally-sized elements (Frogsort) and size-varying elements (VFrogsort) or (XFrogsort)
an ideal interface should explicit prior knowledge about structure in the input data (Squidsort1 and Squidsort2, Swansort)
an ideal interface should provide efficient string sorting, with a radix-sorting default (Skasort) and a size-varying method that is adaptive to duplicates VSwansort), see below
an ideal interface should allow parallel execution (with a user-specified limit on number of threads, processes, machines), see below

It strongly depends on the programming language, how such interfaces look in detail and whether need just enhancement or replacement.

String sorting

Any frequently executed code that manipulates strings is fertile ground for optimization — Kurt Guntheroth²

String sorting is unduly expensive. The reason is localization: respecting the locale order of letters prevents us from using a simple efficient radix sort on strings. Instead we use a more expensive comparison sort, and for each comparison we call an expensive function handling dozens of special cases of the given locale. We found localized string sorting with C++ std::sort up to 40 times more expensive than radix sorting with Skasort. The most effective way to reduce the cost of string sorting would be an international agreement on a character set (UTF-8) and no locale, i.e. simple byte order.

All problems in computer science can be solved by another level of indirection … except for the problem of too many layers of indirection — David Wheeler

There is another reason that string sorting is unduly expensive: The prior art often sorts strings indirectly by moving pointers to strings. This seems clever because - if the pointer is smaller than the string - it can theoretically save movements costs in true random-access-memory. Also it is convenient to use Quicksort with pointers, because we know Quicksort to be fast and it cannot handle size-varying strings, just equally-sized pointers. However, in today’s deep cache-hierarchies this approach doesn’t scale to big \(N\): access costs depend on access distance, moving pointers instead of strings implies random-access to the memory where strings are actually stored, whether or not the strings are finally re-arranged. Suitably implemented Mergesort can directly move strings during the merge-passes, this can easily reduce sorting time and energy by 33%. And the greeNsort^® algorithm Frogsort can do the same, just with less memory.

In summary: direct radix-sorting of strings seems most promising. Where comparison sorting cannot be avoided greeNsort^® algorithms are an alternative, either direct Frogsort, Geckosort, Squidsort or indirect Zacksort, Zucksort.

Multithreading

Multi-threading your software delivers better performance and as well as better energy-efficiency — Intel³

Not all algorithms parallelize equally well. For example Mergesort with 100% buffer can be multithreaded very well, not only with parallel branch execution but also with parallelizing the the work phase (parallel merge), the same is true for [Frogsort](glossary.html#frogsort, see the deep dive: multi-threading.

By contrast, the work-phase of Quicksort (partitioning) has a strong serial character: parallel inplace-partitioning is possible, but complicated and costs extra read or move operations on certain input patterns, see for example Tsigas & Zhang. Parallel partitioning gets easier when 100% buffer memory is allowed, however, there are still challenges regarding work distribution due to random pivots, particularly when ties occur. From a practical⁴ point of view, it seems beneficial to not insist of parallelizing pure Quicksort, it is much easier to follow the approach in Java 8 and use parallel binary Mergesort (better Frogsort or a parallel Multiway-Mergesort with help of a LoserTree at the top levels of the divide&conquer process and switch to e.g. Quicksort (better ZacksortB) once a branch becomes small enough to run it on a single thread.

In summary: parallel branching is easy, parallel work phase is harder: parallel merging is a practical approach to parallelizing the work phase in both, merging sorts and in the top-levels of partitioning sorts, greeNsort^® algorithms help to lower the costs of parallel merging.

Languages

C⁵

C is quirky, flawed, and an enormous success — Dennis Ritchie

C is a compiled language and many other languages have interfaces to C. C has a standard library function qsort which takes four parameters

base − This is the pointer to the first element of the array to be sorted
nitems − This is the number of elements in the array pointed by base
size − This is the size in bytes of each element in the array
compar − This is the function that compares two elements and returns a tri-Boolean {-1,0-1}

Limitations of the interface are:

does not sort stable
does sort only equally sized elements (sorting strings indirectly via pointers results in expensive random-access)
does not sort multi-threaded
the tri-Boolean comparison function requires too much work for always returning {-1,0-1} although a simple Boolean would be enough
implicit: implementation used to be inplace and deterministic (difficult to switch to probabilistic)

Underlying is a Quicksort3 and limitations of the implementation are:

average \(\mathcal{O}(N \log(N))\) but \(\mathcal{O}(N^2)\) worst case (unlike Introsort)
best \(\mathcal{O}(N \log(D))\) case on ties but costs many extra operations, hence energy waste on non-tied data (unlike true binary Quicksort2)
no block-tuning reducing branch-misprediction (unlike Edelkamp&Weiss Block-Quicksort)
no acceleration for presorting (unlike Pdqsort)

Opportunity for improving the implementation within the given interface:

use ZacksortB giving the same performance guarantees but identifying ties at cost of only one extra comparison per partitioning branch
possibly guarantee \(\mathcal{O}(N \log(D))\) worst case by safeguarding⁶ with Heapsort
possibly tune for presorting along the lines of Pdqsort

Opportunity for enhancing the interface, all the following interfaces would benefit from cheaper Boolean operators (e.g. LESS) instead of more expensive tri-Boolean comparators: - offer a new interface for unstable sorting of equally-sized elements (ZacksortB), also multi-threaded - offer a new interface for stable 50%-buffer sorting of equally-sized elements (Frogsort1), also multi-threaded - offer a new interface for stable low-buffer sorting of equally-sized elements (Frogsort2) with an parameter for the buffer fraction⁷ - offer a new interface for stable sqrt-buffer sorting of equally-sized elements (Walksort)⁸ - offer a new interface for stable varying sorting of size-varying elements (or at least for efficient string-sorting in arbitrary collation), also multi-threaded - offer a new interface for specific sorting of atomic types, also multi-threaded

C++⁹

There are only two kinds of languages: the ones people complain about and the ones nobody uses — Bjarne Stroustrup

C++ is an compiled language with template support for many data-types, many other languages have interfaces to C++. C++ has two standard-library sorting functions, unstable std::sort and std::stable_sort, furthermore C’s qsort is available. Since C++11 the implementations must guarantee a \(\mathcal{O}(N \log(N))\) worst case.

libstdc++ std::stable_sort (gcc and clang)

std::stable_sort has a clever implementation of Mergesort that combines three types of merging (inplace-merge with \(\mathcal{O}(N \log(N))\) merge-cost, halfcopy-merge copying data to 50% buffer and merging back, nocopy-merge merging forward to 100% buffer). The algorithm tries to achieve maximum speed by allocating 100% buffer space, if less is obtained it starts with a more expensive merge and switches to a cheaper one as soon as the available buffer is big enough relative to the problem size in the recursion; for example the biggest top-level-merge is done with inplace-merge, the next level with halfcopy-merge and then nocopy-merge. We found the following limitations of the implementation:

it optimizes for time, but not for footprint, this is good on an idle desktop but not good on a busy server
it may allocate memory that is never used, if for example 75% memory are obtained, only 50% are used¹⁰
with little available buffer it is an expensive inplace-mergesort with \(\mathcal{O}(N \log(N)^2)\) although \(\mathcal{O}(N \log(N))\) would be possible with only \(\mathcal{O}(\sqrt{N})\) buffer.
with 50% buffer there is too much copying, even in the prior art it is known how to avoid this, see Ininsort.
despite of 100% buffer a best case of \(\mathcal{O}(N)\) for presorted data is not achieved
the 50% buffer half-merge can not run concurrently on both branches

Opportunity for improving the implementation within the given interface:

with 100% buffer Omitsort¹¹ provides adaptivity with a \(\mathcal{O}(N)\) best case unlike the nocopy Mergesort
with 50% buffer Frogsort1 avoids unnecessary copying compared to halfcopy Mergesort
with parametrizable 14% buffer Frogsort2 has peak sustainability and can be faster than any other algorithm¹²
with negligible \(\mathcal{O}(\sqrt{N})\) buffer Walksort is even more faster than inplace Mergesort
with 0% buffer Grailsort is much faster than inplace Mergesort

libstdc++ std::sort (gcc)

std::sort is using Introsort, based on a simple Quicksort with a pivot deterministically chosen as a middle element. In order to avoid vulnerability algorithmic complexity attacks (or simply unlucky input that leads to quadratic runtime or worse) if the recursion depth gets too deep it falls back on Heapsort. We found the following limitations of this implementation:

no acceleration to expected \(\mathcal{O}(N \log(D))\) in case of tied input
no graceful degradation¹³ if falling back to Heapsort
no block-tuning reducing branch-misprediction (unlike Edelkamp&Weiss Block-Quicksort)
no acceleration for presorting (unlike Pdqsort)
implicit: implementation used to be inplace and deterministic (difficult to switch to probabilistic)

Opportunity for improving the implementation within the given interface:

use ZacksortB with a best case of \(\mathcal{O}(N \log(D))\) for tied data at cost of only one extra comparison per partitioning (and Heapsort for a \(\mathcal{O}(N \log(N))\) worst case)
optionally tune for presorting along the lines of Pdqsort

libc++ std::sort (clang)

std::sort in libc++ of the clang compiler uses s different implementation that covers more special cases and is tuned for tied and presorted data. Unfortunately it seems to contain a vulnerability for quadratic runtime reported 2014 by Orson Peters and has not been fixed since then¹⁴. Hence libc++ std::sort violates the C++ standard and cannot be recommended for production use.

parallel sorting

Since C++17 the standard specifies parallel sorting with a serial and multi-threaded policy, since C++20 it adds a SIMD-policy. The GCC 9 implements the parallel Standard Template Library (STL) via Intel’s Thread Building Blocks (TBB) library. However, the TBB has no stable sort and its unstable sort was incompletely parallelized (parallel branching but not parallel partitioning). In 2018 Intel open-sourced its Parallel Standard Template Library (PSTL) which implements a stable merge sort which parallelizes the branches and the merging. Now the llvm PSTL has the following limitations of the implementation:

the parallel merge of the PSTL does a primitive binary split of the merge task which does only guarantee problem size reduction of 1/4
the unstable parallel sort seems to call the stable parallel sort (and hence does not exploit the dropped stability-requirement)
the stable parallel sort requires 100% buffer (although 50% would be enough)

Opportunities for improving the implementation within the given interface:

Frogsort1 allows a similar implementation with similar complexity but with only 50% buffer optionally use Squidsort1 trading higher code-complexity for adaptivity
k-threaded unstable tie-adaptive sorting could be implemented by parallel sorting k parts of the data with ZacksortB, then allocating 50% buffer and parallel merging those using Frogsort1 techniques.¹⁵

Opportunities for enhancing the interface:

the current interface could be overloaded with a new parameter specifying the desired fraction of buffer, this would allow to benefit from the optimal trade-off that Frogsort2 provides between 50% buffer and little or no buffer¹⁶.
the current interfaces could be overloaded with a new parameter alternative to the comparison-function: a function for key-extraction that could be used in a radix-sort along the lines of Skasort.

Opportunities for improving the interface:

specific sorting could be made the default for atomic types when users do neither provide a comparison-function nor a key-extraction-function.
offer a new interface for stable sustainable sorting of variable-sized elements, and optionally multi-threaded

Clojure

I wanted a language as acceptable as Java or C#, but supporting a much simpler programming model — Rich Hickey

Closure is a modern, dynamic, and functional dialect of the Lisp programming language on the Java platform, it uses the sorting library of Java.

Groovy

Groovy’s sort ultimately calls Java’s Collections’ sort— Guillaume Laforge

Groovy is a Java-syntax-compatible object-oriented programming language for the Java platform, it uses the sorting library of Java.

Java

Java is C++ without the guns, clubs and knives — James Gosling

Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. Java 7 introduced Dual-Pivot-Quicksort for primitive types and Timsort for object-types behind its sort API, and Java 8 introduced the Arrays.parallelSort API, which uses Mergesort for the parallel top-levels of recursion. From the greeNsort^® perspective these are limitations of the implementation:

Dual-Pivot-Quicksort is nicely adaptive to ties but it does not reach the speed of Block-Quicksort for sorting untied data (and implementation complexity is too high)
Timsort is an algorithm extremely spezialized on presorted data, for sorting unsorted data it is too expensive (and implementation complexity is too high), see the deep-dive in the results section. Furthermore the Java documentation encourages inappropriate use of Timsort which dramatically waists energy¹⁷.
parallelSort requires 100% buffer (and its parallel binary merging allows imbalance up to 25:75%)

Opportunities for improving the implementation within the given interface:

Dual-Pivot-Quicksort can be replaced with ZacksortB which is simpler, faster and as adaptive.
Timsort can be replaced with Squidsort1 which is simpler, faster and also very adaptive (sometimes more, sometimes less)
parallelSort can be replaced with a parallel-version of Frogsort1 which requires only 50% buffer space (and using a global median would guarantee 50:50% balance)

Opportunities for enhancing the interface:

the current interface could be overloaded with a new parameter specifying the desired fraction of buffer, this would allow to benefit from the optimal trade-off that Squidsort2 provides¹⁸.
the current interfaces could be overloaded with a new parameter alternative to the comparison-function: a function for key-extraction that could be used in a radix-sort along the lines of Skasort.

Opportunity for enhancing the interface, all the following interfaces would benefit from cheaper Boolean operators (e.g. LESS) instead of more expensive tri-Boolean comparators:

offer a new interface for unstable sorting of equally-sized elements (Zacksort), also multi-threaded
offer a new interface for stable 50%-buffer sorting of equally-sized elements (Frogsort1), also multi-threaded
offer a new interface for stable low-buffer sorting of equally-sized elements (Frogsort2) with an parameter for the buffer fraction¹⁹
offer a new interface for stable sqrt-buffer sorting of equally-sized elements (Walksort)²⁰
offer a new interface for stable varying sorting of size-varying elements (or at least for efficient string-sorting in arbitrary collation), also multi-threaded
offer a new interface for specific sorting of atomic types, also multi-threaded

The following languages benefit from improvements to the Java sorting library:

Closure
JRuby
Kotlin
Scala (partially)

JRuby

JRuby always has and always will try to match CRuby behavior — Charles Nutter

JRuby is an implementation of the Ruby programming language atop the Java Virtual Machine, written largely in Java. Actually, JRuby switched to Java’s sort library in 2016.

Kotlin

Develop with pleasure — Dmitry Jemerov

Kotlin is a cross-platform, statically typed, general-purpose programming language with type inference. Kotlin is designed to interoperate fully with Java, and the JVM version of Kotlin’s standard library depends on the Java Class Library.

Python

I stepped down as ‘Benevolent Dictator’, as I was called — Guido van Rossum

Python is an interpreted, high-level and general-purpose programming language written in C. Python2 supported a cmp parameter to handle user specified comparison functions, in python3 this was completely abandoned for a key parameter to use specified key-extraction functions. The python3 API has two methods list.sort() which sorts an object insitu and a sorted() method that sorts exsitu and does not modify its input.

These are limitations of the interface:

although a stable Natural-Mergesort is used, the sort is not generic because it no longer takes a comparison function
the python documentation promotes sorting by multiple keys, but the interface does force users to sort multiple times which is much more expensive than using multiple keys in one comparison function²¹

These are limitations of the implementation:

Timsort is slower on non-sorted data than nocopy-Mergesort
Timsort, although an algorithm extremely specialized on presorted data, is only faster than Mergesort, when very little sorting work is left to be done, see deep-dive in the results section
Timsort is not adaptive to ties
Timsort is inherently serial

Opportunities for improving the implementation within the given interface:

Timsort can be replaced with Squidsort1 which is simpler, faster and also very adaptive (sometimes more, sometimes less)
for atomic key-extraction functions specific sorting algorithmns could be used such as radix-sort along the lines of Skasort

Opportunities for enhancing the interface:

the current interface could be overloaded with a new parameter specifying the desired fraction of buffer, this would allow to benefit from the optimal trade-off that Squidsort2 provides²².
the current interfaces could be overloaded with a new parameter alternative to the key-extraction-function: a comparison-function that would allow fully generic sorting
the current interfaces could be overloaded with a new parameter for the number of threads used for parallel sorting

Opportunity for enhancing the interface:

offer a new interface for unstable sorting of equally-sized elements (Zacksort), also multi-threaded
offer a new interface for stable sqrt-buffer sorting of equally-sized elements (Walksort)²³
offer a new interface for stable varying sorting of size-varying elements (or at least for efficient string-sorting in arbitrary collation), also multi-threaded

Rust

Less-well-known things I’m very proud that rust shipped 1.0 without — Graydon Hoare

Rust is a systems programming language similar to C++ or Java but memory-save without a garbage-collector. It has sorting APIs for stable and unstable sorting using a comparison-function or a key-extraction-function. Stable sort uses a lean version of Timsort without galloping-mode and sort_unstable uses Pdqsort. We found the sorting libraries slower than C/C++ implementations.

Limitations of the implementation are:

Pdqsort uses more operations than needed which can slow it down with an expensive comparison function
Timsort still suffers from a merging that costs to many moves and and potentially too imbalanced merging, see Python and the results section.

Opportunity for improving the implementation within the given interface:

use ZacksortB instead of Pdqsort to reduce ‘low-overhead’ to ‘minimum-overhead’, possibly guarantee \(\mathcal{O}(N \log(D))\) worst case by safeguarding²⁴ with Heapsort, possibly tune for presorting along the lines of Pdqsort
use the nocopy-Squidsort1 instead of Timsort for a faster adaptive algorithm

Opportunity for enhancing the interface:

enhance the sort API with an optional parameter for the buffer fraction (Squidsort2) and optionally multi-threaded
offer a new interface for stable sustainable sorting of variable-sized elements (or at least for efficient string-sorting in arbitrary collation), and optionally multi-threaded

Opportunities for improving the interface:

specific sorting could be made the default for atomic types when users do neither provide a comparison-function nor a key-extraction-function.

Scala

It stands for scalable language — Martin Odersky

Scala is a general-purpose programming language providing support for both object-oriented programming and functional programming. Scala provides language interoperability with Java, so that libraries written in either language may be referenced directly in Scala or Java code. Scala’s Sorting object provides convenience wrappers for java.util.Arrays.sort, see Java. Beyond that Scala implements its own sorting algorithms based on a tri-Boolean comparator, a counting sort for tabulating Booleans, a primitve Mergesort (Copysort) that copies data to 100% buffer and merges back, a ternary Quicksort3.

Obvious limitations of the Quicksort implementation are:

average \(\mathcal{O}(N \log(N))\) but \(\mathcal{O}(N^2)\) worst case of Quicksort (unlike Introsort)
best \(\mathcal{O}(N \log(D))\) case on ties but costs many extra operations, hence energy waste on non-tied data (unlike true binary Quicksort2)
no block-tuning reducing branch-misprediction (unlike Edelkamp&Weiss Block-Quicksort)
no acceleration for presorting (unlike Pdqsort)

Obvious limitations of the Mergesort implementation are:

200% memory instead of 150% memory
200% moves instead of 100% moves

Opportunity for improving the implementation within the given interface:

use ZacksortB to get adaptivity to ties without the operation overhead of Quicksort3
possibly guarantee \(\mathcal{O}(N \log(D))\) worst case by safeguarding²⁵ with Heapsort
possibly tune for presorting along the lines of Pdqsort
use 50% buffer nocopy-Frogsort1 instead of 100% buffer copy-Mergesort

Opportunity for enhancing the interface:

offer a new interface for unstable sorting that is happy with a cheaper Boolean LESS comparison function and optionally multi-threaded
offer a new interface for stable sustainable sorting of equally sized elements with an optional parameter for the buffer fraction (Frogsort2) and optionally multi-threaded
offer a new interface for stable sustainable sorting of variable-sized elements (or at least for efficient string-sorting in arbitrary collation), and optionally multi-threaded
offer a new interface for msb-radix sorting of integer and float types (and ideally for strings in standard collation also), and optionally multi-threaded
interfaces could be defined that use Boolean operators instead of tri-Boolean comparators

Feedback

Please inform us about

new sorting algorithms
implementations of greeNsort^® algorithms in new languages
new usage of greeNsort^® algorithms in libraries
new usages of greeNsort^® algorithms in applications

The limits of my language mean the limits of my world — Ludwig Wittgenstein

University of Zurich, Informatics and Sustainability Research ↩︎
Optimized C++ by Kurt Guntheroth ↩︎
“Developing Green Software”, no longer available at software.intel.com/content/dam/develop/external/us/en/documents/developing-green-software-183291.pdf↩︎
Note that the very efficient IPS4o is a parallel k-ary partitioning algorithm requiring only \(\mathcal{O}(\sqrt{N})\) buffer, however, at the price of a quite complicated implementation.↩︎
October 2020↩︎
not needed when random-pivots are used, since Zucksort have probabilistic convergence guarantees↩︎
Multithreading is possible but more complicated↩︎
Multithreading is possible but more complicated↩︎
October 2020↩︎
Related to this, I found a logical glitch: __inplace_merge tries to get 100% buffer although it never uses more than 50% when calling std::__merge_adaptive↩︎
alternatives are the slightly faster but less adaptive Knuthsort4 or the slightly slower but more adaptive Octosort↩︎
measured with sorting doubles↩︎
for graceful degradation in the context of sorting see Implementing Sorting in Database Systems ↩︎
The maintainers neither fixed thsi the usual way with Heapsort nor did they switched to Pdqsort offered by Orson Peters which is now in boost and Rust ↩︎
A more complicated alternative would be using the completely parallel Quicksort of Tsigas & Zhang based on the implementation of Singler&Sanders in the GCC Parallel Mode, however that is not adaptive to ties. Another complicated alternative would be IPS4o by Axtmann, Witt, Ferizovic & Sanders, but note that the serial version IPS4o is slower than optimal Quicksort algorithms.↩︎
Note that parallelizing the low-memory versions will more challenging↩︎
The documentation unteaches the well-founded distinction between \(\mathcal{O}(N)\) merging and \(\mathcal{O}(N\log{N})\) sorting: “It is well-suited to merging two or more sorted arrays: simply concatenate the arrays and sort the resulting array”↩︎
Note that parallelizing the low-memory versions will more challenging↩︎
Multithreading is possible but more complicated↩︎
Multithreading is possible but more complicated↩︎
The documentation claims that “The Timsort algorithm used in Python does multiple sorts efficiently because it can take advantage of any ordering already present in a dataset”, but in fact stability of Timsort enables sorting by multiple keys, taking advantage of a previous sort requires the two keys to be correlated, not orthogonal, and rather perfectly correlated↩︎
Note that parallelizing the low-memory versions will more challenging↩︎
Multithreading is possible but more complicated↩︎
not needed when random-pivots are used, since Zacksort have probabilistic convergence guarantees↩︎
not needed when random-pivots are used, since Zacksort have probabilistic convergence guarantees↩︎

Call to action: apply greeNsort knowledge to your artifacts

Prior art

Available algorithms

Choosing an algorithm

parallel

stable

unstable

Improving libraries

Quick win actions

Ideal interfaces

String sorting

Multithreading

Languages

C5

C++9

libstdc++ std::stable_sort (gcc and clang)

libstdc++ std::sort (gcc)

libc++ std::sort (clang)

parallel sorting

Clojure

Groovy

Java

JRuby

Kotlin

Python

Rust

Scala

Feedback

C⁵

C++⁹