Here we comment on the energy efficiency status of sorting in some programming languages and recommend actions for improvments.

We welcome comments on other languages, corrections, updates and - most importantly - improvements



Plan of attack


Compatibility means deliberately repeating other people’s mistakes — David Wheeler

All the following programming languages had their reasons to shape their sorting APIs and implementations as they did, however, that was before the climate crisis. Now programming languages as a whole are measured by their energy efficiency, see Energy Efficiency across Programming Languages, and - no surprise - it turns out that some languages seem more efficient than others, particularly compiled languages, but also Java. Sadly, some of the languages gaining popularity for expensive big-data applications are interpreted (R and Python). If - for some reason - people need to stick with those instead of switching to more efficient languages (e.g. Julia), then it becomes even more important to maximize the efficiency of the compiled sorting libraries used.

Three types of actions can improve sorting efficiency in standard libraries:

  1. improving implementations (within unchanged interfaces)

  2. enhancing interfaces (new parameters, new interfaces)

  3. replacing interfaces (changing existing interfaces for the better)

  4. gives quick wins, 2. should be doable, and perhaps even 3.

Before we comment on single languages, we shortly



Available algorithms


Insensitive choice of algorithms and data structures lead to significant energy wasting — Fethullah Goekkus1

The following greeNsort® algorithms are available to create better implementations:

For more information on the algorithms see the Portfolio.



Quick win actions


The following quick-wins can be achieved by improving implementations:

Opportunity Action
C faster sorting still adaptive to ties replace Quicksort3 with ZacksortB
C++ faster sorting with the same amount of memory or simply save memory replace inplace Mergesort with Walksort,
replace 50% Mergesort with Frogsort1,
replace 100% Mergesort with Omitsort
Clojure see Java see Java
Groovy see Java see Java
Java faster sorting still adaptive to presorting,
faster sorting still adpative to ties
replace Timsort with Squidsort1,
replace Dupisort with ZacksortB
JRuby see Java see Java
Kotlin see Java see Java
Python faster sorting still adaptive to presorting replace Timsort with Squidsort1
Rust faster sorting still adaptive to presorting replace Timsort with Squidsort1
Scala faster sorting still adaptive to ties,
half buffer and half copying,
see Java
replace Quicksort3 with ZacksortB,
replace Copysort with Frogsort1,
see Java



Ideal interfaces


After quick-wins, more can be saved by creating optimal interfaces.

It strongly depends on the programming language, how such interfaces look in detail and whether need just enhancement or replacement.



String sorting


Any frequently executed code that manipulates strings is fertile ground for optimization — Kurt Guntheroth2

String sorting is unduly expensive. The reason is localization: respecting the locale order of letters prevents us from using a simple efficient radix sort on strings. Instead we use a more expensive comparison sort, and for each comparison we call an expensive function handling dozens of special cases of the given locale. We found localized string sorting with C++ std::sort up to 40 times more expensive than radix sorting with Skasort. The most effective way to reduce the cost of string sorting would be an international agreement on a character set (UTF-8) and no locale, i.e. simple byte order.

All problems in computer science can be solved by another level of indirection … except for the problem of too many layers of indirection — David Wheeler

There is another reason that string sorting is unduly expensive: The prior art often sorts strings indirectly by moving pointers to strings. This seems clever because - if the pointer is smaller than the string - it can theoretically save movements costs in true random-access-memory. Also it is convenient to use Quicksort with pointers, because we know Quicksort to be fast and it cannot handle size-varying strings, just equally-sized pointers. However, in todays deep cache-hierarchies this approach doesn’t scale to big \(N\): access costs depend on access distance, moving pointers instead of strings implies random-access to the memory where strings are actually stored, whether or not the strings are finally re-arranged. Suitably implemented Mergesort can directly move strings during the merge-passes, this can easily reduce sorting time and energy by 33%. And the greeNsort® algorithm Frogsort can do the same, just with less memory.

In summary: direct radix-sorting of strings seems most promising. Where comparison sorting cannot be avoided greeNsort® algorithms are an alternative, either direct Frogsort, Geckosort, Squidsort or indirect Zacksort, Zucksort.



Multithreading


Multi-threading your software delivers better performance and as well as better energy-efficiency — Intel3

Not all algorithms parallelize equally well. For example Mergesort with 100% buffer can be multithreaded very well, not only with parallel branch execution but also with parallelizing the the work phase (parallel merge).

By contrast, the work-phase of Quicksort (partitioning) has a strong serial character: parallel inplace-partitioning is possible, but complicated and can cost extra move operations on certain input patterns, see for example Tsigas & Zhang. Parallel partitioning gets easier when 100% buffer memory is allowed, however, there are still challenges regarding work distribution due to random pivots, particularly when ties occur. From a practical point of view, it seems beneficial to not insist of parallelizing pure Quicksorts or partitioning sorts, it is much easier to follow the approach in Java 8 and use parallel binary Mergesort (better Frogsort or a parallel Multiway-Mergesort with help of a LoserTree at the top levels of the divide&conquer process and switch to e.g. Quicksort (better ZacksortB) once a branch becomes small enough to run it on a single thread.

In summary: parallel branching is easy, parallel workphase is harder: parallel merging is a practical approach to parallelizing the work phase in both, merging sorts and in the top-levels of partitioning sorts, greeNsort® algorithms help to lower the costs of parallel merging.



Languages


C4

C is quirky, flawed, and an enormous success — Dennis Ritchie

C is a compiled language and many other languages have interfaces to C. C has a standard library function qsort which takes four parameters

  • base − This is the pointer to the first element of the array to be sorted
  • nitems − This is the number of elements in the array pointed by base
  • size − This is the size in bytes of each element in the array
  • compar − This is the function that compares two elements and returns a tri-boolean {-1,0-1}

Limitations of the interface are:

  • does not sort stable
  • does sort only equally sized elements (sorting strings indirectly via pointers results in expensive random-access)
  • does not sort multi-threaded
  • the tri-boolean comparison function requires too much work for always returning {-1,0-1} although a simple boolean would be enough
  • implicit: implementation used to be inplace and deterministic (difficult to switch to probabilistic)

Underlying is a Quicksort3 and limitations of the implementation are:

  • average \(\mathcal{O}(N \log(N))\) but \(\mathcal{O}(N^2)\) worst case (unlike Introsort)
  • best \(\mathcal{O}(N \log(D))\) case on ties but costs many extra operations, hence energy waste on non-tied data (unlike true binary Quicksort2)
  • no block-tuning reducing branch-misprediction (unlike Edelkamp&Weiss Block-Quicksort)
  • no acceleration for presorting (unlike Pdqsort)

Opportunity for improving the implementation within the given interface:

  • use ZacksortB giving the same performance guarantees but identifying ties at cost of only one extra comparison per partitioning branch
  • possibly guarantee \(\mathcal{O}(N \log(D))\) worst case by safeguarding5 with Heapsort
  • possibly tune for presorting along the lines of Pdqsort

Opportunity for enhancing the interface, all the following interfaces would benefit from cheaper boolean operators (e.g. LESS) instead of more expensive tri-boolean comparators: - offer a new interface for unstable sorting of equally-sized elements (ZacksortB), also multi-threaded - offer a new interface for stable 50%-buffer sorting of equally-sized elements (Frogsort1), also multi-threaded - offer a new interface for stable low-buffer sorting of equally-sized elements (Frogsort2) with an parameter for the buffer fraction6 - offer a new interface for stable sqrt-buffer sorting of equally-sized elements (Walksort)7 - offer a new interface for stable varying sorting of size-varying elements (or at least for efficient string-sorting in arbitrary collation), also multi-threaded - offer a new interface for specific sorting of atomic types, also multi-threaded


C++8

There are only two kinds of languages: the ones people complain about and the ones nobody uses — Bjarne Stroustrup

C++ is an compiled language with template support for many data-types, many other languages have interfaces to C++. C++ has two standard-library sorting functions, unstable std::sort and std::stable_sort, furthermore C’s qsort is available. Since C++11 the implementations must guarantee a \(\mathcal{O}(N \log(N))\) worst case.

libstdc++ std::stable_sort (gcc and clang)

std::stable_sort has a clever implementation of Mergesort that combines three types of merging (inplace-merge with \(\mathcal{O}(N \log(N))\) merge-cost, halfcopy-merge copying data to 50% buffer and merging back, nocopy-merge merging forward to 100% buffer). The algorithm tries to achieve maximum speed by allocating 100% buffer space, if less is obtained it starts with a more expensive merge and switches to a cheaper one as soon as the available buffer is big enough relative to the problem size in the recursion; for example the biggest top-level-merge is done with inplace-merge, the next level with halfcopy-merge and then nocopy-merge. We found the following limitations of the implementation:

  • it optimizes for time, but not for footprint, this is good on an idle desktop but not good on a busy server
  • it may allocate memory that is never used, if for example 75% memory are obtained, only 50% are used9
  • with little available buffer it is an expensive inplace-mergesort with \(\mathcal{O}(N \log(N)^2)\) although \(\mathcal{O}(N \log(N))\) would be possible with only \(\mathcal{O}(\sqrt{N})\) buffer.
  • with 50% buffer there is too much copying, even in the prior art it is known how to avoid this, see H.W. Lang.
  • despite of 100% buffer a best case of \(\mathcal{O}(N)\) for presorted data is not achieved
  • the 50% buffer half-merge can not run concurrently on both branches

Opportunity for improving the implementation within the given interface:

  • with little buffer use Walksort instead of inplace Mergesort worst case, opportunity to parallelize to \(k\) threads at cost of \(\mathcal{O}(k\sqrt{N})\) buffer)
  • with 50% buffer use Frogsort1 instead of halfmerge Mergesort (avoids unnecessary copying, opportunity to parallelize)
  • with 100% buffer use Omitsort10 instead of fullmerge Mergesort (adaptivity with \(\mathcal{O}(N)\) best case, opportunity to parallelize)

libstdc++ std::sort (gcc)

std::sort is using Introsort, based on a simple Quicksort with a pivot deterministically chosen as a middle element. In order to avoid vulnerability algorithmic complexity attacks (or simply unlucky input that leads to quadratic runtime or worse) if the recursion depth gets too deep it falls back on Heapsort. We found the following limitations of this implementation:

  • no acceleration to expected \(\mathcal{O}(N \log(D))\) in case of tied input
  • no graceful degradation11 if falling back to Heapsort
  • no block-tuning reducing branch-misprediction (unlike Edelkamp&Weiss Block-Quicksort)
  • no acceleration for presorting (unlike Pdqsort)
  • implicit: implementation used to be inplace and deterministic (difficult to switch to probabilistic)

Opportunity for improving the implementation within the given interface:

  • use ZacksortB with a best case of \(\mathcal{O}(N \log(D))\) for tied data at cost of only one extra comparison per partitioning (and Heapsort for a \(\mathcal{O}(N \log(N))\) worst case)
  • optionally tune for presorting along the lines of Pdqsort

libc++ std::sort (clang)

std::sort in libc++ of the clang compiler uses s different implementation that covers more special cases and is tuned for tied and presorted data. Unfortunately it seems to contain a vulnerabilty for quadratic runtime reported 2014 by Orson Peters and has not been fixed since then12. Hence libc++ std::sort violates the C++ standard and cannot be recommended for production use.

parallel sorting

Since C++17 the standard specifies parallel sorting with a serial and multi-threaded policy, since C++20 it adds a SIMD-policy. The GCC 9 implements the parallel Standard Template Library (STL) via Intel’s Thread Building Blocks (TBB) library. However, the TBB has no stable sort and its unstable sort was incompletely parallelized (parallel branching but not parallel partitioning). In 2018 Intel open-sourced its Parallel Standard Template Library (PSTL) which implements a stable merge sort which parallelizes the branches and the merging. Now the llvm PSTL has the following limitations of the implementation:

  • the parallel merge of the PSTL does a primitive binary split of the merge task which does only guarantee problem size reduction of 1/4
  • the unstable parallel sort seems to call the stable parallel sort (and hence does not exploit the dropped stability-requirement)
  • the stable parallel sort requires 100% buffer (although 50% would be enough)

Opportunities for improving the implementation within the given interface:

  • Frogsort1 allows a similar implementation with similar complexity but with only 50% buffer optionally use Squidsort1 trading higher code-complexity for adaptivity
  • k-threaded unstable tie-adaptive sorting could be implemented by parallel sorting k parts of the data with ZacksortB, then allocating 50% buffer and parallel merging those using Frogsort1 techniques^[a more complicated alternative would be using the completely parallel Quicksort of Tsigas & Zhang based on the implementation of Singler&Sanders in the GCC Parallel Mode, however that is not adaptive to ties. Another complicated alternative would be IPS4o by Axtmann, Witt, Ferizovic & Sanders, but note that the serial version IPS4o is slower than optimal Quicksort algorithms.

Opportunities for enhancing the interface:

  • the current interface could be overloaded with a new parameter specifying the desired fraction of buffer, this would allow to benefit from the optimal trade-off that Frogsort2 and Squidsort2 provide13.
  • the current interfaces could be overloaded with a new parameter alternative to the comparison-function: a function for key-extraction that could be used in a radix-sort along the lines of Skasort.

Opportunities for improving the interface:

  • specific sorting could be made the default for atomic types when users do neither provide a comparison-function nor a key-extraction-function.
  • offer a new interface for stable sustainable sorting of variable-sized elements, and optionally multi-threaded

Clojure

I wanted a language as acceptable as Java or C#, but supporting a much simpler programming model — Rich Hickey

Closure is a modern, dynamic, and functional dialect of the Lisp programming language on the Java platform, it uses the sorting library of Java.


Groovy

Groovy’s sort ultimately calls Java’s Collections’ sort— Guillaume Laforge

Groovy is a Java-syntax-compatible object-oriented programming language for the Java platform, it uses the sorting library of Java.


Java

Java is C++ without the guns, clubs and knives — James Gosling

Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. Java 7 introduced Dual-Pivot-Quicksort for primitive types and Timsort for object-types behind its sort API, and Java 8 introduced the Arrays.parallelSort API, which uses Mergesort for the parallel top-levels of recursion. From the greeNsort® perspective these are limitations of the implementation:

  • Dual-Pivot-Quicksort is nicely adaptive to ties but it does not reach the speed of Block-Quicksort for sorting untied data (and implementation complexity is too high)
  • Timsort is an algorithm extremely spezialized on presorted data, for sorting unsorted data it is too expensive (and implementation complexity is too high), see the deep-dive in the results section. Furthermore the Java documentation encourages inappropriate use of Timsort which dramatically waists energy14.
  • parallelSort requires 100% buffer (and its parallel binary merging allows imbalance up to 25:75%)

Opportunities for improving the implementation within the given interface:

  • Dual-Pivot-Quicksort can be replaced with ZacksortB which is simpler, faster and as adaptive.
  • Timsort can be replaced with Squidsort1 which is simpler, faster and also very adaptive (sometimes more, sometimes less)
  • parallelSort can be replaced with a parallel-version of Frogsort1 which requires only 50% buffer space (and using a global median would guarantee 50:50% balance)

Opportunities for enhancing the interface:

  • the current interface could be overloaded with a new parameter specifying the desired fraction of buffer, this would allow to benefit from the optimal trade-off that Squidsort2 provides15.
  • the current interfaces could be overloaded with a new parameter alternative to the comparison-function: a function for key-extraction that could be used in a radix-sort along the lines of Skasort.

Opportunity for enhancing the interface, all the following interfaces would benefit from cheaper boolean operators (e.g. LESS) instead of more expensive tri-boolean comparators:

The following languages benefit from improvements to the Java sorting library:


JRuby

JRuby always has and always will try to match CRuby behavior — Charles Nutter

JRuby is an implementation of the Ruby programming language atop the Java Virtual Machine, written largely in Java. Actually, JRuby switched to Java’s sort library in 2016.


Kotlin

Develop with pleasure — Dmitry Jemerov

Kotlin is a cross-platform, statically typed, general-purpose programming language with type inference. Kotlin is designed to interoperate fully with Java, and the JVM version of Kotlin’s standard library depends on the Java Class Library.


Python

I stepped down as ‘Benevolent Dictator’, as I was called — Guido van Rossum

Python is an interpreted, high-level and general-purpose programming language written in C. Python2 supported a cmp parameter to handle user specified comparison functions, in python3 this was completely abandoned for a key parameter to use specified key-extraction functions. The python3 API has two methods list.sort() which sorts an object insitu and a sorted() method that sorts exsitu and does not modify its input.

These are limitations of the interface:

  • although a stable Natural-Mergesort is used, the sort is not generic because it no longer takes a comparison function
  • the python documentation promotes sorting by multiple keys, but the interface does force users to sort multiple times which is much more expensive than using multiple keys in one comparison function18

These are limitations of the implementation:

  • Timsort is slower on non-sorted data than nocopy-Mergesort
  • Timsort, although an algorithm extremely specialized on presorted data, is only faster than Mergesort, when very little sorting work is left to be done, see deep-dive in the results section
  • Timsort is not adaptive to ties
  • Timsort is inherently serial

Opportunities for improving the implementation within the given interface:

  • Timsort can be replaced with Squidsort1 which is simpler, faster and also very adaptive (sometimes more, sometimes less)
  • for atomic key-extraction functions specific sorting algorithmns could be used such as radix-sort along the lines of Skasort

Opportunities for enhancing the interface:

  • the current interface could be overloaded with a new parameter specifying the desired fraction of buffer, this would allow to benefit from the optimal trade-off that Squidsort2 provides19.
  • the current interfaces could be overloaded with a new parameter alternative to the key-extraction-function: a comparison-function that would allow fully generic sorting
  • the current interfaces could be overloaded with a new parameter for the number of threads used for parallel sorting

Opportunity for enhancing the interface:


Rust

Less-well-known things I’m very proud that rust shipped 1.0 without — Graydon Hoare

Rust is a systems programming language similar to C++ or Java but memory-save without a garbage-collector. It has sorting APIs for stable and unstable sorting using a comparison-function or a key-extraction-function. Stable sort uses a lean version of Timsort without galloping-mode and sort_unstable uses Pdqsort. We found the sorting libraries slower than C/C++ implementations.

Limitations of the implementation are:

  • Pdqsort uses more operations than needed which can slow it down with an expensive comparison function
  • Timsort still suffers from a merging that costs to many moves and and potentially too imbalanced merging, see Python and the results section.

Opportunity for improving the implementation within the given interface:

  • use ZacksortB instead of Pdqsort to reduce ‘low-overhead’ to ‘minimum-overhead’, possibly guarantee \(\mathcal{O}(N \log(D))\) worst case by safeguarding21 with Heapsort, possibly tune for presorting along the lines of Pdqsort
  • use the nocopy-Squidsort1 instead of Timsort for a faster adaptive algorithm

Opportunity for enhancing the interface:

  • enhance the sort API with an optional parameter for the buffer fraction (Squidsort2) and optionally multi-threaded
  • offer a new interface for stable sustainable sorting of variable-sized elements (or at least for efficient string-sorting in arbitrary collation), and optionally multi-threaded

Opportunities for improving the interface:

  • specific sorting could be made the default for atomic types when users do neither provide a comparison-function nor a key-extraction-function.

Scala

It stands for scalable language — Martin Odersky

Scala is a general-purpose programming language providing support for both object-oriented programming and functional programming. Scala provides language interoperability with Java, so that libraries written in either language may be referenced directly in Scala or Java code. Scala’s Sorting object provides convenience wrappers for java.util.Arrays.sort, see Java. Beyond that Scala implements its own sorting algorithms based on a tri-boolean comparator, a counting sort for tabulating booleans, a primitve Mergesort (Copysort) that copies data to 100% buffer and merges back, a ternary Quicksort3.

Obvious limitations of the Quicksort implementation are:

  • average \(\mathcal{O}(N \log(N))\) but \(\mathcal{O}(N^2)\) worst case of Quicksort (unlike Introsort)
  • best \(\mathcal{O}(N \log(D))\) case on ties but costs many extra operations, hence energy waste on non-tied data (unlike true binary Quicksort2)
  • no block-tuning reducing branch-misprediction (unlike Edelkamp&Weiss Block-Quicksort)
  • no acceleration for presorting (unlike Pdqsort)

Obvious limitations of the Mergesort implementation are:

  • 200% memory instead of 150% memory
  • 200% moves instead of 100% moves

Opportunity for improving the implementation within the given interface:

  • use ZacksortB to get adaptivity to ties without the operation overhead of Quicksort3
  • possibly guarantee \(\mathcal{O}(N \log(D))\) worst case by safeguarding22 with Heapsort
  • possibly tune for presorting along the lines of Pdqsort
  • use 50% buffer nocopy-Frogsort1 instead of 100% buffer copy-Mergesort

Opportunity for enhancing the interface:

  • offer a new interface for unstable sorting that is happy with a cheaper boolean LESS comparison function and optionally multi-threaded
  • offer a new interface for stable sustainable sorting of equally sized elements with an optional parameter for the buffer fraction (Frogsort2) and optionally multi-threaded
  • offer a new interface for stable sustainable sorting of variable-sized elements (or at least for efficient string-sorting in arbitrary collation), and optionally multi-threaded
  • offer a new interface for msb-radix sorting of integer and float types (and ideally for strings in standard collation also), and optionally multi-threaded
  • interfaces could be defined that use boolean operators instead of tri-boolean comparators

The limits of my language mean the limits of my world — Ludwig Wittgenstein


  1. University of Zurich, Informatics and Sustainability Research↩︎

  2. Optimized C++ by Kurt Guntheroth↩︎

  3. Developing Green Software↩︎

  4. October 2020↩︎

  5. not needed when random-pivots are used, since Zucksort have probabilistic convergence guarantees↩︎

  6. Multithreading is possible but more complicated↩︎

  7. Multithreading is possible but more complicated↩︎

  8. October 2020↩︎

  9. Related to this, I found a logical glitch: __inplace_merge tries to get 100% buffer although it never uses more than 50% when calling std::__merge_adaptive↩︎

  10. Alternatives are the slightly faster but less adaptive Knuthsort4, the slightly slower but more adaptive Octosort, or the sligthly slower but memory parsimonious Frogsort1↩︎

  11. for graceful degradation in the context of sorting see Implementing Sorting in Database Systems↩︎

  12. The maintainers neither fixed thsi the usual way with Heapsort nor did they switched to Pdqsort offered by Orson Peters which is now in boost and Rust↩︎

  13. Note that parallelizing the low-memory versions will more challenging↩︎

  14. The documentation unteaches the well-founded distinction between \(\mathcal{O}(N)\) merging and \(\mathcal{O}(N\log{N})\) sorting: “It is well-suited to merging two or more sorted arrays: simply concatenate the arrays and sort the resulting array”↩︎

  15. Note that parallelizing the low-memory versions will more challenging↩︎

  16. Multithreading is possible but more complicated↩︎

  17. Multithreading is possible but more complicated↩︎

  18. The documentation claims that “The Timsort algorithm used in Python does multiple sorts efficiently because it can take advantage of any ordering already present in a dataset”, but in fact stability of Timsort enables sorting by multiple keys, taking advantage of a previous sort requires the two keys to be correlated, not orthogonal, and rather perfectly correlated↩︎

  19. Note that parallelizing the low-memory versions will more challenging↩︎

  20. Multithreading is possible but more complicated↩︎

  21. not needed when random-pivots are used, since Zacksort have probabilistic convergence guarantees↩︎

  22. not needed when random-pivots are used, since Zacksort have probabilistic convergence guarantees↩︎


greeNsort logo


Copyright © 2020 Dr. Jens Oehlschlägel - All rights reserved - TermsPrivacyImpressum