Here we comment on the energy efficiency status of sorting in some programming languages and recommend actions for improvments.
We welcome comments on other languages, corrections, updates and - most importantly - improvements
Compatibility means deliberately repeating other people’s mistakes — David Wheeler
All the following programming languages had their reasons to shape their sorting APIs and implementations as they did, however, that was before the climate crisis. Now programming languages as a whole are measured by their energy efficiency, see Energy Efficiency across Programming Languages, and - no surprise - it turns out that some languages seem more efficient than others, particularly compiled languages, but also Java. Sadly, some of the languages gaining popularity for expensive big-data applications are interpreted (R and Python). If - for some reason - people need to stick with those instead of switching to more efficient languages (e.g. Julia), then it becomes even more important to maximize the efficiency of the compiled sorting libraries used.
Three types of actions can improve sorting efficiency in standard libraries:
improving implementations (within unchanged interfaces)
enhancing interfaces (new parameters, new interfaces)
replacing interfaces (changing existing interfaces for the better)
gives quick wins, 2. should be doable, and perhaps even 3.
Before we comment on single languages, we shortly
Insensitive choice of algorithms and data structures lead to significant energy wasting — Fethullah Goekkus1
The following greeNsort® algorithms are available to create better implementations:
For more information on the algorithms see the Portfolio.
The following quick-wins can be achieved by improving implementations:
Opportunity | Action | |
---|---|---|
C | faster sorting still adaptive to ties | replace Quicksort3 with ZacksortB |
C++ | faster sorting with the same amount of memory or simply save memory | replace Quicksort2 with ZacksortB or Pdqsort, at 0%-buffer replace inplace-mergesort with Grailsort, at negligible sqrt(N) buffer add Walksort, at parametrizable 14% buffer add Frogsort2, replace 50%-buffer Mergesort with Frogsort1, replace 100%-buffer Mergesort with the more adaptive Omitsort |
Clojure | see Java | see Java |
Groovy | see Java | see Java |
Java | faster sorting still adaptive to presorting, faster sorting still adpative to ties | replace Timsort with Squidsort1, replace Dupisort with ZacksortB |
JRuby | see Java | see Java |
Kotlin | see Java | see Java |
Python | faster sorting still adaptive to presorting | replace Timsort with Squidsort1 |
Rust | faster sorting still adaptive to presorting | replace Timsort with Squidsort1 |
Scala | faster sorting still adaptive to ties, half buffer and half copying, see Java | replace Quicksort3 with ZacksortB, replace Copysort with Frogsort1, see Java |
After quick-wins, more can be saved by creating optimal interfaces.
It strongly depends on the programming language, how such interfaces look in detail and whether need just enhancement or replacement.
Any frequently executed code that manipulates strings is fertile ground for optimization — Kurt Guntheroth2
String sorting is unduly expensive. The reason is localization: respecting the locale order of letters prevents us from using a simple efficient radix sort on strings. Instead we use a more expensive comparison sort, and for each comparison we call an expensive function handling dozens of special cases of the given locale. We found localized string sorting with C++ std::sort
up to 40 times more expensive than radix sorting with Skasort. The most effective way to reduce the cost of string sorting would be an international agreement on a character set (UTF-8) and no locale, i.e. simple byte order.
All problems in computer science can be solved by another level of indirection … except for the problem of too many layers of indirection — David Wheeler
There is another reason that string sorting is unduly expensive: The prior art often sorts strings indirectly by moving pointers to strings. This seems clever because - if the pointer is smaller than the string - it can theoretically save movements costs in true random-access-memory. Also it is convenient to use Quicksort with pointers, because we know Quicksort to be fast and it cannot handle size-varying strings, just equally-sized pointers. However, in todays deep cache-hierarchies this approach doesn’t scale to big \(N\): access costs depend on access distance, moving pointers instead of strings implies random-access to the memory where strings are actually stored, whether or not the strings are finally re-arranged. Suitably implemented Mergesort can directly move strings during the merge-passes, this can easily reduce sorting time and energy by 33%. And the greeNsort® algorithm Frogsort can do the same, just with less memory.
In summary: direct radix-sorting of strings seems most promising. Where comparison sorting cannot be avoided greeNsort® algorithms are an alternative, either direct Frogsort, Geckosort, Squidsort or indirect Zacksort, Zucksort.
Multi-threading your software delivers better performance and as well as better energy-efficiency — Intel3
Not all algorithms parallelize equally well. For example Mergesort with 100% buffer can be multithreaded very well, not only with parallel branch execution but also with parallelizing the the work phase (parallel merge), the same is true for [Frogsort](glossary.html#frogsort, see the deep dive: multi-threading.
By contrast, the work-phase of Quicksort (partitioning) has a strong serial character: parallel inplace-partitioning is possible, but complicated and costs extra read or move operations on certain input patterns, see for example Tsigas & Zhang. Parallel partitioning gets easier when 100% buffer memory is allowed, however, there are still challenges regarding work distribution due to random pivots, particularly when ties occur. From a practical4 point of view, it seems beneficial to not insist of parallelizing pure Quicksort, it is much easier to follow the approach in Java 8 and use parallel binary Mergesort (better Frogsort or a parallel Multiway-Mergesort with help of a LoserTree at the top levels of the divide&conquer process and switch to e.g. Quicksort (better ZacksortB) once a branch becomes small enough to run it on a single thread.
In summary: parallel branching is easy, parallel workphase is harder: parallel merging is a practical approach to parallelizing the work phase in both, merging sorts and in the top-levels of partitioning sorts, greeNsort® algorithms help to lower the costs of parallel merging.
C is quirky, flawed, and an enormous success — Dennis Ritchie
C is a compiled language and many other languages have interfaces to C. C has a standard library function qsort
which takes four parameters
Limitations of the interface are:
Underlying is a Quicksort3 and limitations of the implementation are:
Opportunity for improving the implementation within the given interface:
Opportunity for enhancing the interface, all the following interfaces would benefit from cheaper boolean operators (e.g. LESS
) instead of more expensive tri-boolean comparators: - offer a new interface for unstable sorting of equally-sized elements (ZacksortB), also multi-threaded - offer a new interface for stable 50%-buffer sorting of equally-sized elements (Frogsort1), also multi-threaded - offer a new interface for stable low-buffer sorting of equally-sized elements (Frogsort2) with an parameter for the buffer fraction7 - offer a new interface for stable sqrt-buffer sorting of equally-sized elements (Walksort)8 - offer a new interface for stable varying sorting of size-varying elements (or at least for efficient string-sorting in arbitrary collation), also multi-threaded - offer a new interface for specific sorting of atomic types, also multi-threaded
There are only two kinds of languages: the ones people complain about and the ones nobody uses — Bjarne Stroustrup
C++ is an compiled language with template support for many data-types, many other languages have interfaces to C++. C++ has two standard-library sorting functions, unstable std::sort
and std::stable_sort
, furthermore C’s qsort
is available. Since C++11 the implementations must guarantee a \(\mathcal{O}(N \log(N))\) worst case.
std::stable_sort
has a clever implementation of Mergesort that combines three types of merging (inplace-merge with \(\mathcal{O}(N \log(N))\) merge-cost, halfcopy-merge copying data to 50% buffer and merging back, nocopy-merge merging forward to 100% buffer). The algorithm tries to achieve maximum speed by allocating 100% buffer space, if less is obtained it starts with a more expensive merge and switches to a cheaper one as soon as the available buffer is big enough relative to the problem size in the recursion; for example the biggest top-level-merge is done with inplace-merge, the next level with halfcopy-merge and then nocopy-merge. We found the following limitations of the implementation:
Opportunity for improving the implementation within the given interface:
std::sort
is using Introsort, based on a simple Quicksort with a pivot deterministically chosen as a middle element. In order to avoid vulnerability algorithmic complexity attacks (or simply unlucky input that leads to quadratic runtime or worse) if the recursion depth gets too deep it falls back on Heapsort. We found the following limitations of this implementation:
Opportunity for improving the implementation within the given interface:
std::sort in libc++ of the clang compiler uses s different implementation that covers more special cases and is tuned for tied and presorted data. Unfortunately it seems to contain a vulnerabilty for quadratic runtime reported 2014 by Orson Peters and has not been fixed since then14. Hence libc++ std::sort violates the C++ standard and cannot be recommended for production use.
Since C++17 the standard specifies parallel sorting with a serial and multi-threaded policy, since C++20 it adds a SIMD-policy. The GCC 9 implements the parallel Standard Template Library (STL) via Intel’s Thread Building Blocks (TBB) library. However, the TBB has no stable sort and its unstable sort was incompletely parallelized (parallel branching but not parallel partitioning). In 2018 Intel open-sourced its Parallel Standard Template Library (PSTL) which implements a stable merge sort which parallelizes the branches and the merging. Now the llvm PSTL has the following limitations of the implementation:
Opportunities for improving the implementation within the given interface:
Opportunities for enhancing the interface:
Opportunities for improving the interface:
I wanted a language as acceptable as Java or C#, but supporting a much simpler programming model — Rich Hickey
Closure is a modern, dynamic, and functional dialect of the Lisp programming language on the Java platform, it uses the sorting library of Java.
Groovy’s sort ultimately calls Java’s Collections’ sort— Guillaume Laforge
Groovy is a Java-syntax-compatible object-oriented programming language for the Java platform, it uses the sorting library of Java.
Java is C++ without the guns, clubs and knives — James Gosling
Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. Java 7 introduced Dual-Pivot-Quicksort for primitive types and Timsort for object-types behind its sort
API, and Java 8 introduced the Arrays.parallelSort
API, which uses Mergesort for the parallel top-levels of recursion. From the greeNsort® perspective these are limitations of the implementation:
Opportunities for improving the implementation within the given interface:
Opportunities for enhancing the interface:
Opportunity for enhancing the interface, all the following interfaces would benefit from cheaper boolean operators (e.g. LESS
) instead of more expensive tri-boolean comparators:
The following languages benefit from improvements to the Java sorting library:
JRuby always has and always will try to match CRuby behavior — Charles Nutter
JRuby is an implementation of the Ruby programming language atop the Java Virtual Machine, written largely in Java. Actually, JRuby switched to Java’s sort library in 2016.
Develop with pleasure — Dmitry Jemerov
Kotlin is a cross-platform, statically typed, general-purpose programming language with type inference. Kotlin is designed to interoperate fully with Java, and the JVM version of Kotlin’s standard library depends on the Java Class Library.
I stepped down as ‘Benevolent Dictator’, as I was called — Guido van Rossum
Python is an interpreted, high-level and general-purpose programming language written in C. Python2 supported a cmp parameter to handle user specified comparison functions, in python3 this was completely abandoned for a key parameter to use specified key-extraction functions. The python3 API has two methods list.sort()
which sorts an object insitu and a sorted()
method that sorts exsitu and does not modify its input.
These are limitations of the interface:
These are limitations of the implementation:
Opportunities for improving the implementation within the given interface:
Opportunities for enhancing the interface:
Opportunity for enhancing the interface:
Less-well-known things I’m very proud that rust shipped 1.0 without — Graydon Hoare
Rust is a systems programming language similar to C++ or Java but memory-save without a garbage-collector. It has sorting APIs for stable and unstable sorting using a comparison-function or a key-extraction-function. Stable sort
uses a lean version of Timsort without galloping-mode and sort_unstable
uses Pdqsort. We found the sorting libraries slower than C/C++ implementations.
Limitations of the implementation are:
Opportunity for improving the implementation within the given interface:
Opportunity for enhancing the interface:
sort
API with an optional parameter for the buffer fraction (Squidsort2) and optionally multi-threadedOpportunities for improving the interface:
It stands for scalable language — Martin Odersky
Scala is a general-purpose programming language providing support for both object-oriented programming and functional programming. Scala provides language interoperability with Java, so that libraries written in either language may be referenced directly in Scala or Java code. Scala’s Sorting
object provides convenience wrappers for java.util.Arrays.sort
, see Java. Beyond that Scala implements its own sorting algorithms based on a tri-boolean comparator, a counting sort for tabulating booleans, a primitve Mergesort (Copysort) that copies data to 100% buffer and merges back, a ternary Quicksort3.
Obvious limitations of the Quicksort implementation are:
Obvious limitations of the Mergesort implementation are:
Opportunity for improving the implementation within the given interface:
Opportunity for enhancing the interface:
LESS
comparison function and optionally multi-threadedThe limits of my language mean the limits of my world — Ludwig Wittgenstein
University of Zurich, Informatics and Sustainability Research↩︎
Note that the very efficient IPS4o is a parallel k-ary partitioning algorithm requiring only \(\mathcal{O}(\sqrt{N})\) buffer, however, at the price of a quite complicated implementation.↩︎
October 2020↩︎
not needed when random-pivots are used, since Zucksort have probabilistic convergence guarantees↩︎
Multithreading is possible but more complicated↩︎
Multithreading is possible but more complicated↩︎
October 2020↩︎
Related to this, I found a logical glitch: __inplace_merge
tries to get 100% buffer although it never uses more than 50% when calling std::__merge_adaptive
↩︎
alternatives are the slightly faster but less adaptive Knuthsort4 or the slightly slower but more adaptive Octosort↩︎
measured with sorting doubles↩︎
for graceful degradation in the context of sorting see Implementing Sorting in Database Systems↩︎
The maintainers neither fixed thsi the usual way with Heapsort nor did they switched to Pdqsort offered by Orson Peters which is now in boost and Rust↩︎
A more complicated alternative would be using the completely parallel Quicksort of Tsigas & Zhang based on the implementation of Singler&Sanders in the GCC Parallel Mode, however that is not adaptive to ties. Another complicated alternative would be IPS4o by Axtmann, Witt, Ferizovic & Sanders, but note that the serial version IPS4o is slower than optimal Quicksort algorithms.↩︎
Note that parallelizing the low-memory versions will more challenging↩︎
The documentation unteaches the well-founded distinction between \(\mathcal{O}(N)\) merging and \(\mathcal{O}(N\log{N})\) sorting: “It is well-suited to merging two or more sorted arrays: simply concatenate the arrays and sort the resulting array”↩︎
Note that parallelizing the low-memory versions will more challenging↩︎
Multithreading is possible but more complicated↩︎
Multithreading is possible but more complicated↩︎
The documentation claims that “The Timsort algorithm used in Python does multiple sorts efficiently because it can take advantage of any ordering already present in a dataset”, but in fact stability of Timsort enables sorting by multiple keys, taking advantage of a previous sort requires the two keys to be correlated, not orthogonal, and rather perfectly correlated↩︎
Note that parallelizing the low-memory versions will more challenging↩︎
Multithreading is possible but more complicated↩︎
not needed when random-pivots are used, since Zacksort have probabilistic convergence guarantees↩︎
not needed when random-pivots are used, since Zacksort have probabilistic convergence guarantees↩︎
Copyright © 2020 Dr. Jens Oehlschlägel - All rights reserved - Terms - Privacy - Impressum