Many people in different roles can contribute to a reduction of CO2-footprint in sorting: CS researchers and teachers, library-maintainers, application-programmers.
We comments on other languages, corrections, updates and - most importantly - improvements at feedback(at)greensort.org .
The greeNsort® portfolio of algorithms provides more sustainable alternatives to implementing your own sorting code with popular algorithms such as Quicksort2, Quicksort3, Mergesort, Timsort, Peeksort, Powersort.
The following greeNsort® algorithms are available to create better implementations:
For more information on the algorithms see the Portfolio.
Insensitive choice of algorithms and data structures lead to significant energy wasting — Fethullah Goekkus1
When choosing sorting algorithms consider the following:
Compatibility means deliberately repeating other people’s mistakes — David Wheeler
All the following programming languages had their reasons to shape their sorting APIs and implementations as they did, however, that was before the climate crisis. Now programming languages as a whole are measured by their energy efficiency, see Energy Efficiency across Programming Languages, and - no surprise - it turns out that some languages seem more efficient than others, particularly compiled languages, but also Java. Sadly, some of the languages gaining popularity for expensive big-data applications are interpreted (R and Python). If - for some reason - people need to stick with those instead of switching to more efficient languages (e.g. Julia), then it becomes even more important to maximize the efficiency of the compiled sorting libraries used.
Three types of actions can improve sorting efficiency in standard libraries:
improving implementations (within unchanged interfaces)
enhancing interfaces (new parameters, new interfaces)
replacing interfaces (changing existing interfaces for the better)
gives quick wins, 2. should be doable, and perhaps even 3.
Before we comment on particular languages, we shortly
The following quick-wins can be achieved by improving implementations:
Opportunity | Action | |
---|---|---|
C | faster sorting still adaptive to ties | replace Quicksort3 with ZacksortB |
C++ | faster sorting with the same amount of memory or simply save memory | replace Quicksort2 with ZacksortB or Pdqsort, at 0%-buffer replace inplace-mergesort with Grailsort, at negligible sqrt(N) buffer add Walksort, at parametrizable 14% buffer add Frogsort2, replace 50%-buffer Mergesort with Frogsort1, replace 100%-buffer Mergesort with the more adaptive Omitsort |
Clojure | see Java | see Java |
Groovy | see Java | see Java |
Java | faster sorting still adaptive to presorting, faster sorting still adpative to ties | replace Timsort with Squidsort1, replace Dupisort with ZacksortB |
JRuby | see Java | see Java |
Kotlin | see Java | see Java |
Python | faster sorting still adaptive to presorting | replace Timsort with Squidsort1 |
Rust | faster sorting still adaptive to presorting | replace Timsort with Squidsort1 |
Scala | faster sorting still adaptive to ties, half buffer and half copying, see Java | replace Quicksort3 with ZacksortB, replace Copysort with Frogsort1, see Java |
After quick-wins, more can be saved by creating optimal interfaces.
It strongly depends on the programming language, how such interfaces look in detail and whether need just enhancement or replacement.
Any frequently executed code that manipulates strings is fertile ground for optimization — Kurt Guntheroth2
String sorting is unduly expensive. The reason is
localization: respecting the locale order of letters prevents
us from using a simple efficient radix sort on strings. Instead we use a
more expensive comparison sort, and for each comparison we call an
expensive function handling dozens of special cases of the given locale.
We found localized string sorting with C++ std::sort
up to
40 times more expensive than radix sorting with Skasort. The most effective
way to reduce the cost of string sorting would be an international
agreement on a character set (UTF-8) and no locale, i.e. simple byte
order.
All problems in computer science can be solved by another level of indirection … except for the problem of too many layers of indirection — David Wheeler
There is another reason that string sorting is unduly expensive: The prior art often sorts strings indirectly by moving pointers to strings. This seems clever because - if the pointer is smaller than the string - it can theoretically save movements costs in true random-access-memory. Also it is convenient to use Quicksort with pointers, because we know Quicksort to be fast and it cannot handle size-varying strings, just equally-sized pointers. However, in today’s deep cache-hierarchies this approach doesn’t scale to big \(N\): access costs depend on access distance, moving pointers instead of strings implies random-access to the memory where strings are actually stored, whether or not the strings are finally re-arranged. Suitably implemented Mergesort can directly move strings during the merge-passes, this can easily reduce sorting time and energy by 33%. And the greeNsort® algorithm Frogsort can do the same, just with less memory.
In summary: direct radix-sorting of strings seems most promising. Where comparison sorting cannot be avoided greeNsort® algorithms are an alternative, either direct Frogsort, Geckosort, Squidsort or indirect Zacksort, Zucksort.
Multi-threading your software delivers better performance and as well as better energy-efficiency — Intel3
Not all algorithms parallelize equally well. For example Mergesort with 100% buffer can be multithreaded very well, not only with parallel branch execution but also with parallelizing the the work phase (parallel merge), the same is true for [Frogsort](glossary.html#frogsort, see the deep dive: multi-threading.
By contrast, the work-phase of Quicksort (partitioning) has a strong serial character: parallel inplace-partitioning is possible, but complicated and costs extra read or move operations on certain input patterns, see for example Tsigas & Zhang. Parallel partitioning gets easier when 100% buffer memory is allowed, however, there are still challenges regarding work distribution due to random pivots, particularly when ties occur. From a practical4 point of view, it seems beneficial to not insist of parallelizing pure Quicksort, it is much easier to follow the approach in Java 8 and use parallel binary Mergesort (better Frogsort or a parallel Multiway-Mergesort with help of a LoserTree at the top levels of the divide&conquer process and switch to e.g. Quicksort (better ZacksortB) once a branch becomes small enough to run it on a single thread.
In summary: parallel branching is easy, parallel work phase is harder: parallel merging is a practical approach to parallelizing the work phase in both, merging sorts and in the top-levels of partitioning sorts, greeNsort® algorithms help to lower the costs of parallel merging.
C is quirky, flawed, and an enormous success — Dennis Ritchie
C
is a compiled language and many other languages have interfaces to C. C
has a standard library function qsort
which takes four
parameters
Limitations of the interface are:
Underlying is a Quicksort3 and limitations of the implementation are:
Opportunity for improving the implementation within the given interface:
Opportunity for enhancing the interface, all the
following interfaces would benefit from cheaper Boolean operators
(e.g. LESS
) instead of more expensive tri-Boolean
comparators: - offer a new interface for unstable
sorting of equally-sized elements (ZacksortB), also multi-threaded - offer a new
interface for stable
50%-buffer sorting of equally-sized elements (Frogsort1), also multi-threaded - offer a new
interface for stable
low-buffer sorting of equally-sized elements (Frogsort2) with an parameter
for the buffer fraction7 - offer a new interface for stable sqrt-buffer
sorting of equally-sized elements (Walksort)8 - offer a new
interface for stable
varying sorting of size-varying elements (or at least for
efficient string-sorting in arbitrary
collation), also multi-threaded - offer a new
interface for specific sorting of
atomic types, also multi-threaded
There are only two kinds of languages: the ones people complain about and the ones nobody uses — Bjarne Stroustrup
C++ is
an compiled language with template support for many data-types, many
other languages have interfaces to C++. C++ has two standard-library
sorting functions, unstable std::sort
and
std::stable_sort
, furthermore C’s qsort
is available. Since
C++11 the implementations must guarantee a \(\mathcal{O}(N \log(N))\) worst case.
std::stable_sort
has a clever implementation of Mergesort that combines
three types of merging (inplace-merge with \(\mathcal{O}(N \log(N))\) merge-cost,
halfcopy-merge copying data to 50% buffer and merging back, nocopy-merge
merging forward to 100% buffer). The algorithm tries to achieve maximum
speed by allocating 100% buffer space, if less is obtained it starts
with a more expensive merge and switches to a cheaper one as soon as the
available buffer is big enough relative to the problem size in the
recursion; for example the biggest top-level-merge is done with
inplace-merge, the next level with halfcopy-merge and then nocopy-merge.
We found the following limitations of the
implementation:
Opportunity for improving the implementation within the given interface:
std::sort
is using Introsort, based on a simple
Quicksort with a pivot deterministically chosen as a middle
element. In order to avoid vulnerability algorithmic complexity attacks
(or simply unlucky input that leads to quadratic runtime or worse) if
the recursion depth gets too deep it falls back on Heapsort. We found the
following limitations of this implementation:
Opportunity for improving the implementation within the given interface:
std::sort in libc++ of the clang compiler uses s different implementation that covers more special cases and is tuned for tied and presorted data. Unfortunately it seems to contain a vulnerability for quadratic runtime reported 2014 by Orson Peters and has not been fixed since then14. Hence libc++ std::sort violates the C++ standard and cannot be recommended for production use.
Since C++17 the standard specifies parallel sorting with a serial and multi-threaded policy, since C++20 it adds a SIMD-policy. The GCC 9 implements the parallel Standard Template Library (STL) via Intel’s Thread Building Blocks (TBB) library. However, the TBB has no stable sort and its unstable sort was incompletely parallelized (parallel branching but not parallel partitioning). In 2018 Intel open-sourced its Parallel Standard Template Library (PSTL) which implements a stable merge sort which parallelizes the branches and the merging. Now the llvm PSTL has the following limitations of the implementation:
Opportunities for improving the implementation within the given interface:
Opportunities for enhancing the interface:
Opportunities for improving the interface:
I wanted a language as acceptable as Java or C#, but supporting a much simpler programming model — Rich Hickey
Closure is a modern, dynamic, and functional dialect of the Lisp programming language on the Java platform, it uses the sorting library of Java.
Groovy’s sort ultimately calls Java’s Collections’ sort— Guillaume Laforge
Groovy is a Java-syntax-compatible object-oriented programming language for the Java platform, it uses the sorting library of Java.
Java is C++ without the guns, clubs and knives — James Gosling
Java
is a class-based, object-oriented programming language that is designed
to have as few implementation dependencies as possible. Java 7
introduced Dual-Pivot-Quicksort for primitive types and Timsort for object-types
behind its sort
API, and Java 8 introduced the
Arrays.parallelSort
API, which uses Mergesort for the parallel
top-levels of recursion. From the greeNsort®
perspective these are limitations of the
implementation:
Opportunities for improving the implementation within the given interface:
Opportunities for enhancing the interface:
Opportunity for enhancing the interface, all the
following interfaces would benefit from cheaper Boolean operators
(e.g. LESS
) instead of more expensive tri-Boolean
comparators:
The following languages benefit from improvements to the Java sorting library:
JRuby always has and always will try to match CRuby behavior — Charles Nutter
JRuby is an implementation of the Ruby programming language atop the Java Virtual Machine, written largely in Java. Actually, JRuby switched to Java’s sort library in 2016.
Develop with pleasure — Dmitry Jemerov
Kotlin is a cross-platform, statically typed, general-purpose programming language with type inference. Kotlin is designed to interoperate fully with Java, and the JVM version of Kotlin’s standard library depends on the Java Class Library.
I stepped down as ‘Benevolent Dictator’, as I was called — Guido van Rossum
Python
is an interpreted, high-level and general-purpose programming language
written in C. Python2 supported a cmp parameter to handle user
specified comparison functions, in python3 this was completely abandoned
for a key parameter to use specified key-extraction functions.
The python3 API has two methods list.sort()
which sorts an
object insitu and a sorted()
method that sorts
exsitu and does not modify its input.
These are limitations of the interface:
These are limitations of the implementation:
Opportunities for improving the implementation within the given interface:
Opportunities for enhancing the interface:
Opportunity for enhancing the interface:
Less-well-known things I’m very proud that rust shipped 1.0 without — Graydon Hoare
Rust
is a systems programming language similar to C++ or Java
but memory-save
without a garbage-collector.
It has sorting APIs for stable and unstable sorting using a
comparison-function or a key-extraction-function. Stable
sort
uses a lean version of Timsort without
galloping-mode and sort_unstable
uses Pdqsort. We found the sorting
libraries slower than C/C++ implementations.
Limitations of the implementation are:
Opportunity for improving the implementation within the given interface:
Opportunity for enhancing the interface:
sort
API with an optional parameter for the
buffer fraction (Squidsort2) and optionally
multi-threadedOpportunities for improving the interface:
It stands for scalable language — Martin Odersky
Scala
is a general-purpose programming language providing support for both
object-oriented programming and functional programming. Scala provides
language interoperability with Java, so that libraries written in either
language may be referenced directly in Scala or Java code. Scala’s Sorting
object provides convenience wrappers for
java.util.Arrays.sort
, see Java. Beyond that Scala
implements its own sorting algorithms based on a tri-Boolean comparator,
a counting sort for tabulating Booleans, a primitve Mergesort
(Copysort) that copies
data to 100% buffer and merges back, a ternary Quicksort3.
Obvious limitations of the Quicksort implementation are:
Obvious limitations of the Mergesort implementation are:
Opportunity for improving the implementation within the given interface:
Opportunity for enhancing the interface:
LESS
comparison function and optionally multi-threadedPlease inform us about
The limits of my language mean the limits of my world — Ludwig Wittgenstein
University of Zurich, Informatics and Sustainability Research↩︎
“Developing Green Software”, no longer available at software.intel.com/content/dam/develop/external/us/en/documents/developing-green-software-183291.pdf↩︎
Note that the very efficient IPS4o is a parallel k-ary partitioning algorithm requiring only \(\mathcal{O}(\sqrt{N})\) buffer, however, at the price of a quite complicated implementation.↩︎
October 2020↩︎
not needed when random-pivots are used, since Zucksort have probabilistic convergence guarantees↩︎
Multithreading is possible but more complicated↩︎
Multithreading is possible but more complicated↩︎
October 2020↩︎
Related to this, I found a logical glitch:
__inplace_merge
tries to get 100% buffer although it never
uses more than 50% when calling std::__merge_adaptive
↩︎
alternatives are the slightly faster but less adaptive Knuthsort4 or the slightly slower but more adaptive Octosort↩︎
measured with sorting doubles↩︎
for graceful degradation in the context of sorting see Implementing Sorting in Database Systems↩︎
The maintainers neither fixed thsi the usual way with Heapsort nor did they switched to Pdqsort offered by Orson Peters which is now in boost and Rust↩︎
A more complicated alternative would be using the completely parallel Quicksort of Tsigas & Zhang based on the implementation of Singler&Sanders in the GCC Parallel Mode, however that is not adaptive to ties. Another complicated alternative would be IPS4o by Axtmann, Witt, Ferizovic & Sanders, but note that the serial version IPS4o is slower than optimal Quicksort algorithms.↩︎
Note that parallelizing the low-memory versions will more challenging↩︎
The documentation unteaches the well-founded distinction between \(\mathcal{O}(N)\) merging and \(\mathcal{O}(N\log{N})\) sorting: “It is well-suited to merging two or more sorted arrays: simply concatenate the arrays and sort the resulting array”↩︎
Note that parallelizing the low-memory versions will more challenging↩︎
Multithreading is possible but more complicated↩︎
Multithreading is possible but more complicated↩︎
The documentation claims that “The Timsort algorithm used in Python does multiple sorts efficiently because it can take advantage of any ordering already present in a dataset”, but in fact stability of Timsort enables sorting by multiple keys, taking advantage of a previous sort requires the two keys to be correlated, not orthogonal, and rather perfectly correlated↩︎
Note that parallelizing the low-memory versions will more challenging↩︎
Multithreading is possible but more complicated↩︎
not needed when random-pivots are used, since Zacksort have probabilistic convergence guarantees↩︎
not needed when random-pivots are used, since Zacksort have probabilistic convergence guarantees↩︎
Copyright © 2010 - 2024 Dr. Jens Oehlschlägel - All rights reserved - Terms - Privacy - Impressum