What makes blqsort faster than almost any other Quicksort around – with C and C++ interfaces

This blog supplements and corrects this post in a few respects.: Fast Branchless Quicksort.

The following benchmarks show the execution times for sorting 50 million double values using different sorting implementations. The measurements were taken on an Apple M1 system using Clang and on an Intel i5-8400 Linux system using GCC, both compiled with the -O3 option.

Implementation Apple M1 Intel i5-8400
std::sort 1.33s 3.97s
blqsort (old version) 0.93s 1.29s
blqsort 0.66 1.29s

For a fair comparison, the single-threaded version of blqs was used here. On an M1, the threaded versions are another factor of 3 faster.

blqsort

Full source code is included on Github. There are four implementations of blqsort here, each provided as a single header file.

File Description
blqs.h C++ Single-Threaded
blqs_thr.h C++ Multi-Threaded
blqsort.h C Single-Threaded
blqsort_thr.h C Multi-Threaded

Branchless programming

On modern CPUs, avoiding branch mispredictions is a key technique to speed up programs. This:

for (int i = 0; i < 10000; i++) {
    if (numbers[i] < 500) {
        small_numbers[smlen] = numbers[i];
        smlen += 1;
    }
}
is slower than this:

for (int i = 0; i < 10000; i++) {
    small_numbers[smlen] = numbers[i];
    smlen += (numbers[i] < 500);
}
When ‘if’ slows you down, avoid it.

Partitionierung in Quicksort

The following discussion assumes data types that are cheap to copy, such as int, long, double, int128, pointers, or small structs.

So far so good. In Quicksort, the partition step splits values into two halves: values smaller than the pivot go to the left, larger ones to the right. Conceptually, this can be implemented like this. This version is not in-place (as Quicksort usually is), but that is not important for illustration:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

#define SIZE 1000000
int inp[SIZE];
int outp[SIZE];

double ts(void) {
    static double t0;
    struct timeval tv;
    gettimeofday(&tv, NULL);
    double h = t0;
    t0 = tv.tv_sec + tv.tv_usec / 1000000.0;
    return t0 - h;
}
void init(void) {
    for (int i = 0; i < SIZE; i++) inp[i] = rand() % 1000;
    ts();
}
void partition(void) {
    int* left = outp;
    int* right = outp + SIZE - 1;
    int i = 0;
    while (i < SIZE) {
        int n = inp[i++];
        if (n < 500) {
            *left = n;
            left++;
        }
        else {
            *right = n;
            right--;
        }
    }
}
int main(void) {
    init();
    for (int i = 0; i < 1000; i++) partition();
    printf("Time: %.2f\n", ts());
}
On an M1 macOS system:

Time: 3.26
Branchless version:

void partition(void) {
    int* left = outp;
    int* right = outp + SIZE - 1;
    int i = 0;
    while (i < SIZE) {
        int n = inp[i++];
        int cond = n < 500;
        *left = n;
        left += cond;
        *right = n;
        right -= !cond;
    }
}
Time: 0.69
This is roughly in line with expectations.

A subtle but surprising change

Now we make a tiny change to the original if version:

Instead of *left = n; left++; we write: *left++ = n;

Until recently, I would have bet this was purely cosmetic and would generate identical machine code. But:

void partition(void) {
    int* left = outp;
    int* right = outp + SIZE - 1;
    int i = 0;
    while (i < SIZE) {
        int n = inp[i++];
        if (n < 500) *left++ = n;
        else *right-- = n;
    }
}
Time if: 0.58
That is quite surprising. What happened?

The compiler generates CSEL (conditional select) instructions instead of branches. In this configuration, that works extremely well: the CPU effectively computes both results and then selects one, avoiding a branch. That’s why the ++/-- must be part of the left-hand side expression.

On ARM, CSEL is very fast.

Example assembly (simplified):

10000066c:
    ldr  w14, [x10], #4     ; val = *in++
    cmp  w14, #0x1f4        ; val < 500
    csel x15, x9, x8, lt    ; dest = left/right
    csel x16, xzr, x12, lt  ; +4 or 0
    add  x8, x8, x16        ; left++
    csel x16, x13, xzr, lt  ; -4 or 0
    add  x9, x9, x16        ; right--
    str  w14, [x15]         ; *dest = val
    subs x11, x11, #1       ; --
    b.ne 0x10000066c        ; loop

Results on Intel i5

Compact if version (with gcc)

Time: 3.12
Clearly branch-based code was generated.

Branchless version with GCC:

Time: 0.75
Much faster (similar to branchless on M1)

Compact if version with Clang

Time: 0.76
Here, the compiler also emits specialized instructions, namely CMOV (conditional move). However, they are not quite as effective as ARM’s CSEL.

    mov  edi, [rdx + rcx*4]      ; val = in[i]
    xor  r8d, r8d
    cmp  edi, 0x1f4              ; val < 500
    setge r8b                    ; r8 = (>=500)
    mov  r9, rax
    cmovl r9, rsi                ; dest = left/right
    shl  r8d, 2                  ; 0 or 4
    add  rax, r8                 ; left += (>=500 ? 4 : 0)
    lea  r10, [rsi + r8]
    add  r8, rsi
    sub  r8, 4                   ; right--
    mov  [r9], edi               ; store val
    inc  ecx                     ; i++
    cmp  ecx, 0xf4240
    jne  loop

Loop unrolling

The code can be improved further via loop unrolling. The compiler detects that it can unroll the inner loop in the first loop, writes the loop body 16 times consecutively (in this example), and removes the inner loop.

void partition(void) {
    int* left = outp;
    int* right = outp + SIZE - 1;
    int i = 0;
    while (i < SIZE - 16) {
        for (int j = 16; --j;) {
            int n = inp[i++];
            if (n < 500) *left++ = n;
            else *right-- = n;
        }
    }
    while (i < SIZE) {
        int n = inp[i++];
        if (n < 500) *left++ = n;
        else *right-- = n;
    }
}
On the M1 this yields:

Time: 0.50

Consequences

There are now different code paths: one for the branchless variant and one for the if variant, which on macOS (Clang/ARM64) compiles to efficient CSEL instructions.

In the blqs header, it is currently defined as:

#if defined(__arm64__)
    #define PREFER_IF 1
#else
    #define PREFER_IF 0
#endif
and used in the code like this:

#if PREFER_IF
        T x = *left++;
        if (comp(x, piv)) *lwr++ = x;
        else *sw++ = x;
#else
        int h = comp(*left, piv);
        *lwr = *sw = *left++;
        lwr += h; sw += !h;
#endif
Since Clang on Linux also generates efficient code for the if-version, the following variant may also make sense:

#if defined(__GNUC__) && !defined(__clang__)
    #define PREFER_IF 0
#else
    #define PREFER_IF 1
#endif

Further details

Additional topics covered in blqsort (such as in-place buffering, sorting networks, support for non-trivially copyable types, etc.) are described here:

Fast Branchless Quicksort.

Links

Interactive sorting demo


christof.kaser@gmail.com