This blog supplements and corrects this post in a few respects.: Fast Branchless Quicksort.
The following benchmarks show the execution times for sorting 50 million double values using different sorting implementations. The measurements were taken on an Apple M1 system using Clang and on an Intel i5-8400 Linux system using GCC, both compiled with the -O3 option.
| Implementation | Apple M1 | Intel i5-8400 |
|---|---|---|
| std::sort | 1.33s | 3.97s |
| blqsort (old version) | 0.93s | 1.29s |
| blqsort | 0.66 | 1.29s |
For a fair comparison, the single-threaded version of blqs was used here. On an M1, the threaded versions are another factor of 3 faster.
Full source code is included on Github. There are four implementations of blqsort here, each provided as a single header file.
| File | Description |
|---|---|
| blqs.h | C++ Single-Threaded |
| blqs_thr.h | C++ Multi-Threaded |
| blqsort.h | C Single-Threaded |
| blqsort_thr.h | C Multi-Threaded |
On modern CPUs, avoiding branch mispredictions is a key technique to speed up programs. This:
for (int i = 0; i < 10000; i++) {
if (numbers[i] < 500) {
small_numbers[smlen] = numbers[i];
smlen += 1;
}
}
is slower than this:
for (int i = 0; i < 10000; i++) {
small_numbers[smlen] = numbers[i];
smlen += (numbers[i] < 500);
}
When ‘if’ slows you down, avoid it.
The following discussion assumes data types that are cheap to copy, such as int, long, double, int128, pointers, or small structs.
So far so good. In Quicksort, the partition step splits values into two halves: values smaller than the pivot go to the left, larger ones to the right. Conceptually, this can be implemented like this. This version is not in-place (as Quicksort usually is), but that is not important for illustration:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define SIZE 1000000
int inp[SIZE];
int outp[SIZE];
double ts(void) {
static double t0;
struct timeval tv;
gettimeofday(&tv, NULL);
double h = t0;
t0 = tv.tv_sec + tv.tv_usec / 1000000.0;
return t0 - h;
}
void init(void) {
for (int i = 0; i < SIZE; i++) inp[i] = rand() % 1000;
ts();
}
void partition(void) {
int* left = outp;
int* right = outp + SIZE - 1;
int i = 0;
while (i < SIZE) {
int n = inp[i++];
if (n < 500) {
*left = n;
left++;
}
else {
*right = n;
right--;
}
}
}
int main(void) {
init();
for (int i = 0; i < 1000; i++) partition();
printf("Time: %.2f\n", ts());
}
On an M1 macOS system:
Time: 3.26
Branchless version:
void partition(void) {
int* left = outp;
int* right = outp + SIZE - 1;
int i = 0;
while (i < SIZE) {
int n = inp[i++];
int cond = n < 500;
*left = n;
left += cond;
*right = n;
right -= !cond;
}
}
Time: 0.69
This is roughly in line with expectations.
Now we make a tiny change to the original if version:
Instead of *left = n; left++; we write: *left++ = n;
Until recently, I would have bet this was purely cosmetic and would generate identical machine code. But:
void partition(void) {
int* left = outp;
int* right = outp + SIZE - 1;
int i = 0;
while (i < SIZE) {
int n = inp[i++];
if (n < 500) *left++ = n;
else *right-- = n;
}
}
Time if: 0.58
That is quite surprising. What happened?
The compiler generates CSEL (conditional select) instructions instead of branches. In this configuration, that works extremely well: the CPU effectively computes both results and then selects one, avoiding a branch. That’s why the ++/-- must be part of the left-hand side expression.
On ARM, CSEL is very fast.
Example assembly (simplified):
10000066c:
ldr w14, [x10], #4 ; val = *in++
cmp w14, #0x1f4 ; val < 500
csel x15, x9, x8, lt ; dest = left/right
csel x16, xzr, x12, lt ; +4 or 0
add x8, x8, x16 ; left++
csel x16, x13, xzr, lt ; -4 or 0
add x9, x9, x16 ; right--
str w14, [x15] ; *dest = val
subs x11, x11, #1 ; --
b.ne 0x10000066c ; loop
Compact if version (with gcc)
Time: 3.12
Clearly branch-based code was generated.
Branchless version with GCC:
Time: 0.75
Much faster (similar to branchless on M1)
Compact if version with Clang
Time: 0.76
Here, the compiler also emits specialized instructions, namely CMOV (conditional move). However, they are not quite as effective as ARM’s CSEL.
mov edi, [rdx + rcx*4] ; val = in[i]
xor r8d, r8d
cmp edi, 0x1f4 ; val < 500
setge r8b ; r8 = (>=500)
mov r9, rax
cmovl r9, rsi ; dest = left/right
shl r8d, 2 ; 0 or 4
add rax, r8 ; left += (>=500 ? 4 : 0)
lea r10, [rsi + r8]
add r8, rsi
sub r8, 4 ; right--
mov [r9], edi ; store val
inc ecx ; i++
cmp ecx, 0xf4240
jne loop
The code can be improved further via loop unrolling. The compiler detects that it can unroll the inner loop in the first loop, writes the loop body 16 times consecutively (in this example), and removes the inner loop.
void partition(void) {
int* left = outp;
int* right = outp + SIZE - 1;
int i = 0;
while (i < SIZE - 16) {
for (int j = 16; --j;) {
int n = inp[i++];
if (n < 500) *left++ = n;
else *right-- = n;
}
}
while (i < SIZE) {
int n = inp[i++];
if (n < 500) *left++ = n;
else *right-- = n;
}
}
On the M1 this yields:
Time: 0.50
There are now different code paths: one for the branchless variant and one for the if variant, which on macOS (Clang/ARM64) compiles to efficient CSEL instructions.
In the blqs header, it is currently defined as:
#if defined(__arm64__)
#define PREFER_IF 1
#else
#define PREFER_IF 0
#endif
and used in the code like this:
#if PREFER_IF
T x = *left++;
if (comp(x, piv)) *lwr++ = x;
else *sw++ = x;
#else
int h = comp(*left, piv);
*lwr = *sw = *left++;
lwr += h; sw += !h;
#endif
Since Clang on Linux also generates efficient code for the if-version, the following variant may also make sense:
#if defined(__GNUC__) && !defined(__clang__)
#define PREFER_IF 0
#else
#define PREFER_IF 1
#endif
Additional topics covered in blqsort (such as in-place buffering, sorting networks, support for non-trivially copyable types, etc.) are described here:
christof.kaser@gmail.com