*Real Programmers* will *love* this! (Except that they always knew…)

At CppCon’15, Chandler Carruth, lead of the Clang team at Google, gave a great talk titled Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!

. The talk that is mostly live-coding and micro-benchmarking is both, informative and entertaining. Among other things, Chandler presents an optimized

version of the good old modulus. Let `n`

be a non-negative and `m`

a positive integer. Then he proposes

replacing `n % m`

by `n < m ? n : n % m`

because – after all – if `n`

is less than `m`

, there is nothing to compute.

What supposedly was meant to be a joke, proposing a dumb micro-optimization that actually makes the code run slower, turned out to be in fact an improvement. This is astonishing because it violates just about every principle we’ve learned about writing fast code. In particular:

- Don’t try to outsmart the compiler’s optimizer.
- Avoid unnecessary branches.

I wouldn’t believe this unless I have seen it with my own eyes so I sat down and coded some benchmarks, using the following function.

void modulus ( int m, const std::vector& input, std::vector& output );

The idea is of course that `m`

is a positive integer and `input`

and `output`

are vectors of equal size where `input`

is filled with non-negative integers on entry and on exit, `output[i]`

shall be filled with `input[i] % m`

for all valid subscripts `i`

. I have implemented and compared six versions of this function.

## Implementations

### The Normal

Solution

If I were given the task to write this function, without any further thinking, I would have written it like this.

void modulus ( const int m, const std::vector<int>& input, std::vector<int>& output ) { assert ( m > 0 ); assert (input.size () == output.size () ); const auto op = [ m ] ( const int n ) { return n % m; }; std::transform ( input.cbegin (), input.cend (), output.begin (), op ); }

It’s simple, it’s obvious and it’s supposed to be optimal.

Next, I implemented two versions that leave the modulus as-is but attempt to optimize the looping. We only do a single load, a single arithmetic operation and a single store per iteration. If the loop is not optimized, it could easily dominate the benchmark.

### Duff’s Device

Of course, there are people who don’t trust `std::transform`

to produce a loop as fast as they would. A standard technique for speeding loops up is to *unroll* them such that the relative overhead of the loop control logic is reduced. An infamous idiom to achieve this back in the days when compilers wouldn’t be able to unroll loops themselves is *Duff’s Devive*. Many programmers don’t feel very comfortable when first seeing this construct but it’s actually perfectly valid syntax in both, C and C++. I’m not going into any more detail here. Please refer to the Wikipedia article for more information. Note that I have nested the `case`

labels in a functionally equivalent but probably less confusing way compared to the canonical form from Tom Duff’s original publication. The number 8 for the unrolling factor has no special justification but it’s what everybody seems to be using.

void modulus ( const int m, const std::vector<int>& input, std::vector<int>& output ) { assert ( m > 0 ); assert ( input.size () == output.size () ); const auto n = input.size (); if ( n == 0 ) return; auto src = input.data (); auto dst = output.data (); auto i = ( n + 7 ) / 8; switch ( n % 8 ) { do { case 0: *dst++ = *src++ % m; case 7: *dst++ = *src++ % m; case 6: *dst++ = *src++ % m; case 5: *dst++ = *src++ % m; case 4: *dst++ = *src++ % m; case 3: *dst++ = *src++ % m; case 2: *dst++ = *src++ % m; case 1: *dst++ = *src++ % m; } while ( --i ); } }

### Double Loop

A less dramatic way to utilize loop unrolling with a modern compiler is to break a loop up into an outer loop that iterates a variable number of times and an inner loop that always iterates a fixed number of times. A compiler might be more easy about unrolling a loop with a fixed iteration count (precisely because it avoids the nastiness seen with Duff’s Device). A few iterations might remain to be done after the double loop if the iteration count of the inner loop does not divide the total iteration count.

void modulus ( const int m, const std::vector<int>& input, std::vector<int>& output ) { assert ( m > 0 ); assert ( input.size () == output.size () ); const auto n = input.size (); auto src = input.cbegin (); auto dst = output.begin (); for ( std::size_t i = 0; i < n / 8; ++i ) for ( std::size_t j = 0; j < 8; ++j ) *dst++ = *src++ % m; while ( src != input.cend () ) *dst++ = *src++ % m; }

It is not as elegant as Duff’s Device but easier to understand and shows more trust into the compiler. I would only use this idiom if the total iteration count is *known* a priori to be a multiple of the inner loop’s iteration count so the `while`

loop at the end is not needed. Otherwise the code duplication is ugly and potentially harmful to performance.

### Conditional Modulus

Now finally to the proposed conditional modulus.

void modulus ( const int m, const std::vector<int>& input, std::vector<int>& output ) { assert ( m > 0 ); assert ( input.size () == output.size () ); const auto op = [ m ] ( const int n ) { return n < m ? n : n % m; }; std::transform ( input.cbegin (), input.cend (), output.begin (), op ); }

Since the manual loop unrolling did not speed the normal

modulus up, I didn’t repeat it for the conditional one. Instead, I tried what Chandler also did in his talk: guide the compiler how likely the branch in the conditional will be taken. GCC and Clang allow you to do this with the `__builtin_expect`

. Because not all compilers support it and its syntax is a bit freaky, it is usually not used directly but wrapped into a macro.

#define LIKELY( P ) __builtin_expect ( static_cast<bool>( P ), true ) #define UNLIKELY( P ) __builtin_expect ( static_cast<bool>( P ), false )

I then added also versions of the above `modulus`

implementation with the unary operator `op`

defined as

const auto op = [ m ] ( const int n ) { return LIKELY ( n < m ) ? n : n % m; };

and

const auto op = [ m ] ( const int n ) { return UNLIKELY ( n < m ) ? n : n % m; };

respectively.

## Benchmark

To benchmark the various functions, I like using random data and a large number of samples so to make it less likely that an (un)fortunate choice of the inputs will give a false impression. The following procedure was used.

REPEAT100TIMESBEGINn← drawn randomly from { 2^{10}, …, 2^{20}}m← drawn randomly from { 1, …, MAX_INT }v_{in}[ 1, …,n] ← drawn randomly from { 0, …, MAX_INT }v_{out}[ 1, …,n] ← 0;; input data is in random orderFORi← 0TO10BEGINt_{i}← execution time of modulus(m,v_{in},v_{out})ENDt_{mean}← 1 / 10 Σ_{i = 1, …, 10}t_{i}t_{stdev}← ( 1 / 9 Σ_{i = 1, …, 10}(t_{i}−t_{mean})^{2})^{1 / 2}OUTPUT(t_{mean},t_{stdev})END

Note that `t`_{0} is dropped because the algorithm might need to warm up first.

## Results

The code was compiled with GCC 5.2.0 and Clang 3.6.2 using the flags `-std=c++14 -DNDEBUG -Wall -Wextra -Werror -pedantic -O3`

in both cases.

The results look like this. Plotted are the values of `t`_{mean} / `n` (that is, the average time in nanoseconds spent per individual integer) with `t`_{stdev} / `n` as error bars as a function of the fraction of small

integers (that is the relative frequency of numbers in the input that are less than the modul) overlayed with a linear regression of the form `f`(`x`) = `α` + `β` `x`.

Now *that’s* confusing!

First note that the normal

solution behaves as expected. The regression is almost flat as it ought to be because the algorithm is not data-dependent. Also note that the figures suggest that the days of Duff’s Device are probably over. While there is no significant difference between the version using `std::transform`

and the double-loop, the version using Duff’s Device might be even slightly slower although it is not very significant.

The conditional

versions perform poor if the fraction of small integers is close to 0. This is expected because in this case, we have to do the check every time and then also the modulus most of the time. The performance increases as the fraction of small integers approaches 1 which is also expected because in this case we won’t have to actually do the modulus most of the time. What calls for explanation is the maximum (performance minimum) around 0.5. A closer look revels that the curves somehow appear like a decreasing linear function added to a bell curve. One explanation for this observation is *branch prediction*. If we are close to the extrema (0 or 1) then the conditional will be predictably true or false most of the time and the hardware can foresee this, making branches less expensive. Around 0.5 each branch is taken with about equal probability and the branch predictor can only guess. Amazingly enough, the overall performance is still very competitive. The likely

/ unlikely

annotations seem to have some effect. Especially the Clang plot shows that the likely

annotation marginally improves performance for fractions greater then 0.5 and harms it for fractions less than 0.5. The regressions are not as meaningful and I won’t interpret any more into this because the three slopes are not that different and since the overall shape of the curve is far from linear, it is not clear what a linear approximation can tell about it.

The gray data series at the bottom labeled memcpy

is a reference algorithm that doesn’t compute any modulus but simply copies the input array to the output array (using `std::copy`

).

To harden my hypothesis about the effect of branch prediction, I then modified the benchmark code to *sort* the inputs before feeding them to the algorithms. (At the place marked with the input data is in random order

comment in the above pseudo-code.) The results are compelling.

Even for fractions of small integers as low as 0.1, the conditional

version outperforms the normal

one. Of course, having sorted data is not something we can usually assume but I find this quite impressive. This time, the clock times drop in an almost perfect linear fashion as the fraction of small integers goes from 0 to 1. The likely

/ unlikely

annotations have a relatively small but the expected effect.

## Conclusion

While the benchmark case of computing the modulus over kilobytes of contiguous storage of non-negative integers might not be very realistic, it is remarkable how much this seemingly silly premature optimization can achieve. The gains are already significant for unsorted data with a moderate bias but they become tremendous when branch prediction can partially compensate the additional cost for the conditional.

Chandler concludes his talk with saying that he will think about how to integrate this into Clang. This would of course be the best solution because programmers are notoriously bad at applying the right optimization in the right place. But until this happens and if you can justify it with profiling data also for your real-world application, a function template like this one might come in handy.

#include <type_traits> template < typename T1, typename T2 > using void_if_use_normal_modulo = std::enable_if_t < std::is_integral< T1 >::value && std::is_integral< T2 >::value && ! ( std::is_unsigned< T1 >::value && std::is_unsigned< T2 >::value ) >; template < typename T1, typename T2 > using void_if_use_conditional_modulo = std::enable_if_t < std::is_integral< T1 >::value && std::is_integral< T2 >::value && std::is_unsigned< T1 >::value && std::is_unsigned< T2 >::value >; template < typename T1, typename T2, typename = void > struct mod_helper; template < typename T1, typename T2 > struct mod_helper< T1, T2, void_if_use_normal_modulo< T1, T2 > > { static auto help ( const T1 n, const T2 m ) noexcept { return n % m; } }; template < typename T1, typename T2 > struct mod_helper< T1, T2, void_if_use_conditional_modulo< T1, T2 > > { static auto help ( const T1 n, const T2 m ) noexcept { return n < m ? n : n % m; } }; template < typename T1, typename T2 > auto mod ( const T1 n, const T2 m ) noexcept { return mod_helper< T1, T2 >::help ( n, m ); }

It can probably be improved.

## Downloads

The code used for the benchmarks can be downloaded from my website. You’ll notice that the code snippets shown in this article were simplified. If you run it and get different results, I’d be interested to hear about it.

## Change Log

- 2015-10-05 23:45 (UTC) Small changes to wording and noted the influence of
`UNLIKELY`

for unsorted data with Clang.