int modulus(int n, int m) { return n < m ? n : n % m; }

Real Programmers will love this! (Except that they always knew…)

At CppCon’15, Chandler Carruth, lead of the Clang team at Google, gave a great talk titled Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!. The talk that is mostly live-coding and micro-benchmarking is both, informative and entertaining. Among other things, Chandler presents an optimized version of the good old modulus. Let n be a non-negative and m a positive integer. Then he proposes replacing n % m by n < m ? n : n % m because – after all – if n is less than m, there is nothing to compute.

What supposedly was meant to be a joke, proposing a dumb micro-optimization that actually makes the code run slower, turned out to be in fact an improvement. This is astonishing because it violates just about every principle we’ve learned about writing fast code. In particular:

  • Don’t try to outsmart the compiler’s optimizer.
  • Avoid unnecessary branches.

I wouldn’t believe this unless I have seen it with my own eyes so I sat down and coded some benchmarks, using the following function.

void modulus ( int m, const std::vector& input, std::vector& output );

The idea is of course that m is a positive integer and input and output are vectors of equal size where input is filled with non-negative integers on entry and on exit, output[i] shall be filled with input[i] % m for all valid subscripts i. I have implemented and compared six versions of this function.

Implementations

The Normal Solution

If I were given the task to write this function, without any further thinking, I would have written it like this.

void
modulus ( const int m, const std::vector<int>& input, std::vector<int>& output )
{
  assert ( m > 0 );
  assert (input.size () == output.size () );
  const auto op = [ m ] ( const int n ) { return n % m; };
  std::transform ( input.cbegin (), input.cend (), output.begin (), op );
}

It’s simple, it’s obvious and it’s supposed to be optimal.

Next, I implemented two versions that leave the modulus as-is but attempt to optimize the looping. We only do a single load, a single arithmetic operation and a single store per iteration. If the loop is not optimized, it could easily dominate the benchmark.

Duff’s Device

Of course, there are people who don’t trust std::transform to produce a loop as fast as they would. A standard technique for speeding loops up is to unroll them such that the relative overhead of the loop control logic is reduced. An infamous idiom to achieve this back in the days when compilers wouldn’t be able to unroll loops themselves is Duff’s Devive. Many programmers don’t feel very comfortable when first seeing this construct but it’s actually perfectly valid syntax in both, C and C++. I’m not going into any more detail here. Please refer to the Wikipedia article for more information. Note that I have nested the case labels in a functionally equivalent but probably less confusing way compared to the canonical form from Tom Duff’s original publication. The number 8 for the unrolling factor has no special justification but it’s what everybody seems to be using.

void
modulus ( const int m, const std::vector<int>& input, std::vector<int>& output )
{
  assert ( m > 0 );
  assert ( input.size () == output.size () );
  const auto n = input.size ();
  if ( n == 0 )
    return;
  auto src = input.data ();
  auto dst = output.data ();
  auto i = ( n + 7 ) / 8;
  switch ( n % 8 )
    {
      do
        {
        case 0: *dst++ = *src++ % m;
        case 7: *dst++ = *src++ % m;
        case 6: *dst++ = *src++ % m;
        case 5: *dst++ = *src++ % m;
        case 4: *dst++ = *src++ % m;
        case 3: *dst++ = *src++ % m;
        case 2: *dst++ = *src++ % m;
        case 1: *dst++ = *src++ % m;
        }
      while ( --i );
    }
}

Double Loop

A less dramatic way to utilize loop unrolling with a modern compiler is to break a loop up into an outer loop that iterates a variable number of times and an inner loop that always iterates a fixed number of times. A compiler might be more easy about unrolling a loop with a fixed iteration count (precisely because it avoids the nastiness seen with Duff’s Device). A few iterations might remain to be done after the double loop if the iteration count of the inner loop does not divide the total iteration count.

void
modulus ( const int m, const std::vector<int>& input, std::vector<int>& output )
{
  assert ( m > 0 );
  assert ( input.size () == output.size () );
  const auto n = input.size ();
  auto src = input.cbegin ();
  auto dst = output.begin ();
  for ( std::size_t i = 0; i < n / 8; ++i )
    for ( std::size_t j = 0; j < 8; ++j )
      *dst++ = *src++ % m;
  while ( src != input.cend () )
    *dst++ = *src++ % m;
}

It is not as elegant as Duff’s Device but easier to understand and shows more trust into the compiler. I would only use this idiom if the total iteration count is known a priori to be a multiple of the inner loop’s iteration count so the while loop at the end is not needed. Otherwise the code duplication is ugly and potentially harmful to performance.

Conditional Modulus

Now finally to the proposed conditional modulus.

void
modulus ( const int m, const std::vector<int>& input, std::vector<int>& output )
{
  assert ( m > 0 );
  assert ( input.size () == output.size () );
  const auto op = [ m ] ( const int n ) { return n < m ? n : n % m; };
  std::transform ( input.cbegin (), input.cend (), output.begin (), op );
}

Since the manual loop unrolling did not speed the normal modulus up, I didn’t repeat it for the conditional one. Instead, I tried what Chandler also did in his talk: guide the compiler how likely the branch in the conditional will be taken. GCC and Clang allow you to do this with the __builtin_expect. Because not all compilers support it and its syntax is a bit freaky, it is usually not used directly but wrapped into a macro.

#define LIKELY( P )    __builtin_expect ( static_cast<bool>( P ), true  )
#define UNLIKELY( P )  __builtin_expect ( static_cast<bool>( P ), false )

I then added also versions of the above modulus implementation with the unary operator op defined as

const auto op = [ m ] ( const int n ) { return LIKELY ( n < m ) ? n : n % m; };

and

const auto op = [ m ] ( const int n ) { return UNLIKELY ( n < m ) ? n : n % m; };

respectively.

Benchmark

To benchmark the various functions, I like using random data and a large number of samples so to make it less likely that an (un)fortunate choice of the inputs will give a false impression. The following procedure was used.

REPEAT 100 TIMES
BEGIN
    n ← drawn randomly from { 210, …, 220 }
    m ← drawn randomly from { 1, …, MAX_INT }
    vin [ 1, …, n ] ← drawn randomly from { 0, …, MAX_INT }
    vout [ 1, …, n ] ← 0
    ;; input data is in random order
    FOR i ← 0 TO 10
    BEGIN
        ti ← execution time of modulus(m, vin, vout)
    END
    tmean ← 1 / 10 Σi = 1, …, 10 ti
    tstdev ← ( 1 / 9 Σi = 1, …, 10 ( titmean )2 )1 / 2
    OUTPUT (tmean, tstdev)
END

Note that t0 is dropped because the algorithm might need to warm up first.

Results

The code was compiled with GCC 5.2.0 and Clang 3.6.2 using the flags -std=c++14 -DNDEBUG -Wall -Wextra -Werror -pedantic -O3 in both cases.

The results look like this. Plotted are the values of tmean / n (that is, the average time in nanoseconds spent per individual integer) with tstdev / n as error bars as a function of the fraction of small integers (that is the relative frequency of numbers in the input that are less than the modul) overlayed with a linear regression of the form f(x) = α + β x.

Benchmarks with GCC 5.2.0 using random data

Benchmarks with GCC 5.2.0 using random data (SVG version).

Benchmarks with Clang 3.6.2 using random data

Benchmarks with Clang 3.6.2 using random data (SVG version).

Now that’s confusing!

First note that the normal solution behaves as expected. The regression is almost flat as it ought to be because the algorithm is not data-dependent. Also note that the figures suggest that the days of Duff’s Device are probably over. While there is no significant difference between the version using std::transform and the double-loop, the version using Duff’s Device might be even slightly slower although it is not very significant.

The conditional versions perform poor if the fraction of small integers is close to 0. This is expected because in this case, we have to do the check every time and then also the modulus most of the time. The performance increases as the fraction of small integers approaches 1 which is also expected because in this case we won’t have to actually do the modulus most of the time. What calls for explanation is the maximum (performance minimum) around 0.5. A closer look revels that the curves somehow appear like a decreasing linear function added to a bell curve. One explanation for this observation is branch prediction. If we are close to the extrema (0 or 1) then the conditional will be predictably true or false most of the time and the hardware can foresee this, making branches less expensive. Around 0.5 each branch is taken with about equal probability and the branch predictor can only guess. Amazingly enough, the overall performance is still very competitive. The likely / unlikely annotations seem to have some effect. Especially the Clang plot shows that the likely annotation marginally improves performance for fractions greater then 0.5 and harms it for fractions less than 0.5. The regressions are not as meaningful and I won’t interpret any more into this because the three slopes are not that different and since the overall shape of the curve is far from linear, it is not clear what a linear approximation can tell about it.

The gray data series at the bottom labeled memcpy is a reference algorithm that doesn’t compute any modulus but simply copies the input array to the output array (using std::copy).

To harden my hypothesis about the effect of branch prediction, I then modified the benchmark code to sort the inputs before feeding them to the algorithms. (At the place marked with the input data is in random order comment in the above pseudo-code.) The results are compelling.

Benchmarks with GCC 5.2.0 using sorted random data.

Benchmarks with GCC 5.2.0 using sorted random data (SVG version).

Benchmarks with Clang 3.6.2 using sorted random data.

Benchmarks with Clang 3.6.2 using sorted random data (SVG version).

Even for fractions of small integers as low as 0.1, the conditional version outperforms the normal one. Of course, having sorted data is not something we can usually assume but I find this quite impressive. This time, the clock times drop in an almost perfect linear fashion as the fraction of small integers goes from 0 to 1. The likely / unlikely annotations have a relatively small but the expected effect.

Conclusion

While the benchmark case of computing the modulus over kilobytes of contiguous storage of non-negative integers might not be very realistic, it is remarkable how much this seemingly silly premature optimization can achieve. The gains are already significant for unsorted data with a moderate bias but they become tremendous when branch prediction can partially compensate the additional cost for the conditional.

Chandler concludes his talk with saying that he will think about how to integrate this into Clang. This would of course be the best solution because programmers are notoriously bad at applying the right optimization in the right place. But until this happens and if you can justify it with profiling data also for your real-world application, a function template like this one might come in handy.

#include <type_traits>

template < typename T1, typename T2 >
using void_if_use_normal_modulo = std::enable_if_t
<
  std::is_integral< T1 >::value && std::is_integral< T2 >::value
  && ! ( std::is_unsigned< T1 >::value && std::is_unsigned< T2 >::value )
>;

template < typename T1, typename T2 >
using void_if_use_conditional_modulo = std::enable_if_t
<
  std::is_integral< T1 >::value && std::is_integral< T2 >::value
  && std::is_unsigned< T1 >::value && std::is_unsigned< T2 >::value
>;

template < typename T1, typename T2, typename = void >
struct mod_helper;

template < typename T1, typename T2 >
struct mod_helper< T1, T2, void_if_use_normal_modulo< T1, T2 > >
{
  static auto
  help ( const T1 n, const T2 m ) noexcept
  {
    return n % m;
  }
};

template < typename T1, typename T2 >
struct mod_helper< T1, T2, void_if_use_conditional_modulo< T1, T2 > >
{
  static auto
  help ( const T1 n, const T2 m ) noexcept
  {
    return n < m ? n : n % m;
  }
};

template < typename T1, typename T2 >
auto
mod ( const T1 n, const T2 m ) noexcept
{
  return mod_helper< T1, T2 >::help ( n, m );
}

It can probably be improved.

Downloads

The code used for the benchmarks can be downloaded from my website. You’ll notice that the code snippets shown in this article were simplified. If you run it and get different results, I’d be interested to hear about it.

Change Log

  • 2015-10-05 23:45 (UTC) Small changes to wording and noted the influence of UNLIKELY for unsorted data with Clang.
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: