Suboptimal Optimisations

posted 30 Dec 2013

While theoretical performance and asymptotic runtimes are nice, they don’t take constant factors into account. Sadly, microbenchmarks don’t represent real-world performance well either. To benchmark data structure and algorithm implementations for real-world usage, one should do it in a real program in order to get realistic input and the memory structure of a program.

As you usually would like to compare the speed of one implementation compared to another, this may actually be slightly problematic. While it is easy to compare the absolute difference between implementations, it is hard to compare the relative difference: All implementations have overhead from I/O and other computations not directly related to the data structure or algorithm itself. It’s usually possible to remove some I/O overhead from timings, but the computational overhead is difficult to get rid of without making the program unrealistic.

One way of figuring out the relative performance is by making dummy programs. The dummy program does the exact same computation, but without any data structure in place. One problem is that such a program generally does, well, nothing. Compilers may see that these parts of the program is effectively dead code, and removes those parts. To avoid dead code elimination, one tends to replace data structure operations with an expression incrementing a counter or something similar. To avoid the counter itself to be considered dead, you usually return it from functions and keep it volatile in the main function.

I did this for a (naive) parallel grep implementation, where each core ran completely independent of other cores. Imagine my surprise when the dummy program was a tiny bit slower than one of the data structure programs!

Obviously, having superlinear speedup on a much more expensive task than incrementing an integer gave off some warning lights. I probably made a mistake somewhere, where I modified more than just the data structure function calls. But no, only the calls were replaced. Well, perhaps the data structure implementation for some reason optimises better? Unfortunately, optimisations modify the programs somewhat, but there is no difference between the programs which implies that the real program is more efficient than the dummy. So what could be the issue?

Hardware and Memory

A circuit board with a light sweep line in the middle. — *Light Cycles* by Jason Samfield, CC-BY-NC-SA 2.0

At this point, I guessed that there was some subtle hardware performance trick which doesn’t activate when the dummy program runs. Cache trashing? Nope, the dummy reduces the possibility of that. Branch misprediction? Nope, the dummy just removes a lot of those.

After some head scratching, I finally figured out the underlying reason, which lies within the simplified code snippet below. The code is run in parallel on all cores, and line_matches is not sharing any state across the threads. See if you can spot the reason before you read on!

// values and variables set up here
const uint32_t own_tid = input->own_tid; // thread id
uint32_t *output_arr = input->output_arr;

uint32_t contained_lines = 0;

for (uint32_t line_idx = from; line_idx < to; line_idx++) {
  // input arguments is setup here
  if (line_matches(input_arguments)) {
    contained_lines++;
  }
}

output_arr[own_tid] = contained_lines;

There are several optimisations done in this piece of code here, but the only one which affects performance negatively is a variable elimination: contained_lines was removed, and operations was directly performed on output_arr[own_tid]. In effect, the resulting semantics was something like this:

// values and variables set up here
const uint32_t own_tid = input->own_tid; // thread id
uint32_t *output_ptr = &input->output_arr[own_tid];

*output_ptr = 0;

for (uint32_t line_idx = from; line_idx < to; line_idx++) {
  // input arguments is setup here
  if (line_matches(input_arguments)) {
    (*output_ptr)++;
  }
}

People knowing how caches work on multicore processors should see what happens here: Heavy cache invalidation. As cache lines on modern processors are either 64 or 128 bytes long, all output values in the output array will be contained inside a single cache line. And since all cores have their own L1 cache, a write to that cache line means all other cores must update it to ensure cache coherence. In general, it’s not incredibly expensive. In this case, however, it accumulated and made it slower than the real program.

Volatility

A wooden nitroglycerin crate with warning text in Swedish — *(Untitled)* by Tomas Hellberg, CC-BY-NC-SA 2.0

So, how can we speed up the dummy program? Essentially, what you have to do to reduce the cache invalidations is to remove the variable elimination. It is, in fact, possible to do this through malloc:

// values and variables set up here
const uint32_t own_tid = input->own_tid; // thread id
uint32_t *output_arr = input->output_arr;

uint32_t *contained_lines = malloc(sizeof(uint32_t));
*contained_lines = 0

for (uint32_t line_idx = from; line_idx < to; line_idx++) {
  // input arguments is setup here
  if (line_matches(input_arguments)) {
    (*contained_lines)++;
  }
}

output_arr[own_tid] = *contained_lines;
free(contained_lines);

Curiously, this runs faster than the original version. It isn’t the optimal solution, however: In this case, turning output_arr into a volatile variable solves the problem better. Even though volatility is generally signalled to mean that the variable can change contents or be read by others, it also removes optimisations. This is generally what you want, but be a bit careful: Data structure operations can, in theory, be optimised more heavily than volatile variables. As long that doesn’t happen, volatile away!