Suboptimal Optimisations
posted
While theoretical performance and asymptotic runtimes are nice, they don’t take constant factors into account. Sadly, microbenchmarks don’t represent real-world performance well either. To benchmark data structure and algorithm implementations for real-world usage, one should do it in a real program in order to get realistic input and the memory structure of a program.
As you usually would like to compare the speed of one implementation compared to another, this may actually be slightly problematic. While it is easy to compare the absolute difference between implementations, it is hard to compare the relative difference: All implementations have overhead from I/O and other computations not directly related to the data structure or algorithm itself. It’s usually possible to remove some I/O overhead from timings, but the computational overhead is difficult to get rid of without making the program unrealistic.
One way of figuring out the relative performance is by making dummy programs. The dummy program does the exact same computation, but without any data structure in place. One problem is that such a program generally does, well, nothing. Compilers may see that these parts of the program is effectively dead code, and removes those parts. To avoid dead code elimination, one tends to replace data structure operations with an expression incrementing a counter or something similar. To avoid the counter itself to be considered dead, you usually return it from functions and keep it volatile in the main function.
I did this for a (naive) parallel grep implementation, where each core ran completely independent of other cores. Imagine my surprise when the dummy program was a tiny bit slower than one of the data structure programs!
Obviously, having superlinear speedup on a much more expensive task than incrementing an integer gave off some warning lights. I probably made a mistake somewhere, where I modified more than just the data structure function calls. But no, only the calls were replaced. Well, perhaps the data structure implementation for some reason optimises better? Unfortunately, optimisations modify the programs somewhat, but there is no difference between the programs which implies that the real program is more efficient than the dummy. So what could be the issue?
Hardware and Memory
At this point, I guessed that there was some subtle hardware performance trick which doesn’t activate when the dummy program runs. Cache trashing? Nope, the dummy reduces the possibility of that. Branch misprediction? Nope, the dummy just removes a lot of those.
After some head scratching, I finally figured out the underlying reason, which
lies within the simplified code snippet below. The code is run in parallel on
all cores, and line_matches
is not sharing any state across the threads. See
if you can spot the reason before you read on!
// values and variables set up here
const uint32_t own_tid = input->own_tid; // thread id
uint32_t *output_arr = input->output_arr;
uint32_t contained_lines = 0;
for (uint32_t line_idx = from; line_idx < to; line_idx++) {
// input arguments is setup here
if (line_matches(input_arguments)) {
contained_lines++;
}
}
output_arr[own_tid] = contained_lines;
There are several optimisations done in this piece of code here, but the only
one which affects performance negatively is a variable elimination:
contained_lines
was removed, and operations was directly performed on
output_arr[own_tid]
. In effect, the resulting semantics was something like
this:
// values and variables set up here
const uint32_t own_tid = input->own_tid; // thread id
uint32_t *output_ptr = &input->output_arr[own_tid];
*output_ptr = 0;
for (uint32_t line_idx = from; line_idx < to; line_idx++) {
// input arguments is setup here
if (line_matches(input_arguments)) {
(*output_ptr)++;
}
}
People knowing how caches work on multicore processors should see what happens here: Heavy cache invalidation. As cache lines on modern processors are either 64 or 128 bytes long, all output values in the output array will be contained inside a single cache line. And since all cores have their own L1 cache, a write to that cache line means all other cores must update it to ensure cache coherence. In general, it’s not incredibly expensive. In this case, however, it accumulated and made it slower than the real program.
Volatility
So, how can we speed up the dummy program? Essentially, what you have to do to
reduce the cache invalidations is to remove the variable elimination. It is, in
fact, possible to do this through malloc
:
// values and variables set up here
const uint32_t own_tid = input->own_tid; // thread id
uint32_t *output_arr = input->output_arr;
uint32_t *contained_lines = malloc(sizeof(uint32_t));
*contained_lines = 0
for (uint32_t line_idx = from; line_idx < to; line_idx++) {
// input arguments is setup here
if (line_matches(input_arguments)) {
(*contained_lines)++;
}
}
output_arr[own_tid] = *contained_lines;
free(contained_lines);
Curiously, this runs faster than the original version. It isn’t the optimal
solution, however: In this case, turning output_arr
into a volatile variable
solves the problem better. Even though volatility is generally signalled to mean
that the variable can change contents or be read by others, it also removes
optimisations. This is generally what you want, but be a bit careful: Data
structure operations can, in theory, be optimised more heavily than volatile
variables. As long that doesn’t happen, volatile away!