Avoiding Benchmarking Pitfalls with black_box in Rust
Nov 2022

When benchmarking short programs, you often encounter two big problems that mess up your final results: (1) hardware and operating systems are full of side-effects that are neither transparent nor directly manipulable and (2) compilers can optimize in unpredictable ways, requiring IR/Assembly inspection and knowledge of compiler intrinsics.

One such example happened while I was benchmarking a multithreaded queue. I chose my struct alignments in a way that would reduce cache coherency traffic, which should translate to a noticeable improvement in per-thread throughput on write-heavy workloads. However, I measured the exact opposite! This is a simplified version of the code (the part after '&' is just a bitmask for wrapping the index):
for i in 0..0xffff {
    *head = (*head + 1) & ((1 << C) - 1);
We're dereferencing a *mut, incrementing the value, and storing it at the same address. We should expect to see at least an ldr and str instruction. Generating the corresponding arm64 assembly with RUSTFLAGS="--emit asm" cargo bench --no-run yields:
        add     w9, w9, #1
        and     w9, w9, #0xffff
        subs    x10, x10, #1
        b.ne    LBB0_1
        str     w9, [x8]
Sneaky! The last line reveals the problem with my benchmark. The changed values are written back to main memory only once after the loop has finished, which means there is barely any cache coherency traffic happening! Honestly, I would've expected llvm to just unroll the loop and add 65536 directly to [x8], but to my surprise it keeps the loop around. We can fix this in native Rust by using a compiler hint called std::hint::black_box:
for i in 0..0xffff {
    *head = (*head + 1) & ((1 << C) - 1);
According to the official Rust docs, black_box is "an identity function that hints to the compiler to be maximally pessimistic about what black_box could do". And to our satisfaction, we find our load and store instructions:
        ldr     x11, [sp]
        ldr     w12, [x11]
        add     w12, w12, #1
        and     w12, w12, #0xffff
        str     w12, [x11]
        str     x9, [sp, #24]
        subs    x8, x8, #1
        b.ne    LBB0_1
There are some caveats though:
  • black_box does not guarantee anything, and only works as an advisory function. It's not a rustc or llvm intrinsic. So manual inspection of IR or assembly is still necessary.
  • black_box is still experimental and awaits stabilization, part of which is possibly a name change.
  • Creating a version of black_box that gives strict guarantees would require a top-to-bottom rework, including patching backends to support these intrinsics.
You can find an interesting discussion about this function in the tracking issue on GitHub.