Measuring the impact of false sharing

January 2023 - RSS

I recently began writing a bounded, wait-free, multi-producer single-consumer (MPSC) queue. An often cited wisdom for ring-buffer based implementations goes as follows:

Don't pack tail or head indices of each queue together into a struct, without padding each index to a cache line. Otherwise, the cache-coherency traffic due to false sharing will drastically reduce your performance!

In this article, I would like to find out the concrete performance penalty of false sharing for my data structure. I'll be measuring the effects on both ARM (Apple Silicon) and x86 (Intel/AMD) processors.

What is False Sharing

Assume that we're operating in a shared-memory, multi-processor system, and perform concurrent reads and writes to our main memory. In the simplest case, a memory controller will determine a global order of operations, and make any change to the shared memory immediately visible to all other processors. In reality though, CPUs make use of local cache hierarchies, causing possibly stale data to reside in different cache domains.

Simplified visualization of ccNUMA machine

Without some global ordering guarantees, writing meaningful concurrent algorithms becomes pretty much impossible. To resolve this, CPUs employ a microarchitectural cache coherence protocol that (1) propagates writes to all other processors and (2) serializes writes to the same memory location, so that they are observed in the same global order by every processor.

Note: Cache coherency is part of a CPUs memory model and always architecture or ISA specific. Furthermore, operating systems and programming languages may define their own memory models to abstract over a variety of hardware. The C/C++11 memory ordering specification is one such example.

Most CPUs follow an invalidation-based protocol like MESI, usually equipped with an additional state for making certain operations more efficient via cache-to-cache transfers. For these protocols, writing to an address generally invalidates all other processors' cache lines that contained this addresses value, thereby forcing the affected processors to pull the latest value from a lower memory level (last-level cache or main memory) on the next read in the worst case. This can strongly degrade performance, and can cause a poorly-implemented multithreaded system to perform worse than its single-threaded counterpart. For a real example, refer to chapter 7.3 of the AMD64 architecture manual, which covers their MOESDIF (Modified, Owned, Exclusive, Shared, Dirty, Invalid, Forward) cache coherence protocol.

Update: As noted by David Chisnall, if the writing processor did not perform a previous load on a cache line, cache coherency protocols may sometimes employ remote stores instead of invalidation. This means that the value is sent to the remote processor over the interconnect, updating the value in-place.

Intuitively, it makes sense that lots of processors reading and writing concurrently to the same address would cause a performance bottleneck. This problem is known as true sharing, and is quite easy to identify. False sharing on the other hand is more insidious. It's a type of performance degradation where multiple processors write to different memory addresses concurrently, but still cause lots of cache coherency traffic. How is that possible if they are writing to different memory addresses?

The reason is that the cache coherence protocol is rough-grained and invalidates an entire cache line on any write, so that updating 8 bytes of data (size of an unsigned 64-bit integer) also invalidates 56 neighboring bytes.

So even though 2 distinct objects are updated by two different processors, we see a similar level of performance degradation as with true sharing, as long as these objects map to the same cache line.

The most egregious offenders are densely packed structs, with each variable being accessed by another thread. This gives you a false illusion of data independence and parallelism.

#[repr(C)]
pub struct State {
    t1: u64, // only accessed by thread #1
    t2: u64, // only accessed by thread #2
    t3: u64, // only accessed by thread #3
}

Depending on the alignment of State, all three values will likely map to the same cache line. To avoid this, you generally align variables to a cache line using either macros or padding fields*. (*): Padding fields like this is very common in multithreading libraries or kernel code. In the Linux kernel for example, this is done with the macro __cacheline_aligned_in_smp. You can see it used for the struct bpf_ringbuf in /bpf/ringbuf.c.

By now it's pretty clear that we should definitely avoid false sharing; all that's left is measuring the actual performance gain of cache line alignment.

The Data Structure: MPSC Queue

To measure the impact of false sharing, we will perform experiments on a wait-free, bounded, multi-producer single-consumer queue. It's a data structure that allows multiple concurrent threads to push data into a single data stream, which can be read by a consumer thread. The way these concurrent write operations are merged into a single consumable stream differs between implementations.

In our case, producers are pushing data into their own thread-local FIFO ring buffer, which is regularly polled and emptied by the consumer thread. When the queue is full, new data from producers are discarded (partial write / fail on overflow according to this classification).