I recently began writing a bounded, wait-free, multi-producer
single-consumer (MPSC) queue.
An often cited wisdom for ring-buffer based implementations goes as follows:
Don't pack tail or head
indices of each queue together into a struct, without padding each index
to a cache line. Otherwise, the cache-coherency traffic due to false
sharing will drastically reduce your performance!
In this article, I would like to find out the concrete performance
penalty of false sharing for my data structure. I'll be measuring
the effects on both ARM (Apple Silicon) and x86 (Intel/AMD) processors.
What is False Sharing
Assume that we're operating in a shared-memory, multi-processor system,
and perform concurrent reads and writes to our main memory. In the
simplest case, a memory controller will determine a global order of
operations, and make any change to the shared memory immediately
visible to all other processors. In reality though, CPUs make use
of local cache hierarchies, causing possibly stale data to reside in
different cache domains.
Without some global ordering guarantees, writing meaningful concurrent
algorithms becomes pretty much impossible.
To resolve this, CPUs employ a microarchitectural cache coherence
protocol that (1) propagates writes to all other processors and (2)
serializes writes to the same memory location, so that they are observed
in the same global order by every processor.
Note: Cache coherency is part of a CPUs memory model and
always architecture or ISA specific. Furthermore, operating systems
and programming languages may define their own memory models to abstract
over a variety of hardware. The
C/C++11
memory ordering specification
is one such example.
Most CPUs follow an invalidation-based protocol like
MESI, usually
equipped with an additional state for making certain operations more
efficient via cache-to-cache transfers.
For these protocols, writing to an address generally invalidates all
other processors' cache lines that contained this addresses value,
thereby forcing the affected processors to pull the latest value
from a lower memory level (last-level cache or main memory)
on the next read in the worst case. This can
strongly degrade performance, and can cause a poorly-implemented
multithreaded system to perform worse than its single-threaded
counterpart. For a real example, refer to chapter 7.3 of the
AMD64 architecture manual, which covers their MOESDIF
(Modified, Owned, Exclusive, Shared,
Dirty, Invalid, Forward) cache
coherence protocol.
Update: As noted by David Chisnall, if the writing processor
did not perform a previous load on a cache line, cache coherency
protocols may sometimes employ remote stores instead of invalidation.
This means that the value is sent to the remote
processor over the interconnect, updating the value in-place.
Intuitively, it makes sense that lots of processors reading and writing
concurrently to the same address would cause a performance bottleneck.
This problem is known as true sharing, and is quite easy to identify.
False sharing on the other hand is more insidious. It's a type of
performance degradation where multiple processors write to different
memory addresses concurrently, but still cause lots of cache coherency
traffic. How is that possible if they are writing to different memory
addresses?
The reason is that the cache coherence protocol is rough-grained and
invalidates an entire cache line on any write, so that updating 8 bytes
of data (size of an unsigned 64-bit integer) also invalidates 56
neighboring bytes.
So even though 2 distinct objects are updated by two different
processors, we see a similar level of performance degradation
as with true sharing, as long as these objects map to the same
cache line.
The most egregious offenders are densely packed structs, with each
variable being accessed by another thread. This gives you a false
illusion of data independence and parallelism.
#[repr(C)]pubstructState{t1: u64,// only accessed by thread #1
t2: u64,// only accessed by thread #2
t3: u64,// only accessed by thread #3
}
Depending on the alignment of State, all three values will likely
map to the same cache line. To avoid this, you generally align variables
to a cache line using either macros or padding fields*.
(*): Padding fields like this is very common in multithreading libraries
or kernel code.
In the Linux kernel for example, this is done with the macro
__cacheline_aligned_in_smp.
You can see it used for the struct bpf_ringbuf
in /bpf/ringbuf.c.
By now it's pretty clear that we should definitely avoid false sharing;
all that's left is measuring the actual performance gain of cache
line alignment.
The Data Structure: MPSC Queue
To measure the impact of false sharing, we will perform experiments on
a wait-free, bounded, multi-producer single-consumer queue. It's a
data structure that allows multiple concurrent threads to push data
into a single data stream, which can be read by a consumer thread.
The way these concurrent write operations are merged into a single
consumable stream differs between implementations.
In our case, producers are pushing data into their own thread-local
FIFO ring buffer, which is regularly polled and emptied by the consumer
thread. When the queue is full, new data from producers are discarded
(partial write / fail on overflow according to
this classification).