Acquire-release (acq-rel) is a weaker memory model than sequential consistency (SC). This post illustrates how acq-rel allows more interleaving than SC through examples and assembly analysis.

§Store buffer

A store buffer is a per-CPU buffer that holds pending stores before they are committed to main memory and visible to other CPUs. Within a CPU, operations follow program order, but store buffers can cause surprising behaviors in multi-core environments.

This is demonstrated using a litmus test.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
C store-buffer

{
[x] = 0;
[y] = 0;
}

P0 (atomic_int* x, atomic_int* y) {
atomic_store_explicit(x, 1, memory_order_release);
r0 = atomic_load_explicit(y, memory_order_acquire);
}

P1 (atomic_int* x, atomic_int* y) {
atomic_store_explicit(y, 1, memory_order_release);
r0 = atomic_load_explicit(x, memory_order_acquire);
}

exists
( true
/\ 0:r0 == 0
/\ 1:r0 == 0
)

Running herd7 -c11 -cat rc11.cat storebuffer.litmus produces:

1
2
3
Ok
Witnesses
Positive: 1 Negative: 3

The result shows that writes to x and y can be held in store buffers, making them temporarily invisible to other CPUs. This behavior is not allowed under SC — verified by running the same test with -cat sc.cat.

§Assembly

Let’s examine the compiled assembly on two architectures.

First, the C code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <stdatomic.h>
#include <pthread.h>
#include <stdio.h>
#include <stdint.h>

atomic_int x = 0, y = 0;
int r1 = -1, r2 = -1;

void* thread0(void* arg) {
atomic_store_explicit(&x, 1, memory_order_release);
r1 = atomic_load_explicit(&y, memory_order_acquire);
return NULL;
}

void* thread1(void* arg) {
atomic_store_explicit(&y, 1, memory_order_release);
r2 = atomic_load_explicit(&x, memory_order_acquire);
return NULL;
}

Since both threads follow the same logic, we only show one.

§x64

The corresponding assembly (with -O):

1
2
3
4
5
6
thread0:
mov dword ptr [rip + x], 1 <---------- store
mov eax, dword ptr [rip + y] <---------- load
mov dword ptr [rip + r1], eax
xor eax, eax
ret

On x64, acq-rel compiles down to regular mov instructions. That’s why acq-rel is often described as “free” on x64. But since x64 allows store-load reordering, the result r1 = r2 = 0 can still occur.

Using SC (memory_order_seq_cst), the compiler emits xchg, which flushes the store buffer, ensuring the store becomes visible to all CPUs. This enforces sequential consistency.

1
2
3
4
5
6
7
thread0:
mov eax, 1
xchg dword ptr [rip + x], eax <----------- store + store-buffer flushing
mov eax, dword ptr [rip + y] <----------- load
mov dword ptr [rip + r1], eax
xor eax, eax
ret

§aarch64

On ARM64 (compiled with clang -O -march=armv8.6-a), acq-rel generates stlr (store-release) and ldapr (load-acquire). These instructions include memory-ordering semantics and are more costly than regular str and ldr.

1
2
3
4
5
6
7
8
9
10
11
12
thread0:
mov w8, #1
adrp x9, x
add x9, x9, :lo12:x
stlr w8, [x9] <----------- store release
adrp x8, y
add x8, x8, :lo12:y
ldapr w8, [x8] <------------ load acquire
adrp x9, r1
mov x0, xzr
str w8, [x9, :lo12:r1]
ret

However, ldapr may still be reordered with respect to earlier stores. To prevent this and enforce SC semantics, the compiler uses ldar instead for memory_order_seq_cst.

1
2
3
4
5
6
7
8
9
10
11
12
thread0:
mov w8, #1
adrp x9, x
add x9, x9, :lo12:x
stlr w8, [x9] <------------- store sc
adrp x8, y
add x8, x8, :lo12:y
ldar w8, [x8] <-------------- load sc
adrp x9, r1
mov x0, xzr
str w8, [x9, :lo12:r1]
ret

§Summary

Acquire-release ordering allows certain instruction reorderings that sequential consistency does not. This difference shows up in:

  • Litmus tests: acq-rel permits outcomes that SC forbids
  • Assembly:
    • On x64: acq-rel uses mov, SC uses xchg
    • On aarch64: acq-rel uses stlr/ldapr, SC uses stlr/ldar

§References