Observe acquire-release in assembly

Acquire-release (acq-rel) is a weaker memory model than sequential consistency (SC). This post illustrates how acq-rel allows more interleaving than SC through examples and assembly analysis.

§Store buffer

A store buffer is a per-CPU buffer that holds pending stores before they are committed to main memory and visible to other CPUs. Within a CPU, operations follow program order, but store buffers can cause surprising behaviors in multi-core environments.

This is demonstrated using a litmus test.

C store-buffer

{
[x] = 0;
[y] = 0;
}

P0 (atomic_int* x, atomic_int* y) {
  atomic_store_explicit(x, 1, memory_order_release);
  r0 = atomic_load_explicit(y, memory_order_acquire);
}

P1 (atomic_int* x, atomic_int* y) {
  atomic_store_explicit(y, 1, memory_order_release);
  r0 = atomic_load_explicit(x, memory_order_acquire);
}

exists
(  true
/\ 0:r0 == 0
/\ 1:r0 == 0
)

Running herd7 -c11 -cat rc11.cat storebuffer.litmus produces:

1
2
3

Ok
Witnesses
Positive: 1 Negative: 3

The result shows that writes to x and y can be held in store buffers, making them temporarily invisible to other CPUs. This behavior is not allowed under SC — verified by running the same test with -cat sc.cat.

§Assembly

Let’s examine the compiled assembly on two architectures.

First, the C code:

#include <stdatomic.h>
#include <pthread.h>
#include <stdio.h>
#include <stdint.h>

atomic_int x = 0, y = 0;
int r1 = -1, r2 = -1;

void* thread0(void* arg) {
    atomic_store_explicit(&x, 1, memory_order_release);
    r1 = atomic_load_explicit(&y, memory_order_acquire);
    return NULL;
}

void* thread1(void* arg) {
    atomic_store_explicit(&y, 1, memory_order_release);
    r2 = atomic_load_explicit(&x, memory_order_acquire);
    return NULL;
}

Since both threads follow the same logic, we only show one.

§x64

The corresponding assembly (with -O):

thread0:
        mov     dword ptr [rip + x], 1      <---------- store
        mov     eax, dword ptr [rip + y]    <----------  load
        mov     dword ptr [rip + r1], eax
        xor     eax, eax
        ret

On x64, acq-rel compiles down to regular mov instructions. That’s why acq-rel is often described as “free” on x64. But since x64 allows store-load reordering, the result r1 = r2 = 0 can still occur.

Using SC (memory_order_seq_cst), the compiler emits xchg, which flushes the store buffer, ensuring the store becomes visible to all CPUs. This enforces sequential consistency.

thread0:
        mov     eax, 1
        xchg    dword ptr [rip + x], eax   <----------- store + store-buffer flushing
        mov     eax, dword ptr [rip + y]   <-----------  load
        mov     dword ptr [rip + r1], eax
        xor     eax, eax
        ret

§aarch64

On ARM64 (compiled with clang -O -march=armv8.6-a), acq-rel generates stlr (store-release) and ldapr (load-acquire). These instructions include memory-ordering semantics and are more costly than regular str and ldr.

thread0:
        mov     w8, #1
        adrp    x9, x
        add     x9, x9, :lo12:x
        stlr    w8, [x9]                <----------- store release
        adrp    x8, y
        add     x8, x8, :lo12:y
        ldapr   w8, [x8]                <------------ load acquire
        adrp    x9, r1
        mov     x0, xzr
        str     w8, [x9, :lo12:r1]
        ret

However, ldapr may still be reordered with respect to earlier stores. To prevent this and enforce SC semantics, the compiler uses ldar instead for memory_order_seq_cst.

thread0:
        mov     w8, #1
        adrp    x9, x
        add     x9, x9, :lo12:x
        stlr    w8, [x9]               <------------- store sc
        adrp    x8, y
        add     x8, x8, :lo12:y
        ldar    w8, [x8]               <-------------- load sc
        adrp    x9, r1
        mov     x0, xzr
        str     w8, [x9, :lo12:r1]
        ret

§Summary

Acquire-release ordering allows certain instruction reorderings that sequential consistency does not. This difference shows up in:

Litmus tests: acq-rel permits outcomes that SC forbids
Assembly:
- On x64: acq-rel uses mov, SC uses xchg
- On aarch64: acq-rel uses stlr/ldapr, SC uses stlr/ldar

albertnetymk's notes