

Semantics, languages and algorithms for multicore programming

Albert Cohen



Luc Maranget


Francesco Zappa Nardelli


## Vote: topics for my this lecture

1. Operational and axiomatic formalisation of x 86 -TSO
2. The Iwarx and stwcx Power instructions
3. Fence optimisations for $x 86$-TSO
4. The Java memory model
5. The C++11 memory model
6. Static and dynamic techniques for data-race detection (5)
7. The Linux memory model (?!)
8. Compiler correctness statements (compile non-determinism?) (5)
9. The Linux memory model (ahem, kinda)

## The Linux memory model

Facts:

- abstraction layer over hardware and compilers

- relied upon by kernel developers to write "portable kernel code"
- documented by a text file:
http://www.kernel.org/doc/Documentation/memory-barriers.txt


## The Linux memory model

## Facts:

- abstraction layer over hardware and compilers

- relied upon by kernel developers to write "portable kernel code"
- documented by a text file:
http://www.kernel.org/doc/Documentation/memory-barriers.txt

More facts:
I attempted to understand the doc, and exchanged a few email with Paul Mc Kenney. However I don't understand much...

In the next hour, let's go over the documentation together and see if we can make sense of it...

## The Linux memory model

Expected to account for all supported combinations of compiler and hardware memory model...

Linux kernel

Linux memory model


Hardware
alpha: Weak ordering. No dependency ordering. "Time does not go backwards" gives guarantees similar to Power/ARM A-cumulativity. Possibly B-cumulativity as well. I am not aware of formalization of this architecture's memory ordering other than Gharachorloo's PhD.
arm: You know at least as much as I do about this one.
avr32: Uniprocessor-only, kernel build failure for SMP.
blackfin: Uniprocessor-only to the best of my knowledge. There are rumored to be some experimental SMP systems that lack cache coherence, and are thus outside of the Linux kernel's remit. See for example: https://docs.blackfin.uclinux.org/doku.php?id=linux-kernel:smp-like The system.h file flushes cache when a memory barrier is encountered, which is consistent with an attempt to run the Linux kernel on a non-cache-coherent system...
cris: Uniprocessor-only to the best of my knowledge. Though there appears to be recent addition of some SMP support. Its system.h file is consistent with full sequential consistency. Or extreme optimism on the part of the cris developers.
frv: Uniprocessor-only to the best of my knowledge.
h8300: Uniprocessor-only to the best of my knowledge. There is code in system.h that appears to be intended for SMP, but it looks to me like a (harmless) copy-paste error. Either that or SMP h8300 systems are sequentially consistent.
ia64: Total order of all release operations, which include the "mf" (memory fence) instruction. Memory fences cannot restore sequential consistency.
m32r. Uniprocessor-only to the best of my knowledge. However, there does appear to be some recent multiprocessor support. This is quite strange -- atomic instructions flush cache, but memory barriers are no-ops. Looks quite experimental.
$m 68 k$ : Uniprocessor-only to the best of my knowledge.
microblaze: Uniprocessor-only to the best of my knowledge. At least one SMP attempt: http://microblazesmp.blogspot.com/ Its system.h file looks uniprocessor-only.
mips: Multiprocessor. Old SGI MIPS systems were sequentially consistent. Newer systems used for network infrastructure are rumored to have weak memory models similar to Power and ARM. And its system.h file is consistent with a weak memory model.
mn10300: Recent SMP support which I know little about. The system.h file looks uniprocessor only, and contains comments on Intel, so copy-pasted from x86.
parisc: TSO, similar to $\times 86$.
powerpc: You know at least as much about this as I do.
s390: TSO, but with self-snooping of store buffer prohibited.
score: Uniprocessor-only to the best of my knowledge.
sh: Recent SMP support which I know little about. Its system.h file is consistent with weak memory ordering.
sparc: TSO, similar to $\times 86$. There is documentation about weaker memory models (PSO and RMO), but in practice the hardware is TSO.
tile: Recent SMP CPU which I know little about. Seems to be weakly ordered based on its system.h file.
um: Looks like an x86 knockoff judging by the system.h file.
unicore32: Uniprocessor-only to the best of my knowledge.
x86: You know this one at least as well as do I.
xtensa: Uniprocessor-only -- kernel build failure otherwise.

## The Linux memory model

My intuition:

Annoying facts:

## The Linux memory model

My intuition:
kinda of lowest common denominator between all hardware memory models of architectures Linux can be compiled to, taking into account also some common gcc optimisations, with some weirdnesses.

Annoying facts:
semantics of "read barriers" really weak, unclear how to formalise it compilation of barriers on Itanium looks broken -- hardware might exhibit behaviours prohibited by the MM.

## ...let's read the doc...

## The Linux memory model: macros

on x86:

```
#define mb() asm volatile("mfence":::"memory")
#define rmb() asm volatile("lfence":::"memory")
#define wmb() asm volatile("sfence" ::: "memory")
```

as far as we know, Ifence and sfence are noop in x86TSO

## on Power:

\#define mb() __asm___volatile_("sync" : : : "memory")
\#define rmb() _-asm___volatile__("sync" : : : "memory")
\#define wmb() _-asm___volatile_ ("sync" : : : "memory")
\#define read_barrier_depends() do $\}$ while(0)

Internship proposal on the fly...

## Sort out what the REAL Linux memory model is

Yes. Of course, if people come up with lots of situations where the more-complex programming model would help significantly, then it might be worth revisiting this.

Pros:
Challenging!


Can have a huge impact!
Collaboration with Paul Mc Kenney possible!


## 2. The C++11 memory model

a good example of an axiomatic memory model


## The C++11 memory model

1300 page prose specification defined by the ISO.
The design is a detailed compromise:
hardware/compiler implementability useful abstractions broad spectrum of programmers

Welcome to the official home of

# TSO IEC JTC1/SC22/WG21 - The C++ Standards Committee 

2011-09-15: standards I projects I papers I mailings $\mid$ internals $\mid$ meetings $\mid$ contacts

News 2011-09-11: The new C++ standard - C++11 - is published!

## The syntactic divide

// for regular programmers:
atomic_int $\mathrm{x}=0$;
x.store(1);
y = x.load();
// for experts:
x.store(2, memory_order);
y = x.load(memory_order);
atomic_thread_fence(memory_order);
where memory_order is one of the following:
mo_seq_cst mo_release mo_acquire
mo_acq_rel mo_consume mo_relaxed

## How may a program execute?

Two layer semantics:

1) an operational semantics processes programs, identifying memory actions, and constructs candidate executions (Eopsem);

$$
P \longrightarrow E_{1}, \ldots, E_{n}
$$

2) an axiomatic memory model judges Eopsem paired with a memory ordering $X$ witness

$$
\mathrm{E}_{\mathrm{i}} \longrightarrow \mathrm{X}_{\mathrm{i} 1}, \ldots, \mathrm{X}_{\mathrm{im}}
$$

3) searches the consistent executions for races and uncostrained reads
is there an $X_{i j}$ with a race?

## Relations

An Eopsem part containing:
sb sequenced before, program order
asw additional synchronizes with, inter-thread ordering

An $X_{\text {witness }}$ part containing:
if relates a write to any reads that take its value
sc a total order over mo_seq_cst and mutex actions
mo modification order, per location total order of writes

From these, compute synchronise-with (sw) and happens-before ( $h b$ ).

We ignore consume atomics, which enables us to live in a simplified model. Full details in Batty et al., POPL 11.

## Formally

```
cpp_memory_model_opsem (p : program) =
let pre_executions =
    {(Eopsem,Xwitness). opsem p Eopsem ^
        consistent execution (Eopsem, Xwitness)}
in
if \existsX \in pre executions.
    (indeterminate reads }\textrm{X}={})\textrm{V
    (unsequenced races }X={}) 
    (data races X = {})
then NoNE
else Some pre_executions
```


## A single-threaded example

1. sequenced before (sb) - given by opsem


## A single-threaded example

1. sequenced before (sb) - given by opsem
2. read-from (rf) - part of the witness
```
int main() {
    int x = 2;
    int y = 0;
    y = (x==x);
    return 0;
}
```



## A single-threaded ex. with undefined behaviour

An unsequenced race.


## A simple concurrent program

We will omit asw arrows whenever we are not interested in the initialisation.

$$
\text { rrows whenever } \begin{aligned}
& \mathrm{b}: \mathrm{W}_{\mathrm{na}} \mathrm{x}=3 \quad \mathrm{c}: \mathrm{R}_{\mathrm{nq}} \mathrm{x}=2 \\
& \text { ed in the initialisation. } \\
& \mathrm{d}: \mathrm{W}_{\mathrm{na}} \mathrm{y}=0
\end{aligned}
$$

$$
\begin{aligned}
& \text { int } y, x=2 \text {; } \\
& \mathrm{x}=3 \text {; } \\
& \mid y=(x==3) ;
\end{aligned}
$$

## Locks and unlocks

int $\mathrm{x}, \mathrm{r}$;
mutex m;

| m. $\operatorname{lock}() ;$ | m. $\operatorname{lock}() ;$ |
| :--- | :--- |
| $x=\ldots$ | $r=x ;$ |
| m.unlock ()$;$ |  |

1. the operational semantics defines the sb arrows


## Locks and unlocks

int $\mathrm{x}, \mathrm{r}$;
mutex m;

| m. $\operatorname{lock}() ;$ | m. $\operatorname{lock}() ;$ |
| :--- | :--- |
| $\mathrm{x}=\ldots$ | $\mathrm{r}=\mathrm{x} ;$ |
| $\mathrm{m} . \operatorname{unlock}() ;$ |  |

1. the operational semantics defines the sb arrows
2. guess an sc order on Unlock/Lock actions (part of the witness)


## Locks and unlocks

int $\mathrm{x}, \mathrm{r}$;
mutex m;

| m. $\operatorname{lock}() ;$ | m. $\operatorname{lock}() ;$ |
| :--- | :--- |
| $\mathrm{x}=\ldots$ | $\mathrm{r}=\mathrm{x} ;$ |
| $\mathrm{m} . \operatorname{unlock}() ;$ |  |

1. the operational semantics defines the sb arrows
2. guess an sc order on Unlock/Lock actions (part of the witness)
3. the sc order is included in the syncronised-with relation

f:U mutex

## Locks and unlocks

## simple-happens-before


int $x, r ;$
mutex m;

| m. $\operatorname{lock}() ;$ | m. $\operatorname{lock}() ;$ |
| :--- | :--- |
| $\mathrm{x}=\ldots$ |  |
| $\mathrm{m} . \operatorname{unlock}() ;$ | $\mathrm{r}=\mathrm{x} ;$ |

1. the operational semantics defines the sb arrows
2. guess an sc order on Unlock/Lock actions (part of the witness)
3. the sc order is included in the syncronised-with relation

f:U mutex
4. which in turn defines the happens-before relation...

## Happens before

The happens before relation is key to the model:

1. non-atomic loads read the most recent write in happens before. (This is unique in DRF programs)
2. the story is more complex for atomics, as we shall see.
3. data races are defined as an absence of happens before between conflicting actions.
simple-happens-before


f:U mutex

## A data race

asw

$$
\begin{aligned}
& \text { int } y, x=2 \text {; } \\
& x=3 ; \quad \mid y=(x==3) ;
\end{aligned}
$$

## A data race

$$
\begin{aligned}
& \text { int } y, x=2 \text {; } \\
& x=3 ; \quad \mid y=(x==3) ;
\end{aligned}
$$



## Data race definition

let data_races actions $h b=$
$\{(a, b) \mid \forall a \in$ actions $b \in$ actions
$\neg(a=b) \wedge$
same_location a $b \wedge$
(is_write $a \vee$ is_write $b$ ) $\wedge$
$\neg($ same_thread $a b) \wedge$
$\neg($ is_atomic_action $a \wedge$ is_atomic_action $b) \wedge$
$\neg((a, b) \in h b \vee(b, a) \in h b)\}$

Programs with a data race have undefined behaviour (DRF model).

## Simple concurrency: Dekker's example and SC

```
atomic_int x = 0;
atomic_int y = 0;
x.store(1, seq_cst); y.store(1, seq_cst);
y.load(seq_cst); x.load(seq_cst);
```



Why is this behaviour forbidden?

## Simple concurrency, Dekker's example and SC

```
atomic_int x = 0;
atomic_int y = 0;
x.store(1, seq_cst); y.store(1, seq_cst);
y.load(seq_cst); x.load(seq_cst);
```

$\mathrm{c}: \mathrm{W}_{\mathrm{sc}} \mathrm{y}=1$
$e: W_{s c} x=1$
$f: R_{s c} y=1$

The sc relation must define a total order over unlocks/locks and seq_cst accesses... sc is included in hb, an rf must respect hb.

## Expert concurrency: the release-acquire idiom

// sender
x = ...
y.store(1, release);
// receiver
while (0 == y.load(acquire));
r = x ;

Here we have an rf arrow beetwen a pair of release/acquire accesses.


## Expert concurrency: the release-acquire idiom

// sender
x = ...
y.store(1, release);
// receiver
while (0 == y.load(acquire));
r = x;

Here we have an rf arrow beetwen a pair of release/acquire accesses.

The rf arrow beetwen release/acquire accesses induces an sw arrow between those accesses.


## Expert concurrency: the release-acquire idiom

// sender
x = ...
y.store(1, release);
// receiver
while (0 == y.load(acquire));
r = x ;
Here we have an rf arrow beetwen a pair of release/acquire accesses.

The rf arrow beetwen release/acquire accesses
 induces an sw arrow between those accesses.

And in turn defines an hb constraint. simple-happens-before

$$
(\stackrel{\text { sequenced-before }}{\longrightarrow} \cup \stackrel{\text { synchronizes-with }}{\longrightarrow})^{+}
$$

## Relaxed writes

x.load(relaxed);
y.store(1, relaxed);

```
y.load(relaxed);
```

y.load(relaxed);
x.store(1, relaxed);

```
x.store(1, relaxed);
```

No data-races, no synchronisation cost, but weakly ordered.

## Relaxed writes, ctd.

```
atomic_int x = 0;
atomic_int y = 0;
```


$\mathrm{c}:$ Wrlx $\mathrm{x}=1 \quad \mathrm{~d}:$ Wrlx $\mathrm{y}=1$
rf

$\mathrm{f}: \operatorname{Rrlx} \mathrm{y}=0$
$\mathrm{g}:$ Rrlx $\mathrm{y}=1$

$h: \operatorname{Rrlx} x=0$

Again, no data-races, no synchronisation cost, but weakly ordered (IRIW).

## Expert concurrency: fences avoid excess sync.

// sender
$x=\ldots$
y.store(1, release);

```
// receiver
while (0 == y.load(acquire));
r = x;
```

// sender
x = ...
y.store(1, release);
// receiver while (0 == y.load(relaxed));
fence(acquire);
$r=x$;

## Expert concurrency: fences avoid excess sync.

```
// sender
x = ...
y.store(1, release);
```

// receiver
while (0 == y.load(relaxed));
fence(acquire);
r = x ;

Here we have an rf arrow beetwen a release write and a relaxed write.


## Expert concurrency: fences avoid excess sync.

```
// sender
x = ...
y.store(1, release);
```

```
// receiver
while (0 == y.load(relaxed));
fence(acquire);
r = x;
```

Here we have an rf arrow beetwen a release write and a relaxed write.

$\begin{array}{lll}\mathrm{d}: \mathrm{W}_{\text {rel }} \mathrm{y}=1 & \rightarrow & \mathrm{f}: \mathrm{F}_{\text {aca }} \\ \text { tions } & \mathrm{sw} & \mathrm{sb} \\ \text { te, adding }\end{array}$
$\mathrm{g}: \mathrm{R}_{\mathrm{na}} \mathrm{x}=1$

## Expert concurrency: fences avoid excess sync.

```
// sender
x = ...
y.store(1, release);
```

Here we have an rf arrow beetwen a release write and a relaxed write.

The acquire fence follows the $\mathrm{sb} / \mathrm{rf}$ relations looking for the corresponding release write, adding a sw arrow.

Happens-before follows as usual...

## Modification order

atomic_int x = 0; | x.store(1, relaxed); | $\begin{array}{l}\text { x.load(relaxed); } \\ \text { x.store(2, relaxed); }\end{array}$ |
| :--- | :--- |
| x.load(relaxed); |  |



Modification order is a total order over atomic writes of any memory order.

## Coherence and atomic reads

All forbidden:


CoRR
$a: W x=1$
hb
$\mathrm{b}: \mathrm{W} \mathrm{mo}$
$\mathrm{m}=2$
CoWW


CoRW

Idea: atomics cannot read from later writes in happens-before.

## Coherence and atomic reads

All forb


Idea: atomics cannot read from later writes in happens-before.

## The full model



## Is C++11 hopelessly complicated?

Programmers cannot be given this model.
However, with a formal definition, we can do proofs!

- Can we compile to $x 86 ?$

| Operation | x86 Implementation |
| :--- | :--- |
| load(non-seq_cst) | mov |
| load(seq_cst) | lock xadd(0) |
| store(non-seq_cst) | mov |
| store(seq_cst) | lock xchg |
| fence(non-seq_cst) | no-op |


| C $++0 x$ Operation | POWER Implementation |
| :--- | :--- |
| Non-atomic Load | ld |
| Load Relaxed | ld |
| Load Consume | ld (and preserve dependency) |
| Load Acquire | ld; cmp; bc; isync |
| Load Seq Cst | sync; ld; cmp; bc; isync |
| Non-atomic Store | st |
| Store Relaxed | st |
| Store Release | lwsync; st |
| Store Seq Cst | sync; st |

## Is C++11 hopelessly complicated?

Simplifications:
Full model: visible sequences of side effects are unneded (HOL4)
Derivative models:

- without consume, happens-before is transitive
- DRF programs using only seq_cst atomics are SC (false)

```
atomic_int x = 0;
atomic_int y = 0;
if (1 == x.load(seq_cst)) |if (1 == y.load(seq_cst))
    atomic_init(&y, 1);
    atomic_init(&x, 1);
```

atomic_init is a non-atomic write, and in C++11 they race.

## The current state of the standard

## Fixed:

- in some cases, happens-before was cyclic
- coherence
- seq_cst atomics were more broken


## Not fixed:

- self satisfying conditional

```
r1 = x.load(mo_relaxed); | r2 = y.load(mo_relaxed);
if (r1 == 42)
    y.store(r1, mo_relaxed);
r2 = y.load(mo_relaxed);
```



- seq_cst atomics are still not SC


# 3. Sketch of an operational formalisation of $x 86-T S O$ 

...starting with a formalisation of SC

## Separate language and memory semantics

```
class ArrayWrapper
    public:
        ArrayWrapper (int n)
            : _p_vals( new int[ n ] )
            , _size( n )
        {}
        // copy constructor
        ArrayWrapper (const ArrayWrapper& other)
            : _p_vals( new int[ other._size ] )
            , _size( other._size )
            for ( int i = 0; i < _size; ++i )
            { ( )
                _p_vals[ i ] = other._p_vals[ i ];
        }
        ~ArrayWrapper ()
            delete [] _P_vals;
        }
    private:
    int *_p_vals;
    int _size;
```

program
semantics defined via an LTS

memory semantics defined via an LTS

Labels for interaction:
$\mathrm{W}_{\mathrm{t}}[\mathrm{a}] \mathrm{v}$ : a write of value v to address a by thread t $R_{t}[a] v$ : a read of $v$ from a by $t$ by thread $t$ + other events for barriers and locked instructions

## Separate language and memory semantics

Separate language and state semantics proved to be a very good choice in many (unrelated) projects I worked on!

Labels for interaction:
$\mathrm{W}_{\mathrm{t}}[\mathrm{a}] \mathrm{v}$ : a write of value v to address a by thread t $R_{t}[a] v$ : a read of $v$ from a by $t$ by thread $t$

+ other events for barriers and locked instructions


## A tiny language



## What can a thread do in isolation?

$e \xrightarrow{l} e^{\prime} \quad e$ does $l$ to become $e^{\prime}$
$\xlongequal[{* x \xrightarrow{\mathrm{R} x=n}} n]{\text { READ }}$
$\stackrel{* x=n \xrightarrow{\mathrm{~W} x=n} n}{ } \quad$ WRITE
$\frac{e \xrightarrow{l} e^{\prime}}{* x=e \xrightarrow{l} * x=e^{\prime}}$
$\overline{n ; e \xrightarrow{\tau} e} \quad \mathrm{SEQ}$
$e_{1} \xrightarrow{l} e_{1}^{\prime}$
$\overline{e_{1} ; e_{2} \xrightarrow{l} e_{1}^{\prime} ; e_{2}}$

$$
\begin{array}{ll}
\frac{e_{1} \xrightarrow{l} e_{1}^{\prime}}{e_{1}+e_{2} \xrightarrow{l} e_{1}^{\prime}+e_{2}} & \text { PLUS_CONTEXT_1 } \\
\frac{e_{2} \xrightarrow{l} e_{2}^{\prime}}{n_{1}+e_{2} \xrightarrow{l} n_{1}+e_{2}^{\prime}} & \text { PLUS_CONTEXT_2 } \\
\frac{n=n_{1}+n_{2}}{n_{1}+n_{2} \xrightarrow{\tau} n} \quad \text { PLUS }
\end{array}
$$

Observe that we can read an arbitrary value from the memory.

## Lifting to processes

$p \xrightarrow{l_{t}} p^{\prime} \quad p$ does $l_{t}$ to become $p^{\prime}$
$\frac{e \xrightarrow{l} e^{\prime}}{t: e \xrightarrow{l_{t}} t: e^{\prime}} \quad$ THREAD
$\frac{p_{1} \xrightarrow{l_{t}} p_{1}^{\prime}}{p_{1}\left|p_{2} \xrightarrow{l_{t}} p_{1}^{\prime}\right| p_{2}}$
PAR_CONTEXT_LEFT
Actions are labelled by the thread that performed the action.

Free interleaving.
$\frac{p_{2} \xrightarrow{l_{t}} p_{2}^{\prime}}{p_{1}\left|p_{2} \xrightarrow{l_{t}} p_{1}\right| p_{2}^{\prime}}$

## PAR_CONTEXT_RIGHT

## A sequentially consistent memory

Take M to be a function from addresses to integers.

$$
M \xrightarrow{l} M^{\prime} \quad M \text { does } l \text { to become } M^{\prime}
$$

$$
\frac{M(x)=n}{M \xrightarrow{\mathrm{R} x=n} M} \quad \text { MREAD }
$$

$$
\overline{M \xrightarrow{\mathrm{~W} x=n} M \oplus(x \mapsto n)}
$$

Mwrite

## SC semantics: whole system transitions

$s^{l_{t}} s^{\prime} \quad s$ does $l_{t}$ to become $s^{\prime}$

$$
\begin{gathered}
p \xrightarrow{\mathrm{R}_{t} x=n} p^{\prime} \\
\left\langle\begin{array}{l}
M \xrightarrow{\mathrm{R} x=n} M^{\prime} \\
\langle p, M\rangle \xrightarrow{\mathrm{R}_{t} x=n}\left\langle p^{\prime}, M^{\prime}\right\rangle
\end{array} \quad\right. \text { SREAD } \\
\frac{p \xrightarrow{\mathrm{~W}_{t} x=n} p^{\prime}}{M \xrightarrow{\mathrm{~W} x=n} M^{\prime}} \quad \text { SWRITE } \\
\frac{p p, M\rangle \xrightarrow{\mathrm{W}_{t} x=n}\left\langle p^{\prime}, M^{\prime}\right\rangle}{\langle p, M\rangle \xrightarrow{\tau_{t}}\left\langle p^{\prime}, M\right\rangle} \quad \text { STAU }
\end{gathered}
$$

## SC semantics, example

All threads read and write the shared memory. Threads execute asynchronously,the semantics allows any interleaving of the thread transitions.


Each interleaving has a linear order of reads and writes to memory.
...now we just have to define a TSO memory...

## x86-TSO abstract machine



## x86-tso: a formalisation using an LTS

The machine state $s$ can be represented by a tuple ( $M, B, L$ ):
M : address -> value option
B : tid -> (address * value) list
L : tid option
where:
$M$ is the shared memory, mapping addresses to values
B gives the store buffer for each thread
L is the global machine lock indicating when a thread has exclusive access to memory (omitted in these slides)

## x86-tso abstract machine: selected transition rules

t is not blocked in machine state $\mathrm{s}=(\mathrm{M}, \mathrm{B}, \mathrm{L})$ if $[\ldots$ or] the lock is not held.
In buffer $\mathrm{B}(\mathrm{t})$ there are no pending writes for address x if there are no $(x, v)$ elements in $B(t)$.

RM: Read from memory
not_blocked $(s, t)$
$s . M(x)=v$
no_pending $(s . B(t), x)$
$s \xrightarrow{\mathrm{R}_{t} x=v} s$
Thread $t$ can read $v$ from memory at address $x$ if $t$ is
not blocked, the memory does contain $v$ at $x$, and
there are no writes to $x$ in $t$ 's store buffer.

## x86-tso abstract machine: selected transition rules

## RB: Read from write buffer

not_blocked $(s, t)$
$\exists b_{1} b_{2} . s \cdot B(t)=b_{1}++[(x, v)]++b_{2}$
no_pending $\left(b_{1}, x\right)$
$s \xrightarrow{\mathrm{R}_{t} x=v} s$
Thread $t$ can read $v$ from its store buffer for address $x$ if $t$ is not blocked and has $v$ as the newest write to $x$ in its buffer;

## x86-tso abstract machine: selected transition rules

WB: Write to write buffer

$$
s \xrightarrow{\mathrm{~W}_{t} x=v} s \oplus \llbracket B:=s . B \oplus(t \mapsto([(x, v)]++s . B(t))) \downarrow
$$

Thread $t$ can write $v$ to its store buffer for address $x$ at any time;

WM: Write from write buffer to memory

$$
\begin{aligned}
& \text { not_blocked }(s, t) \\
& s \cdot B(t)=b++[(x, v)]
\end{aligned}
$$

$$
\begin{aligned}
& s \xrightarrow{\tau_{t x=v}} \\
& s \oplus \backslash M:=s . M \oplus(x \mapsto v) \rrbracket \oplus \backslash B:=s . B \oplus(t \mapsto b) \rrbracket
\end{aligned}
$$

If $t$ is not blocked, it can silently dequeue the oldest write from its store buffer and place the value in memory at the given address, without coordinating with any hardware thread

## 4. Veryfing fence elimination optimisations

aka reasoning on the $x 86 T S O$ operational memory model and compiler correctness

## CompCertTSO



CFG generation
[POPL 2011]

## CompCertTSO + fence optimisations



## Compilers are ideal for verification

source program (e.g., C)

Compilers are:

- Basic computing infrastructure
- Generally reliable, but nevertheless contain many bugs
e.g., Yang et al. [PLDI 2011] found 79 gcc \& 202 llvm bugs
- "Specifiable": compiler correctness = preservation of behaviours
- Interesting: naturally higher-order, involve clever algorithms
- Big, but modular


## Language semantics

The semantics of all the CompCertTSO languages is defined by:

- a type of programs, prg
- a type of states, states
- a set of initial states for each program, init $\in \operatorname{prg} \rightarrow \mathbb{P}($ states $)$
- a transition relation, $\quad \rightarrow \in \mathbb{P}($ states $\times$ event $) \times$ states $)$
call, return, fail, oom,

The visible behaviour of a program is defined by the external function calls (call) and returns (return), errors (fail), and running out of memory (oom).

## Traces

- Finite sequences of call \& return events ending with: end: successful termination, inftau: infinite execution that stops performing visible events oom: execution runs out of memory
- Infinite sequences of call \& return events;

$$
\begin{aligned}
\operatorname{traces}(p) \stackrel{\text { def }}{=} & \left\{\ell \cdot \operatorname{end} \mid \exists s \in \operatorname{init}(p) . \exists s^{\prime} . s \xlongequal{\ell} s^{\prime} \wedge s^{\prime} \nrightarrow\right\} \\
& \cup\left\{\ell \cdot \operatorname{tr} \mid \exists s \in \operatorname{init}(p) . \exists s^{\prime} . s \xlongequal{\ell \cdot \mathrm{fail}} s^{\prime}\right\} \\
& \cup\left\{\ell \cdot \operatorname{inftau} \mid \exists s \in \operatorname{init}(p) . \exists s^{\prime} . s \nRightarrow s^{\prime} \wedge \operatorname{inftau}\left(s^{\prime}\right)\right\} \\
& \cup\left\{\ell \cdot \operatorname{oom} \mid \exists s \in \operatorname{init}(p) . \exists s^{\prime} . s \stackrel{\ell}{\Rightarrow} s^{\prime}\right\} \\
& \cup\{t r \mid \exists s \in \operatorname{init}(p) . s \text { can do the infinite trace } \operatorname{tr}\}
\end{aligned}
$$

NB: Erroneous computations become undefined after the first error.

## Compiler correctness

source program (e.g., C)
Compiler
target program (e.g., x86)
traces(source_program) $\supseteq$ traces(target_program)
print "a" || print "b"
$\longrightarrow$ print "ab"
print "ab"
$\rightarrow$ print "a" || print "b"
fail
$\longrightarrow$ print "ab"
print "ab"


## Store buffering



## Store buffering



## Store buffering



## Store buffering



## Store buffering



## Store buffering



## Store buffering



## Store buffering + fences



## Store buffering + fences



## Store buffering + fences



## Store buffering + fences

MFENCE blocks until the thread buffer is empty


## Who inserts fences?

```
1. The programmer, explicitly. Example: Fraser's lockfree-lib:
/*
    * II. Memory barriers.
    * MB(): All preceding memory accesses must commit before any later accesses.
    *
    * If the compiler does not observe these barriers (but any sane compiler
    * will!), then VOLATILE should be defined as 'volatile'.
    */
#define MB() __asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
```

2. The compiler, to implement a high-level memory model, e.g. SEQ_CST C++0x low-level atomics on x86:

Load SEQ_CST: MFENCE; MOV
Store SEQ_CST: MOV; MFENCE

## Fence instructions

1. Fences are necessary
to implement locks \& not fully-commutative linearizable objects (e.g., stacks, queues, sets, maps).
[Attiya et al., POPL 2011]
2. Fences can be expensive

## Redundant fences (1)

If we have two consecutive fence instructions, we can remove the latter:


The buffer is already empty when the second fence is executed.

Generalisation:
MFENCE
NON-WRITE INSTR
‥
NON-WRITE INSTR
MFENCE

$\xrightarrow{>} \quad$| MFENCE |
| :--- |
| NON-WRITE INSTR |
| $\cdots$ |
| NON-WRITE INSTR |
| NOP |

A fence is redundant if it always follows a previous fence or locked instruction in program order, and no memory store instructions are in between.

A forward data-flow problem over the boolean domain $\{\perp, \top\}$

Associate to each program point:
$\perp$ : along all execution paths there is an atomic instruction before the current program point, with no intervening writes;

T : otherwise.

$$
\mathcal{F} \mathcal{E}_{1}(n)= \begin{cases}\top & \text { if predecessors }(n)=\emptyset \\ \bigsqcup_{p \in \operatorname{predecessors}(n)} T_{1}\left(\operatorname{instr}(p), \mathcal{F E}_{1}(p)\right) & \text { otherwise }\end{cases}
$$

A fence is redundant if it always follows a previous fence or locked instruction in program order, and no memory store instructions are in between.


## Redundant fences (2)

If we have two consecutive fence instructions, we can remove the former:


Intuition: the visible effects initially published by the former fence, are now published by the latter, and nobody can tell the difference.

Generalisation:

| MFENCE |  | NOP |  |
| :--- | :--- | :--- | :--- |
| INSTRUCTION | 1 |  | INSTRUCTION |
| IN | 1 |  |  |
| $\cdots$ |  | $\cdots$ |  |
| INSTRUCTION |  | INSTRUCTION | $n$ |
| MFENCE |  | MFENCE |  |

## Redundant fences (2)

If there are reads in between the fences...

but

|  | Thread 0 | Thread 1 |  |
| :---: | :---: | :---: | :---: |
| $[\mathrm{x}]=[\mathrm{y}]=0$ | $\begin{aligned} & \text { MOV }[\mathrm{x}] \\ & \operatorname{NOP} \\ & \text { MOV EAX } \\ & \text { MFENCE } \end{aligned}$ | $\begin{aligned} & \operatorname{MOV}[\mathrm{y}] \\ & \operatorname{MFENCE} \\ & \operatorname{MOV} \text { EBX } \end{aligned} \leftarrow[\mathrm{x}]=10$ | $\begin{gathered} \mathrm{EAX}=\mathrm{EBX}=0 \\ \text { allowed } \end{gathered}$ |

## Redundant fences (2)

If there are reads in between the fences...
but

| Thread 0 | Thread 1 | $E A X=E B X=0$ <br> forbidden |
| :---: | :---: | :---: |
| $\left\lvert\, \begin{aligned} & \text { MOV }[\mathrm{x}] \leftarrow 1 \\ & \text { MFENCE } \\ & \text { MON FAX } \leftarrow \quad \leftarrow \mathrm{F} \end{aligned}\right.$ | $\left\lvert\, \begin{aligned} & \operatorname{MOV~[y]} \\ & \operatorname{MFENCE} \end{aligned}\right.$ |  |
| If there are read optimisa | in between, the is unsound. |  |
| $\left\lvert\, \begin{aligned} & \text { MOV }[\mathrm{x}] \\ & \mathrm{NOP} \\ & \mathrm{MOV} \text { EAX } \\ & \mathrm{MOV} \\ & \mathrm{MFENCE} \end{aligned}\right.$ | $\begin{aligned} & \operatorname{MOV}[\mathrm{y}] \\ & \operatorname{MFENCE} \\ & \operatorname{MOV} \text { EBX } \end{aligned} \leftarrow[\mathrm{x}]=10$ | $\begin{gathered} \mathrm{EAX}=\mathrm{EBX}=0 \\ \text { allowed } \end{gathered}$ |

## Redundant fences (2)

Swapping a STORE and a MFENCE is sound:

MFENCE; STORE
STORE; MFENCE

1. transformed program's behaviours $\subseteq$ source program's behaviours (source program might leave pending write in its buffer)
2. There is the new intermediate state if the buffer was initially non-empty, but this intermediate state is not observable.
(a local read is needed to access the local buffer)
Intuition: Iterate this swapping... later fence or locked instruction in program order, and no memory read instructions are in between.

A backward data-flow problem over

| $T_{2}($ nop, $\mathcal{E})$ | $=\mathcal{E}$ |
| :--- | :--- |
| $T_{2}(\operatorname{op}(o p, \vec{r}, r), \mathcal{E})$ | $=\mathcal{E}$ |
| $T_{2}($ load $(\kappa$, addr, $\vec{r}, r), \mathcal{E})$ | $=\top$ |
| $T_{2}(\operatorname{store}(\kappa$, addr,, src $), \mathcal{E})$ | $=\mathcal{E}$ |
| $T_{2}(\operatorname{call}($ sig, ros, args, res $), \mathcal{E})$ | $=\top$ |
| $T_{2}(\operatorname{cond}($ cond, args $), \mathcal{E})$ | $=\mathcal{E}$ |
| $T_{2}($ return $($ optarg $), \mathcal{E})$ | $=\top$ |
| $T_{2}($ threadcreate $($ optarg $), \mathcal{E})$ | $=\top$ |
| $T_{2}($ atomic $($ aop $, \vec{r}, r), \mathcal{E})$ | $=\perp$ |
| $T_{2}($ fence, $\mathcal{E})$ | $=\perp$ |

T : otherwise.
Associate to each program point:
$\perp$ : along all execution paths there is an atomic instruction after the current program point, with no intervening reads; the boolean domain $\{\perp, \top\}$

## FE1 and FE2 are both useful

Removed by FE1 but not FE2:

Removed by FE2 but not FE1:

```
MFENCE
```

```
MOV EAX <- [Y]
```

MOV EAX <- [Y]
MFENCE
MFENCE
MOV EBX <- [y]

```
MOV EBX <- [y]
```

MOV [x] <- 1
MFENCE
MOV [x] <- 2
MFENCE

## Informal correctness argument

Intuition: FE2 can be thought as iterating

and then applying
MFENCE; MFENCE NOP; MFENCE

This argument works for finite traces, but not for infinite traces as the later fence might never be executed:

| MFENCE; | NOP; |
| :--- | :--- |
| STORE; |  |
| WHILE $(1) ;$ | STORE; |
| MFENCE | WHILE (1); |
| MFENCE |  |

## Basic simulations

A pair of relations

$$
\sim \in \mathbb{P}(\text { src.states } \times \text { tgt.states }) \quad>\in \mathbb{P}(\text { tgt.states } \times \text { tgt.states })
$$

is a basic simulation for compile : src.prg $\rightarrow$ tgt.prg if:

```
sim_init: }\forallp\mp@subsup{p}{}{\prime}.\operatorname{compile}(p)=\mp@subsup{p}{}{\prime}\Longrightarrow\forallt\in\operatorname{init}(\mp@subsup{p}{}{\prime}).\existss\in\operatorname{init}(p).s~
sim_end:}\forallst.s~t\wedget\not-> _ \Longrightarrows\not-> _
sim_step:}\forallst\mp@subsup{t}{}{\prime}ev.s~t\wedget\xrightarrow{}{ev}\mp@subsup{t}{}{\prime}\wedgeev\not=0om
ll
```

Exhibiting a basic simulation implies:
traces $($ compile $(p)) \backslash\{t \cdot$ inftau $\mid t$ trace $\} \subseteq$ traces $(p)$
"simulation can stutter forever"

## Usual approach: measured simulations

Definition 2 (Measured sim.). A measured simulation is any basic simulation $(\sim,>)$ such that $>$ is well-founded.

Theorem 1. If there exists a measured simulation for the compilation function compile, then for all programs $p$, $\operatorname{traces}($ compile $(p)) \subseteq \operatorname{traces}(p)$.

## Simulation for FE2

$s \equiv i t$ iff thread $i$ of $s$ and $t$ have identical pc, local states and buffers
$s \sim i s^{\prime}$ iff thread $i$ of $s$ can execute zero or more nop, op, store and mfence instructions and end in the state $s^{\prime}$
$s \sim t$ iff

- t's CFG is the optimised version of s's CFG; and
$-s$ and $t$ have identical memories; and
- $\forall$ thread $i$, either $s \equiv i t$ or
the analysis for $i^{\prime} s \mathrm{pc}$ returned $\perp$ and $\exists s^{\prime}, s \sim \sim_{i} s^{\prime}$ and $s^{\prime} \equiv i t$
" $s$ is some instructions behind and can catch up"
Stutter condition:
$t>t^{\prime}$ iff $t \rightarrow t^{\prime}$ by a thread executing a nop, op, store or mfence ( and t's buffer being non-empty)


## Simulation for FE2


" $s$ is some instructions behind and can catch up"
Stutter condition:
$t>t^{\prime}$ iff $t \rightarrow t^{\prime}$ by a thread executing a nop, op, store or mfence ( and t's buffer being non-empty)

## Simulation for FE2

$S \equiv i t$ iff thread $i$ of $s$ and $t$ have identical nc. local states and buffers
$s \sim_{i} s^{\prime}$ iff th But if (1) all threads have non-empty buffers, and
(2) are stuck executing infinite loops, and
(3) no writes are ever propagated to memory,
$s \sim t$ iff then we can stutter forever.

- t's CFG
$-s$ and $t$
$-\forall$ thread


Solution 1: Assume this case never arises (fairness)
Solution 2: Do a case split.

- If this case does not arise, we are done.
- If it does, use a different (weaker) simulation to construct an infinite trace for the source


## Weaktau simulation

Definition 3 (Weaktau sim.). A weaktau simulation consists of a basic simulation $(\sim,>)$ with and an additional relation between source and target states, $\simeq \in \mathbb{P}$ (src.states $\times$ tgt.states $)$ satisfying the following properties:

```
sim_weaken : }\foralls,t.s~t\Longrightarrows\simeq
sim_wstep:}\forallst\mp@subsup{t}{}{\prime}.s\simeqt\wedget\xrightarrow{}{\tau}\mp@subsup{t}{}{\prime}\wedget>\mp@subsup{t}{}{\prime}
    (s\xrightarrow{}{\tau}*\xrightarrow{}{\mathrm{ fail}}\mp@subsup{)}{)}{\prime}-s\mathrm{ reaches a failure}
\vee (\exists\mp@subsup{s}{}{\prime}.s\xrightarrow{}{\tau}*\xrightarrow{}{\tau}\mp@subsup{s}{}{\prime}\wedge\mp@subsup{s}{}{\prime}\simeq\mp@subsup{t}{}{\prime}) -s does a matching step sequence.
```

Theorem 2. If there exists a weaktau-simulation $(\sim,>, \simeq)$ for the compilation function compile, then for all programs $p$, $\operatorname{traces}(\operatorname{compile}(p)) \subseteq \operatorname{traces}(p)$.

Remarks:

- Once the simulation game moves from $\sim$ to $\simeq$, stuttering is forbidden;
- Can view difference between $\sim$ and $\simeq$ as a boolean prophecy variable.


## Weaktau simulation for FE2

$s \sim t, t>t^{\prime}$ as before.
$s \simeq t$ iff

- t's CFG is the optimised version of s's CFG; and
$-\forall i, \exists s^{\prime}$ s.t. $s \sim i s^{\prime} \equiv i t$.
(i.e., same as $s \sim t$ except that the memories memories are unrelated.)


## A closer look at the RTL



Patterns like that on the left are common.

FE1 and FE2 do not optimise these patterns.

It would be nice to hoist those fences out of the loop.

## A closer look at the RTL



Patterns like that on the left are common.

FE1 and FE2 do not optimise these patterns.
 Do you perform PRE?

## A closer look at the RTL



Patterns like that on the left are common.

FE1 and FE2 do not optimise these patterns.

It would be nien...cont of the loop. Do you perform PRE?
...adding a fence is always safe...

## Partial redundancy elimination



## Partial redundancy elimination



A: a backward analysis returning $T$ if along some path after the current program point there is an atomic instruction with no intervening reads;

B: a forward analysis returning $\perp$ if along all paths to the current program point there is a fence with no later reads or atomic instructions.

Replace NOP with FENCE after conditionals if:

- B returns $\perp$
- A returns $\perp$
- A returns T on the other branch

Breturns $\perp$ :
a previous fence will be eliminated if we insert a fence at both branches of conditional nodes.
A returns $\perp$ :
the previous fence won't be removed by FE2.
A returns T on the other branch:
the other branch already makes the previous fence partially redundant.

B: a forward analysis returning $\perp$ if along all paths to the current program point there is a fence with no later reads or atomic instructions.

Replace NOP with FENCE after conditionals if:

- B returns $\perp$
- A returns $\perp$
- A returns $T$ on the other branch


## Evaluation of the optimisations

- Insert MFENCEs before every read (br), or after every write (aw).
- Count the MFENCE instructions in the generated code.

|  | br | br+FE1 | aw | aw+FE2 | aw+PRE+FE2 |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Dekker | 3 | 2 | 5 | 4 | 4 |
| Bakery | 10 | 2 | 4 | 3 | 3 |
| Treiber | 5 | 2 | 3 | 1 | 1 |
| Fraser | 32 | 18 | 19 | 12 | 11 |
| TL2 | 166 | 95 | 101 | 68 | 68 |
| Genome | 133 | 79 | 62 | 41 | 41 |
| Labyrinth | 231 | 98 | 63 | 42 | 42 |
| SSCA | 1264 | 490 | 420 | 367 | 367 |

## Evaluation of the optimisations

- Insert MFENCEs before every read (br), or after every write (aw).
- Cour Important remark for your future work:

This is not a decent evaluation... we know nothing about real code, and the number of fences is not a good measure. But unclear how to do better.

Evaluation should be taken seriously by CS scientists!
http://evaluate.inf.usi.ch/

| Labyrinth | 231 | 98 | 63 | 42 | 42 |
| :---: | :---: | :---: | :---: | :---: | :---: |
| SSCA | 1264 | 490 | 420 | 367 | 367 |

Conclusion

## Syllabus

In these lectures we have covered the hardware models of two modern computer architectures (x86 and Power/ARM - at least for a large subset of their instruction set).

We have seen how compiler optimisations can also break concurrent programs and the importance of defining the memory model of highlevel programming languages (and we have seen in detail the C++11 memory model).

We have also introduced some proof methods to reason about concurrency.

After these lectures, you might have the feeling that multicore programming is a mess and things can't just work.


The memory models of modern hardware are better understood.

Programming languages attempt to specify and implement reasonable memory models.

Researchers and programmers are now interested in these problems.

The memory models of modern hardware are better understood.

## Still, many open problems...

problems.



All these lectures are based on work done with/by my colleagues. Thank you!


## And thank you all for attending these lectures!

Please, fill the course evaluation form. It is vital feedback to make a better course next year.


