## Semantics, languages and algorithms

 for multicore programmingAlbert Cohen



Luc Maranget


Francesco Zappa Nardelli


## Vote: topics for my next lecture

1. The Iwarx and stwcx Power instructions
2. Hunting compiler concurrency bugs8
3. Operational and axiomatic formalisation of $x 86-T S O$ ..... 2
4. Fence optimisations for x86-TSO5. The Java memory model6. The C11/C++11 memory model7
5. Static and dynamic techniques for data-race detection ..... 6
6. What about the Linux kernel ..... 3
7. A word on techniques for data-race detection

## Recall: Data-race freedom

Definition [data-race-freedom]: A program (traceset) is data-race free if none of its executions has two adjacent conflicting actions from different threads.

Equivalently, a program is data-race free if in all its executions all pairs of conflicting actions are ordered by happens-before.

Two conflicting accesses

A racy program

| Thread 0 | Thread 1 |
| :---: | :---: |
| $* y=1$ | if $* x==1$ |
| $* x=1$ | then print $* y$ |

not related by happens before.


## Recall: Happens-before

Definition [program order]: program order, $<_{\text {po, }}$, is a total order over the actions of the same thread in an interleaving.

Definition [synchronises with]: in an interleaving I, index i synchroniseswith index $\mathrm{j}, \mathrm{i}<$ sw j , if $\mathrm{i}<\mathrm{j}$ and $A(\mathrm{l})=\mathrm{U}$ (unlock), $\mathrm{A}(\mathrm{l})=\mathrm{L}$ (lock).

Definition [happens-before]: Happens-before is the transitive closure of program order and synchronises with.

## Examples of happens before

| Thread 0 | Thread 1 |
| :--- | :--- |
| $* y=1$ | $\operatorname{lock}() ;$ |
| $\operatorname{lock}() ;$ |  |
| $* x=1$ |  |
| tmp $=* x ;$ |  |
| unlock ()$;$ | if tmp $=1$ <br> then print *y |



## Data race detection: dynamic approaches

Modern high-performance dynamic race detectors are based either on:
happens-before ordering
reconstruct happens-before order
in the current execution report a race if two conflicting accesses are not related by hb
lockset computation
records which locks protect every memory access
report a race if intersection of all locksets for a variable is empty
popularised by Eraser (Savage et al.) '97
no false positives
drawback: misses races occurring on rare executions
can detect races not observed in the execution being monitored
drawback: unsound (false positives)

## Examples of lockset computation

| $\operatorname{lock}(b)$ | $\operatorname{lock}(a)$ |
| :--- | :--- |
| $\operatorname{lock}(a)$ | $x=2$ |
| $x=1$ | unlock(a) |
| unlock(a) |  |

$$
1: L(b) ; 1: L(a) ; 1: W x 1 ; 1: U(a) ; 2: L(a) ; 2: W x 2 ; 2: U(a)
$$

locks
$\mathrm{C}(\mathrm{x})$

1:b, a 2: a

C(x):

$$
x: a, b
$$

$$
x: a
$$

lockset for $x$ non-empty at the end, no data-race

| $\operatorname{lock}(b)$ | $\operatorname{lock}(c)$ |
| :--- | :--- |
| $\operatorname{lock}(a)$ | $x=2$ |
| $x=1$ | unlock(c) |
| unlock(a) |  |

$$
1: \mathrm{L}(\mathrm{~b}) ; 1: \mathrm{L}(\mathrm{a}) ; 1: \mathrm{Wx} 1 ; 1: \mathrm{U}(\mathrm{a}) ; 2: \mathrm{L}(\mathrm{c}) ; 2: \mathrm{Wx} 2 ; 2: \mathrm{U}(\mathrm{c})
$$

C(x):

$$
x: a, b
$$

x:empty
lockset for $x$ empty at the end, possible data-race

## lockset vs happens-before

$$
\begin{array}{l|l}
\mathrm{y}=1 & \operatorname{lock}(\mathrm{a}) \\
\operatorname{lock}(\mathrm{a}) & \mathrm{x}=2 \\
\mathrm{x}=1 & \mathrm{unlock}(\mathrm{a}) \\
\operatorname{unlock}(\mathrm{a}) & \mathrm{y}=2
\end{array}
$$



## lockset vs happens-before

$$
\begin{array}{l|l}
\mathrm{y}=1 & \operatorname{lock}(\mathrm{a}) \\
\operatorname{lock}(\mathrm{a}) & \mathrm{x}=2 \\
\mathrm{x}=1 & \mathrm{unlock}(\mathrm{a}) \\
\text { unlock(a) } & \mathrm{y}=2
\end{array}
$$

This program has a race on $y$


## lockset vs happens-before

$$
\begin{array}{l|l}
\mathrm{y}=1 & \operatorname{lock}(\mathrm{a}) \\
\operatorname{lock}(\mathrm{a}) & \mathrm{x}=2 \\
\mathrm{x}=1 & \text { unlock(a) } \\
\text { unlock(a) } & \mathrm{y}=2
\end{array}
$$

This program has a race on $y$

If only the execution below is observed:

happens-before computation does not report a race.
Lockset computation detects instead that accesses to y are unprotected and reports a possible race.

## lockset vs happens-before (2)

$$
\begin{array}{l|l}
\end{array}
$$

## lockset vs happens-before (2)



## lockset vs happens-before (2)



Happens-before computation will not report a race
(no matter which execution is observed)

Since accesses to y are unprotected, locksets computation reports a false positive.

## Data race detection

Modern high-performance dynamic race detectors are based either on:

## happens-before ordering <br> lockset computation

reconstruct happens-before order
in the current execution
report a race if intersection if two
conflicting accesses are not related
records which locks protect every memory access
report a race if intersection of all locksets for a variable is empty
popularised by Eraser (Savage et al.) '97
sound
drawback: misses races occurring on rare executions
can detect races not observed in the execution being monitored
drawback: unsound (false positives)

## Data race detection



## Data race detection: static approaches

Run a bunch of static analysis for inferring locksets.

Hard:

- aliasing on memory locations
- lock pointers
- must account all language features

Also done via fancy effect type-systems.



## 2. The C++11 memory model

a good example of an axiomatic memory model


## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0
\end{aligned}
$$

Thread 1
int s;
for (s=0; s!=4; s++) \{
if ( $a==1$ )
return NULL;
for ( $b=0 ; b>=26 ;++b$ )
\}

Thread 2
b $=42$; printf("\%d\n", b);

## Shared memory

$$
\begin{aligned}
& \text { int } a=1 \\
& \text { int } b=0
\end{aligned}
$$

Thread 1
int s;
for (s=0; s!=4; s++) \{
if ( $a==1$ )
return NULL;
for ( $b=0 ; b>=26 ;++b$ )

Thread 2
b $=42$; printf("\%d\n", b);
\}

## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0 ;
\end{aligned}
$$

Thread 1
int s;
for (s=0; s!=4; s++) \{
if ( $a==1$ )
return NULL;
for ( $b=0 ; b>=26 ;++b$ )
\}
;
\}

Thread 2
$b=42 ;$
printf("\%d\n", b);

## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0
\end{aligned}
$$

Thread 1
int s; for ( $s=0 ; \mathrm{s}!=4 ; \mathrm{s}++$ ) \{ if ( $a==1$ )
return NULL;
for ( $b=0 ; b>=26 ;++b$ )

Thread 2
b $=42$; printf("\%d\n", b);

## -

\}

## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0 ;
\end{aligned}
$$

Thread 1
int s;
for (s=0; s!=4; s++) \{
if ( $a==1$ )
return NULL;
for ( $b=0 ; b>=26 ;++b$ )
\}

Thread 2
b $=42$; printf("\%d\n", b);

## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0 ;
\end{aligned}
$$

Thread 1
int s;
for (s=0; s!=4; s++) \{
if ( $a==1$ )
return NULL;
for ( $b=0 ; b>=26 ;++b$ )

## ;

\}

Thread 2
b $=42$; printf("\%d\n", b);

## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0
\end{aligned}
$$

Thread 1
int s;
for (s=0; s!=4; s++) \{
if ( $a==1$ )
return NULL;
for ( $b=0 ; b>=26 ;++b$ )

Thread 2
b $=42$; printf("\%d\n", b);
\}
Thread 1 returns without modifying $b$.

## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0 ;
\end{aligned}
$$

Thread 1
int s;
for (s=0; s!=4; s++) \{
if ( $a==1$ )
return NULL;
for ( $b=0 ; b>=26 ;++b$ )

Thread 2
b $=42$; printf("\%d\n", b);

## ;

\}
Thread 1 returns without modifying $b$.

## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0
\end{aligned}
$$

Thread 1
int s;
for (s=0; s!=4; s++) \{
if ( $a==1$ )
return NULL;
for ( $b=0 ; b>=26 ;++b$ )
\}

Thread 2
b $=42$; printf("\%d\n", b);

## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0
\end{aligned}
$$

Thread 1
int s;
for (s=0; s!=4; s++) \{
if ( $a==1$ )
return NULL;
for ( $b=0 ; b>=26 ;++b$ )
\}

Thread 2
b $=42$; printf("\%d\n", b);

## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0
\end{aligned}
$$

Thread 1
int s;
for (s=0; s!=4; s++) \{
if ( $a==1$ )
return NULL;
for ( $b=0 ; b>=26 ;++b$ )

## ;

\}

Thread 2
b $=42$; printf("\%d\n", b);
gcc 4.7 -O2
...sometimes we get 0 on the screen

```
int s;
for (s=0; s!=4; s++) {
    if (a==1)
        return NULL;
    for (b=0; b>=26; ++b)
}
```


## gcc 4.7 -O2

```
int s;
for ( \(s=0 ; \mathrm{s}!=4 ; \mathrm{s}++\) ) \{
    if ( \(a==1\) )
    return NULL;
    for ( \(b=0 ; b>=26 ;++b\) )
        ;
\}
```

```
movl a(%rip), %edx # load a into edx
```

movl a(%rip), %edx \# load a into edx
movl b(%rip), %eax \# load b into eax
movl b(%rip), %eax \# load b into eax
testl %edx, %edx \# if a!=0
testl %edx, %edx \# if a!=0
jne .L2 \# jump to .L2
jne .L2 \# jump to .L2
movl \$0, b(%rip)
movl \$0, b(%rip)
ret
ret
.L2:
.L2:
movl %eax, b(%rip) \# store eax into b
movl %eax, b(%rip) \# store eax into b
xorl %eax, %eax \# store 0 into eax
xorl %eax, %eax \# store 0 into eax
ret \# return

```
ret # return
```


## gcc 4.7 -O2

```
int s;
for ( \(s=0 ; \mathrm{s}!=4 ; \mathrm{s}++\) ) \{
    if ( \(a==1\) )
        return NULL;
    for ( \(b=0 ; b>=26 ;++b\) )
        ;
\}
```

The outer loop can be (and is) optimised away

```
mov1 a(%rlp), %edx # road a lnto edx
movl b(%rip), %eax # load b into eax
testl %edx, %edx # if a!=0
jne .L2 # jump to .L2
movl $0, b(%rip)
ret
.L2:
movl %eax, b(%rip) # store eax into b
xorl %eax, %eax # store 0 into eax
ret # return
```

```
```

int s;

```
```

int s;
for ( $s=0 ; \mathrm{s}!=4 ; \mathrm{s}++$ ) \{
for ( $s=0 ; \mathrm{s}!=4 ; \mathrm{s}++$ ) \{
if ( $a==1$ )
if ( $a==1$ )
return NULL;

```
    return NULL;
```

```
    for ( \(b=0 ; b>=26 ;++b\) )
```

    for ( \(b=0 ; b>=26 ;++b\) )
        ;
        ;
    \}

```
\}
```

gcc 4.7 - O 2

```
movl a(%rip), %edx # load a into edx
movl b(%rip), %eax # load b into eax
testl %edx, %edx # if a!=0
jne .L2
movl $0, b(%rip)
ret
.L2:
movl %eax, b(%rip) # store eax into b
xorl %eax, %eax # store 0 into eax
ret # return
```

gcc 4.7-O2

```
int s;
for ( \(s=0 ; \mathrm{s}!=4 ; \mathrm{s}++\) ) \{
    if ( \(a==1\) )
    return NULL;
    for ( \(b=0 ; b>=26 ;++b\) )
        ;
\}
```

| movl a(\%rip), \%edx | \# load a into edx |
| :--- | :--- |
| movl b(\%rip), \%eax | \# load b into eax |
| testl \%edx, \%edx | \# if a!=0 |
| jne .L2 | \# jump to .L2 |
| movl $\$ 0, b(\% r i p)$ |  |
| ret |  |
| .L2: |  |
| movl \%eax, b(\%rip) \# store eax into b <br> xorl \%eax, \%eax \# store 0 into eax <br> ret  |  |

gcc 4.7-O2

```
int s;
for ( \(s=0 ; \mathrm{s}!=4 ; \mathrm{s}++\) ) \{
    if ( \(a==1\) )
    return NULL;
    for ( \(b=0 ; b>=26 ;++b\) )
        ;
\}
```

```
movl a(%rip), %edx # load a into edx
movl b(%rip), %eax # load b into eax
testl %edx, %edx # if a!=0
jne .L2 # jump to .L2
movl $0, b(%rip)
ret
.L2:
movl %eax, b(%rip) # store eax into b
xorl %eax, %eax # store 0 into eax
ret # return
```


## gcc 4.7 - O2

```
int s;
for ( \(s=0 ; \mathrm{s}!=4 ; \mathrm{s}++\) ) \{
    if ( \(a==1\) )
                                    return NULL;
    for ( \(b=0 ; b>=26 ;++b\) )
        ;
\}
```

```
movl a(%rip), %edx # load a into edx
movl b(%rip), %eax # load b into eax
testl %edx, %edx # if a!=0
jne .L2
movl $0, b(%rip)
ret
.L2:
movl %eax, b(%rip) # store eax into b
xorl %eax, %eax # store 0 into eax
ret # return
```


## gcc 4.7 - O2

```
int s;
for (s=0; s!=4; s++) \{
    if ( \(a==1\) )
    return NULL;
    for ( \(b=0 ; b>=26 ;++b\) )
        ;
\}
```

```
movl a(%rip), %edx # load a into edx
movl b(%rip), %eax # load b into eax
testl %edx, %edx # if a!=0
jne .L2
movl $0, b(%rip)
ret
.L2:
movl %eax, b(%rip) # store eax into b
xorl %eax, %eax # store 0 into eax
ret
# return
```


## The compiled code saves and restores $b$

## Correct in a sequential setting.

## What about concurrency?

```
movl a(%rip), %edx # load a into edx
movl b(%rip), %eax # load b into eax
testl %edx, %edx # if a!=0
jne .L2 # jump to .L2
movl $0, b(%rip)
ret
.L2:
movl %eax, b(%rip) # store eax into b
xorl %eax, %eax # store 0 into eax
ret # return
```


## The C++11 memory model

1300 page prose specification defined by the ISO.
The design is a detailed compromise:
hardware/compiler implementability
useful abstractions
broad spectrum of programmers

Welcome to the official home of

# ITA IEC JTC1/SC22/WG21 - The C++ Standards Committee 

2011-09-15: standards I projects I papers I mailings I internals I meetings I contacts

News 2011-09-11: The new C++ standard - C++11 - is published!

## The syntactic divide

// for regular programmers:
atomic_int $\mathrm{x}=0$;
x.store(1);
y $=x . l o a d() ;$
// for experts:
x.store(2, memory_order);
$\mathrm{y}=\mathrm{x} . \mathrm{load}_{\left(m e m o r y \_o r d e r\right) ; ~}$
atomic_thread_fence(memory_order);
where memory_order is one of the following:
mo_seq_cst mo_release mo_acquire
mo_acq_rel mo_consume mo_relaxed

## How may a program execute?

Two layer semantics:

1) a denotational semantics processes programs, identifying memory actions, and constructs candidate executions (Eopsem);

$$
P \longrightarrow E_{1}, \ldots, E_{n}
$$

2) an axiomatic memory model judges Eopsem paired with a memory ordering Xwitness

$$
\mathrm{E}_{\mathrm{i}} \longrightarrow \mathrm{X}_{\mathrm{i} 1}, \ldots, X_{\mathrm{im}}
$$

3) searches the consistent executions for races and uncostrained reads is there an $\mathrm{X}_{\mathrm{ij}}$ with a race?

## Relations

An Eopsem part containing:
sb sequenced before, program order
asw additional synchronizes with, inter-thread ordering

An $X_{\text {witness }}$ part containing:
rf relates a write to any reads that take its value
sc a total order over mo_seq_cst and mutex actions
mo modification order, per location total order of writes

From these, compute synchronise-with (sw) and happens-before ( $h b$ ).

We ignore consume atomics, which enables us to live in a simplified model. Full details in Batty et al., POPL 11.

## Formally

```
cpp_memory_model_opsem (p : program) =
let pre_executions =
    {(Eopsem,Xwitness). opsem p Eopsem ^
        consistent_execution (Eopsem, Xwitness)}
in
if \existsX \in pre_executions.
        (indeterminate_reads X = {}) v
        (unsequenced_races }X={}) 
        (data_races X = {})
then NoNE
else Some pre_executions
```


## A single-threaded example

1. sequenced before (sb) - given by opsem


## A single-threaded example

1. sequenced before (sb) - given by opsem
2. read-from (rf) - part of the witness
```
int main() {
    int x = 2;
    int y = 0;
    y = (x==x);
    return 0;
}
```



## A single-threaded ex. with undefined behaviour

An unsequenced race.


## A simple concurrent program

We will omit asw arrows whenever we are not interested in the initialisation.

$$
\text { rrows whenever } \begin{aligned}
& \mathrm{b}: \mathrm{W}_{\mathrm{na}} \mathrm{x}=3 \quad \mathrm{c}: \mathrm{R}_{\mathrm{nq}} \mathrm{x}=2 \\
& \text { ed in the initialisation. } \\
& \mathrm{d}: \mathrm{W}_{\mathrm{na}} \mathrm{y}=0
\end{aligned}
$$

$$
\begin{aligned}
& \text { int } y, x=2 \text {; } \\
& \mathrm{x}=3 \text {; } \\
& \mid y=(x==3) ;
\end{aligned}
$$

## Locks and unlocks

int $x, r ;$
mutex m;

| m. $\operatorname{lock}() ;$ | m. $\operatorname{lock}() ;$ |
| :--- | :--- |
| $x=\ldots$ | $r=x ;$ |
| m.unlock ()$;$ |  |

1. the operational semantics defines the sb arrows


## Locks and unlocks

int $\mathrm{x}, \mathrm{r}$;
mutex m;

| m. $\operatorname{lock}() ;$ | m. $\operatorname{lock}() ;$ |
| :--- | :--- |
| $\mathrm{x}=\ldots$ | $\mathrm{r}=\mathrm{x} ;$ |
| $\mathrm{m} . \operatorname{unlock}() ;$ |  |

1. the operational semantics defines the sb arrows
2. guess an sc order on Unlock/Lock actions (part of the witness)


## Locks and unlocks

int $\mathrm{x}, \mathrm{r}$;
mutex m;

| m. $\operatorname{lock}() ;$ | m. $\operatorname{lock}() ;$ |
| :--- | :--- |
| $\mathrm{x}=\ldots$ | $\mathrm{r}=\mathrm{x} ;$ |
| $\mathrm{m} . \operatorname{unlock}() ;$ |  |

1. the operational semantics defines the sb arrows
2. guess an sc order on Unlock/Lock actions (part of the witness)
3. the sc order is included in the syncronised-with relation

f:U mutex

## Locks and unlocks

## simple-happens-before


int $x, r ;$
mutex m;

| m. $\operatorname{lock}() ;$ | m. $\operatorname{lock}() ;$ |
| :--- | :--- |
| $x=\ldots$ | $r=x ;$ |
| m.unlock ()$;$ |  |

1. the operational semantics defines the sb arrows
2. guess an sc order on Unlock/Lock actions (part of the witness)
3. the sc order is included in the syncronised-with relation

f:U mutex
4. which in turn defines the happens-before relation...

## Happens before

The happens before relation is key to the model:

1. non-atomic loads read the most recent write in happens before. (This is unique in DRF programs)
2. the story is more complex for atomics, as we shall see.
3. data races are defined as an absence of happens before between conflicting actions.
simple-happens-before


f:U mutex

## A data race

$$
\begin{aligned}
& \text { int } y, x=2 ; \\
& x=3 ; \\
& a: W_{n a} x=2
\end{aligned}
$$

## A data race

$$
\begin{aligned}
& \text { int } y, x=2 \text {; } \\
& x=3 ; \quad \mid y=(x==3) ;
\end{aligned}
$$

$$
\mathrm{d}: \mathrm{W}_{\mathrm{na}} \mathrm{y}=0
$$

Here we have two conflicting accesses not related by happens-before.

## Data race definition

let data_races actions $h b=$
$\{(a, b) \mid \forall a \in$ actions $b \in$ actions
$\neg(a=b) \wedge$
same_location a $b \wedge$
(is_write $a \vee$ is_write $b$ ) $\wedge$
$\neg($ same_thread $a b) \wedge$
$\neg($ is_atomic_action $a \wedge$ is_atomic_action $b) \wedge$
$\neg((a, b) \in h b \vee(b, a) \in h b)\}$

Programs with a data race have undefined behaviour (DRF model).

## Simple concurrency: Dekker's example and SC

```
atomic_int x = 0;
atomic_int y = 0;
x.store(1, seq_cst); y.store(1, seq_cst);
y.load(seq_cst); x.load(seq_cst);
```



Why is this behaviour forbidden?

## Simple concurrency, Dekker's example and SC

```
atomic_int x = 0;
atomic_int y = 0;
x.store(1, seq_cst); y.store(1, seq_cst);
y.load(seq_cst); x.load(seq_cst);
```



The sc relation must define a total order over unlocks/locks and seq_cst accesses... sc is included in hb, an rf must respect hb.

## Expert concurrency: the release-acquire idiom

// sender
X = ...
y.store(1, release);
// receiver
while (0 == y.load(acquire));
r = x ;

Here we have an rf arrow between a pair of release/acquire accesses.


## Expert concurrency: the release-acquire idiom

// sender
x = ...
y.store(1, release);
// receiver
while (0 == y.load(acquire));
r = x ;

Here we have an rf arrow between a pair of release/acquire accesses.

The rf arrow between release/acquire accesses induces an sw arrow between those accesses.


## Expert concurrency: the release-acquire idiom

// sender
x = ...
y.store(1, release);
// receiver
while (0 == y.load(acquire));
r = x ;
Here we have an rf arrow between a pair of release/acquire accesses.

The rf arrow between release/acquire accesses
 induces an sw arrow between those accesses.

And in turn defines an hb constraint. $\xrightarrow{\text { simple-happens-before }}=$

$$
(\xrightarrow{\text { sequenced-before }} \cup \xrightarrow{\text { synchronizes-with }})^{+}
$$

## Relaxed writes

x.load(relaxed);
y.store(1, relaxed);

```
y.load(relaxed);
```

y.load(relaxed);
x.store(1, relaxed);

```
x.store(1, relaxed);
```

No data-races, no synchronisation cost, but weakly ordered.

## Relaxed writes, ctd.

```
atomic_int x = 0;
atomic_int y = 0;
```


$\mathrm{c}:$ Wrlx $\mathrm{x}=1 \quad \mathrm{~d}:$ Wrlx $\mathrm{y}=1$
rf

$\mathrm{f}: \operatorname{Rrlx} \mathrm{y}=0$

Again, no data-races, no synchronisation cost, but weakly ordered (IRIW).

## Expert concurrency: fences avoid excess sync.

// sender
$\mathrm{x}=$...
y.store(1, release);

```
// receiver
while (0 == y.load(acquire));
r = x;
```

// sender
x = ...
y.store(1, release);
// receiver while (0 == y.load(relaxed));
fence(acquire);
$r=x$;

## Expert concurrency: fences avoid excess sync.

```
// sender
x = ...
y.store(1, release);
```

// receiver

```
while (0 == y.load(relaxed));
```

fence(acquire);
r = x ;

Here we have an rf arrow between a release write and a relaxed write.


## Expert concurrency: fences avoid excess sync.

// sender<br>x = ...<br>y.store(1, release);

```
// receiver
while (0 == y.load(relaxed));
fence(acquire);
r = x;
```

Here we have an rf arrow between a release write and a relaxed write.

$\begin{array}{lll}\mathrm{d}: \mathrm{W}_{\text {rel }} \mathrm{y}=1 & \rightarrow & \mathrm{f}: \mathrm{F}_{\text {acq }} \\ \text { tw } \\ \text { tions } & \mathrm{sw} & \mathrm{sb}\end{array}$
$\mathrm{g}: \mathrm{R}_{\mathrm{na}} \mathrm{x}=1$

## Expert concurrency: fences avoid excess sync.

```
// sender
x = ...
y.store(1, release);
```

```
// receiver
while (0 == y.load(relaxed));
fence(acquire);
r = x;
```

Here we have an rf arrow between a release write and a relaxed write.

The acquire fence follows the $\mathrm{sb} / \mathrm{rf}$ relations looking for the corresponding release write, adding a sw arrow.


Happens-before follows as usual...

## Modification order (aka coherence)

atomic_int $x=0$; | x.store $(1, ~ r e l a x e d) ; ~$ |
| :--- | :--- | \left\lvert\, \(\begin{aligned} \& x.load(relaxed); <br>

\& x. store(2, relaxed);\end{aligned}\right. \| $$
\begin{aligned} & \text { x.load(relaxed) } ;\end{aligned}
$$\)


Modification order is a total order over atomic writes of any memory order.

## Coherence and atomic reads

All forbidden:


CoRR
$a: W x=1$
hb
$\mathrm{b}: \mathrm{W} \mathrm{mo}$
$\mathrm{m}=2$
CoWW


CoRW

Idea: atomics cannot read from later writes in happens-before.

## Coherence and atomic reads

All forb
A pair $E_{\text {opsem }}, X_{\text {witness }}$ (a pre-execution) defines a consistent execution when it satisfies the constraints we have sketched on $\mathrm{hb} / \mathrm{rf} / \mathrm{mo}$ and is race-free.
$\mathrm{b}: \mathrm{W} \mathrm{x}=2$
CoWW
$\mathrm{d}: \mathrm{W} \mathrm{x}=2$
CoRW

Idea: atomics cannot read from later writes in happens-before.

## The full model

|  |  |  |  |
| :---: | :---: | :---: | :---: |
| $\square x^{\text {a }}$ |  |  |  |
| Sto－（ta） |  | mismen |  |
|  | ， | 20am ${ }^{\text {a }}$ | visible＿sequence＿of＿side＿effects＿tail $=$ visible＿sequence＿of＿side＿effects＿tail vsse＿head $b=$ $\{c$ ．vsse＿head $\xrightarrow{\text { madifcation－ordee }} c \wedge$ |
| 4－1 | man | $\begin{aligned} & \text { s＿release } a_{\text {red }} \wedge( \\ & \left(b=a_{\text {red }}\right) \vee \\ & \text {（rs＿element } \\ & a_{\text {rel }} \\ & b \wedge a_{\text {red }} \xrightarrow{\text { modification－crder }}\end{aligned} b$ |  |
|  |  |  |  |
|  |  | $1{ }^{\text {a }}$ |  |
|  |  |  |  |
| $4{ }^{\text {a }}$ |  |  | divmen |
| $4{ }^{4}$ |  | （resememit |  |
|  | ， |  |  |
|  |  |  | 何 |
| demp |  |  | 边 |
|  | 边 |  |  |
|  | and |  |  |
|  | memen | （ex）maxy |  |
| － | masme | 隹 | 边 |
|  | ， | － | 边 |
|  |  | mammex | Exiche |
|  | 边 | ＝20，mestuv |  |
|  | Namer |  | ate |
|  | Some |  | A0， |
| andemen | （tamer |  |  |
|  |  |  | ， |
| No．apem | mamber mato | 为 | 为 |
| and | Humatimatis |  | $\underset{y y y y}{ }$ |
|  |  |  | （exase） |
| 隹 |  | ata | 为 |
| （ ammut |  |  | （2ase） |
|  |  |  |  |
| 䢒 |  | Aramy |  |
|  |  | 边 |  |
|  | 隹 |  |  |
|  | 14 | 边 | 边 |
|  |  |  |  |
| 为 | min |  | inmen |
|  | mimex |  | In mexmen |
| 5mem | mixmbun | mameme |  |
|  | dimems |  | $\pm \pm$ mex |
|  |  | Ameme |  |
|  |  | anmex |  |
|  |  | 边 |  |
|  |  | 为 | data＿races $=$ data＿races $=\{(a, b)$ ． $(a \neq b) \wedge$ same＿location $a b \wedge($ is＿write $a \vee$ is＿write $b) \wedge$ $\neg$ same＿thread $a b \wedge$ |
|  |  | minm | 边 |
|  |  | 2mase |  |
|  |  | 0 |  |
| 成 |  |  | 隹 |
|  | （＊30．4．1：21 Effects：Releases the calling threads ownership of the mutex．＊） | 为 |  |
|  | （e） |  |  |
|  | 为 |  | \％om |

## Is C++11 hopelessly complicated?

Programmers cannot be given this model.
However, with a formal definition, we can do proofs! For instance:

- Can we compile to $x 86 ?$
- Can we compile to Power?

| Operation | x86 Implementation |
| :--- | :--- |
| load(non-seq_cst) | mov |
| load(seq_cst) | lock xadd(0) |
| store(non-seq_cst) | mov |
| store(seq_cst) | lock xchg |
| fence(non-seq_cst) | no-op |


| C $++0 x$ Operation | POWER Implementation |
| :--- | :--- |
| Non-atomic Load | ld |
| Load Relaxed | ld |
| Load Consume | ld (and preserve dependency) |
| Load Acquire | ld; cmp; bc; isync |
| Load Seq Cst | sync; ld cmp; bc; isync |
| Non-atomic Store | st |
| Store Relaxed | st |
| Store Release | lwsync; st |
| Store Seq Cst | sync; st |

## Is C++11 hopelessly complicated?

Simplifications:
Full model: visible sequences of side effects are unneeded (HOL4)
Derivative models:

- without consume, happens-before is transitive


## The current state of the standard

## Fixed:

- in some cases, happens-before was cyclic
- coherence
- seq_cst atomics were more broken


## Not fixed:

- out of thin air reads (and self satisfying conditionals)
- seq_cst atomics do not guarantee SC


3. Hunting compiler concurrency bugs


## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0
\end{aligned}
$$

Thread 1
int s;
for (s=0; s!=4; s++) \{
if ( $a==1$ )
return NULL;
for ( $b=0 ; b>=26 ;++b$ )
\}

Thread 2
b $=42$; printf("\%d\n", b);

## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0 ;
\end{aligned}
$$

Thread 1
int s;
for (s=0; s!=4; s++) \{
if ( $a==1$ )
return NULL;
for ( $b=0 ; b>=26 ;++b$ )

Thread 2
$\mathrm{b}=42$; printf("\%d\n", b);
;
\}

Since Thread 1 does not update $b$, program is data-race free (DRF)

## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0 ;
\end{aligned}
$$

Thread 1
int s;
for (s=0; s!=4; s++) \{
if ( $a==1$ )
return NULL;
for ( $b=0 ; b>=26 ;++b$ )

Thread 2
$\mathrm{b}=42$; printf("\%d\n", b);
;
\}

Since Thread 1 does not update $b$, program is data-race free (DRF)
DRF programs must only exhibit sequentially consistent behaviours
C11/C++11 standard

## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0 ;
\end{aligned}
$$

Thread 1
int $s ;$
for ( $s=0 ; s!=4 ; s++$ ) \{
if ( $a==1$ )
return NULL;
for ( $b=0 ; b>=26 ;++b$ )

Thread 2
b $=42$; printf("\%d\n", b);
;
\}

Since Thread 1 does not update $b$, program is data-race free (DRF)
DRF programs must only exhibit sequentially consistent behaviours
C11/C++11 standard
This program MUST only print 42.

## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0
\end{aligned}
$$

Thread 1
int s;
for (s=0; s!=4; s++) \{
if ( $a==1$ )
return NULL;
for ( $b=0 ; b>=26 ;++b$ )
\}

Thread 2
b $=42$; printf("\%d\n", b);

## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0 ;
\end{aligned}
$$

Thread 1

| movl | a(\%rip),\%edx |
| :--- | :--- |
| movl | $b(\% r i p), \% e a x$ |
| testl | \%edx, \%edx |
| jne | .$L 2$ |
| movl | $\$ 0, b(\% r i p)$ |
| ret |  |
| L2: |  |
| movl | \%eax, b(\%rip) |
| xorl | \%eax, \%eax |
| ret |  |

.L2:
movl \%eax, b(\%rip)
xorl \%eax, \%eax
ret

Thread 2
b $=42$; printf("\%d\n", b);

## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0
\end{aligned}
$$

Thread 1

| movl | $a(\% r i p), \% e d x$ |
| :--- | :--- |
| movl | $b(\% r i p), \% e a x$ |
| testl | \%edx, \%edx |
| jne | .L2 |
| movl | $\$ 0, b(\% r i p)$ |
| ret |  |
| L2: |  |
| movl | \%eax, b(\%rip) |
| xorl | \%eax, \%eax |
| ret |  |

movl b(\%rip),\%eax
testl \%edx, \%edx
jne .L2
movl \$0, b(\%rip)
.L2:
xorl \%eax, \%eax
ret

- Read a (1) into edx


## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0 ;
\end{aligned}
$$

Thread 1

| movl | $a(\% r i p), \% e d x$ |
| :--- | :--- |
| movl | $b(\% r i p), \% e a x$ |
| testl | \%edx, \%edx |
| jne | .L2 |
| movl | $\$ 0, b(\% r i p)$ |
| ret |  |
| L2: |  |
| movl | \%eax, b(\%rip) |
| xorl | \%eax, \%eax |
| ret |  |

b $=42$; printf("\%d\n", b);

- Read a (1) into edx
- Read b (0) into eax


## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0 ;
\end{aligned}
$$

Thread 1

| movl | $a(\%$ rip $), \% e d x$ |
| :--- | :--- |
| movl | $b(\% r i p), \% e a x$ |
| testl | \%edx, \%edx |
| jne | L2 |
| movl | $\$ 0, b(\% r i p)$ |
| ret |  |
| L2: |  |
| movl | \%eax, b(\%rip) |
| xorl | \%eax, \%eax |
| ret |  |

movl b(\%rip),\%eax testl \%edx, \%edx jne .L2
movl \$0, b(\%rip) ret
.L2:
movl \%eax, b(\%rip)
xorl \%eax, \%eax ret

Thread 2

```
b = 42; printf("\%d\n", b);
```

- Read a (1) into edx
- Read b (0) into eax
- Store 42 into b


## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0 ;
\end{aligned}
$$

Thread 1

| movl | $a(\%$ rip $), \% e d x$ |
| :--- | :--- |
| movl | b(\%rip),\%eax |
| testl | \%edx, \%edx |
| jne | L2 |
| movl | $\$ 0, b(\%$ rip $)$ |
| ret |  |
| L2: |  |
| movl | \%eax, b(\%rip) |
| xorl | \%eax, \%eax |
| ret |  |

    movl b(\%rip),\%eax
    testl \%edx, \%edx
    jne .L2
    movl \$0, b(\%rip)
    .L2:
movl \%eax, b(\%rip)
xorl \%eax, \%eax
ret

Thread 2
b $=42$; printf("\%d\n", b);

- Read a (1) into edx
- Read b (0) into eax
- Store 42 into b
- Store eax (0) into b


## Shared memory

$$
\begin{aligned}
& \text { int } a=1 ; \\
& \text { int } b=0 ;
\end{aligned}
$$

Thread 1

| movl | $a(\%$ rip $), \%$ edx |
| :--- | :--- |
| movl | b(\%rip),\%eax |
| testl | \%edx, \%edx |
| jne | L2 |
| movl | $\$ 0, b(\% r i p)$ |
| ret |  |
| L2: |  |
| movl | \%eax, b(\%rip) |
| xorl | \%eax, \%eax |
| ret |  |

    movl b(\%rip),\%eax
    testl \%edx, \%edx
    jne .L2
    movl \$0, b(\%rip)
    .L2:
movl \%eax, b(\%rip)
xorl \%eax, \%eax
ret

Thread 2
b $=42$;
printf("\%d\n", b);

- Read a (1) into edx
- Read b (0) into eax
- Store 42 into b
- Store eax (0) into b
- Print b... 0 is printed

The compiled code saves and restores $b$

## Correct in a sequential setting

Introduces unexpected behaviours
in some concurrent context
ret
.L2:
movl \%eax, b(\%rip)
xorl \%eax, \%eax ret

- Read a (1) into edx
- Read b (0) into eax
- Store 42 into b
- Store eax (0) into b
- Print b... 0 is printed

The compiled code saves and restores $b$

## Correct in a sequential setting

## Introduces unexpected behaviours in some concurrent context

## This is a concurrency compiler bug

mov \%eax, b(\%rıp)
xorl \%eax, \%eax ret

- Read b (0) into eax
- Store 42 into b
- Store eax (0) into b
- Print b... 0 is printed

Compiler testing: state of the art
Yang, Chen, Eide, Regehr - PLDI 2011


## Compiler testing: state of the art

Yang, Chen, Eide, Regehr - PLDI 2011

## Random

Reported hundreds of bugs
on various versions of gcc, clang and other compilers


## Compiler testing: state of the art

Yang, Chen, Eide, Regehr - PLDI 2011

## Random

Reported hundreds of bugs


## Hunting concurrency compiler bugs?

## How to deal with non-determinism?

How to generate non-racy interesting programs?
How to capture all the behaviours of concurrent programs?
A compiler can optimise away behaviours:
how to test for correctness?
limit case: two compilers generate correct code with disjoint final states

## Idea

C/C++ compilers support separate compilation
Functions can be called in arbitrary non-racy concurrent contexts

## $\downarrow$

C/C++ compilers can only apply transformations sound with respect to an arbitrary non-racy concurrent context

## Hunt concurrency compiler bugs

$=$
search for transformations of sequential code not sound in an arbitrary non-racy context

## Random

optimising compiler under test reference semantics REFERENCE MEMORY TRACE
int $a=1 ; \quad$ int $s ;$
int $b=0$;
for (s=0; s!=4; s++) \{
if ( $a==1$ ) return NULL;
for ( $b=0 ; b>=26 ;++b$ ) ;
\}


Load a 1


Load a 1
Load b 0
Store b 0

## Cannot match some events $\longrightarrow$ detect compiler bug



Load a 1


Load a 1
Load b 0
Store b 0
Load b 0
Store b 0

## Contributions

Sound optimisations in the C11/C++11 memory model extending Sevcik's work on an idealised DRF model - PLDI 11

A tool to hunt concurrency bugs in C and $\mathrm{C}++$ compilers

Interaction with GCC developers

## Sound Optimisations

## in the C11/C++11 Memory Model

## Example: loop invariant code motion



## Example: loop invariant code motion

Compiler Writer


Sophisticated program analyses Fancy algorithms
Source code or IR
Operations on AST

Semanticist


## Example: loop invariant code motion

Compiler Writer


Sophisticated program analyses Fancy algorithms
Source code or IR
Operations on AST

Semanticist


```
for (int \(i=0 ; i<2 ; i++)\{\)
    \(z=i ;\)
    \(x[i]+=y+1\);
\}
```


## Example: loop invariant code motion

Compiler Writer


Sophisticated program analyses Fancy algorithms
Source code or IR
Operations on AST

```
tmp = y+1 ;
for (int i=0; i<2; i++) {
    z = i;
    x[i] +=tmp ;
}
```


## Example: loop invariant code motion

Compiler Writer


Sophisticated program analyses Fancy algorithms
Source code or IR
Operations on AST

## Semanticist



Elimination of run-time events Reordering of run-time events Introduction of run-time events

Operations on sets of events

```
tmp = y+1 ;
for (int i=0; i<2; i++) {
    z = i;
    x[i] +=tmp ;
}
```


## Example: loop invariant code motion

Compiler Writer


Sophisticated program analyses Fancy algorithms
Source code or IR
Operations on AST

## Semanticist



Elimination of run-time events Reordering of run-time events Introduction of run-time events

Operations on sets of events

```
tmp = y+1 ;
for (int i=0; i<2; i++) \{
    z = i;
    x[i] +=tmp ;
\}
```

Store z 0
Load y 42
Store x[0] 43
Store z 1
Load y 42
Store x[1] 43

## Example: loop invariant code motion

## Compiler Writer



Sophisticated program analyses Fancy algorithms
Source code or IR
Operations on AST

```
tmp = y+1 ;
for (int i=0; i<2; i++) {
    z = i;
    x[i] += tmp ;
}
```

Semanticist


Elimination of run-time events Reordering of run-time events Introduction of run-time events

Operations on sets of events

```
Load y 42
    Store z 0
Store x[0] 43
Store z 1
Store x[1] 43
```


## Elimination of overwritten writes



## Elimination of overwritten writes



# Under which conditions is it correct to eliminate the first store? 

What is the semantics of
C11/C++11 concurrent code?

## The C11/C++11 memory model

C11/C++11 are based on the DRF approach:

- racy code is undefined
- race-free code must exhibit only sequentially consistent behaviours
- main synchronisation mechanism: lock/unlock

Escape mechanism for experts, low-level atomics:
— races allowed

- attributes on accesses specify their semantics:

MO_SEQ_CST MO_RELEASE/MO_ACQUIRE MO_RELAXED

## MO_RELEASE / MO_ACQUIRE

$$
g=0 ; \text { atomic } f=0 ;
$$

Thread 1
$g=42$;
f.store(1,M0_RELEASE);

Thread 2
while (f.load(MO_ACQUIRE)==0); printf ("\%d",g)

## MO_RELEASE / MO_ACQUIRE

$$
g=0 ; \text { atomic } f=0 ;
$$

Thread 1
$g=42$;
f.store(1,M0_RELEASE);

Thread 2
while (f.load(MO_ACQUIRE)==0); printf ("\%d",g)

## MO_RELEASE / MO_ACQUIRE

$$
g=0 ; \text { atomic } f=0 ;
$$

Thread 1
$g=42$;
f.store(1,M0_RELEASE);

Thread 2
while (f.load(MO_ACQUIRE)==0); printf ("\%d",g)

## MO_RELEASE / MO_ACQUIRE

$$
g=0 ; \text { atomic } f=0 ;
$$

Thread 1
$g=42$;
f.store(1,M0_RELEASE);

Thread 2
while (f.load(MO_ACQUIRE)==0); printf ("\%d",g)

## MO_RELEASE / MO_ACQUIRE

$$
g=0 ; \text { atomic } f=0 ;
$$

Thread 1
$g=42 ;$
f.store(1,MO_RELEASE);

Thread 2
sync while (f.load(MO_ACQUIRE)==0);
printf ("\%d",g)

## MO_RELEASE / MO_ACQUIRE

$$
g=0 ; \text { atomic } f=0
$$

Thread 1


The release/acquire synchronisation guarantees that:

- the program is DRF
- 42 is printed at the end of the execution

Remark: unlock $\simeq$ release, lock $\simeq$ acquire.

## Same-thread release/acquire pairs

A same-thread release-acquire pair is a pair of a release action followed by an acquire action in program order.

An action is a release if it is a possible source of a synchronisation unlock mutex, release or seq_cst atomic write

An action is an acquire if it is a possible target of a synchronisation lock mutex, acquire or seq_cst atomic read

## Elimination of overwritten writes



It is safe to eliminate the first store if there are:

1. no intervening accesses to $g$
2. no intervening same-thread release-acquire pairs

## The soundness condition

$$
\begin{gathered}
\text { Shared memory } \\
g=0 ; \text { atomic } f 1=f 2=0 ;
\end{gathered}
$$

Thread 1
$\mathrm{g}=1$;
f1.store(1,RELEASE);
while(f2.load(ACQUIRE)==0);
g = 2;

## The soundness condition

$$
\begin{gathered}
\text { Shared memory } \\
g=0 ; \text { atomic } f 1=f 2=0 ;
\end{gathered}
$$



## The soundness condition

$$
\begin{gathered}
\text { Shared memory } \\
g=0 ; \text { atomic } f 1=f 2=0 ;
\end{gathered}
$$


f1. $\operatorname{store(1,RELEASE);~}$
while $(f 2 . l o a d(A C Q U I R E)==0) ; ~ s a m e-t h r e a d ~ r e l e a s e-a c q u i r e ~ p a i r ~$ g = 2;

## The soundness condition

$$
\begin{gathered}
\text { Shared memory } \\
g=0 ; \text { atomic } f 1=f 2=0 ;
\end{gathered}
$$

Thread 1
g = 1;
f1.store(1,RELEASE); while(f2.load(ACQUIRE)==0); g = 2;

Thread 2
while(f1.load(ACQUIRE)==0); printf("\%d", g); f2.store(1,RELEASE);

## The soundness condition

$$
\begin{gathered}
\text { Shared memory } \\
g=0 ; \text { atomic } f 1=f 2=0 ;
\end{gathered}
$$

Thread 1
Thread 2
$g=1 ; \quad$ sync while(f1. load(ACQUIRE)==0);
f1.store(1,RELEASE); $\xrightarrow[\text { printf("\%d", g); }]{ }$
while(f2. load(ACQUIRE) $==0) ;{\underset{s i n c}{ }}$ f2.store(1,RELEASE);
g = 2;
Thread 2 is non-racy

## The soundness condition

## Shared memory

$$
g=0 ; \text { atomic f1 }=f 2=0
$$

Thread 1
Thread 2
$g=1 ; \longleftrightarrow$ sync while(f1. load(ACQUIRE)==0);


g = 2;
Thread 2 is non-racy
The program should only print 1

## The soundness condition

Shared memory
$g=0 ;$ atomic $f 1=f 2=0 ;$

Thread 1
f1. store(1, RELEASE) $\xrightarrow{\text { sync }}$ while(f1.load(ACQUIRE)==0);解
while(f2. load(ACQUIRE)==0); $\operatorname{singc}$ f2. store(1,RELEASE);
g = 2;

Thread 2 is non-racy
The program should only print 1
If we perform overwritten write elimination it prints 0

## The soundness condition

$$
\begin{gathered}
\text { Shared memory } \\
g=0 ; \text { atomic } f 1=f 2=0 ;
\end{gathered}
$$

Thread 1
$g=1 ; \quad$ sync while(f1. load(ACQUIRE)==0);
f1. store(1,RELEASE);
while(f2.load(ACQUIRE)==0);
g = 2;

Thread 2 printf("\%d", g); f2.store(1,RELEASE);

## The soundness condition

$$
\begin{gathered}
\text { Shared memory } \\
g=0 ; \text { atomic } f 1=f 2=0 ;
\end{gathered}
$$

Thread 1
g = 1;
f1.store(1,RELEASE);
$g=2 ;$

Thread 2
sync while(f1.load(ACQUIRE)==0); printf("\%d", g); f2.store(1,RELEASE);

## The soundness condition

Shared memory

$$
g=0 ; \text { atomic f1 }=f 2=0
$$

Thread 1
Thread 2
$g=1 ; \quad$ sync while(f1. load(ACQUIRE)==0);
f1.store(1,RELEASE); $\quad$ printf("\%d", g);
$g=2 ;$ datarace f2.store(1,RELEASE);

If only a release (or acquire) is present, then all discriminating contexts are racy.
It is sound to optimise the overwritten write.

## Eliminations: bestiary

| Store $\mathrm{g} \mathrm{V}_{1}$ | Read g v | Store g v | Read g v | Store $\mathrm{g} \mathrm{V}_{1}$ |
| :---: | :---: | :---: | :---: | :---: |
| $\text { sb } \downarrow$ | $\mathrm{sb} \downarrow$ | $\mathrm{sb} \downarrow$ | $\text { sb } \downarrow$ | sb $\downarrow$ |
| no access to g | no access to g | no access to g | no access to g | no access to g |
| no rel/acq pair | no rel/acq pair | no rel/acq pair | no rel/acq pair | no rel/acq pair |
| $\text { sb } \downarrow$ | sb | $\text { sb } \downarrow$ | $\mathrm{sb} \downarrow$ | $\mathrm{sb} \downarrow$ |
| Store g V 2 | Read g v | Read g v | Store g v | Store $\mathrm{g} \mathrm{V}_{1}$ |

Overwritten-Write Read-after-Read Read-after-Write Write-after-Read Write-after-Write

Reads which are not used (via data or control dependencies) to decide a write or synchronisation event are also eliminable (irrelevant reads).

## Theorem

## Soundness proved w.r.t. Batty et al. formalisation of the C11/C++11 memory model (POPL 11)



Overwritten-Write Read-after-Read Read-after-Write Write-after-Read Write-after-Write

Reads which are not used (via data or control dependencies) to decide a write or synchronisation event are also eliminable (irrelevant reads).

## Reorderings and introductions

Correctness criterion for reordering events:
— different addresses

- no synchronisations in-between

Roach-motel reordering (reordering across locks) not observed in practice

Read introductions observed in practice (gcc, clang).
Introduction of eliminable reads proved correct. Introduction of irrelevant reads does not introduce new behaviours, but cannot be proved correct in a DRF model.

## The CMMTEST Tool

## Random

reference semantics


REFERENCE MEMORY TRACE
optimising
compiler
under test
optimising
compiler
under test
optimising
compiler
under test


MEMORY TRACE

# only transformations sound in any 

 concurrent (non-racy) context?





## Subtleties:

- dependencies between eliminable events
- some optimisations (e.g. merging of accesses) cannot be expressed in the C11/C++11 formalisation
- the tool also ensures that the compilation of atomic accesses is preserved by the optimiser


## OCaml tool

1. analyse the traces to detect eliminable actions
2. match reference and optimised traces

## Interaction with GCC developers

## 1. Some GCC bugs

## Some concurrency compiler bugs found in the latest version of GCC.

Store introductions performed by loop invariant motion or if-conversion optimisations.

All promptly fixed.
Remark: these bugs break the Posix thread model too.

## 2. Checking compiler invariants

GCC internal invariant: never reorder with an atomic access
Baked this invariant into the tool and found a counterexample...
...not a bug, but fixed anyway

```
atomic_uint a;
int32_t g1, g2;
```

```
int main (int, char *[]) {
    a.load() & a.load ();
    g2 = g1 != 0;
}
```

| ALoad | a | 0 |  | Load | g1 |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| ALoad | a | 0 |  | ALoad | a |  |
| Load | g1 | 0 |  | ALoad | a |  |
| Store | g2 | 0 |  | Store | g2 |  |

## 3. Detecting unexpected behaviours

uint16_t g


If $g$ is initialised with 0, a load gets replaced by a store:


The introduced store cannot be observed by a non-racy context.
Still, arguable if a compiler should do this or not.

## 4. Out of thin-air reads

## Memory access synchronisation

$$
x=y=0
$$

Thread 1
Thread 2

| $y=1$ | if | (x.load |
| :---: | :---: | :---: |
| x.store(1,MO_RELEASE) | $r 2=y$ |  |

## Memory access synchronisation

$$
x=y=0
$$

Thread 1
Thread 2


Non-atomic loads must return the most recent write in the happens-before order

## Understanding MO_RELAXED

$$
x=y=0
$$

Thread 1
Thread 2

$$
\begin{array}{l|l}
\mathrm{y}=1 & \text { if }(\mathrm{x} . \operatorname{load}(\mathrm{MO} \text { RELAXED })==1) \\
(1, \text { MO_RELAXED }) & \mathrm{r} 2=\mathrm{y}
\end{array}
$$

## Understanding MO_RELAXED

$$
x=y=0
$$

Thread 1
Thread 2

$$
\begin{array}{c|c}
y=1 \mathrm{if}\left(\mathrm{x} . \operatorname{load}\left(\mathrm{MO} \_ \text {RELAXED }\right)==1\right) \\
\text { x.store(1,MO_RELAXED) } & \mathrm{r} 2=\mathrm{y}
\end{array}
$$

## DATA RACE

Two conflicting accesses not related by happens-before

## Understanding MO_RELAXED

$$
x=y=0
$$

Thread 1
Thread 2

| y.store(1,MO_RELAXED) | if (x.load(MO_RELAXED) |
| :---: | :---: |
| x.store(1,MO_RELAXED) | y.load(MO_RELAXED) |

## WELL DEFINED

$$
\text { but } \mathrm{r} 2=0 \text { is possible }
$$

## Understanding MO_RELAXED

$$
x=y=0
$$

Thread 1
Thread 2

$$
\begin{array}{c|c}
\text { y.store(1,MO_RELAXED) } & \text { if (x.load(MO_RELAXED) == } 1) \\
\text { x.store(1,MO_RELAXED) } & \text { r2 = y.load(MO_RELAXED) }
\end{array}
$$

Allow a RELAXED load to see any store that:

- does not happens-after it
- is not hidden by an intervening store hb-ordered between them


## Intuition

the compiler (or hardware) can reorder independent accesses

$$
x=y=0
$$

Thread 1
Thread 2

```
Y.store(1,MO_RELAXED) if (x.load(MO_RELAXED) == 1)
x.store(1,MO_RELAXED) r2 = y.load(MO_RELAXED)
```

Allow a RELAXED load to see any store that:

- does not happens-after it
- is not hidden by an intervening store hb-ordered between them


# Shorthand <br> from now on, all the memory accesses are atomic with MO_RELAXED semantics 

## Out-of-thin-air

Thread 1

$$
x=y=0
$$

Thread 2

$$
\begin{aligned}
& r 1=x \\
& y=r 1
\end{aligned}
$$

$$
\begin{aligned}
& r 2=y \\
& x=42
\end{aligned}
$$

## Out-of-thin-air

Thread 1

$$
x=y=0
$$

Thread 2

$$
\begin{aligned}
& r 1=x \\
& y=r 1
\end{aligned}
$$

$$
\begin{aligned}
& r 2=y \\
& x=42
\end{aligned}
$$

$$
r 1=r 2=42
$$

is a valid execution.


## Intuition

the compiler (or hardware) can reorder independent accesses

Thread 1

$$
x=y=0
$$

Thread 2

$$
\begin{aligned}
& r 1=x \\
& y=r 1
\end{aligned}
$$

$$
\begin{aligned}
& r 2=y \\
& x=42
\end{aligned}
$$

$$
r 1=r 2=42
$$

is a valid execution.


## Out-of-hin-air reads

Thread 1

$$
x=y=0
$$

Thread 2

$$
\begin{aligned}
& r 1=x \\
& y=r 1
\end{aligned}
$$

$$
\begin{aligned}
& r 2=y \\
& x=r 2
\end{aligned}
$$

## Out-of-thin-air reads

Thread 1

$$
x=y=0
$$

Thread 2

$$
\begin{array}{l|l}
r 1=x & r 2=y \\
y=r 1 & x=r 2
\end{array}
$$

$$
r 1=r 2=42
$$

is also an allowed execution

the value 42 appears out-of-thin-air

Thread 1

$$
x=y=0
$$

Thread 2

$$
\begin{aligned}
& r 1=x \\
& y=r 1
\end{aligned}
$$

$$
\begin{aligned}
& \mathrm{r} 2=\mathrm{y} \\
& \mathrm{x}=\mathrm{r} 2
\end{aligned}
$$

$$
r 1=r 2=42
$$

is also an allowed execution


## Speculation can justify out-of-thin-air reads

If the compiler states that x is likely to hold $42 \ldots$

$$
\begin{aligned}
& \mathrm{y}:=42 \\
& \mathrm{r} 1:=\mathrm{x} \\
& \text { if }(\mathrm{r} 1 \quad!=42) \mathrm{y}:=\mathrm{r} 1 \text {; } \\
& \text { print } \mathrm{r} 1
\end{aligned}
$$

| initially $\mathrm{x}=\mathrm{y}=0$ |  |
| :--- | :--- |
| $\mathrm{r} 1:=\mathrm{y}$ | $\mathrm{r} 2:=\mathrm{y}$ |
| $\mathrm{y}: \mathrm{F}$ r1 | $\mathrm{x}:=\mathrm{r} 2$ |
| Print r 1 | print r 2 |

## Speculation can justify out-of-thin-air reads

If the compiler states that x is likely to hold $42 \ldots$

```
y := 42
r1 := x
if (r1 != 42) y := r1;
print r1
```

| initially $\mathrm{x}=\mathrm{y}=0$ |  |
| :--- | :--- |
| $\mathrm{r} 1:=\mathrm{y}$ | $\mathrm{r} 2:=\mathrm{y}$ |
| $\mathrm{y}: \mathrm{F}$ r1 | $\mathrm{x}:=\mathrm{r} 2$ |
| Print r 1 | print r 2 |

It does not happen in practice... even if it might!

```
struct foo {
    atomic<struct foo *> next;
}
struct foo *a;
```

Thread 1

$$
\begin{aligned}
& \text { r1 = a->next } \\
& \text { r1->next = a }
\end{aligned}
$$



```
struct foo {
    atomic<struct foo *> next;
}
struct foo *a;
```

Thread 1

$$
\begin{aligned}
& \text { r1 = a->next } \\
& \text { r1->next = a }
\end{aligned}
$$



```
struct foo {
    atomic<struct foo *> next;
}
struct foo *a, *b;
```

Thread 1
Thread 2

$$
\begin{aligned}
& r 1=a->\text { next } \\
& \text { r1->next }=a
\end{aligned}
$$

r2 = b->next
r2->next $=\mathrm{b}$

```
struct foo {
    atomic<struct foo *> next;
}
struct foo *a, *b;
```

Thread 1

$$
\begin{array}{l|l}
\mathrm{r} 1=\mathrm{a}->\text { next } & \mathrm{r} 2=\mathrm{b}->\text { next } \\
\mathrm{r} 1->\text { next }=\mathrm{a} & \mathrm{r} 2->\text { next }=\mathrm{b}
\end{array}
$$

If a and b initially reference disjoint data-structures we expect a and b to remain disjoint

```
struct foo {
    atomic<struct foo *> next;
}
struct foo *a, *b;
```

Thread 1
Thread 2

$$
\begin{aligned}
& r 1=a->\text { next } \\
& \text { r1->next }=a
\end{aligned}
$$


b


If the compiler speculates $\mathrm{r} 1=\mathrm{b}$ and $\mathrm{r} 2=\mathrm{a}$, then
the store $\mathrm{r} 1->$ next $=$ a justifies $\mathrm{r} 2=\mathrm{b}->$ next assigning $\mathrm{r} 2=\mathrm{a}$ (and symmetrically to justify r1=b)

Thread 1
Thread 2

$$
\begin{aligned}
& r 1=a->\text { next } \\
& r 1->\text { next }=a
\end{aligned}
$$

r2 = b->next
r2->next $=b$

b


If the compiler speculates $\mathrm{r} 1=\mathrm{b}$ and $\mathrm{r} 2=\mathrm{a}$, then
the store $\mathrm{r} 1->$ next $=$ a justifies $\mathrm{r} 2=\mathrm{b}->$ next assigning $\mathrm{r} 2=\mathrm{a}$ (and symmetrically to justify r1=b)

Thread 1
Thread 2

$$
\begin{aligned}
& r 1=a->\text { next } \\
& r 1->\text { next }=a
\end{aligned}
$$

$$
\begin{aligned}
& r 2=b->\text { next } \\
& \text { r2->next }=b
\end{aligned}
$$



If the compiler speculates $\mathrm{r} 1=\mathrm{b}$ and $\mathrm{r} 2=\mathrm{a}$, then
the store $r 1->$ next $=a$ justifies $r 2=b->$ next assigning $r 2=a$ (and symmetrically to justify r1=b)

## Break our basic intuitions about memory and sharing!




Common compiler optimisations are unsound in Cl 1

$$
x=y=a=0
$$

| if (x.load (rlx)==42) | if (y.load (rlx)==42) | $\mathrm{a}=1$ |
| :---: | :---: | :---: |
| y.write(42,rlx) | if ( $\mathrm{a}==1$ ) |  |
|  | x.write(42,rlx) |  |

$$
x=y=a=0
$$

if | $(x . \operatorname{load}(r l x)==42)$ |  |
| :---: | :---: |
| $y \cdot w r i t e ~$ | $(42, r l x)$ |\(\left|\begin{array}{c}if(y \cdot \operatorname{load}(r l x)==42) <br>

if(a==1) <br>
x.write(42, r l x)\end{array}\right| a=1\)

## Remark 1

This code is not racy!
There is no consistent execution in which the read of a occurs.

$$
x=y=a=0
$$

if | $(x . \operatorname{load}(r l x)==42)$ |
| :---: |
| $y \cdot w r i t e ~$ |
| $(42, r l x)$ |\(\left|\begin{array}{c}if(y \cdot \operatorname{load}(r l x)==42) <br>

if(a==1) <br>
x.write(42, r l x)\end{array}\right| a=1\)

Remark 2

$$
a=1 \wedge x=y=0
$$

is the only possible final state

$$
x=y=a=0
$$

if | $(x . \operatorname{load}(r l x)==42)$ |  |
| :---: | :---: |
| $y \cdot w r i t e ~$ | $(42, r l x)$ |\(\left|\begin{array}{c}if(y \cdot \operatorname{load}(r l x)==42) <br>

if(a==1) <br>
x.write(42, r l x)\end{array}\right| a=1\)

## Consider sequentialisation:

$C \| D \Rightarrow C ; D$
(ought to be correct)

$$
x=y=a=0
$$

| $\begin{gathered} \text { if }(x . l o a d(r l x)==42) \\ y . \operatorname{write}(42, r l x) \end{gathered}$ | ```if (y.load(rlx)==42) if (a==1) x.write(42,rlx)``` | $a=1$ |
| :---: | :---: | :---: |
|  | $\mathrm{a}=1$ |  |
| $\begin{gathered} \text { if }(x . \operatorname{load}(r l x)==42) \\ y \cdot \operatorname{write}(42, r l x) \end{gathered}$ | ```if \((y . \operatorname{load}(r l x)==42)\) if ( \(a==1\) ) x.write(42,rlx)``` |  |

$$
x=y=a=0
$$

$$
\begin{array}{c|c}
\text { if }(\mathrm{x} . \operatorname{load}(\mathrm{rlx})==42) & \text { if }(\mathrm{y} \cdot \operatorname{load}(\mathrm{rlx})==42) \\
\mathrm{y} \cdot \mathrm{write}(42, r l \mathrm{x}) & \text { if }(\mathrm{a}==1) \\
& \text { x.write }(42, r l x)
\end{array}
$$

$$
a=1
$$

$$
x=y=a=0
$$

|  | $\mathrm{a}=1$ |
| :---: | :---: |
| if (x.load (rlx)==42) | if (y.load (rlx)==42) |
| y.write(42,rlx) | if ( $\mathrm{a}==1$ ) |
|  | x.write( $42, r l x$ ) |



$$
a=1
$$

$$
x=y=42
$$

$$
x=y=a=0
$$

|  | $\mathrm{a}=1$ |
| :---: | :---: |
| if (x.load (rlx)==42) | if (y.load (rlx)==42) |
| y.write(42,rlx) | if ( $\mathrm{a}==1$ ) |
|  | x.write(42,rlx) |

## Break common source-to-source

(or LLVM IR - to - LLVM IR)

## compiler optimisations

including expression linearisation and roach-motel reorderings

## Are there any solutions?



| Thread 0 | Thread 1 |
| :---: | :---: |
| $\mathrm{r} 1=\mathrm{x}$ | $\mathrm{r} 2=\mathrm{y}$ |
| $\mathrm{y}=\mathrm{r} 1$ | $\mathrm{x}=42$ |



| Thread 0 | Thread 1 |
| :--- | :--- |
| $\mathrm{r} 1=\mathrm{x}$ | $\mathrm{r} 2=\mathrm{y}$ |
| $\mathrm{y}=\mathrm{r} 1$ | $\mathrm{x}=\mathrm{r} 2$ |



| Thread 0 | Thread 1 |
| :--- | :--- |
| $\mathrm{r} 1=\mathrm{x}$ | $\mathrm{r} 2=\mathrm{y}$ |
| $\mathrm{y}=\mathrm{r} 1$ | $\mathrm{x}=42$ |



$$
r 1=r 2=42 . \quad \text { Can you spot the difference? }
$$

| Thread 0 | Thread 1 |
| :--- | :--- |
| $r 1=\mathrm{x}$ | $\mathrm{r} 2=\mathrm{y}$ |
| $\mathrm{y}=\mathrm{r} 1$ | $\mathrm{x}=\mathrm{r} 2$ |



| Thread 0 | Thread 1 |
| :--- | :--- |
| $\mathrm{r} 1=\mathrm{x}$ | $\mathrm{r} 2=\mathrm{y}$ |
| $\mathrm{y}=\mathrm{r} 1$ | $\mathrm{x}=42$ |



The "bad" example has a cycle of dependencies.

| Thread 0 | Thread 1 |
| :--- | :--- |
| $r 1=\mathrm{x}$ | $\mathrm{r} 2=\mathrm{y}$ |
| $\mathrm{y}=\mathrm{r} 1$ | $\mathrm{x}=\mathrm{r} 2$ |



## Solution 1.

## Prohibit executions with dependency cycles

The "bad" example has a cycle of dependencies.

| Thread 0 | Thread 1 |
| :--- | :--- |
| $r 1=\mathrm{x}$ | $\mathrm{r} 2=\mathrm{y}$ |
| $\mathrm{y}=\mathrm{r} 1$ | $\mathrm{x}=\mathrm{r} 2$ |



## Compiler writers

## do not want to track all dependencies

## Compiler writers

 do not want to track all dependencies$$
\begin{aligned}
& \text { if }(x) \\
& \quad \mathrm{a}[\mathrm{i}++]=1 \text {; } \\
& \text { else } \\
& \quad \mathrm{a}[\mathrm{i}++]=2 ;
\end{aligned}
$$

## Does the store to $i$ depend on the load of $x$ ?

## Solution 2. Brute force

## Disallow cycles altogether


acyclic $(\mathrm{hb} \cup\{(a, b) \mid r f(b)=a\})$

## Allows all source-to-source optimisations

(except for r/w reordering on atomics)
but expensive on ARM and GPUs
Disallow cycles altogether

acyclic $(\mathrm{hb} \cup\{(a, b) \mid r f(b)=a\})$

## Solution 3. less brute force

## Allow cycles but make this racy

 by allowing a to read 1

## Efficient implementation of atomics on ARM/GPUs

## but all R/W reorderings are unsound

## Allow cycles but make this racy

$$
\text { by allowing a to read } 1
$$

if | $(x . \operatorname{load}(r l x)==42)$ |  |
| :---: | :---: |
| $y \cdot w r i t e ~$ | $(42, r l x)$ |\(\left|\begin{array}{c}if(y \cdot \operatorname{load}(r l x)==42) <br>

if(a==1) <br>
x . w r i t e(42, r l x)\end{array}\right| a=1\)

## State of the art

"Implementations should ensure
that no "out-of-thin-air" values are computed that circularly depend on their own computation."

Current proposal for C14

## Conclusion

## Syllabus

In these lectures we have covered the hardware models of two modern computer architectures (x86 and Power/ARM - at least for a large subset of their instruction set).

We have seen how compiler optimisations can also break concurrent programs and the importance of defining the memory model of highlevel programming languages.

We have also introduced some proof methods to reason about concurrency.

After these lectures, you might have the feeling that multicore programming is a mess and things can't just work.


The memory models of modern hardware are better understood.

Programming languages attempt to specify and implement reasonable memory models.

Researchers and programmers are now interested in these problems.

The memory models of modern hardware are better understood.

## Still, many open problems...

problems.

The memory models of modern hardware are better understood.

Still, many research opportunities! els.
problems.


All these lectures are based on work done with/by my colleagues. Thank you!


## And thank you all for attending these lectures!

Please, fill the course evaluation form, that's important to make a better course next year.



## 4. Sketch of an operational formalisation of x86-TSO

...starting with a formalisation of SC

## Separate language and memory semantics

```
class ArrayWrapper
    public:
        ArrayWrapper (int n)
            : _p_vals( new int[ n ] )
            , _size(n )
        {)
        // copy constructor
        ArrayWrapper (const ArrayWrapper& other)
            : p_vals( new int[ other._size ] )
            , _size( other._size )
            for ( int i = 0; i < _size; ++i )
            i ( i ( )
                _p_vals[ i ] = other._p_vals[ i ];
        }
        ~ArrayWrapper ()
            delete [] _P_vals;
        }
    private:
    int * p_vals;
    int _size;
```

program
semantics defined via an LTS

memory
semantics defined via an LTS
$\mathrm{W}_{\mathrm{t}}[\mathrm{a}] \mathrm{v}$ : a write of value v to address a by thread t
Labels for interaction: $\quad R_{t}[a] v$ : a read of $v$ from $a$ by $t$ by thread $t$ + other events for barriers and locked instructions

## Separate language and memory semantics

Separate language and state semantics proved to be a very good choice in many (unrelated) projects I worked on!

Labels for interaction:
$\mathrm{W}_{\mathrm{t}}[\mathrm{a}] \mathrm{v}$ : a write of value v to address a by thread t $R_{t}[a] v$ : a read of $v$ from a by $t$ by thread $t$

+ other events for barriers and locked instructions


## A tiny language



## What can a thread do in isolation?

$e \xrightarrow{l} e^{\prime} \quad e$ does $l$ to become $e^{\prime}$
$\xlongequal[{* x \xrightarrow{\mathrm{R} x=n}} n]{\text { READ }}$
$\underset{* x=n \xrightarrow{\mathrm{~W} x=n} n}{ } \quad$ WRITE
$\frac{e \xrightarrow{l} e^{\prime}}{* x=e \xrightarrow{l} * x=e^{\prime}}$
$\overline{n ; e \xrightarrow{\tau} e} \quad \mathrm{SEQ}$
$\frac{e_{1} \xrightarrow{l} e_{1}^{\prime}}{e_{1} ; e_{2} \xrightarrow{l} e_{1}^{\prime} ; e_{2}}$

$$
\begin{array}{ll}
\frac{e_{1} \xrightarrow{l} e_{1}^{\prime}}{e_{1}+e_{2} \xrightarrow{l} e_{1}^{\prime}+e_{2}} & \text { PLUS_CONTEXT_1 } \\
\frac{e_{2} \xrightarrow{l} e_{2}^{\prime}}{n_{1}+e_{2} \xrightarrow{l} n_{1}+e_{2}^{\prime}} & \text { PLUS_CONTEXT_2 } \\
\frac{n=n_{1}+n_{2}}{n_{1}+n_{2} \xrightarrow{\tau} n} \quad \text { PLUS }
\end{array}
$$

Observe that we can read an arbitrary value from the memory.

## Example

Show that the expression:

$$
(* x=* y) ; * x
$$

can perform the following trace:

$$
(* x=* y) ; * x \xrightarrow{\mathrm{R} y=7} \xrightarrow{\mathrm{~W} x=7} \xrightarrow{\tau} \xrightarrow{\mathrm{R} x=9} 9
$$

## Lifting to processes

$p \xrightarrow{l_{t}} p^{\prime} \quad p$ does $l_{t}$ to become $p^{\prime}$
$\frac{e \xrightarrow{l} e^{\prime}}{t: e \xrightarrow{l_{t}} t: e^{\prime}} \quad$ THREAD
$\frac{p_{1} \xrightarrow{l_{t}} p_{1}^{\prime}}{p_{1}\left|p_{2} \xrightarrow{l_{t}} p_{1}^{\prime}\right| p_{2}}$
PAR_CONTEXT_LEFT
Actions are labelled by the thread that performed the action.

Free interleaving.
$\frac{p_{2} \xrightarrow{l_{t}} p_{2}^{\prime}}{p_{1}\left|p_{2} \xrightarrow{l_{t}} p_{1}\right| p_{2}^{\prime}}$

## PAR_CONTEXT_RIGHT

## A sequentially consistent memory

Take M to be a function from addresses to integers.

$$
M \xrightarrow{l} M^{\prime} \quad M \text { does } l \text { to become } M^{\prime}
$$

$$
\frac{M(x)=n}{M \xrightarrow{\mathrm{R} x=n} M} \quad \text { MREAD }
$$

$$
\overline{M \xrightarrow{\mathrm{~W} x=n} M \oplus(x \mapsto n)}
$$

Mwrite

## SC semantics: whole system transitions

$$
s \xrightarrow{l_{t}} s^{\prime} \quad s \text { does } l_{t} \text { to become } s^{\prime}
$$

$$
\begin{gathered}
p \xrightarrow{\mathrm{R}_{t} x=n} p^{\prime} \\
M \xrightarrow{\mathrm{R} x=n} M^{\prime} \\
\langle p, M\rangle \xrightarrow{\mathrm{R}_{t} x=n}\left\langle p^{\prime}, M^{\prime}\right\rangle
\end{gathered} \quad \text { SREAD }
$$

Synchronising between the processes and the memory.

## SC semantics, example

All threads read and write the shared memory. Threads execute asynchronously,the semantics allows any interleaving of the thread transitions.


Each interleaving has a linear order of reads and writes to memory.
...now we just have to define a TSO memory...

## A sequentially consistent memory

Take M to be a function from addresses to integers.
$M \xrightarrow{l} M^{\prime} \quad M$ does $l$ to become $M^{\prime}$

$$
\frac{M(x)=n}{M \xrightarrow{\mathrm{R} x=n} M} \quad \text { MREAD }
$$

$$
M \xrightarrow{\mathrm{~W} x=n} M \oplus(x \mapsto n)
$$

Mwrite

## x86-TSO abstract machine



## x86-TSO abstract machine

- The store buffers are FIFO. A reading thread must read its most recent buffered write, if there is one, to that address; otherwise reads are satisfied from shared memory.
- To execute a LOCK'd instruction, a thread must first obtain the global lock. At the end of the instruction, it flushes its store buffer and relinquishes the lock. While the lock is held by one thread, no other thread can read.
- A buffered write from a thread can propagate to the shared memory at any time except when some other thread holds the lock.


## x86-tso: a formalisation using an LTS

The machine state $s$ can be represented by a tuple ( $M, B, L$ ):
M : address -> value option
B : tid -> (address * value) list
L : tid option
where:
$M$ is the shared memory, mapping addresses to values
B gives the store buffer for each thread
L is the global machine lock indicating when a thread has exclusive access to memory (omitted in these slides)

## x86-tso abstract machine: selected transition rules

t is not blocked in machine state $\mathrm{s}=(\mathrm{M}, \mathrm{B}, \mathrm{L})$ if $[\ldots$ or $]$ the lock is not held.
In buffer $\mathrm{B}(\mathrm{t})$ there are no pending writes for address x if there are no $(x, v)$ elements in $B(t)$.

RM: Read from memory not_blocked $(s, t)$
$s . M(x)=v$
no_pending $(s . B(t), x)$
$s \xrightarrow{\mathrm{R}_{t} x=v} s$
Thread $t$ can read $v$ from memory at address $x$ if $t$ is
not blocked, the memory does contain $v$ at $x$, and there are no writes to $x$ in $t$ 's store buffer.
x86-tso abstract machine: selected transition rules

RB: Read from write buffer
not_blocked $(s, t)$
$\exists b_{1} b_{2} . s \cdot B(t)=b_{1}++[(x, v)]++b_{2}$
no_pending $\left(b_{1}, x\right)$
$s \xrightarrow{\mathrm{R}_{t} x=v} s$
Thread $t$ can read $v$ from its store buffer for address $x$ if $t$ is not blocked and has $v$ as the newest write to $x$ in its buffer;

## x86-tso abstract machine: selected transition rules

WB: Write to write buffer

$$
s \quad \xrightarrow{\mathrm{~W}_{t} x=v} s \oplus \llbracket B:=s . B \oplus(t \mapsto([(x, v)]++s . B(t))) \downarrow
$$

Thread $t$ can write $v$ to its store buffer for address $x$ at any time;

WM: Write from write buffer to memory

$$
\begin{aligned}
& \text { not_blocked }(s, t) \\
& s . B(t)=b++[(x, v)]
\end{aligned}
$$

$$
\begin{aligned}
& s \xrightarrow{\tau_{t x=v}} \\
& s \oplus \backslash M:=s . M \oplus(x \mapsto v) \rrbracket \oplus \backslash B:=s . B \oplus(t \mapsto b) \rrbracket
\end{aligned}
$$

If $t$ is not blocked, it can silently dequeue the oldest write from its store buffer and place the value in memory at the given address, without coordinating with any hardware thread


## 5. Veryfing fence elimination optimisations

aka reasoning on the x86TSO operational memory model and compiler correctness

## CompCertTSO



CFG generation
[POPL 2011]

## CompCertTSO + fence optimisations



## Compilers are ideal for verification

source program (e.g., C)

Compilers are:

- Basic computing infrastructure
- Generally reliable, but nevertheless contain many bugs
e.g., Yang et al. [PLDI 2011] found 79 gcc \& 202 llvm bugs
- "Specifiable": compiler correctness = preservation of behaviours
- Interesting: naturally higher-order, involve clever algorithms
- Big, but modular


## Language semantics

The semantics of all the CompCertTSO languages is defined by:

- a type of programs, prg
- a type of states, states
- a set of initial states for each program, init $\in \operatorname{prg} \rightarrow \mathbb{P}($ states $)$
- a transition relation, $\quad \rightarrow \in \mathbb{P}($ states $\times$ event $) \times$ states $)$
call, return, fail, oom,

The visible behaviour of a program is defined by the external function calls (call) and returns (return), errors (fail), and running out of memory (oom).

## Traces

- Finite sequences of call \& return events ending with: end: successful termination, inftau: infinite execution that stops performing visible events oom: execution runs out of memory
- Infinite sequences of call \& return events;

$$
\begin{aligned}
\operatorname{traces}(p) \stackrel{\text { def }}{=} & \left\{\ell \cdot \operatorname{end} \mid \exists s \in \operatorname{init}(p) . \exists s^{\prime} . s \xlongequal{\ell} s^{\prime} \wedge s^{\prime} \nrightarrow\right\} \\
& \cup\left\{\ell \cdot \operatorname{tr} \mid \exists s \in \operatorname{init}(p) . \exists s^{\prime} . s \xlongequal{\ell \cdot \text { fail }} s^{\prime}\right\} \\
& \cup\left\{\ell \cdot \operatorname{inftau} \mid \exists s \in \operatorname{init}(p) . \exists s^{\prime} . s \xlongequal{\Rightarrow} s^{\prime} \wedge \operatorname{inftau}\left(s^{\prime}\right)\right\} \\
& \cup\left\{\ell \cdot \operatorname{oom} \mid \exists s \in \operatorname{init}(p) . \exists s^{\prime} . s \xlongequal{\Rightarrow} s^{\prime}\right\} \\
& \cup\{t r \mid \exists s \in \operatorname{init}(p) . s \text { can do the infinite trace } \operatorname{tr}\}
\end{aligned}
$$

NB: Erroneous computations become undefined after the first error.

## Compiler correctness

source program (e.g., C)
Compiler target program (e.g., x86)
traces(source_program) $\supseteq$ traces(target_program)
print "a" || print "b" $\longrightarrow$ print "ab"
print "ab"
print "a" || print "b"
fail
$\longrightarrow$ print "ab"
print "ab"


## Fence instructions prevent hardware reorderings

E.g., on x86-TSO:


## Who inserts fences?

```
1. The programmer, explicitly. Example: Fraser's lockfree-lib:
/*
    * II. Memory barriers.
    * MB(): All preceding memory accesses must commit before any later accesses.
    *
    * If the compiler does not observe these barriers (but any sane compiler
    * will!), then VOLATILE should be defined as 'volatile'.
    */
#define MB() __asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
```

2. The compiler, to implement a high-level memory model, e.g. SEQ_CST C++0x low-level atomics on x86:

Load SEQ_CST: MFENCE; MOV
Store SEQ_CST: MOV; MFENCE

## Fence instructions

1. Fences are necessary
to implement locks \& not fully-commutative linearizable objects (e.g., stacks, queues, sets, maps).
[Attiya et al., POPL 2011]
2. Fences can be expensive

## Redundant fences (1)

If we have two consecutive fence instructions, we can remove the latter:
MFENCE

MFENCE $\quad$| MFENCE |
| :--- |
| NOP |

The buffer is already empty when the second fence is executed.

Generalisation:

| MFENCE | MFENCE |  |
| :--- | :--- | :--- | :--- |
| NON-WRITE INSTR | NON-WRITE INSTR |  |
| ‥ | $\ldots$ |  |
| NON-WRITE INSTR |  | NON-WRITE INSTR |
| MFENCE | NOP |  |

A fence is redundant if it always follows a previous fence or locked instruction in program order, and no memory store instructions are in between.

A forward data-flow problem over the boolean domain $\{\perp, \top\}$

Associate to each program point:
$\perp$ : along all execution paths there is an atomic instruction before the current program point, with no intervening writes;

T : otherwise.

$$
\mathcal{F} \mathcal{E}_{1}(n)= \begin{cases}\top & \text { if predecessors }(n)=\emptyset \\ \bigsqcup_{p \in \operatorname{predecessors}(n)} T_{1}\left(\operatorname{instr}(p), \mathcal{F E}_{1}(p)\right) & \text { otherwise }\end{cases}
$$

A fence is redundant if it always follows a previous fence or locked instruction in program order, and no memory store instructions are in between.


## Redundant fences (2)

If we have two consecutive fence instructions, we can remove the former:
MFENCE

MFENCE $\quad$| NOP |
| :--- |
| MFENCE |

Intuition: the visible effects initially published by the former fence, are now published by the latter, and nobody can tell the difference.

Generalisation:

| MFENCE |  | NOP |  |
| :--- | :--- | :--- | :--- |
| INSTRUCTION | 1 |  | INSTRUCTION |
| IN | 1 |  |  |
| $\cdots$ |  | $\cdots$ |  |
| INSTRUCTION |  | INSTRUCTION | $n$ |
| MFENCE |  | MFENCE |  |

## Redundant fences (2)

If there are reads in between the fences...

but

|  | Thread 0 | Thread 1 |  |
| :---: | :---: | :---: | :---: |
| $[\mathrm{x}]=[\mathrm{y}]=0$ | $\begin{aligned} & \text { MOV }[\mathrm{x}] ~ \leftarrow ~ \\ & \operatorname{NOP} \\ & \text { MOV EAX } \leftarrow \mathrm{y}] \\ & \operatorname{MFENCE} \end{aligned}$ | $\begin{aligned} & \operatorname{MOV}[\mathrm{y}] \\ & \operatorname{MFENCE} \\ & \operatorname{MOV} \text { EBX } \end{aligned} \leftarrow[\mathrm{x}]=10$ | $\begin{gathered} \mathrm{EAX}=\mathrm{EBX}=0 \\ \text { allowed } \end{gathered}$ |

## Redundant fences (2)

If there are reads in between the fences...


## Redundant fences (2)

Swapping a STORE and a MFENCE is sound:

MFENCE; STORE
STORE; MFENCE

1. transformed program's behaviours $\subseteq$ source program's behaviours (source program might leave pending write in its buffer)
2. There is the new intermediate state if the buffer was initially non-empty, but this intermediate state is not observable.
(a local read is needed to access the local buffer)
Intuition: Iterate this swapping... later fence or locked instruction in program order, and no memory read instructions are in between.

A backward data-flow problem over

| $T_{2}($ nop, $\mathcal{E})$ | $=\mathcal{E}$ |
| :--- | :--- |
| $T_{2}(\operatorname{op}(o p, \vec{r}, r), \mathcal{E})$ | $=\mathcal{E}$ |
| $T_{2}(\operatorname{load}(\kappa$, addr, $\vec{r}, r), \mathcal{E})$ | $=\top$ |
| $T_{2}(\operatorname{store}(\kappa$, addr, $\vec{r}$, src $), \mathcal{E})$ | $=\mathcal{E}$ |
| $T_{2}(\operatorname{call}($ sig, ros, args, res $), \mathcal{E})$ | $=\top$ |
| $T_{2}(\operatorname{cond}($ cond, args $), \mathcal{E})$ | $=\mathcal{E}$ |
| $T_{2}($ return $($ optarg $), \mathcal{E})$ | $=\mathrm{\top}$ |
| $T_{2}($ threadcreate $($ optarg $), \mathcal{E})$ | $=\top$ |
| $T_{2}($ atomic $($ aop $, \vec{r}, r), \mathcal{E})$ | $=\perp$ |
| $T_{2}($ fence, $\mathcal{E})$ | $=\perp$ |

T : otherwise.
Associate to each program point:
$\perp$ : along all execution paths there is an atomic instruction after the current program point, with no intervening reads;

$$
\mathcal{F} \mathcal{E}_{2}(n)=\left\{\begin{array}{l}
\top \\
\bigsqcup_{s \in \operatorname{successors}(n)} T_{2}\left(\text { instr }(s), \mathcal{F} \mathcal{E}_{2}(s)\right)
\end{array}\right.
$$

$$
\text { if successors }(n)=\emptyset
$$

otherwise

## FE1 and FE2 are both useful

Removed by FE1 but not FE2:

Removed by FE2 but not FE1:

```
MFENCE
```

```
MOV EAX <- [Y]
```

MOV EAX <- [Y]
MFENCE
MFENCE
MOV EBX <- [y]

```
MOV EBX <- [y]
```

MOV [x] <- 1
MFENCE
MOV [x] <- 2
MFENCE

## Informal correctness argument

Intuition: FE2 can be thought as iterating

and then applying
MFENCE; MFENCE NOP; MFENCE

This argument works for finite traces, but not for infinite traces as the later fence might never be executed:

| MFENCE; | NOP; |
| :--- | :--- |
| STORE; |  |
| WHILE $(1) ;$ | STORE; |
| MFENCE | WHILE (1); |
| MFENCE |  |

## Basic simulations

A pair of relations

$$
\sim \in \mathbb{P}(\text { src.states } \times \text { tgt.states }) \quad>\in \mathbb{P}(\text { tgt.states } \times \text { tgt.states })
$$

is a basic simulation for compile : src.prg $\rightarrow$ tgt.prg if:

```
sim_init: }\forallp\mp@subsup{p}{}{\prime}.\operatorname{compile}(p)=\mp@subsup{p}{}{\prime}\Longrightarrow\forallt\in\operatorname{init}(\mp@subsup{p}{}{\prime}).\existss\in\operatorname{init}(p).s~
sim_end: }\forallst.s~t\wedget\not->-\Longrightarrows\not-> -
sim_step:}\forallst\mp@subsup{t}{}{\prime}ev.s~t\wedget\xrightarrow{}{ev}\mp@subsup{t}{}{\prime}\wedgeev\not=0om
ll
```

Exhibiting a basic simulation implies:
traces $($ compile $(p)) \backslash\{t \cdot$ inftau $\mid t$ trace $\} \subseteq$ traces $(p)$
"simulation can stutter forever"

## Usual approach: measured simulations

Definition 2 (Measured sim.). A measured simulation is any basic simulation $(\sim,>)$ such that $>$ is well-founded.

Theorem 1. If there exists a measured simulation for the compilation function compile, then for all programs $p$, $\operatorname{traces}(\operatorname{compile}(p)) \subseteq \operatorname{traces}(p)$.

## Simulation for FE2

$s \equiv i t$ iff thread $i$ of $s$ and $t$ have identical pc, local states and buffers
$s \sim i s^{\prime}$ iff thread $i$ of $s$ can execute zero or more nop, op, store and mfence instructions and end in the state $s^{\prime}$
$s \sim t$ iff

- t's CFG is the optimised version of s's CFG; and
$-s$ and $t$ have identical memories; and
- $\forall$ thread $i$, either $s \equiv i t$ or
the analysis for $i^{\prime} s \mathrm{pc}$ returned $\perp$ and $\exists s^{\prime}, s \sim \sim_{i} s^{\prime}$ and $s^{\prime} \equiv i t$
" $s$ is some instructions behind and can catch up"
Stutter condition:
$t>t^{\prime}$ iff $t \rightarrow t^{\prime}$ by a thread executing a nop, op, store or mfence ( and t's buffer being non-empty)


## Simulation for FE2


" $s$ is some instructions behind and can catch up"
Stutter condition:
$t>t^{\prime}$ iff $t \rightarrow t^{\prime}$ by a thread executing a nop, op, store or mfence ( and t's buffer being non-empty)

## Simulation for FE2



## Weaktau simulation

Definition 3 (Weaktau sim.). A weaktau simulation consists of a basic simulation $(\sim,>)$ with and an additional relation between source and target states, $\simeq \in \mathbb{P}$ (src.states $\times$ tgt.states $)$ satisfying the following properties:

```
sim_weaken: }\foralls,t.s~t\Longrightarrows\simeq
sim_wstep:}\forallst\mp@subsup{t}{}{\prime}.s\simeqt\wedget\xrightarrow{}{\tau}\mp@subsup{t}{}{\prime}\wedget>\mp@subsup{t}{}{\prime}
    (s\xrightarrow{->}{\tau}\xrightarrow{}{\mathrm{ fail }}\mathrm{ -) - -s reaches a failure}
\vee ( \exists s ^ { \prime } . s \xrightarrow { \tau } * \xrightarrow { \tau } s ^ { \prime } \wedge s ^ { \prime } \simeq t ^ { \prime } ) - s \text { does a matching step sequence.}
```

Theorem 2. If there exists a weaktau-simulation ( $\sim,>, \simeq$ ) for the compilation function compile, then for all programs $p$, $\operatorname{traces}(\operatorname{compile}(p)) \subseteq \operatorname{traces}(p)$.

Remarks:

- Once the simulation game moves from $\sim$ to $\simeq$, stuttering is forbidden;
- Can view difference between ~ and $\simeq$ as a boolean prophecy variable.


## Weaktau simulation for FE2

$s \sim t, t>t^{\prime}$ as before.
$s \simeq t$ iff

- t's CFG is the optimised version of s's CFG; and
$-\forall i, \exists s^{\prime}$ s.t. $s \sim i s^{\prime} \equiv i t$.
(i.e., same as $s \sim t$ except that the memories memories are unrelated.)


## A closer look at the RTL



Patterns like that on the left are common.

FE1 and FE2 do not optimise these patterns.

It would be nice to hoist those fences out of the loop.

## A closer look at the RTL



Patterns like that on the left are common.

FE1 and FE2 do not optimise these patterns.
 Do you perform PRE?

## A closer look at the RTL



Patterns like that on the left are common.

FE1 and FE2 do not optimise these patterns.

It would be nice...acout of the loop. Do you perform PRE?
...adding a fence is always safe...

## Partial redundancy elimination



## Evaluation of the optimisations

- Insert MFENCEs before every read (br), or after every write (aw).
- Count the MFENCE instructions in the generated code.

|  | br | br+FE1 | aw | aw+FE2 | aw+PRE+FE2 |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Dekker | 3 | 2 | 5 | 4 | 4 |
| Bakery | 10 | 2 | 4 | 3 | 3 |
| Treiber | 5 | 2 | 3 | 1 | 1 |
| Fraser | 32 | 18 | 19 | 12 | 11 |
| TL2 | 166 | 95 | 101 | 68 | 68 |
| Genome | 133 | 79 | 62 | 41 | 41 |
| Labyrinth | 231 | 98 | 63 | 42 | 42 |
| SSCA | 1264 | 490 | 420 | 367 | 367 |

## Evaluation of the optimisations

- Insert MFENCEs before every read (br), or after every write (aw).


