






























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A paper on automatic fence insertion for memory consistency in modern architectures. The authors introduce axiomatic memory models and describe a fence inference approach based on finding critical cycles in the abstract event graph of a program. They evaluate their method on classic examples and large real-world programs, achieving good precision and scalability.
Typology: Study notes
1 / 38
This page cannot be seen from the preview
Don't miss anything!
JADE ALGLAVE, Microsoft Research, University College London DANIEL KROENING, University of Oxford VINCENT NIMAL, Microsoft Research DANIEL POETZL, University of Oxford
Modern architectures rely on memory fences to prevent undesired weakenings of memory consistency. As the fences’ semantics may be subtle, the automation of their placement is highly desirable. But precise methods for restoring consistency do not scale to deployed systems code. We choose to trade some precision for genuine scalability: our technique is suitable for large code bases. We implement it in our new musketeer tool, and report experiments on more than 700 executables from packages found in Debian GNU/Linux 7.1, including memcached with about 10,000 LoC.
1. INTRODUCTION
Concurrent programs are hard to design and implement, especially when running on multiprocessor architectures. Multiprocessors implement weak memory models, which feature e.g. instruction reordering and store buffering (both appearing on x86), or store atomicity relaxation (a particularity of Power and ARM). Hence, multiprocessors allow more behaviours than Lamport’s Sequential Consistency (SC) [Lamport 1979], a theo- retical model where the execution of a program corresponds to an interleaving of the operations executed by the different threads. This has a dramatic effect on program- mers, most of whom learned to program with SC. Fortunately, architectures provide special fence (or barrier) instructions to prevent certain behaviours. Yet both the questions of where and how to insert fences are con- tentious, as fences are architecture-specific and expensive in terms of runtime. Attempts at automatically placing fences include Visual Studio 2013, which offers an option to guarantee acquire/release semantics (we study the performance impact of this policy in Section 2). The C++11 standard provides an elaborate API for inter- thread communication, giving the programmer some control over which fences are used, and where. But the use of such APIs might be a hard task, even for expert pro- grammers. For example, Norris and Demsky [2013] reported a bug found in a pub- lished C11 implementation of a work-stealing queue. We address here the question of how to synthesise fences, i.e. automatically place them in a program to enforce robustness/stability [Bouajjani et al. 2011; Alglave and Maranget 2011], which implies SC. This should lighten the programmer’s burden. The fence synthesis tool needs to be based on a precise model of weak memory. In verifica- tion, models commonly adopt an operational style, where an execution is an interleav- ing of transitions accessing the memory (as in SC). To address weaker architectures, the models are augmented with buffers and queues that implement the features of the
This work is supported by SRC 2269.002, EPSRC H017585/1 and ERC 280053. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is per- mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. ⃝c YYYY ACM 0164-0925/YYYY/01-ARTA $15. DOI:http://dx.doi.org/10.1145/0000000.
A:2 J. Alglave et al.
hardware. Similarly, a good fraction of the fence synthesis methods, e.g. Linden and Wolper [2013], Kuperstein et al. [2010], Kuperstein et al. [2011], Liu et al. [2012], Ab- dulla et al. [2013], and Bouajjani et al. [2013] rely on operational models to describe executions of programs.
Challenges. Methods using operational models inherit the limitations of methods based on interleavings, e.g. the “severely limited scalability”, as Liu et al. [2012] put it. Indeed, none of them scale to programs with more than a few hundred lines of code, due to the very large number of executions a program can have. Another impediment to scalability is that these methods establish if there is a need for fences by exploring the executions of a program one by one. Finally, considering models `a la Power or ARM makes the problem significantly more difficult. Intel x86 offers only one fence (mfence), but Power offers a variety of synchro- nisation mechanisms: fences (e.g. sync and lwsync) and dependencies (address, data, or control). This diversity makes the optimisation more subtle: one cannot simply min- imise the number of fences, but rather has to consider the costs of the different syn- chronisation mechanisms; for instance, it might be cheaper to use one full fence than four dependencies.
Our approach. We tackle these challenges with a static approach. Our choice of model almost mandates this approach: we rely on the axiomatic semantics of Alglave et al. [2010]. We feel that an axiomatic semantics is an invitation to build abstract objects that embrace all the executions of a program. Previous works, e.g. [Shasha and Snir 1988; Alglave and Maranget 2011; Bouajjani et al. 2011; Bouajjani et al. 2013], show that weak memory behaviours boil down to the presence of certain cycles, called critical cycles, in the executions of the program. A critical cycle essentially represents a minimal violation of SC, and thus indicates where to place fences to restore SC. We detect these cycles statically, by exploring an over-approximation of the executions of the program.
Contributions. We describe below the contributions of our paper:
— A self-contained introduction to axiomatic memory models, including a detailed ac- count of the special shapes of critical cycles (Section 4). — A fence inference approach, based on finding critical cycles in the abstract event graph (aeg) of a program (Section 5), and then computing via a novel integer linear programming formulation a minimal set of fences to guarantee sequential consis- tency (Section 6). The approach takes into account the different costs of fences, and is sound for a wide range of architectures, including x86-TSO, Power, and ARM. — The first formal description of the construction of aegs (Section 5.4), and a correct- ness proof showing that the aeg does capture all potential executions of the analysed program (Section 5.5). This includes a description of how to correctly use overapprox- imate points-to information during the construction of the aeg. The aeg abstraction is not specific to fence insertion and can also be used for other program analysis tasks. — A formalisation of the generation of the event structures and candidate executions of a program in the framework of Alglave et al. [2010] (Section 5.5). This has in previous work only been treated informally. — An implementation of our approach in the new tool musketeer and an evaluation and comparison of our tool to others (Sections 2 and 7). Our evaluation on both classic ex- amples (such as Dekker’s algorithm) and large real-world programs from the Debian GNU/Linux distribution (such as memcached which has about 10000 LoC) shows that our method achieves good precision and scales well.
A:4 J. Alglave et al.
N/A
(a) stack, x
(^0) M P V E H
10
20
30
5
6
7
8
Overhead (in %)
(b) stack, ARM
(^0) M P V E H
10
20
7
3
6
5
8
Overhead (in %)
N/A
(c) queue, x
(^0) M P V E H
10
20
30
7
2 9.^4
4
Overhead (in %)
(d) queue, ARM
(^0) M P V E H
5
10
15
20
9
8
(^7 4). 1
7
Overhead (in %)
Fig. 1. Overheads for the different fencing strategies.
stack on x86 stack on ARM queue on x86 queue on ARM (M) [10.059; 10.089] [11.950; 11.973] [12.296; 12.328] [21.419; 21.452] (P) [10.371; 10.400] [12.608; 12.638] [13.206; 13.241] [21.818; 21.850] (V) N/A [12.294; 12.318] N/A [22.219; 22.255] (E) [10.989; 11.010] [13.214; 13.259] [13.357; 13.390] [22.099; 22.128] (H) [12.788; 12.838] [14.574; 14.587] [15.686; 15.723] [24.983; 25.013]
Fig. 2. Confidence intervals for the mean execution times (in sec) for the data structure experiments.
and 4 GB of RAM, and on an ARMv7 (32-bit) Samsung Exynos 4412 with 4 cores at 1.6 GHz and 2 GB of RAM. For each program version, Figure 1 gives the mean overhead w.r.t. the unfenced program. We give the overhead (in %) in user time (as given by Linux time), i.e. the time spent by the program in user mode on the CPU. Amongst the approaches that guarantee SC (i.e. all but V), the best results were achieved with our tool musketeer. We checked the statistical significance of the execution time improvement of our method over the existing methods by computing and comparing the confidence inter- vals for the mean execution times. The sample size is N = 100 and the confidence level is 1 = 95%. The confidence intervals are given in Figure 2. If the confidence inter- vals for two methods are non-overlapping, we can conclude that the difference between the means is statistically significant. As discussed later in Section 5.1, the fence insertion approaches compared in this section analyse C programs while assuming a straightforward compilation scheme to assembly in which accesses are not reordered or otherwise optimised by the compiler. Thus, sound results are only achieved when using compilation settings that guarantee these properties (e.g., gcc -O0). Nevertheless, we also compared the approaches when compiling with -O1 to get an estimate about how the different approaches would fare when allowing more compiler optimisations. We observed that on x86 the runtime decreased between 1% and 9%. On ARM the runtime decreased between 3% and 31%. The relative performance of the different approaches remained the same as with -O0, i.e., the best runtime was achieved with musketeer (M) while the approach H (fence after every access to static or heap memory) was slowest. We give the corresponding data online at http://www. cprover.org/wmm/musketeer.
Don’t Sit on the Fence: A Static Analysis Approach to Automatic Fence Insertion A:
authors tool model style objective Abdulla et al. [2013] memorax operational reachability Alglave et al. [2010] offence axiomatic SC Bouajjani et al. [2013] trencher operational SC Fang et al. [2003] pensieve axiomatic SC Kuperstein et al. [2010] fender operational reachability Kuperstein et al. [2011] blender operational reachability Linden and Wolper [2013] remmex operational reachability Liu et al. [2012] dfence operational specification Sura et al. [2005] pensieve axiomatic SC Abdulla et al. [2015] persist operational persistence Fig. 3. Overview of existing fence synthesis tools.
3. RELATED WORK
The work of Shasha and Snir [1988] is a foundation for much of the field of fence synthesis. Most of the work cited below inherits their notions of delay and critical cycle. A delay is a pair of instructions in a thread that can be reordered by the underlying architecture. A critical cycle essentially represents a minimal violation of sequential consistency. Figure 3 classifies the methods that we compare to w.r.t. their style of model (oper- ational or axiomatic). The table further indicates the objective of the fence insertion procedure: enforcing SC, preventing reachability of error states (i.e. ensuring safety properties), or other specifications (such as enforcing given orderings of memory ac- cesses). We report on our experimental comparison to these tools in Section 7. We now sum- marise the fence synthesis methods per style. We write TSO for Total Store Order, implemented in Sparc TSO [SPARC 1994] and Intel x86 [Owens et al. 2009]. We write PSO for Partial Store Order and RMO for Relaxed Memory Order, two other Sparc architectures. We write Power for IBM Power [Power 2009].
3.1. Operational models
Linden and Wolper [2013] explore all executions (using what they call automata ac- celeration) to simulate the reorderings occuring under TSO and PSO. Abdulla et al. [2013] couple predicate abstraction for TSO with a counterexample-guided strategy. They check if an error state is reachable; if so, they calculate what they call the maxi- mal permissive sets of fences that forbid this error state. Their method guarantees that the fences they find are necessary, i.e., removing a fence from the set would make the error state reachable again. A precise method for PSO is presented by Abdulla et al. [2015]. Kuperstein et al. [2010] explore all executions for TSO, PSO and a subset of RMO, and along the way build constraints encoding reorderings leading to error states. The fences can be derived from the set of constraints at the error states. The same au- thors [Kuperstein et al. 2011] improve this exploration under TSO and PSO using an abstract interpretation they call partial coherence abstraction, relaxing the order in the write buffers after a certain bound, thus reducing the state space to explore. Mesh- man et al. [2014] synthesise fences for infinite-state algorithms to satisfy safety spec- ifications under TSO and PSO. The approach works by refinement propagation: They successively refine the set of inferred fences by combining abstraction refinements of the analysed program. Liu et al. [2012] offer a dynamic synthesis approach for TSO and PSO, enumerating the possible sets of fences to prevent an execution picked dy- namically from reaching an error state.
Don’t Sit on the Fence: A Static Analysis Approach to Automatic Fence Insertion A:
4. AXIOMATIC MEMORY MODELS
Weak memory effects can occur as follows: a thread sends a write to a store buffer, then to a cache, and finally to memory. While the write transits through buffers and caches, reads can occur before the written value is available in memory to all threads. To describe such situations, we use the framework of Alglave et al. [2010], embracing in particular SC, Sun TSO (i.e. the x86 model [Owens et al. 2009]), Power, and ARM. In this framework, a memory model is specified as a predicate on candidate executions. The predicate indicates whether a candidate execution is allowed (i.e. may occur) or disallowed (i.e. cannot occur) on the respective architecture. A candidate execution is represented as a directed graph. The nodes are memory events (reads or writes), and the edges indicate certain relations between the events. For example, a read-from (rf) edge from a write to a read indicates that the read takes its value from that write. We illustrate this framework in the next section using a litmus test (Figure 4). A lit- mus test is a short concurrent program together with a condition on its final state. The given litmus test consists of two threads, which access shared variables x and y. The shared variables are assumed to be initialized to zero at the beginning. The given condition holds for an execution in which load (c) reads value 1 from y, and load (d) reads value 0 from x. Whether the given litmus test has an execution that can end up in this final state depends on the memory model. For example, the given outcome can occur on Power but not on TSO. Thus, for a given architecture, a set of litmus tests to- gether with the information of whether the given outcome can occur on any execution characterises the architecture.
4.1. Basics
We next describe how the set of candidate executions of a program is defined. A can- didate execution is obtained by first generating an event structure. An event structure E , (E; po) is a set of memory events E together with the program order relation po.^1 An event is a read from memory or a write to memory, consisting of an identifier, a di- rection (R for read or W for write), a memory address (represented by a variable name) and a value. The program order po is a per-thread total order over E. An event struc- ture represents an execution of the program, assuming the shared reads can return arbitrary values. For example, Figure 5a gives an event structure associated with the litmus test in Figure 4. A store instruction (e.g. x 1 on T 0 ) gives rise to a write event (e.g. (a)Wx 1 ), and a load instruction (e.g. r1 y on T 1 ) gives rise to a read event (e.g. (c)Ry1). In this particular event structure, we have assumed that the load (c) on T 1 read value 1 , and the load (d) on T 1 read value 0 , but any value for the loads (c) and (d) would give rise to a valid event structure. An event structure can be completed to a candidate execution by adding an exe-
cution witness X , (co; rf; fr). An execution witness represents the communication between the threads and consists of the three relations co, rf, and fr. The coherence relation co is a per-address total order on write events, and models the memory co- herence widely assumed by modern architectures. It links a write w to any write w′ to the same address that hits the memory after w. The read-from relation rf links a write w to a read r such that r reads the value written by w. The fr relation is defined in terms of rf and co (hence we say it is a derived relation). A read r is in fr with a write w if the write w′^ from which r reads hits the memory before w. Formally, we have: (r; w) 2 fr , 9 w′:(w′; r) 2 rf ^ (w′; w) 2 co.
(^1) Our notion of event structures differs from the one previously introduced by Winskel [1986]. Winskel’s event structures also contain a conflict relation in addition to a set of events and a partial order over them.
A:8 J. Alglave et al.
mp T 0 T 1 (a) x 1 (c) r1 y (b) y 1 (d) r2 x Final state? r1=1 ^ r2=
Fig. 4. Message Passing ( mp ).
(a) Wx
(b) Wy
(c) Ry
(d) Rx
po po
(a) Wx
(b) Wy
(c) Ry
(d) Rx
po
rf po
fr
(a) event structure (b) candidate execution
Fig. 5. Event structure and candidate execution
Figure 5b shows the event structure of Figure 5a completed to a candidate execution. A candidate execution is uniquely identified by the event structure E and execution witness X. Inserting fences into a program does not change the set of its candidate executions. For example, if we would place a fence between the two stores of T 0 in Figure 4, the litmus test would still have the same set of candidate executions. How- ever, the fences do affect which of those candidate executions are possible on a given architecture. Not all event structures can be completed to a candidate execution. For example, had we assumed that the first read in Figure 4 reads value 2 , then there would be no execution witness such that the read can be matched up via rf with a corresponding write writing the same value (as there is no instruction writing value 2 ). As we have mentioned earlier, a memory model is specified as a predicate on candi- date executions. Such a predicate is typically formulated as an acyclicity condition on candidate executions. For example, a candidate execution (E; X) is allowed (i.e. possi- ble) on SC if and only if acyclic(po [ co [ rf [ fr). This means that a candidate execution is not possible on SC if it contains at least one cycle formed of edges from po, co, rf, and fr. Consider for example the candidate execution in Figure 5b. This execution is not possible on SC as it has a cycle.
4.2. Minimal cycles
Any execution that has a cycle (i.e., a cycle formed of edges in po [ rf [ co [ fr) also has a minimal cycle. Given a candidate execution (E; X), a minimal cycle is a cycle such that
(MC1) per thread, there are at most two accesses, and the accesses are ad- jacent in the cycle; and
(MC2) for a memory location ℓ, there are at most three accesses to ℓ along the cycle, and the accesses are adjacent in the cycle.
The reason for (MC1) is that the po relation is transitive. That is, given a cycle with more than two accesses for a thread, the po edge from the first to the last access (ac- cording to po) forms a chord in the cycle. This chord can be used to bypass the other
A:10 J. Alglave et al.
SC x86 Power poWR always mfence sync poWW always always sync or lwsync poRW always always sync or lwsync or dp poRR always always sync or lwsync or dp or branch;isync Fig. 7. ppo and fences per architecture.
Gharachorloo 1995]), we relax the external read-from rfe, and call the corresponding write non-atomic. This is the main particularity of Power and ARM, and cannot happen on TSO/x86. Some program-order pairs may be relaxed as well (e.g. write-read pairs on x86), i.e. only a subset of po is guaranteed to occur in order. We call this subset the preserved program order, ppo. When a relation is not relaxed on a given architecture, we call it safe. Figure 7 summarises ppo per architecture. The columns are architectures, e.g. x86, and the lines are relations, e.g. poWR. We write e.g. poWR for the program order be- tween a write and a read. We write “always” when the relation is in the ppo of the architecture: e.g. poWR is in the ppo of SC. When we write something else, typically the name of a fence, e.g. mfence, the relation is not in the ppo of the architecture (e.g. poWR is not in the ppo of x86), and the fence can restore the ordering: e.g. mfence maintains write-read pairs in program order. Following Alglave et al. [2010], the relation fence (with fence po; for some concrete architecture-specific fence) induced by a fence is non-cumulative when it only orders certain pairs of events surrounding the fence. The relation fence is cumulative when it additionally makes writes atomic, e.g. by flushing caches. In our model, this amounts to making sequences of external read-from and fences (rfe; fence or fence; rfe) safe, but rfe alone would not be safe. In Figure 4, placing a cumulative fence between the two writes on T 0 will not only prevent their reordering, but also enforce an ordering between the write (a) on T 0 and the read (c) on T 1 , which reads from T 0 (in Figure 5b).
Architectures. An architecture A determines the relations safe (i.e., not relaxed) on A. We always consider the coherence co, the from-read relation fr and the fence relations to be safe. SC relaxes nothing, i.e. also rf and po are safe. For example, TSO authorises the reordering of write-read pairs (relaxing po edges from a write event to a read event, i.e. poWR) and store buffering (relaxing rfi edges). Thus, the TSO memory model can be phrased as acyclic((po n poWR) [ mfence [ co [ rfe [ fr). We refer to Alglave et al. [2014] for a description of the Power memory model. All models we handle satisfy the SC per location property. That is, the edges that are part of cycles that consist of events that access only a single memory location are never relaxed. Formally, we have acyclic(po-loc [co[rf[fr), with po-loc restricted to the po edges between events that access the same memory location. This property models the memory coherence provided by modern architectures. We illustrate it with a litmus test in Figure 8. The two threads access the memory location x, which is assumed to be 0 at the beginning. The condition models whether it is possible for T 0 to first read the new value 1 , and then read the old value 0. The corresponding candidate execution is depicted on the right. The execution has a cycle that consists of only one memory location. Therefore, this execution is not possible as the edges in such cycles are never relaxed.
Don’t Sit on the Fence: A Static Analysis Approach to Automatic Fence Insertion A:
corr T 0 T 1 (a) r1 x (c) x 1 (b) r2 x Final state? r1=1 ^ r2=
(a) Rx
(b) Rx
(c) Wx
po
rf
fr
Fig. 8. Read-read coherence.
4.4. Critical cycles
Following [Shasha and Snir 1988; Alglave and Maranget 2011], for an architecture A, a delay is a po or rf edge that is not safe (i.e. is relaxed) on A. A candidate execution (E; X) is valid on A yet not on SC iff
(DC1) it contains at least one cycle that contains a delay, and (DC2) all cycles in it contain a delay.
If there would be a cycle that does not contain a delay, then the execution would be invalid both on SC and A. If there would be no cycles at all in the execution, then the execution would be valid on both SC and A. To enforce SC on a weaker architecture A, we need to insert memory fences into the program such as to disallow the candidate executions that satisfy properties DC1 and DC2. That is, we need to insert fences such that for each such candidate execution at least one cycle is not relaxed. It is not necessary to disallow all cycles in a candi- date execution, as ensuring that one cycle is not relaxed is sufficient to disallow an execution.
Critical cycles. Any candidate execution that satisfies properties DC1 and DC2 has a cycle, and thus also has a minimal cycle (see Section 4.2). By DC2, the minimal cycle also contains a delay. We refer to such cycles as critical cycles. Formally, a critical cycle for architecture A is a cycle which has the following characteristics:
(CS1) the cycle contains at least one delay for A; (CS2) per thread, there are at most two accesses, the accesses are adjacent in the cycle, and the accesses are to different memory locations; and (CS3) for a memory location ℓ, there are at most three accesses to ℓ along the cycle, the accesses are adjacent in the cycle, and the accesses are from different threads.
Thus, a critical cycle is a minimal cycle for which it additionally holds that (a) it has at least one delay, (b) the accesses of a single thread are to different memory locations, and (c) the accesses to a given location ℓ are from different threads. In fact, together with the properties MC1 and MC2 of minimal cycles, property (a) implies properties (b) and (c), as we show in the next two paragraphs. To see that (b) holds, assume we have a minimal cycle for which (a) holds but not (b). Thus, there is a thread in the cycle for which its two accesses are to the same location. Then either (i) there is no additional access in the cycle, or (ii) there is an additional access in the cycle. (i) We have a cycle of length 2. This cycle must involve the two accesses (which are to the same memory location). Such cycles are never relaxed due to memory coherence. The cycle thus does not contain a delay. But this contradicts the initial assumption that the cycle has a delay. Therefore, (b) must hold. (ii) By MC2, this location can occur at most three times in the cycle. The third access must be by a different thread. Since communication edges are always between events operating on the same memory location, it follows that we have a cycle of length 3 that mentions
Don’t Sit on the Fence: A Static Analysis Approach to Automatic Fence Insertion A:
framework. A goto-program is a sequence of goto-instructions, and closely mirrors the C program from which it was generated. We refer to http://www.cprover.org/goto-cc for further details. The C program in Figure 9 features two threads which can interfere. The first thread writes the argument “input” to x, then randomly writes 1 to y or reads z, and then writes 1 to x. The second thread successively reads y, z and x. In the corresponding goto-program, the if-else structure has been transformed into a guard with the condi- tion of the if followed by a goto construct. From the goto-program, we then compute an abstract event graph (aeg), given in Figure 10(a). The events a; b 1 ; b 2 and c (resp. d; e and f ) correspond to thread 1 (resp. thread 2 ) in Figure 9. We only consider accesses to shared variables, and ignore the local variables. We finally explore the aeg to find the potential critical cycles. An aeg represents all the candidate executions of a program (in the sense of Sec- tion 4). Figure 10(b) and (c) give two executions associated with the aeg given in Fig- ure 10(a). For readability, the transitive po edges have been omitted (e.g. between the two events d′^ and f ′). The concrete events that occur in an execution are shown in bold. In an aeg, the events do not have concrete values, whereas in an execution they do. Also, an aeg merely indicates that two accesses to the same variable could form a data race (see the competing pairs (cmp) relation in Figure 10(a), which is a symmetric relation), whereas an execution has oriented relations (e.g. indicating the write that a read takes its value from, see e.g. the rf arrow in Figure 10(b) and (c)). The execution in Figure 10(b) has a critical cycle (with respect to e.g. Power) between the events a′, b′ 2 , d′, and f ′. The execution in Figure 10(c) does not have a critical cycle. We build an aeg essentially as in [Alglave et al. 2013]. However, our goal and theirs differ: they instrument an input program to reuse SC verification tools to perform weak memory verification, whereas we are interested in automatic fence placement. More- over, the work of [Alglave et al. 2013] did not present a semantics of goto-programs in terms of aegs, or a proof that the aeg does encompass all potential executions of the program, both of which we do in this section.
5.2. Points-to information
In the previous section, we have denoted abstract events as, e.g., (a)Wx, with x be- ing an address specifier denoting a shared variable in the program. However, shared memory accesses are often performed via pointer expressions. We thus use a pointer analysis to compute which memory locations an expression in the program (such as a[i+1] or *p) might access. The pointer analysis we use is a standard concurrent points-to analysis that we have shown to be sound for our weak memory models in earlier work [Alglave et al. 2011].^2 The analysis computes for each memory access ex- pression in the goto-program an abstraction of the set of memory locations potentially accessed. The result of the pointer analysis is for each memory access expression either a set of address specifiers fs 1 ; : : : ; sng (denoting that the expression might access any of the memory locations associated with the specifiers, with si ̸= ), or the singleton set fg containing the special address specifier (denoting that the expression might access any memory location). An address specifier si ̸= might refer to a single memory loca- tion or a region of memory locations. A specifier that refers to a region of memory is for example returned for accesses to arrays. That is, expressions a[i] and a[j] accessing
(^2) As pointed out by a reviewer, for our fence insertion approach it may be sufficient to use a pointer analysis that is sound for SC but not for weaker models. While this is not the case for all memory models that could be expressed in the framework of Alglave et al. [2010], it may hold for the models we consider in this paper. A proof of this conjecture remains as future work.
A:14 J. Alglave et al.
(a)Wx
(b 1 )Wy
(c)Wx
(d)Ry
(e)Rz
(f )Rx
(b 2 )Rz
pos pos
pos pos
pos
cmp pos
cmp
cmp
(a′) Wx
(b′ 1 ) Wy
(c′) Wx
(d′) Ry
(e′) Rz
(f ′) Rx
(b′ 2 )Rz
po
po
po
fr po
rf
fr
co
(a′′) Wx
(b′′ 1 )Wy
(c′′) Wx
(d′′) Ry
(e′′) Rz
(f ′′) Rx
(b′′ 2 ) Rz
po
po
po
po rf
co
(a) aeg of Figure 9 (b) ex. with critical cycle (c) ex. without critical cycle Fig. 10. The aeg corresponding to the program in Figure 9 and two executions corresponding to it.
a global array a, with i ̸= j, would both be mapped to the same specifier a, denoting the region of memory associated with the array a. We say that a concrete memory location m and an address specifier s are compatible, written comp(m; s), if m is in the set of memory locations abstracted over by s. For example, comp(m; ) holds for any memory location m. As another example, if m refers to a location in array a, and specifier s represents that array, then comp(m; s). Given two address specifiers s 1 ; s 2 , we similarly write comp(s 1 ; s 2 ) when the inter- section between the set of memory locations abstracted by s 1 and the set of memory locations abstracted by s 2 is non-empty. For example, comp(s 1 ; ) holds for any address specifier s 1. Consequently, instead of being associated with a concrete memory location, abstract events have a single address specifier si (which can also be ). Thus, if the pointer analysis yields a set of address specifiers fs 1 ; : : : ; sng for an expression, the aeg will contain a static event for each. We provide more details in the following sections.
5.3. Formal Definition of Abstract Event Graphs
Given a goto-program (such as the one on the right of Figure 9), we build an aeg , (Es; pos; cmp), where Es is the set of abstract events, pos is the static program order, and cmp are the competing pairs. Given an aeg G, we write respectively G:Es, G:pos and G: cmp for the abstract events, the static program order and the competing pairs of G. The aeg for the program on the right of Figure 9 is given in Figure 10(a).
Abstract events. An abstract event represents all events with same program point, direction (write or read), and compatible memory location. An abstract event consists of first a unique identifier, then the direction specifier (W or R), and then the address specifier. In Figure 10(a), (a)Wx abstracts the events (a′)Wx1 and (a′′)Wx2 in the exe- cutions of Figure 10(b) and (c). Moreover, for example, a static event (a)W would also abstract the two events, as is compatible with any memory location. We write addr(e) for the address specifier of an abstract event e.
Static program order. The static program order relation pos abstracts all the (dy- namic) po edges that connect two events in program order and that cannot be decom- posed as a succession of po edges in this execution. We write po+ s (resp. po s ) for the transitive (resp. reflexive-transitive) closure of this relation. We also write begin(pos) and end(pos) to denote respectively the sets of the first and last abstract events of pos. That is, if we imagine the pos relation as a directed graph, then begin(pos) contains the abstract events in pos that do not have incoming edges, and end(pos) contains the abstract events that do not have outgoing edges.
(^3) We denote the function composition operator by ◦.
A:16 J. Alglave et al.
(6) assume / assert / skip: fassume, assertg(ϕ); i / skip; i
[ assume(ϕ); i] = [i] [ assert(ϕ); i] = [i] ^ [i]
pos
[ skip; i] = [i]
(7) atomic section: atomic begin; i 1 ; atomic end; i 2
atomic start; i 1 ; atomic end; i 2 =
let section = [i 1 ]((aeg:Es [ ffg; aeg:pos [ end(aeg:pos) ffg; aeg: cmp)) in let pos′^ = section:pos [ end(section:pos) ffg in
[i 2 ]((section:Es; pos′; section: cmp))
f ^ [i 1 ]^ f ^ [i 2 ]
pos pos pos pos
(8) new thread: start thread th; i
let local = body(th) in let main = i in let inter = i in
[i] [body(f )]
pos cmp
(local:Es [ main:Es, local:pos [ main:pos, local:Es inter:Es)
(9) end of thread: end thread;
end thread = aeg ∅
Fig. 12. Operations to create the aeg of a goto-program (continued).
Competing pairs. The external communications coe [ rfe [ fre are over-approximated by the competing pairs relation cmp. In Figure 10(a), the cmp edges (a; f ), (b 1 ; d), and (c; f ) abstract in particular the fre edges (f ′; c′) and (f ′; a′), and the rfe edge (b′ 1 ; d′) in Figure 10(b). We do not need to represent internal communications as they are already covered by po+ s. The cmp construction is similar to the first steps of static data race detection (see e.g. [Kahlon et al. 2009, Sec. 5]), where statements involved in write-read or write-write communications are collected. As further work, we could reduce the set of competing pairs using a synchronisation analysis, as in e.g. [Sura et al. 2005]. If we assume the correctness of locks for example, some threads might never interfere.
Don’t Sit on the Fence: A Static Analysis Approach to Automatic Fence Insertion A:
int local1=x; local1=local1ˆlocal1; int local2=∗(&y + local1);
mov r, #x ldr r1,[r, #0] exors r1, r1, r mov r2, #y mov r3 [r2, r1]
Fig. 13. A fragment of C program (left) with two shared variables x and y, and a possible straightforward translation in ARM assembly (right).
Fences. In the aeg, we encode memory fences as special abstract events – i.e., as nodes in the graph. We write f for a full fence (e.g., mfence in x86, sync in Power, dmb in ARM), lwf for a lightweight fence (e.g., lwsync in Power), cf for a control fence (e.g. isync in Power, isb in ARM). In [Alglave et al. 2010], fences are modelled as a relation fenced between concrete events. We did not use this approach, since a fence can then correspond to several edges and we would need an additional relation over our abstract event relations to keep track of the placement of fences. The effect of fences is interpreted during the cycle search in the aeg. Dependencies form a relation between abstract events that is maintained outside of the aeg. They are calculated in musketeer from the input program on the C level.^4 Dependencies relate two accesses to shared memory at the assembly level via regis- ters. As we analyse C programs and we make no assumption regarding the machine or the compiler used, neither the assembly translation nor the use of registers between shared memory accesses is unique or provided to us. For example, there is a depen- dency between x and y in the program fragment presented in Figure 13 (left), since a straightforward translation to assembly could generate a dependency. For example, in ARM assembly, we can translate this C program to the assembly code in Figure 13 (right), where a dependency by address between x and y was intentionally placed via the register r1. r1 always holds 0 after the interpretation of the exclusive disjunction on the value of x, and this 0 is added to the pointer to y before being dereferenced in register r3. Processors ignore that the value is always 0 and enforce a dependency. Compiler optimisations can, however, remove these dependencies. In the tool muske- teer, we provide the option –no-dependencies which safely ignores all these calculated dependencies in the aeg. Under this option, dependencies or fences might be spuri- ously inserted in places where an actual dependency already exists. The same consideration applies to dependency generation, as we will treat in detail in Subsection 7.1.
5.4. Constructing Abstract Event Graphs from C Programs
We define a semantics of goto-programs in terms of abstract events, static program order and competing pairs. We give this semantics below by means of a case split on the type of the goto-instructions. Each of these cases is accompanied in Figures 11 and 12 by a graphical representation summarising the aeg construction on the right-hand side, and a formal definition of the semantics on the left-hand side. We assume that for- ward jumps are unconditional and that backward jumps are conditional. Conditional forward jumps can be “simulated” by those two instructions. The construction of the aeg from a goto-program is implemented by means of a case split over the type of goto-instruction. The algorithm is outlined in Figure 15 (left side). We give further details about the algorithmin Section 5.7.
(^4) We keep track of the relations between local and shared variables per thread to calculate a dependency relation.
Don’t Sit on the Fence: A Static Analysis Approach to Automatic Fence Insertion A:
Assumption, assertion, skip. Similarly to the guarded statement, as we cannot eval- uate the condition, we abstract the assumptions and assertions by bypassing them. They are thus handled the same as the skip statement.
Atomic sections. The atomic sections in a goto-program model idealised atomic sec- tions without having to rely on the correctness of their implementation. In the CProver framework, we use the constructs CPROVER atomic begin and CPROVER atomic end. Atomic sections are used in many theoretical concurrency and verification works. For example, we use them (see Section 7) for copying data to atomic structures, as in e.g. our implementation of the Chase-Lev queue [Chase and Lev 2005] or for imple- menting compare-and-swaps, as in e.g. our implementation of Michael and Scott’s queue [Michael and Scott 1996]. In this work we overapproximate atomic sections by only considering the effect on memory ordering of entering and leaving a critical section. We do not model the atom- icity aspect of atomic sections. Also handling this aspect could improve performance and precision as less (spurious) interferences have to be considered. We model atomic sections by placing two full fences, written f in Figure 12, right after the beginning of the section and just before the end of the section.
Construction of cmp. We also compute the competing pairs that abstract external communications between threads. For each abstract event with address specifier s, we augment the cmp relation with pairs made of this abstract event and abstract events from an interfering thread with a compatible address specifier s′^ (i.e. the sets of memory locations represented by s and s′^ overlap). One of the two accesses needs to be a write. These cmp edges abstract the relations coe, fre, and rfe. In Figure 12, we use to construct the cmp edges when a new interfering thread is spawned. That is, when the goto-instruction start thread is met. We define as A B = f(a; b) 2 A B j comp(addr(a); addr(b)) ^ (write(a) _ write(b))g. We write ¯∅ for the triple (∅; ∅; ∅) representing the empty aeg.
5.5. Correctness of the aeg construction
In this section, we explain why the aeg constructed with the transformers of Figures 11 and 12 captures all the candidate executions of a program. As a consequence, if we search for cycles in the aeg we can find all corresponding cycles that might occur in a concrete execution. Given a goto-program, we want to show that for any event structure that could be derived from it and for any candidate execution valid for this event structure, the sets of events (E) and all the relations (po, rf, fr, co) are contained (in a sense defined in the following paragraph) in their static counterparts Es, po+ s , and cmp. We need here the transitive closure of pos as the po relation of a concrete execution is transitive, but the static program order relation pos of an aeg is not transitive. We refer in the following to Es as the set of static events, and to E as the set of dynamic events. Similarly, we refer to pos as the static program order, and to po as the dynamic program order. The reason why we choose the terms static and dynamic is that Es, pos, and cmp are computed by our static analysis, whereas E, po, rf, fr, and co refer to the events and relations that represent a concrete candidate execution. We further use the symbols se and srel to refer to a set of static events and a static relation, respectively. To map the static events to dynamic events, and relations over static events to re- lations over dynamic events, we define : }(Es Es)! }(E E) which concretises a static relation as the union of all the dynamic relations that it could correspond to, and
e :^ }(Es)^!^ }(E)^ which concretises a set of static events as the union of all the sets of
A:20 J. Alglave et al.
C program
goto-program
set of candidate executions
aeg
goto-cc
Section 4 (^) aeg
exec
Section 5
syntax (^) semantics
soundness to prove
Fig. 14. From C programs to aegs and candidate executions
dynamic events that they could correspond to.^5 These two functions are a formalisation of how we interpret aegs.
More formally, we define (^) e(se) , fe′^ j 9 e 2 se s.t. comp(addr(e); addr(e′)) ^ dir(e) = dir(e′) ^ origin(e) = origin(e′)g, where origin(a) returns the syntactical object in the source code from which either a concrete or abstract event was ex- tracted.^6 In this definition e′^ is a concrete event whereas e is a static event. The function comp(addr(e); addr(e′)) indicates whether the memory address accessed by e′^ and the address specifier of e are compatible (i.e. whether the result addr(e) com- puted by the pointer analysis includes the memory address addr(e′); see Section 5. for details). The function dir() (for direction) indicates whether the static event or event given is a read or write. Thus, dir(e) = dir(e′) requires that the static event and the concrete event are either both reads or both writes. We use (^) e to define : (srel) , f(c 1 ; c 2 ) j 9(s 1 ; s 2 ) 2 srel s.t. (c 1 ; c 2 ) (^2) e(fs 1 g) (^) e(fs 2 g)g. In this definition c 1 ; c 2 are concrete events and s 1 ; s 2 are static events. We next show a lemma about (^) e and that we will use later on.
LEMMA 5.1. Let E 1 (^) e(Es, 1 ) and E 2 (^) e(Es, 2 ). Then E 1 E 2 (Es, 1 Es, 2 ).
PROOF. Let (c 1 ; c 2 ) 2 E 1 E 2. Then c 1 2 e(Es, 1 ) and c 2 2 e(Es, 2 ). Using the def- inition of (^) e, there are s 1 ; s 2 such that c 1 2 e(fs 1 g) and c 2 2 e(fs 2 g). Then, since (s 1 ; s 2 ) 2 Es, 1 Es, 2 , it follows by the definition of that (c 1 ; c 2 ) 2 (Es, 1 Es, 2 ).
Figure 14 shows the relationship between goto-programs, candidate executions, and aegs. In our proof, we show that for a given program, any execution of that program has its events and relations contained in the (^) e, of the corresponding aeg.
(^5) We usually use γe with a singleton set (containing one static event) as an argument. (^6) The function origin is dual to evts, which returns events given an expression or instruction.