Execution Model

为了更好地理解memory_order相关的内容,首先介绍C++抽象机和execution model的知识。

Abstract Machine

  • C++ programs describe operations that are performed on the abstract machine. C++ implementations define the characteristics of the abstract machine and translate operations on the abstract machine into operations on the system.

Storage Model

  • Storage is flat; no notion hierarchy (caches, etc). Objects reside in storage at a single memory location.
  • Some objects may not have a unique memory location:
    • Eligible empty base classes
    • Objects marked [[no_unique_address]]
  • An implementation is allowed to store two objects at the same machine address or not store an object at all.
  • An object cannot have more than one memory location.
  • Every thread in a program can potentially access every object and function in a program.

Threads of Execution

  • A thread of execution is a single flow of control in a program which evaluates a function call; threads may run concurrently.
  • Variables with static storage duration are initialized as a consequence of program initiation.
  • Variables with thread storage duration are initialized as a consequence of thread execution.

Expression

  • An expression is a sequence of operators and operands that specifies a computation.
  • Subexpressions are a part of a larger expression.
  • Full expressions are not subexpressions.
  • Full expressions may include subexpressions that are not lexically part of the expression.

Evaluations

  • Evaluation of an expression includes value computations and the initiation of side effects.
  • Side effects change the environment:
    • Reading a volatile object or modifying any object.
    • Calling a library I/O function.
    • Calling a function that does any of the above.
  • Value computations are pure and have no observable effect.
  • Completion of the execution of an evaluation does not imply completion of its side effects.

Sequenced-Before

  • Given any two evaluations A and B within the same thread of execution, if A is sequenced before B, then the execution of A shall precede the execution of B.
  • Each full expression is sequenced before the next full expression in program order.
  • If A and B are indeterminately sequenced, then either A is sequenced before B or B is sequenced before A, but it is unspecified which. E.g. A and B are not interleaved.
  • If A and B are unsequenced, then A is not sequenced before B and B is not sequenced before A. E.g. A and B may be interleaved.

Statements

  • Statements are compositions of full expressions.
  • When calling a function…
    • Every evaluation within the function and every evaluation not within the function are indeterminately sequenced.
    • The expression designating the function is sequenced before the argument expressions.
    • Each argument expression is indeterminately sequenced with all other argument expressions.
    • Every expression in the body of the function is sequenced after the expression designating the function and every argument expression of the function .

Initializer Lists

  • Each element of a brace initializer is sequenced before the all subsequent elements.

Synchronizes With

  • Two library operations A and B may be related by the synchronizes-with relation.
  • Asymmetric: A synchronizes with B does not imply that B synchronizes with A
  • Given any two evaluations A and B… If A happens before B:
    • A is sequenced before B, or
    • A synchronizes with B, or
    • For some evaluation X, A happens before X and X happens before B.

Execution Steps

  • The evaluations executed by threads are delineated by execution steps.

  • An execution step is::

    • Termination of the thread.
    • An access of a volatile object.
    • Completion of:
      • A library I/O function call.
      • A synchronization operation.
      • An atomic operation.
  • Implementations may assume that all threads will eventually perform an execution step.

  • Infinite loops that have no observable effects are undefined behavior.

Forward Progress

[p0027r1]Light-Weight Execution Agents

  • Some atomic operations may fail spuriously due to interference from other threads.

  • Implementations are encouraged, but not required, to prevent spurious failures from indefinitely delaying progress.

  • Block: Wait for a condition to be satisfied before continuing execution.

  • Blocking library functions are considered to continuously execute execution steps while waiting for their condition to be satisfied. Thus, blocking makes progress.

  • Forward progress guarantees that something observable should eventually happens.

  • There are three forward progress guarantees:

    • Concurrent forward progress
      • The thread will make progress, regardless of whether other threads are making progress.
      • Examples:
        • Preemptive OS thread scheduling
        • Unbounded thread pool
    • Parallel forward progress
      • Once the thread has executed its first execution step, the thread will make progress.
      • Examples:
        • Run-to-completion user-space tasking
        • Threads on modern NVIDIA GPUs
    • Weakly parallel forward progress
      • The thread is not guaranteed to make progress.
      • Examples:
        • Non-preemptive OS thread scheduling
        • Suspendable user-space tasking
          • Work-stealing task schedulers
          • Fibers
        • Thread-less asynchrony
          • Lazy execution
          • C++ coroutines
        • Threads on legacy GPUs

Boost Blocking

  • Block on a thread with weaker forward progress while preserving the calling thread’s forward progress.
  • When a thread P boost blocks on a set S of other threads, the forward progress guarantees of at least one of the threads in S is temporarily upgraded to P’s forward progress guarantee. Repeat until the blocking condition is satisfied.
  • Boost blocking ensures your children threads make progress, not your siblings.

memory order

内存一致性模型描述的是程序在执行过程中内存操作正确性的问题。内存操作包括读操作和写操作,每一操作又可以用两个时间点界定:发出和响应。在假定没有流水线的情况下(即单个处理器内指令的执行是按顺序执行的),设系统内共有$N$个处理器,每个处理器可发出$s_i$个内存操作(读/写),那么总共有$\frac{(\sum_{i=1}^{n}s_i)!}{\prod_{i=1}^{n}s_i}$中不同执行顺序的指令序列,内存一致性模型给出了这些指令序列中哪些执行顺序是合法的。

这里的memory order,指的是C++语言层面的内存序模型,与exectuion model的作用类似,是为了屏蔽底层微体系结构的不同内存序模型而产生的一层抽象。这篇论文C++并发模型的数学模型给出了讨论并发模型时的各种名词的确切定义。

ordering constraints

C++中有三种ordering constraint

  • Sequential consistency: memory_order_seq_cst
  • Acquire-release:memory_order_consume, memory_order_acquire, memory_order_release, and memory_order_acq_rel
  • Relaxed: memory_order_relaxed

Sequential Consistency

顺序一致性保证了在所有线程上的操作都遵循一个“全局时钟”。

acquire-release语义

而在acquire-release语义中,不同线程之间并不存在全局的同步,同步只发生在同一个原子变量的的原子操作上,具体来说,一个线程中对一个原子变量的写操作会和另一个线程中对同一个原子变量的读操作发生同步。acquire-release语义基于一个基本思想:在同一个原子变量上同步的一个release操作和一个acquire操作建立了一个顺序约束,即所有读写操作不能被重排到release操作之后,同时读写操作不能被重排到acquire操作之前。

传递性

之前我们说过,在同一个原子变量上同步的一个release操作和一个acquire操作建立了一个顺序约束,这样一来,如果不同线程操作的是同一个原子变量,我们便可以利用这种机制来进行不同线程间的同步。但如果要同步的线程之间没有共享的原子变量呢?我们可以利用acquire-release的传递性来借助另一个线程进行同步,我们来看一个例子(省略了分别用三个线程执行这三个函数的代码)。

在上面的代码中,

  • 蓝色箭头表示sequenced-before关系,即同一线程中的操作按照他们在源码中的出现顺序执行。
  • 红色箭头表示synchronizes-with关系,这是在不同线程之间共享的原子变量上的acquire-release语义带来的。以第一个和第二个线程为例,它们之间的synchronizes-with关系保证了只要语句dataProduced.store(true,std::memory_order_release)发生在dataPrduced.load(std::memory_order_acquire)之前,那么所有在dataProduced.store(true,std::memory_order_release)之前的操作产生的影响对发生在dataPrduced.load(std::memory_order_acquire)之后的操作都是可见的。

注意: 我们必须将dataPrduced.load(std::memory_order_acquire)放在while循环中来确保acquire操作确实发生在release操作之后,即保证synchronizes-with的条件为真,才能使得这种传递性生效。

最终他们一起保证了myShaerdWoke[1]=2mySharedWork={1,0,3}之间是happens-before的关系。

释放序列

标准中的描述:

A release sequence headed by a release operation A on an atomic object M is a maximal contiguous sub-sequence of side effects in the modification order of M, where the first operation is A, and every subsequent operation * is performed by the same thread that performed A, or * is an atomic read-modify-write operation.

下面来看一个例子

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#include <atomic>
#include <thread>
#include <iostream>
#include <mutex>

std::atomic<int> atom{0};
int somethingShared{0};

using namespace std::chrono_literals;

void writeShared(){
    somethingShared = 2011;
    atom.store(2, std::memory_order_release);
}

void readShared(){
    while ( !(atom.fetch_sub(1, std::memory_order_acquire) > 0) ){
        std::this_thread::sleep_for(100ms);
    }

    std::cout << "somethingShared: " << somethingShared << std::endl;
}

int main(){

    std::thread t1(writeShared);    
    std::thread t2(readShared);
    std::thread t3(readShared);

    t1.join();
    t2.join();
    t3.join();

    std::cout << "atom: " << atom << std::endl;
    std::cout << std::endl;

}

在这个例子中,t1线程和t2线程之间由在atom上的acquire-release语义获得了synchronizes-with关系从而能够同步。但是fetch_and_sub作为一个 读-修改-写 操作,我们只是指定了它的读是acquire的,并没有规定它的写是release的,但由于释放序列的规则,t2和t3之间依然能够建立synchronized-with的关系,所以最终atom的输出始终为0。

关于std::memory_order_consume

不建议使用std::memory_order_consume,C++17已经考虑修改std::memory_order_consume

的语义,并且目前没有编译器支持它。

Fences

C++支持两种类型的屏障:

  • std::atomic_thread_fence:用于同步不同线程间对相同内存的访问
  • std::atomic_signal_fence:用于同步一个线程上的signal handler和其他代码

内存屏障可以使得一些特定顺序地读写操作在重排时不会穿过屏障。

thread fence

有三种类型的屏障:

  • std::atomic_thread_fence()/std::atomic_thread_fence(std::memory_order_acq_rel)
  • std::atomic_thread_fence(std::memory_order_acquire)
  • std::atomic_thread_fence(std::memory_order_release)

当他们作用在两个操作(读读、读写、写读、写写四种情况)之间时,分别有如下效果。

acquire fence使得在acquire fence之前的读操作不会和在acquire fence之后的读、写操作重排。

release fence使得在release fence之后的写操作不会和在release fence之前的读、写操作重排。

下面是一个使用线程间内存屏障的例子:

在这个例子中,acquire-release fence组织了所有原子和非原子操作在被重排时穿过内存屏障,consumer()当中的while循环使得acquire fence synchronizes-with producer()中的release fence,最终使得release fence之前所有非原子操作和relaxed的原子操作产生效果对acquire fence之后的代码都是可见的。

和通过原子变量进行同步的比较

  • fence有比原子操作更强的顺序保证。对一个原子变量的acquire操作只保证了在它之后的读、写操作不会被重排到它之前。而acquire fence在此基础上还保证了在acquire fence之前的读不会被重排到fence之后。这使得我们可以在acquire fence后可以只用relaxed来读原子变量。
  • fence比在原子操作上使用对应内存序的原子操作开销更大。

此外,从上面的例子可以看到,在使用fence时,我们仍然需要额外的原子变量来保证acquire fence在release fence之后。事实上,标准也是这样进行描述的:

[33.5.11 Fences] A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.

signal fence

std::atomic_signal_fence能够在一个线程和在相同线程上执行的signal_handler之间对非原子和realxed类型的原子变量的访问建立同步关系。下面是一个例子。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17

std::atomic<bool> a{false};
std::atomic<bool> b{false};

extern "C" void handler(int){
    if(a.load(std::memory_order_relaxed)){
        std::atomic::signal_fence(std::memory_order_acquire);
        assert(b.load(std::memory_order_relaxed));
    }
}

int main(){
    std::siganl(SIGTERM, handler);
    b.store(true, std::memory_order_relaxed);
    std::atomic_signal_fence(std::memory_order_release);
    a.store(true, std::memory_order_relaxed);
}

由于std::atomic_signal_fence之间建立了acquire-release屏障,assert断言永远不会触发。

原子操作对实际生成的CPU指令序列的影响

在x86和amd64架构上,除了std::memory_order::seq_cst外的std::atomic_thread_fence并不影响CPU指令生成,而只影响编译器代码处理。

关于C++11中各种memory_order的原子操作和内存屏障对编译器代码处理和在各种指令集架构上CPU指令生成的实际影响,参见C/C++11 mappings to processors。比如在x86-64平台上,编译器将直接忽略aqcuire-release相关的fence语句,因为x86-64的硬件内存模型为强一致性模型,不会出现aqcuire-release模型索要避免的情况。