I was in a process of writing an explanation for a TRB patch. The changes in question touched TRB's concurrency mechanics, which I thought I had properly already taken into account, but then realized this might've not really been the case. So it was time for some TRB homework: The Threads.
The summary of my study below, is based mostly on these sources:
- A C/C++ threads intro
- An article about memory reordering
- A follow-up article about memory reordering
- Cooperating sequential processes, by E. W. Dijkstra
- GCC manual
I also found Ada's approach to concurrency1 an interesting, and contrasting read.
Anyway...
The next time I see multi-threaded code, I will remember these four things:
I. Compiler and CPU do not always follow the programmer-specified execution order
The compiler may optimize/reorder code, making it vastly different from what the programmer specified.
The CPU may also reorder code, again, making it different from what the programmer (and compiler!) specified.
The only guiding principle is, that the reorganized program's effect must be exactly the same as the programmer-specified variant would result in. But only in a single-threaded case!
II. Nothing must be assumed of individual thread's relative execution speed
Concurrently running threads' instructions interleave, but there's roughly no guarantee at all, how exactly. If point I above didn't exist, the only guarantee would be that each thread's instructions execute in the specified order, for that thread only.
III. Compiler and CPU memory barriers required for thread synchronization
Compilers and CPU's provide2 memory "barriers" or "fences" for the programmer to control the said reordering mechanisms. These are markers, or machine instructions, that are placed in code, where no reordering across that point should happen, if it were to change the running program's current memory state -- compared to what it would've been without any reordering.
In practice, certain implicit compiler barriers also exist, like an external call. Compiler cannot reliably predict the external function's side effects, so it must sync up with the memory state of the programmer-specified program once, before resuming its improv-reordering solo, after the external call has returned.
IV. Usually, simple reads or writes of primitive data types are atomic
From multi-threaded programming's point of view, neither C++ nor GCC, AFAIK, guarantee atomic handling3 for any piece of data, not even for a bool
.
CPU manufacturers, on the other hand, specify4 which of their CPUs' memory operations are atomic. In x86_64 CPUs, for example, reads and writes up to 64-bit are atomic, if properly aligned.
In practice, handling of the "usual primitive" C++-variables5 just happen to be atomic for each individual read or write. And only for that. For example, incrementing a variable, is likely not atomic, because it's not a simple assignment, or a value read.
A couple of practical experiments
A C++ piece, compiled with -O2
optimization:
1 | #include < stdio.h> |
2 | |
3 | int main() |
4 | { |
5 | unsigned int cnt = 0; |
6 | for (int i=0; i<0xFFFFFF; i++) |
7 | cnt++; |
8 | printf("%d\n", cnt); |
9 | return 0; |
10 | } |
...and it could result in the following machine code, like in my case:
1 | ... |
2 | 0000000000400450 < main>: |
3 | 400450: 48 83 ec 08 sub $0x8,%rsp |
4 | 400454: ba ff ff ff 00 mov $0xffffff,%edx |
5 | 400459: be f4 05 40 00 mov $0x4005f4,%esi |
6 | 40045e: bf 01 00 00 00 mov $0x1,%edi |
7 | 400463: 31 c0 xor %eax,%eax |
8 | 400465: e8 c6 ff ff ff callq 400430 <__printf_chk@plt> |
9 | ... |
Notice that there's no loop at all! Here the compiler is not merely reordering, but entirely omitting -- the point I's compiler in full swing.
I also tried out a CPU reordering demonstration, which I copied with some changes, below:
1 | #include < pthread.h> |
2 | #include < semaphore.h> |
3 | #include < stdio.h> |
4 | |
5 | // Set one of these to 1 to prevent CPU reordering |
6 | #define USE_CPU_FENCE 0 |
7 | #define USE_SINGLE_HW_THREAD 0 |
8 | |
9 | #if USE_SINGLE_HW_THREAD |
10 | #include < sched.h> |
11 | #endif |
12 | |
13 | |
14 | sem_t beginSema1, beginSema2, endSema; |
15 | int X, Y, r1, r2; |
16 | |
17 | inline void fence() |
18 | { |
19 | #if USE_CPU_FENCE |
20 | // Prevent CPU (and compiler) reordering across this barrier/fence |
21 | asm volatile("mfence" ::: "memory"); |
22 | #else |
23 | // Prevent just the compiler reordering across this barrier/fence |
24 | asm volatile("" ::: "memory"); |
25 | #endif |
26 | } |
27 | |
28 | void *thread1Func(void *param) |
29 | { |
30 | for (;;) |
31 | { |
32 | sem_wait(&beginSema1); // Wait for signal |
33 | |
34 | // ----- THE TRANSACTION! ----- |
35 | X = 1; |
36 | fence(); |
37 | r1 = Y; |
38 | |
39 | sem_post(&endSema); // Notify transaction complete |
40 | } |
41 | }; |
42 | |
43 | void *thread2Func(void *param) |
44 | { |
45 | for (;;) |
46 | { |
47 | sem_wait(&beginSema2); // Wait for signal |
48 | |
49 | // ----- THE TRANSACTION! ----- |
50 | Y = 1; |
51 | fence(); |
52 | r2 = X; |
53 | |
54 | sem_post(&endSema); // Notify transaction complete |
55 | } |
56 | }; |
57 | |
58 | int main() |
59 | { |
60 | // Initialize the semaphores |
61 | sem_init(&beginSema1, 0, 0); |
62 | sem_init(&beginSema2, 0, 0); |
63 | sem_init(&endSema, 0, 0); |
64 | |
65 | // Spawn the threads |
66 | pthread_t thread1, thread2; |
67 | pthread_create(&thread1, NULL, thread1Func, NULL); |
68 | pthread_create(&thread2, NULL, thread2Func, NULL); |
69 | |
70 | #if USE_SINGLE_HW_THREAD |
71 | // Force thread affinities to the same cpu core. |
72 | cpu_set_t cpus; |
73 | CPU_ZERO(&cpus); |
74 | CPU_SET(0, &cpus); |
75 | pthread_setaffinity_np(thread1, sizeof(cpu_set_t), &cpus); |
76 | pthread_setaffinity_np(thread2, sizeof(cpu_set_t), &cpus); |
77 | #endif |
78 | |
79 | // Repeat the experiment ad infinitum |
80 | int detected = 0; |
81 | for (int iterations = 1; ; iterations++) |
82 | { |
83 | // Reset X and Y |
84 | X = 0; |
85 | Y = 0; |
86 | // Signal both threads |
87 | sem_post(&beginSema1); |
88 | sem_post(&beginSema2); |
89 | |
90 | // Wait for both threads |
91 | sem_wait(&endSema); |
92 | sem_wait(&endSema); |
93 | |
94 | // Check if there was a simultaneous reorder |
95 | if (r1 == 0 && r2 == 0) |
96 | { |
97 | detected++; |
98 | printf("%d reorders detected after %d iterations\n", |
99 | detected, iterations); |
100 | } |
101 | } |
102 | return 0; // Never returns |
103 | } |
On my machine, it behaved exactly as advertised. I also removed the randomization6, and it still behaved the same. So, again, the point I's CPU in full swing.
TRB locks
Underneath the TRB's critical section construct is found the Linux libc pthread library. When a critical section starts, pthread_mutex_lock()
is executed. Equally, when a critical section ends, pthread_mutex_unlock()
is called. Also, a pthread_mutex_trylock()
variation exists.
I dug into TRB's musl libc, looking for clues do pthread_mutex_lock()
, or pthread_mutex_unlock()
include a CPU memory barrier, or not, and what would such a barrier look like. I found a Linux kernel "futex" syscall, which I further dug into, but could only find suggestive material, and not properly verify the case. The closest I got was that the futex syscall, for x86_64, may contain CPU instructions like:
- LOCK XCHGL
- LOCK XADDL
- LOCK ORL
- LOCK ANDL
- LOCK XORL
Because the pthread lock calls are external calls, likely that alone makes them implicit compiler memory barriers.
Finally
An other type of a concurrency issue...
- See "Tasks and Synchronization" [↩]
- Or at least, should provide [↩]
- Atomicity matters in that sense, that, for example, you only want to read either: 1) an old value of an integer, or 2) a new value, but not 3) a whatever-bit-mixture of both, because the new value's write was in-progress at the moment of the read. [↩]
- Again, AFAIK. [↩]
- bool, char, int, long: signed/unsigned; float, double [↩]
- Shown in the original version [↩]
At some point I did some digging through TRB's getdata weirdness, which got me thinking that TRB oughta come with at least two pieces of documentation in order to be considered "complete".
One of them would be a documentation of the memory/concurrency model, which is, I believe, what you're considering. Still, there are a few interesting unanswered questions here in my opinion: what dynamic data handling is TRB performing at run-time? which of is it assumed to be thread-local and what portion of it is considered to be "shared" between threads? what synchronization primitives are used for protection and how? Concurrency is a bitch, what with data races, lock ordering and indefinite postponement issues, so maybe there's a few ways to get TRB to hang merely through bad synchronization design. Maybe, I haven't really looked, to be honest.
The other would be a line-by-line annotation of the code with what it actually does, especially with respect to assumptions in the previous paragraph. It's a ton of work, but I for one see no other way to guard TRB against future exploits. The good part is that it only needs to be done once, and it's done for as long as Bitcoin remains a thing.
Comment by spyked — 2021-08-22 @ 20:20:58
@spyked
Despite the article name suggesting it'd be highly TRB-specific, I ended up with a pretty generic summary about concurrency. It's because I didn't have a clear picture even of the concurrency basics in my head; discovering it while I was trying to finish a patch. I might continue on the topic some other day, in form of another article, or even a book-style chapter/specification. Time will tell...
Re your TRB documentation points: I think were well worth mentioning in this context, and I find the presented open questions entirely justified!
Comment by cgra — 2021-08-27 @ 8:08:03