cgra's

2021-12-01

TRB Defect Exhibition - OOM and Other Spam

Filed under: TRB — cgra @ 15:15:18

Table of contents:


This is a continuation to the two previous articles (here and here), in a series of presenting some potentially annoying TRB defects.

These are all I believe are worth separately mentioning, and I'm currently aware of, but I have no reason to believe there wasn't any more issues to be found.

We’ll take a look at the std::vector specifics first, because it will become interesting when figuring out how TRB might occupy memory.

1 TRB std::vector’s memory footprint

Because TRB is built with a specific toolchain, we are in theory, able to say how the std::vector in TRB's case exactly works, even if any particular C++ standard, book, or compiler documentation didn’t ever completely specify it.

It’s still not an easy task in 100% accuracy to tell what this accumulated pile of computing history does, and I’m not attempting it here. But a perhaps reasonable explanation is given to the behavior I observed, but couldn’t find an exact "paper specification" of.

C++ std::vector has two separate, same-sounding concepts: size, and capacity. Capacity is how much contiguous memory it occupies, ie. reserves, while the size is how much of that reserved capacity is currently in use. Size cannot be bigger than capacity. The unit for both, size and capacity, is the size of an item. For example, if the item size is 8 bytes, a vector capacity of 10 would mean 10*8=80 bytes of occupied memory.

The std::vector's abstraction allows inserting new items to it, even if it was already full in capacity. When this happens, vector's capacity is first automatically doubled by allocating another memory block, twice as big, copying the old data there, and deallocating the original memory block. Only, if the capacity increment will more than double the old capacity of the vector, the new capacity will be: old capacity + the increase. Note, that the peak memory consumption during a size increase is therefore: old capacity + new capacity!

An interesting consequence is, that if a vector ends up any larger than 1/3 of the available memory, it cannot grow in capacity anymore. Every capacity-increasing vector operation will throw an allocation error.

Finally, std::vector's capacity doesn't automatically, or in any clean way, decrease. The abstraction does provide multiple operations that sound like they’d reduce the object’s memory footprint: clear(), erase(), resize(). In reality, they only affect the size.

However, the capacity reduction may not be that interesting to an ideal TRB, because everything either can have a hard upper boundary (mempool, block size, tx size, peer count, etc.), or grows at a predictable rate (block count). When you know how much memory is ever at most going to be needed, you just keep that much always available, and be done with it.

Clues for the std::vector behavior can be found in TRB’s build directory: build/buildroot-2015.05/output/build/host-gcc-final-4.9.2/libstdc++-v3/include/bits/, in files: stl_vector.h and vector.tcc.

A handful of std::vector code shortcuts:

Other sources used:

  • The C++ Programming Language, 3rd edition, by Bjarne Stroustrup
  • The C++ Programming Language, 4th edition, by Bjarne Stroustrup
  • ISO IEC 14882:1998, Programming languages -- C++ (The C++98 standard)
1.1 Other TRB STL containers

I haven’t thoroughly checked or tested, but to my understanding, the following C++ STL items in TRB don’t have a similar underlying capacity doubling property as std::vector:

  • std::multimap,
  • std::multiset,
  • std::map,
  • std::list,
  • std::deque, or
  • std::set.

2 vRecv burst processing defect

This defect has two key factors. The first is that there are amplifying messages that an attacking peer can send us, that require way more memory and/or CPU time than their size implies being reasonable. Such messages are: getdata, getblocks, and getheaders. Possibly in the same category also belongs addr, inv, and tx, but with much more limited effect.

The other key factor is that the attacker, for maximum impact, can fill up the victim node’s vRecv, if node is busy processing something else1. When the victim node resumes processing the attacking peer’s messages, it will process the whole vRecv in one go, before anything else will happen. If vRecv is filled with previously mentioned amplifying messages, the combined impact can be massive. The default vRecv upper limit is 10e6 bytes.

Let’s take a closer look at each message type, and start with a familiar one.

getdata
Roughly speaking, receiving one getdata of 10 block hashes may force TRB to read up to ~10e6 bytes from disk and storing into variable vSend. The message will be 24 + 1 + 10 * (4 + 32) = 385 bytes, so the amplification would be 10e6 / 385 ~ 26000. The required CPU time will also be significant2.

While the patch obey_sendbuffersize prevents a single getdata message from causing more than the -maxsendbuffer amount of memory load, a bursted series of 10-block getdata’s should bypass the limit without a problem.

Besides block hashes, also transaction hashes should work, while harder to achieve amplification as high. If large transactions do not show up naturally as often as needed, the anyone-can-spend angle may come into play: The attacker can first create and feed the victim node a fresh and beefy anyone-can-spend transaction (without a fee cost), before bombing with getdata messages, requesting that same transaction multiple times per message.

getblocks
In my case, TRB will process this purpose-built getblocks message in ~140 ms3, but the message is only 861 bytes small. The message is as follows:

Message header 24 bytes
Locator version 4 bytes
Locator size 1 byte
Fake hashes 25 * 32 bytes
Stop hash 32 bytes

Fake hashes and stop hash have the same purpose: None of them should actually exist. I used null hashes explicitly.

The message makes TRB think the block-requesting peer is at height 22397, and 100% in a losing branch. In response, TRB packs an inventory of 22397 first blocks from the main chain, to be advertised as a cure to the poor, lost peer.

Each message will cause a load of 22397 * (4 + 32) = 806292 bytes (see vInventoryToSend). The amplification would be, in the worst case, 44 * 806292 / 861 ~ 3700.

getheaders
This is similar to getblocks, but more severe. While getblocks loads CInv objects into vInventoryToSend, getheaders instead loads CBlock objects (without transactions) into vSend. The attack message is almost exactly the same as for getblocks, with only two differences: The header obviously has different command in it ("getheaders" != "getblocks"), and it has 30 fake hashes, instead of 25. Therefore, this message weighs 1021 bytes.

This many hashes is enough to make the victim node to believe the attacking peer is up to 1050585 blocks up in a losing branch, and needs to be corrected with the entire main chain, which at the time of writing is ~700000 items. Every item is 81 bytes in size5, so the memory amplification would be 700000 * 81 / 1021 ~ 55500.

On my computer, processing each of these messages takes ~300 ms6.

inv
An inventory of 50000 new items will take space 24 + 3 + 50000 * (4 + 32) = 1800027 bytes. Processing such a message will reserve 50000 * (37 * (4 + 32) + 2 * 8) = 6200000 bytes of memory, which makes an amplification of 6200000 / 1800027 ~ 3.44.

addr
Sending TRB a 10-address message will trigger node’s address relaying function and increases the memory amplification by two copies of each address. Additionally, address is stored into global mapAddresses (First AddAddress() call, then insert), and into attacking peer’s setAddrKnown. This makes an amplification of ~4. (See also peer addresses in detail.)

tx
Perhaps a bit more interesting, than inv or addr, because the memory amplification is a function of connected peer count. Besides adding transaction hash to attacking peer’s setInventoryKnown, a transaction not seen before will be queued for relay to other peers as well8. If a node has 125 simultaneously connected peers (the default maximum for TRB), the multiplier will be ~124.

A likely key to make this attack work is the anyone-can-spend transaction spam.

3 Peer-to-peer protocol state

The state TRB keeps track per peer is found in class CNode. A number of these state variables can be remotely made to occupy more RAM than is reasonable.

The rest of the state data is common to all peers. Whether a variable is peer-specific, or common to all peers, will be later mentioned separately, for each discussed variable.

3.1 High-impact variables

vNodesDisconnected (declaration, patch article)
Every disconnected peer is put on hold for 15 minutes into this dumpster, before finally garbage-collected. That enables an attacker to temporarily waste the target node's memory just by repeatedly connecting and disconnecting. The effect can be amplified by filling any peer-specific variables, before disconnecting, and creating larger garbage to collect.

This variable is common to all peers. In theory, also vector capacity doubling issue applies.

mapAlreadyAskedFor (declaration)
Inventory items from all peers, of supposed actual items we don't have. Every inventory item weighs 32+4 bytes. The life-cycle of each inventory item is as follows:

  • One added on every received inv item that is new and interesting (AskFor() call, then insert)
  • When accepting a trasaction to mempool, it’s hash is removed from this collection
  • When accepting a block, it’s hash is removed from this collection

Bogus entries never get removed, which makes this an easy spam target: An attacker that keeps on sending garbage inventory adverts, will eventually fill victim node’s entire memory.

For completeness, even normal operation may in theory, leave dangling transaction entries: If a specific transaction-including block was processed between a tx inv and a tx message -- the transaction didn’t get to be included into mempool, but was already advertised to us.

This variable is common to all peers.

setInventoryKnown (declaration)
A supposed spam guard for TRB itself. The problem is, this variable may only grow, as long as the peer doesn’t disconnect.

This is a set of inventory items that we know the peer in question knows of. Records are added when peer advertises items (blocks and transactions) to us, or when we advertise items to him. Records are only removed, if peer disconnects and whole peer data gets eventually garbage-collected.

While even in normal operation, this variable is "designed" to grow indefinitely, without an upper boundary, there’s also nothing that prevents peers from sending us bogus inventory. That’s how this variable provides another easy way to make TRB run out of memory.

This variable is peer-specific.

vInventoryToSend (declaration)
A collection of inventory items queued for being sent to the peer. These could be:

The most trouble are the hash responses to getblocks requests. See the bursted getblocks exploit.

What’s more, depending on the underlying vector capacity of the vInventoryToSend, the direct+indirect memory footprint of this collection will peak at 3 to 4 times the contained items’ actual size.

This variable is peer-specific. The vector capacity doubling issue applies.

3.2 Peer addresses

Nodes advertise each other the other nodes’ network addresses. An address is a bunch of constant bytes, a timestamp indicating currentness or age, an IP-address and a 16-bit port number. Total size of one peer address is 30 bytes.

Uniqueness of a peer address is determined by both, the IP and port value, which makes the nominally valid address space huge (equality operator, GetKey()). 2(32+16) total peer addresses to spam with, of which only a handful are considered invalid by TRB.

Every peer address a node is advertised, that is at most 10 minutes old, and the containing addr message was at most 10 items in size, is further queued for relaying to up to two other peers. The addresses are relayed only to one peer per a ThreadMessagerHandler2 round9, which takes 100 ms + a variable processing time. On relay, one peer’s vAddrToSend queue is completely dumped into it’s vSend (push 1, push 2).

Peer address relaying target is randomized10 by target peer / relayed IP pair / 24h period. What this means from the attacker’s perspective, is that if a certain IP-address relays to peer A, there are 65535 addresses in total that can be deterministically aimed at the same peer A -- just take every possible port number for that same IP-address. This makes a 1966050-byte payload of somewhat aimable, peer address spam.

Garbage peer addresses get cleaned up at age older than ~two weeks. The cleanup routine triggers every 10 minutes, and each time runs 20 seconds at most. Given that the attacker can set the advertised addresses’ age himself, this could make another way to arrange a buffer-filling window for the burst processing defect exploit.

3.2.1 Peer address attacks

Forced peer disconnect
It appears possible to force victim node’s other peers to disconnect, by sending multiple, full-buffer batches of young garbage peer addresses, eventually causing a "vSend flood control disconnect" on the other peers. As a preparation stage, or possibly as a continuous operation, the attacker may:

  • Create as many connections to the peer, as possible.
  • Test a sufficient set of individual IP addresses to see, which are relayed back to attacker’s peer connections, and which not. The ones that do not relay back, must relay to the other peers, and therefore provide each an aimable spam unit.

Out-of-memory
Simply advertising young garbage peer addresses to a node, the attacker has ~two weeks to fill mapAddresses (variable common to all peers), or until the memory runs out. If the attacking peer remains connected the whole time, setAddrKnown (peer-specific variable) will also fill for a maximum of 24 hours, depending on the time of the day the victim node was started at (See nLastRebroadcast11) . Up to two, other peers’ setAddrKnown’s will also fill per each garbage address, once addresses are being relayed.

Not to forget that an actively updated copy of mapAddresses resides also on disk.

Burst-amplified out-of-memory
If the above is done with possibly multiple attacking peer connections, and in maximal, full-buffer bursts, could possibly provide a further vAddrToSend (peer-specific variable) amplification. In addition, the vector capacity doubling issue applies, and there’s also a temporary memory peak when vAddrToSend is dumped into vSend.

Self-propagating, network-wide address spam
Because TRB will soon relay the peer addresses it was fed by the attacker, and given the garbage addresses will stay around for ~two weeks, this appears to be an automatic spam machine targeting the whole TRB network, with a mere single victim node as an entry point.

3.3 Lower-impact, amplifying variables

These variables provide a simple way to create large, dead peer connections, when exploiting the vNodesDisconnected dumpster vulnerability. The larger the dead connections are, the higher the probability for the victim node running out of memory within the 15 min dumpster cooldown.

The dumpster amplifying variables equally work for amplifying a coordinated sybil attack, where the attacker may utilize every peer slot the target TRB node still has available, to simultaneously cause as high as possible memory load, using multiple peer connections.

All variables mentioned here are peer-specific.

strSubVer
Represents the agent string of the peer, such as "/therealbitcoin.org:0.9.99.99/", or "/Satoshi:0.16.3/".

TRB doesn’t place an explicit limit on its length, and while the value isn’t used for anything, it’s nevertheless kept in memory until the peer connection is closed, and garbage-collected.

In practice, the agent string length is limited by whatever other, more general limits first affecting the incoming version message. Reception of messages is by default, limited to ~10e6 bytes at once, per peer, so a new peer could, if so wished, greet us with a ~10 MB user agent string!

mapAskFor
A collection of peer-advertised inventory, each item consisting of a type code and a hash. Every item we’ve been advertised the next time, and not received a corresponding actual item yet for, is kept in RAM for up to two minutes.

The peer may choose to dump us up to 50000 items per inventory message, and always send each unique item (or list) twice.

vRecv
The underlying type is CDataStream, which is a "stream-like" wrapper around a std::vector. Because of the internal std::vector capacity doubling mechanism, this variable may eventually become up to twice12 in size of the user’s parameter-specified size.

vSend
Similar to the vRecv above, this also may eventually become up to twice13 in size of the user’s parameter-specified size.

Because vSend flood control check will not interrupt an on-going vRecv processing, even if it does disconnect the peer, it may grow as large as a full vRecv input can generate data. This is not exactly "low-impact", but it’s specific to the burst processing defect. Otherwise vSend should remain low in impact.

There is also a vSend-specific peculiarity: Every time the buffer size grows to a multiple of 232 bytes, CNode::nMessageStart will wrap around and make checksumming pushed messages 4GB heavier (See here and here).

3.4 Peer statistics

For completeness, if an attacker could spoof his IP address without limits, there are three variables to fill with new peer IP addresses, or timestamps:

  • setBanned On every peer ban, a 4-byte IP address and a timestamp (8 bytes) is added.
  • setKnown On every new, unique (since restart) peer IP, a 4-byte IP address is added.
  • vTimeOffsets On every new, unique (since restart) peer IP, a 8-byte timestamp is added. Vector capacity doubling issue applies. Also, related to debug.log spamming, is how time offsets' log print grows linearly with the seen unique peer IP count.

4 Spamming debug.log

There are possibly other ways to do this, too, but here is an example.

When a message header (empty payload), with a 12-character command and a 10-digit size, is sent to TRB, a text like the following will print to the log:

11/27/21 16:14:09 CMessageHeader::IsValid() : (burnburnburn, 2147483648 bytes) nMessageSize > MAX_SIZE
11/27/21 16:14:09 

PROCESSMESSAGE: ERRORS IN HEADER burnburnburn

The print size is 171 bytes, and the header size is 24 bytes. When sending such headers in a continuous stream, this should make roughly a disk spam amplification of 171/24 = 7.125.

  1. Such as 1) other peer’s messages, 2) cleaning up garbage peer addresses, 3) possibly devise your own tactic []
  2. See also vSend, CPU time will dramatically increase every 4GB step []
  3. Naturally, this depends on the hardware []
  4. See vInventoryToSend for impact multiplier []
  5. See CBlock IMPLEMENT_SERIALIZE, notice that a 1-byte vtx size is still serialized, when vtx is empty []
  6. See also vSend, CPU time will dramatically increase every 4GB step []
  7. 1: Add to setInventoryKnown. AskFor(inv). 2: AskFor() mapAlreadyAskedFor insert, 3: AskFor() mapAskFor insert []
  8. First RelayMessage(), then RelayInventory(), then PushInventory() for every peer []
  9. pnodeTrickle selected and passed, addr's conditionally pushed []
  10. Expect some slow read ahead []
  11. This may, or may not also prove helpful in figuring out the address cleanup cycle’s timing []
  12. + another 64kB, but with eventual flood disconnect []
  13. + whatever the size of the bite the OS eats up the send buffer in, which may not be very interesting... []

2021-11-12

TRB Defect Exhibition - Two DoS Classics

Filed under: TRB — cgra @ 13:13:02

The denial-of-service classics for TRB presented here are:

  • the unbounded mempool and
  • the blockchain checkpoints aging over time, which make it possible to spam a node with bogus blocks, using more and more powerful, but affordable second-hand mining hardware.

I consider them classics, because I can't imagine they weren't widely known already long ago. Why I bothered to mention these anyway, is because I'm in the process of enumerating a bunch of other TRB DoS issues too, with similar effects.

On the other hand, I haven't fully tested either of these theories in practice, so it's always possible I am imagining things... Please correct, if seeing false claims.

Unconfirmed transactions, aka mempool

TRB mempool keeps unconfirmed transactions in two sets: A relay set contains all the transactions it has received from other nodes (or from the node operator) within the last 15 minutes. These transactions are relayed to all the other peers. Variables mapRelay and vRelayExpiration implement the relay set (declarations, writes, 'getdata' response filtering) .

The other set is the mining set. It originally made sense for the TRB operator to mine blocks himself, so these transactions are the longer-kept transactions, besides the relay set, that the node does not bother relaying anymore, but will keep for putting into new blocks. Variables mapTransactions and mapNextTx implement the mining set (declaration, 11 other mapTransactions occurrences in main.cpp).

The specific feature that mapNextTx exists for, is permanently disabled. Therefore, mapNextTx is an easily removable, unnecessary piece. (declaration, ptxOld always NULL (another, depending construct for easy removal), write occurrence 1, write occurrence 2).

While the relay set shrinks over time, transactions are also dropped from both sets when they show up in newly accepted, best-chain blocks.

TRB mempool will grow as long as valid transactions keep coming in, and don't get accepted into new blocks fast enough, or, node runs out of memory.

Finally, the anyone-can-spend transactions (SegWit, Taproot(?)) might make a particularly easy tool for generating bogus transactions to fill TRB mempool with.

Spam blocks

mapBlockIndex keeps an in-RAM index of downloaded, valid blocks, losing chains included (declaration, index accepted blocks, ~50 total occurrences).

The memory cost of the index is possibly well over 108 bytes1 per entry (entry type).

TRB implements hard-coded checkpoints for freezing the older end of the main chain, so that no branch originating from the frozen side cannot be valid anymore, and automatically rejected.

Because the last TRB checkpoint is so old (last checkpoint, validation by checkpoint), today’s second-hand mining hardware is turning into a low-cost tool for generating spam blocks, starting from right after the last checkpoint, because the mining difficulty back then was much lower.

The attacker could choose anything between empty and full bogus blocks, depending on the proportion he wanted the victim’s RAM and disk to be filled in.

Generating spam blocks, an example

The following method assumes an applicable, 14Th miner. An Antminer S9 could be such a miner, but I don’t know for sure.

First, get the miner and figure out how to use it, if not already familiar.

Start building on block height 169342. The next block, your first spam block, will be the last block of the same difficulty adjustment period the last checkpoint was in. The timestamp of your spam block 169343 should be at least 8 weeks later than the first block in the same difficulty adjustment period, block at height 167328.

The point of this is to immediately gain the maximum drop in difficulty in our new-born spam branch. The starting difficulty bits are 0x1a0c309c, and after the above difficulty manipulation, would be 0x1a30c270. The second spam block, height 169344, will have 1/4 difficulty of spam block 169343.

The block 169343 should finish within 20 to 32 min of mining. Create 2016 lower-difficulty blocks continuing the new spam chain.

Mining the lower-difficulty blocks should finish in ~7.34 days. The timestamp of spam block 171359, should again, be at least 8 weeks later than timestamp of block 169344, the first of the same difficulty adjustment period.

Repeat this difficulty reduction step as many times as necessary.

After the difficulty is brought down to a comfortable level, you can simply start mining competing blocks, all of the same height and time, and the stream of spam blocks will just keep flowing, and will never reach the current timestamp of the victim node, which otherwise could end the spam game.

  1. bnChainWork’s variable-length, big-integer contents were not taken into account, only the constant-length bits found inside the OpenSSL bignum_st typedef. []

2021-08-26

TRB OOM Patch - no_peer_dumpster

Filed under: TRB — cgra @ 14:14:15

A strong warning

What follows, should be considered EXPERIMENTAL, because:

  1. I could be wrong,
  2. brain fart this thick is difficult to breathe,
  3. nobody else looked yet, and
  4. it's not been extensively tested.

A strong warning

Let us proceed. We will first analyze the current situation, then have a look at the proposed solution.

The reader may first want to have a careful look at the patch contents below this article, then resume here to read the whole description.

Current situation

TRB has a bunch of (here, here, and here) out-of-memory DoS issues. This article will attempt to address one specific issue in that group, which is the vNodesDisconnected dumpster, where every disconnected CNode is put on hold for 15 minutes, after being disconnected, and before actually deallocated.

The dumpster mechanism is bad, because:

  • It's not necessary (elaborated later), and
  • It enables an attacker to waste the target node's RAM just by repeatedly connecting and disconnecting.

Brain Farts of Satoshi, episode ∞ - Ref Counting

Tied to the dumpster mechanism are1: CNode's AddRef(), Release(), GetRefCount(), nRefCount, and nReleaseTime.

AddRef() and Release() calls should go in pairs, otherwise nRefCount could remain positive, indefinitely. In such cases, the node wouldn't ever get removed from vNodesDisconnected, and deallocated.

These refs are adjusted as follows:

  • Increment and decrement for each peer, when beginning and ending a message processing loop in ThreadMessageHandler2() thread.
  • Increment and decrement for each peer, when beginning and ending a send()/recv() handling loop in ThreadSocketHandler2() thread.
  • Increment on peer's connect-time, when peer gets added to vNodes. Both, inbound and outbound.
  • Decrement when disconnected node is moved from vNodes to vNodesDisconnected.
  • Increment in a corner case where simultaneously connecting inbound, and outbound, to the same peer, resulting in sharing a CNode object without separate inbound/outbound connections. See next.

In ConnectNode(), an unbalanced AddRef() is theoretically possible: If new inbound peer was connected after the FindNode() call in OpenNetworkConnection(), and before the FindNode() call in ConnectNode(), and also happened to be the same peer our node was about to connect to (ie. ConnectNode()'s parameter addrConnect). Here, the inbound and outbound connection attempts would end up peculiarly sharing the same CNode object.

Otherwise, starting from the last FindNode() call, nothing prevents a peer from creating multiple connections to our node, and if we had an outbound connection already to the peer, we could end up with separate inbound and outbound connections to the same (unlike above, not a shared CNode object).

In the dumpster filling section is an expression that never triggers. The ref count increments for new connections (inbound, outbound) are only balanced with their equivalent decrements after triggering the expression, which is impossible!

Therefore, ref counts are only ever productively read by the check: whether to already delete a disconnected node, or not.

Ref counting is mostly pointless

Because the ref count reading happens in the ThreadSocketHandler2() thread, it's entirely pointless to adjust ref counts in the send()/recv handler (increment, decrement). The ref count reading will never happen at the same time -- both are in different positions of the same thread.

Adjusting the ref count, when peer is being added to, or removed from vNodes (new inbound, new outbound; dumpster load), is similarly pointless, but serves one distinct purpose: Prevent a theoretical risk of a dangling pointer dereference on two rows:

  • Patch line 57, a redundant timestamp assignment, and
  • patch line 177, an outbound-node-flag assignment, which could easily reside in a thread-safe position.

The last remaining, and within the scope of this patch, fundamentally the only, useful place to adjust the ref count is the ThreadMessageHandler2()'s loop. It signals the deletion code in ThreadSocketHandler2() thread that whether the message handler thread is still actively handling the peer data, or not.

The 15 min dumpster cooldown of disconnected nodes doesn't add to the ref counting anything else, but a loose, extra protection against the marginal dangling pointer dereference risk in OpenNetworkConnection(), at patch line 177: The second FindNode() call could find an existing peer, which the ThreadSocketHandler2() could delete immediately after - now we have a dangling pnode pointer.

Proposed changes

Removals: redundant outbound validation, unbalanced ref count increment, redundant assignment

Repeatedly validating outbound connections (patch lines 164, 165, 28, and 32) is useless, and for the FindNode() check especially, because the acceptance of inbound connections from a single peer IP address is not limited to one only.

The AddRef() by the second FindNode() position is not balanced anywhere, and could cause a stale peer to stick till node restart.

Also, assignment at patch line 57 is already done in CNode constructor, so is redundant. And besides, it marginally risks a dangling pointer dereference, as previously noted.

For these reasons we may remove lines from ConnectNode(), marked by patch lines from 28 to 41, and line 57.

Simplified outbound connection init

We may simplify the OpenNetworkConnection() function to return void, instead of bool, because the return value isn't used anywhere, and it further helps cleaning up the whole function.

The outbound-node-flag assignment from OpenNetworkConnection() should be moved to happen before the new peer gets added to vNodes in ConnectNode(), to avoid one of the marginal risks of a dangling pointer dereference noted earlier.

ConnectNode() would convey it's meaning more accurately, if it didn't return a CNode pointer. Specifically, we don't want to do anything with a node object, when outside of cs_vNodes/ref count guard, to avoid further dangling pointers. So we change the return type to void.

ConnectNode() parameter nTimeout is always 0, so we bolt the value down.

These changes are highlighted by patch lines: 16, 17, 25, 26, 53, 56-62, 155-180, 218 and 219.

Reduce the ref counting to required minimum

Because the AddRef() / Release() / GetRefCount() only make sense in coordinating the stale peer deletion moment, between ThreadMessageHandler2() and ThreadSocketHandler2(), we remove all the other uses:

AddRef()'s nTimeout parameter is never used, we may remove it.

The nReleaseTime will not be used, and must go (patch lines 90, 227, 235 and 250-255). We previously fixed the corner case it was loosely originally protecting -- nReleaseTime won't bed missed.

Because ref count is never 0 at deletion check time, we may remove the expression that never triggers.

Tow away the dumpster

The vNodesDisconnected can go entirely. We will keep the stale nodes in vNodes instead, and they won't stay there for long.

None of the TRY_CRITICAL_BLOCK's were necessary, because:

  • The whole dumpster-loading and dumpster-unloading section is surrounded by the cs_vNodes lock
  • Either an outer cs_vNodes has us covered, where the other threads may hold these locks (ThreadRCPServer() and ThreadBitcoinMiner(), which only are concerned with relaying either user's transactions, or freshly self-mined blocks), or
  • ThreadMessageHandler2() is either sleeping, waiting for cs_vNodes, or signalling own use with an increased ref count

We keep the zero ref count condition, under which the final deallocation may happen, like before.

The outbound/inbound flag check is pointless, because no node may show up here that has both flags false. The only three places where CNode constructor is called, are:

The 1) removal from vNodes, and 2) CNode deallocation, will happen together, under required conditions. Closing the TCP socket and other cleanup should happen without delay, even if the final deletion had to wait until ThreadMessageHandler2() is done with it's current iteration.

This whole dumpster hauling change is highlighted by the patch lines 70, and 78-118.

Finally, ThreadMessageHandler2() and ThreadSocketHandler2() threads could remain off-sync for a while so that the deletion section didn't immediately coincide with the message handler's sleeping period. For this reason, we're probably better off not letting already disconnected nodes to unnecessarily get locked down by the ref count guard (patch lines 188-198).

In conclusion

Simple as that! Right?

The patch

The patch file and my corresponding signature.

1 diff -uNr a/bitcoin/manifest b/bitcoin/manifest
2 --- a/bitcoin/manifest 1b686bcb52a6a5df9ee7db45315e32fd9b90ebff8783cde2aec664538b6722a972be2c060c4dc97eb5138f454413a2df670ed361b120bfa43acba686aeb9a54f
3 +++ b/bitcoin/manifest 2e02a09e85de234b0523bb1b280cd7838355b5a2fe20984c4f295327fe31980303ea83ad4b84b995e424bbf24184f1b8fb4ead67b516eae0ba852f992ad358ce
4 @@ -32,3 +32,4 @@
5 616451 mod6_phexdigit_fix mod6 Adds missing comma to separate values in the phexdigit array in util.cpp.
6 617254 mod6_excise_hash_truncation mod6 Regrind of ben_vulpes original; Removes truncation of hashes printed to TRB log file
7 617255 mod6_whogaveblox mod6 Regrind of asciilifeform original; Record the origin of every incoming candidate block (whether accepted or rejected)
8 +697539 cgra_no_peer_dumpster cgra Don't store stale peers; Trim down 'ref counter'
9 diff -uNr a/bitcoin/src/net.cpp b/bitcoin/src/net.cpp
10 --- a/bitcoin/src/net.cpp 6d7c7634cce09792942fc69cf945b64e8f8e77b496ad6bbaed12dccbc5e13c31fc2f1162735cae7951893fb6b3635955703059b1e8ba1607cc483c494c0a126c
11 +++ b/bitcoin/src/net.cpp f35e0d35dad6b268d20b025c3d49875c8e8c78f254caa20b2c88952f1719b83216d9cf7a6c99eb4045387d4abe27bf2c12bebbb189d719c61d39074e9c7705bb
12 @@ -18,7 +18,7 @@
13 void ThreadMessageHandler2(void* parg);
14 void ThreadSocketHandler2(void* parg);
15 void ThreadOpenConnections2(void* parg);
16 -bool OpenNetworkConnection(const CAddress& addrConnect);
17 +void OpenNetworkConnection(const CAddress& addrConnect);
18
19
20
21 @@ -446,22 +446,8 @@
22 return NULL;
23 }
24
25 -CNode* ConnectNode(CAddress addrConnect, int64 nTimeout)
26 +void ConnectNode(CAddress addrConnect)
27 {
28 - if (addrConnect.ip == addrLocalHost.ip)
29 - return NULL;
30 -
31 - // Look for an existing connection
32 - CNode* pnode = FindNode(addrConnect.ip);
33 - if (pnode)
34 - {
35 - if (nTimeout != 0)
36 - pnode->AddRef(nTimeout);
37 - else
38 - pnode->AddRef();
39 - return pnode;
40 - }
41 -
42 /// debug print
43 printf("trying connection %s lastseen=%.1fhrs lasttry=%.1fhrs\n",
44 addrConnect.ToString().c_str(),
45 @@ -484,19 +470,9 @@
46
47 // Add node
48 CNode* pnode = new CNode(hSocket, addrConnect, false);
49 - if (nTimeout != 0)
50 - pnode->AddRef(nTimeout);
51 - else
52 - pnode->AddRef();
53 + pnode->fNetworkNode = true;
54 CRITICAL_BLOCK(cs_vNodes)
55 vNodes.push_back(pnode);
56 -
57 - pnode->nTimeConnected = GetTime();
58 - return pnode;
59 - }
60 - else
61 - {
62 - return NULL;
63 }
64 }
65
66 @@ -604,7 +580,6 @@
67 void ThreadSocketHandler2(void* parg)
68 {
69 printf("ThreadSocketHandler started\n");
70 - list vNodesDisconnected;
71 int nPrevNodeCount = 0;
72
73 loop
74 @@ -618,40 +593,19 @@
75 vector vNodesCopy = vNodes;
76 BOOST_FOREACH(CNode* pnode, vNodesCopy)
77 {
78 - if (pnode->fDisconnect ||
79 - (pnode->GetRefCount() <= 0 && pnode->vRecv.empty() && pnode->vSend.empty()))
80 + if (pnode->fDisconnect)
81 {
82 - // remove from vNodes
83 - vNodes.erase(remove(vNodes.begin(), vNodes.end(), pnode), vNodes.end());
84 -
85 // close socket and cleanup
86 pnode->CloseSocketDisconnect();
87 pnode->Cleanup();
88
89 - // hold in disconnected pool until all refs are released
90 - pnode->nReleaseTime = max(pnode->nReleaseTime, GetTime() + 15 * 60);
91 - if (pnode->fNetworkNode || pnode->fInbound)
92 - pnode->Release();
93 - vNodesDisconnected.push_back(pnode);
94 - }
95 - }
96 -
97 - // Delete disconnected nodes
98 - list vNodesDisconnectedCopy = vNodesDisconnected;
99 - BOOST_FOREACH(CNode* pnode, vNodesDisconnectedCopy)
100 - {
101 - // wait until threads are done using it
102 - if (pnode->GetRefCount() <= 0)
103 - {
104 - bool fDelete = false;
105 - TRY_CRITICAL_BLOCK(pnode->cs_vSend)
106 - TRY_CRITICAL_BLOCK(pnode->cs_vRecv)
107 - TRY_CRITICAL_BLOCK(pnode->cs_mapRequests)
108 - TRY_CRITICAL_BLOCK(pnode->cs_inventory)
109 - fDelete = true;
110 - if (fDelete)
111 + // is ThreadMessageHandler2() also done with this node?
112 + if (pnode->GetRefCount() <= 0)
113 {
114 - vNodesDisconnected.remove(pnode);
115 + // remove from vNodes
116 + vNodes.erase(
117 + remove(vNodes.begin(), vNodes.end(), pnode),
118 + vNodes.end());
119 delete pnode;
120 }
121 }
122 @@ -754,7 +708,6 @@
123 {
124 printf("accepted connection %s\n", addr.ToString().c_str());
125 CNode* pnode = new CNode(hSocket, addr, true);
126 - pnode->AddRef();
127 CRITICAL_BLOCK(cs_vNodes)
128 vNodes.push_back(pnode);
129 }
130 @@ -768,8 +721,6 @@
131 CRITICAL_BLOCK(cs_vNodes)
132 {
133 vNodesCopy = vNodes;
134 - BOOST_FOREACH(CNode* pnode, vNodesCopy)
135 - pnode->AddRef();
136 }
137 BOOST_FOREACH(CNode* pnode, vNodesCopy)
138 {
139 @@ -886,11 +837,6 @@
140 }
141 }
142 }
143 - CRITICAL_BLOCK(cs_vNodes)
144 - {
145 - BOOST_FOREACH(CNode* pnode, vNodesCopy)
146 - pnode->Release();
147 - }
148
149 Sleep(10);
150 }
151 @@ -1055,27 +1001,22 @@
152 }
153 }
154
155 -bool OpenNetworkConnection(const CAddress& addrConnect)
156 +void OpenNetworkConnection(const CAddress& addrConnect)
157 {
158 //
159 // Initiate outbound network connection
160 //
161 if (fShutdown)
162 - return false;
163 + return;
164 if (addrConnect.ip == addrLocalHost.ip || !addrConnect.IsIPv4() ||
165 FindNode(addrConnect.ip) || CNode::IsBanned(addrConnect.ip))
166 - return false;
167 + return;
168
169 vnThreadsRunning[1]--;
170 - CNode* pnode = ConnectNode(addrConnect);
171 + ConnectNode(addrConnect);
172 vnThreadsRunning[1]++;
173 if (fShutdown)
174 - return false;
175 - if (!pnode)
176 - return false;
177 - pnode->fNetworkNode = true;
178 -
179 - return true;
180 + return;
181 }
182
183
184 @@ -1113,9 +1054,14 @@
185 vector vNodesCopy;
186 CRITICAL_BLOCK(cs_vNodes)
187 {
188 - vNodesCopy = vNodes;
189 - BOOST_FOREACH(CNode* pnode, vNodesCopy)
190 - pnode->AddRef();
191 + BOOST_FOREACH(CNode* pnode, vNodes)
192 + {
193 + if (!pnode->fDisconnect)
194 + {
195 + pnode->AddRef(); // guard against premature CNode delete
196 + vNodesCopy.push_back(pnode);
197 + }
198 + }
199 }
200
201 // Poll the connected nodes for messages
202 @@ -1140,7 +1086,7 @@
203 CRITICAL_BLOCK(cs_vNodes)
204 {
205 BOOST_FOREACH(CNode* pnode, vNodesCopy)
206 - pnode->Release();
207 + pnode->Release(); // now safe to delete CNode, if must
208 }
209
210 // Wait and allow messages to bunch up.
211 diff -uNr a/bitcoin/src/net.h b/bitcoin/src/net.h
212 --- a/bitcoin/src/net.h 492c9cc92a504bb8174d75fafcbee6980986182a459efc9bfa1d64766320d98ba2fa971d78d00a777c6cc50f82a5d424997927378e99738b1b3b550bdaa727f7
213 +++ b/bitcoin/src/net.h 48c582f558747cae0d621146500651c3cdbf770154dec5b528e87f6340c08620cf02a705d6b3703d36e28deca5d62faa73b1eb9a7e9bc45c203091d7078ebcd7
214 @@ -31,7 +31,7 @@
215 bool AddAddress(CAddress addr, int64 nTimePenalty=0, CAddrDB *pAddrDB=NULL);
216 void AddressCurrentlyConnected(const CAddress& addr);
217 CNode* FindNode(unsigned int ip);
218 -CNode* ConnectNode(CAddress addrConnect, int64 nTimeout=0);
219 +void ConnectNode(CAddress addrConnect);
220 void AbandonRequests(void (*fn)(void*, CDataStream&), void* param1);
221 bool AnySubscribed(unsigned int nChannel);
222 void MapPort(bool fMapPort);
223 @@ -126,7 +126,6 @@
224 int nMisbehavior;
225
226 public:
227 - int64 nReleaseTime;
228 std::map mapRequests;
229 CCriticalSection cs_mapRequests;
230 uint256 hashContinue;
231 @@ -176,7 +175,6 @@
232 fSuccessfullyConnected = false;
233 fDisconnect = false;
234 nRefCount = 0;
235 - nReleaseTime = 0;
236 hashContinue = 0;
237 nStartingHeight = -1;
238 fGetAddr = false;
239 @@ -205,16 +203,12 @@
240
241 int GetRefCount()
242 {
243 - return std::max(nRefCount, 0) + (GetTime() < nReleaseTime ? 1 : 0);
244 + return nRefCount;
245 }
246
247 - CNode* AddRef(int64 nTimeout=0)
248 + void AddRef()
249 {
250 - if (nTimeout != 0)
251 - nReleaseTime = std::max(nReleaseTime, GetTime() + nTimeout);
252 - else
253 - nRefCount++;
254 - return this;
255 + nRefCount++;
256 }
257
258 void Release()
259
  1. See bitcoin/src/net.h, declaration of class CNode []
  2. See: bitcoin/src/net.cpp []

2021-08-22

TRB Homework, Part N: Threads

Filed under: TRB — cgra @ 11:11:20

I was in a process of writing an explanation for a TRB patch. The changes in question touched TRB's concurrency mechanics, which I thought I had properly already taken into account, but then realized this might've not really been the case. So it was time for some TRB homework: The Threads.

The summary of my study below, is based mostly on these sources:

I also found Ada's approach to concurrency1 an interesting, and contrasting read.

Anyway...

The next time I see multi-threaded code, I will remember these four things:

I. Compiler and CPU do not always follow the programmer-specified execution order

The compiler may optimize/reorder code, making it vastly different from what the programmer specified.

The CPU may also reorder code, again, making it different from what the programmer (and compiler!) specified.

The only guiding principle is, that the reorganized program's effect must be exactly the same as the programmer-specified variant would result in. But only in a single-threaded case!

II. Nothing must be assumed of individual thread's relative execution speed

Concurrently running threads' instructions interleave, but there's roughly no guarantee at all, how exactly. If point I above didn't exist, the only guarantee would be that each thread's instructions execute in the specified order, for that thread only.

III. Compiler and CPU memory barriers required for thread synchronization

Compilers and CPU's provide2 memory "barriers" or "fences" for the programmer to control the said reordering mechanisms. These are markers, or machine instructions, that are placed in code, where no reordering across that point should happen, if it were to change the running program's current memory state -- compared to what it would've been without any reordering.

In practice, certain implicit compiler barriers also exist, like an external call. Compiler cannot reliably predict the external function's side effects, so it must sync up with the memory state of the programmer-specified program once, before resuming its improv-reordering solo, after the external call has returned.

IV. Usually, simple reads or writes of primitive data types are atomic

From multi-threaded programming's point of view, neither C++ nor GCC, AFAIK, guarantee atomic handling3 for any piece of data, not even for a bool.

CPU manufacturers, on the other hand, specify4 which of their CPUs' memory operations are atomic. In x86_64 CPUs, for example, reads and writes up to 64-bit are atomic, if properly aligned.

In practice, handling of the "usual primitive" C++-variables5 just happen to be atomic for each individual read or write. And only for that. For example, incrementing a variable, is likely not atomic, because it's not a simple assignment, or a value read.

A couple of practical experiments

A C++ piece, compiled with -O2 optimization:

1 #include < stdio.h>
2
3 int main()
4 {
5 unsigned int cnt = 0;
6 for (int i=0; i<0xFFFFFF; i++)
7 cnt++;
8 printf("%d\n", cnt);
9 return 0;
10 }

...and it could result in the following machine code, like in my case:

1 ...
2 0000000000400450 < main>:
3 400450: 48 83 ec 08 sub $0x8,%rsp
4 400454: ba ff ff ff 00 mov $0xffffff,%edx
5 400459: be f4 05 40 00 mov $0x4005f4,%esi
6 40045e: bf 01 00 00 00 mov $0x1,%edi
7 400463: 31 c0 xor %eax,%eax
8 400465: e8 c6 ff ff ff callq 400430 <__printf_chk@plt>
9 ...

Notice that there's no loop at all! Here the compiler is not merely reordering, but entirely omitting -- the point I's compiler in full swing.

I also tried out a CPU reordering demonstration, which I copied with some changes, below:

1 #include < pthread.h>
2 #include < semaphore.h>
3 #include < stdio.h>
4
5 // Set one of these to 1 to prevent CPU reordering
6 #define USE_CPU_FENCE 0
7 #define USE_SINGLE_HW_THREAD 0
8
9 #if USE_SINGLE_HW_THREAD
10 #include < sched.h>
11 #endif
12
13
14 sem_t beginSema1, beginSema2, endSema;
15 int X, Y, r1, r2;
16
17 inline void fence()
18 {
19 #if USE_CPU_FENCE
20 // Prevent CPU (and compiler) reordering across this barrier/fence
21 asm volatile("mfence" ::: "memory");
22 #else
23 // Prevent just the compiler reordering across this barrier/fence
24 asm volatile("" ::: "memory");
25 #endif
26 }
27
28 void *thread1Func(void *param)
29 {
30 for (;;)
31 {
32 sem_wait(&beginSema1); // Wait for signal
33
34 // ----- THE TRANSACTION! -----
35 X = 1;
36 fence();
37 r1 = Y;
38
39 sem_post(&endSema); // Notify transaction complete
40 }
41 };
42
43 void *thread2Func(void *param)
44 {
45 for (;;)
46 {
47 sem_wait(&beginSema2); // Wait for signal
48
49 // ----- THE TRANSACTION! -----
50 Y = 1;
51 fence();
52 r2 = X;
53
54 sem_post(&endSema); // Notify transaction complete
55 }
56 };
57
58 int main()
59 {
60 // Initialize the semaphores
61 sem_init(&beginSema1, 0, 0);
62 sem_init(&beginSema2, 0, 0);
63 sem_init(&endSema, 0, 0);
64
65 // Spawn the threads
66 pthread_t thread1, thread2;
67 pthread_create(&thread1, NULL, thread1Func, NULL);
68 pthread_create(&thread2, NULL, thread2Func, NULL);
69
70 #if USE_SINGLE_HW_THREAD
71 // Force thread affinities to the same cpu core.
72 cpu_set_t cpus;
73 CPU_ZERO(&cpus);
74 CPU_SET(0, &cpus);
75 pthread_setaffinity_np(thread1, sizeof(cpu_set_t), &cpus);
76 pthread_setaffinity_np(thread2, sizeof(cpu_set_t), &cpus);
77 #endif
78
79 // Repeat the experiment ad infinitum
80 int detected = 0;
81 for (int iterations = 1; ; iterations++)
82 {
83 // Reset X and Y
84 X = 0;
85 Y = 0;
86 // Signal both threads
87 sem_post(&beginSema1);
88 sem_post(&beginSema2);
89
90 // Wait for both threads
91 sem_wait(&endSema);
92 sem_wait(&endSema);
93
94 // Check if there was a simultaneous reorder
95 if (r1 == 0 && r2 == 0)
96 {
97 detected++;
98 printf("%d reorders detected after %d iterations\n",
99 detected, iterations);
100 }
101 }
102 return 0; // Never returns
103 }

On my machine, it behaved exactly as advertised. I also removed the randomization6, and it still behaved the same. So, again, the point I's CPU in full swing.

TRB locks

Underneath the TRB's critical section construct is found the Linux libc pthread library. When a critical section starts, pthread_mutex_lock() is executed. Equally, when a critical section ends, pthread_mutex_unlock() is called. Also, a pthread_mutex_trylock() variation exists.

I dug into TRB's musl libc, looking for clues do pthread_mutex_lock(), or pthread_mutex_unlock() include a CPU memory barrier, or not, and what would such a barrier look like. I found a Linux kernel "futex" syscall, which I further dug into, but could only find suggestive material, and not properly verify the case. The closest I got was that the futex syscall, for x86_64, may contain CPU instructions like:

  • LOCK XCHGL
  • LOCK XADDL
  • LOCK ORL
  • LOCK ANDL
  • LOCK XORL

Because the pthread lock calls are external calls, likely that alone makes them implicit compiler memory barriers.

Finally

An other type of a concurrency issue...

  1. See "Tasks and Synchronization" []
  2. Or at least, should provide []
  3. Atomicity matters in that sense, that, for example, you only want to read either: 1) an old value of an integer, or 2) a new value, but not 3) a whatever-bit-mixture of both, because the new value's write was in-progress at the moment of the read. []
  4. Again, AFAIK. []
  5. bool, char, int, long: signed/unsigned; float, double []
  6. Shown in the original version []

Powered by a less-enormous pile of '???'