What follows, should be considered EXPERIMENTAL, because:
- I could be wrong,
- brain fart this thick is difficult to breathe,
- nobody else looked yet, and
- it's not been extensively tested.
Let us proceed. We will first analyze the current situation, then have a look at the proposed solution.
The reader may first want to have a careful look at the patch contents below this article, then resume here to read the whole description.
Current situation
TRB has a bunch of (here, here, and here) out-of-memory DoS issues. This article will attempt to address one specific issue in that group, which is the vNodesDisconnected
dumpster, where every disconnected CNode
is put on hold for 15 minutes, after being disconnected, and before actually deallocated.
The dumpster mechanism is bad, because:
- It's not necessary (elaborated later), and
- It enables an attacker to waste the target node's RAM just by repeatedly connecting and disconnecting.
Brain Farts of Satoshi, episode ∞ - Ref Counting
Tied to the dumpster mechanism are1: CNode
's AddRef()
, Release()
, GetRefCount()
, nRefCount
, and nReleaseTime
.
AddRef()
and Release()
calls should go in pairs, otherwise nRefCount
could remain positive, indefinitely. In such cases, the node wouldn't ever get removed from vNodesDisconnected
, and deallocated.
These refs are adjusted as follows:
- Increment and decrement for each peer, when beginning and ending a message processing loop in
ThreadMessageHandler2()
thread. - Increment and decrement for each peer, when beginning and ending a
send()
/recv()
handling loop inThreadSocketHandler2()
thread. - Increment on peer's connect-time, when peer gets added to
vNodes
. Both, inbound and outbound. - Decrement when disconnected node is moved from
vNodes
tovNodesDisconnected
. - Increment in a corner case where simultaneously connecting inbound, and outbound, to the same peer, resulting in sharing a
CNode
object without separate inbound/outbound connections. See next.
In ConnectNode()
, an unbalanced AddRef()
is theoretically possible: If new inbound peer was connected after the FindNode()
call in OpenNetworkConnection()
, and before the FindNode()
call in ConnectNode()
, and also happened to be the same peer our node was about to connect to (ie. ConnectNode()
's parameter addrConnect
). Here, the inbound and outbound connection attempts would end up peculiarly sharing the same CNode
object.
Otherwise, starting from the last FindNode()
call, nothing prevents a peer from creating multiple connections to our node, and if we had an outbound connection already to the peer, we could end up with separate inbound and outbound connections to the same (unlike above, not a shared CNode
object).
In the dumpster filling section is an expression that never triggers. The ref count increments for new connections (inbound, outbound) are only balanced with their equivalent decrements after triggering the expression, which is impossible!
Therefore, ref counts are only ever productively read by the check: whether to already delete a disconnected node, or not.
Ref counting is mostly pointless
Because the ref count reading happens in the ThreadSocketHandler2()
thread, it's entirely pointless to adjust ref counts in the send()
/recv
handler (increment, decrement). The ref count reading will never happen at the same time -- both are in different positions of the same thread.
Adjusting the ref count, when peer is being added to, or removed from vNodes
(new inbound, new outbound; dumpster load), is similarly pointless, but serves one distinct purpose: Prevent a theoretical risk of a dangling pointer dereference on two rows:
- Patch line 57, a redundant timestamp assignment, and
- patch line 177, an outbound-node-flag assignment, which could easily reside in a thread-safe position.
The last remaining, and within the scope of this patch, fundamentally the only, useful place to adjust the ref count is the ThreadMessageHandler2()
's loop. It signals the deletion code in ThreadSocketHandler2()
thread that whether the message handler thread is still actively handling the peer data, or not.
The 15 min dumpster cooldown of disconnected nodes doesn't add to the ref counting anything else, but a loose, extra protection against the marginal dangling pointer dereference risk in OpenNetworkConnection()
, at patch line 177: The second FindNode()
call could find an existing peer, which the ThreadSocketHandler2()
could delete immediately after - now we have a dangling pnode
pointer.
Proposed changes
Removals: redundant outbound validation, unbalanced ref count increment, redundant assignment
Repeatedly validating outbound connections (patch lines 164, 165, 28, and 32) is useless, and for the FindNode()
check especially, because the acceptance of inbound connections from a single peer IP address is not limited to one only.
The AddRef() by the second FindNode()
position is not balanced anywhere, and could cause a stale peer to stick till node restart.
Also, assignment at patch line 57 is already done in CNode
constructor, so is redundant. And besides, it marginally risks a dangling pointer dereference, as previously noted.
For these reasons we may remove lines from ConnectNode()
, marked by patch lines from 28 to 41, and line 57.
Simplified outbound connection init
We may simplify the OpenNetworkConnection()
function to return void
, instead of bool
, because the return value isn't used anywhere, and it further helps cleaning up the whole function.
The outbound-node-flag assignment from OpenNetworkConnection()
should be moved to happen before the new peer gets added to vNodes
in ConnectNode()
, to avoid one of the marginal risks of a dangling pointer dereference noted earlier.
ConnectNode()
would convey it's meaning more accurately, if it didn't return a CNode
pointer. Specifically, we don't want to do anything with a node object, when outside of cs_vNodes
/ref count guard, to avoid further dangling pointers. So we change the return type to void
.
ConnectNode()
parameter nTimeout
is always 0, so we bolt the value down.
These changes are highlighted by patch lines: 16, 17, 25, 26, 53, 56-62, 155-180, 218 and 219.
Reduce the ref counting to required minimum
Because the AddRef()
/ Release()
/ GetRefCount()
only make sense in coordinating the stale peer deletion moment, between ThreadMessageHandler2()
and ThreadSocketHandler2()
, we remove all the other uses:
- From
ConnectNode()
outbound connection initialization, - From
ThreadSocketHandler2()
send()
/recv()
loop (before loop, after loop), and - From
ThreadSocketHandler2()
inbound connection initialization - From
ThreadSocketHandler2()
dumpster-loading connection ref count decrement
AddRef()
's nTimeout
parameter is never used, we may remove it.
The nReleaseTime
will not be used, and must go (patch lines 90, 227, 235 and 250-255). We previously fixed the corner case it was loosely originally protecting -- nReleaseTime
won't bed missed.
Because ref count is never 0 at deletion check time, we may remove the expression that never triggers.
Tow away the dumpster
The vNodesDisconnected
can go entirely. We will keep the stale nodes in vNodes
instead, and they won't stay there for long.
None of the TRY_CRITICAL_BLOCK
's were necessary, because:
- The whole dumpster-loading and dumpster-unloading section is surrounded by the
cs_vNodes
lock - Either an outer
cs_vNodes
has us covered, where the other threads may hold these locks (ThreadRCPServer()
andThreadBitcoinMiner()
, which only are concerned with relaying either user's transactions, or freshly self-mined blocks), or ThreadMessageHandler2()
is either sleeping, waiting forcs_vNodes
, or signalling own use with an increased ref count
We keep the zero ref count condition, under which the final deallocation may happen, like before.
The outbound/inbound flag check is pointless, because no node may show up here that has both flags false. The only three places where CNode
constructor is called, are:
- New inbound peer connection, where
fInbound
will become true, - new outbound peer connection, where
fNetworkNode
will become true, and - a global
pnodeLocalHost
object allocation inStartNode()
2, which is a redundant object, doing nothing of importance.
The 1) removal from vNodes
, and 2) CNode
deallocation, will happen together, under required conditions. Closing the TCP socket and other cleanup should happen without delay, even if the final deletion had to wait until ThreadMessageHandler2()
is done with it's current iteration.
This whole dumpster hauling change is highlighted by the patch lines 70, and 78-118.
Finally, ThreadMessageHandler2()
and ThreadSocketHandler2()
threads could remain off-sync for a while so that the deletion section didn't immediately coincide with the message handler's sleeping period. For this reason, we're probably better off not letting already disconnected nodes to unnecessarily get locked down by the ref count guard (patch lines 188-198).
In conclusion
Simple as that! Right?
The patch
The patch file and my corresponding signature.
1 | diff -uNr a/bitcoin/manifest b/bitcoin/manifest |
2 | |
3 | |
4 | |
5 | 616451 mod6_phexdigit_fix mod6 Adds missing comma to separate values in the phexdigit array in util.cpp. |
6 | 617254 mod6_excise_hash_truncation mod6 Regrind of ben_vulpes original; Removes truncation of hashes printed to TRB log file |
7 | 617255 mod6_whogaveblox mod6 Regrind of asciilifeform original; Record the origin of every incoming candidate block (whether accepted or rejected) |
8 | +697539 cgra_no_peer_dumpster cgra Don't store stale peers; Trim down 'ref counter' |
9 | diff -uNr a/bitcoin/src/net.cpp b/bitcoin/src/net.cpp |
10 | |
11 | |
12 | |
13 | void ThreadMessageHandler2(void* parg); |
14 | void ThreadSocketHandler2(void* parg); |
15 | void ThreadOpenConnections2(void* parg); |
16 | -bool OpenNetworkConnection(const CAddress& addrConnect); |
17 | +void OpenNetworkConnection(const CAddress& addrConnect); |
18 | |
19 | |
20 | |
21 | |
22 | return NULL; |
23 | } |
24 | |
25 | -CNode* ConnectNode(CAddress addrConnect, int64 nTimeout) |
26 | +void ConnectNode(CAddress addrConnect) |
27 | { |
28 | - if (addrConnect.ip == addrLocalHost.ip) |
29 | - return NULL; |
30 | - |
31 | - // Look for an existing connection |
32 | - CNode* pnode = FindNode(addrConnect.ip); |
33 | - if (pnode) |
34 | - { |
35 | - if (nTimeout != 0) |
36 | - pnode->AddRef(nTimeout); |
37 | - else |
38 | - pnode->AddRef(); |
39 | - return pnode; |
40 | - } |
41 | - |
42 | /// debug print |
43 | printf("trying connection %s lastseen=%.1fhrs lasttry=%.1fhrs\n", |
44 | addrConnect.ToString().c_str(), |
45 | |
46 | |
47 | // Add node |
48 | CNode* pnode = new CNode(hSocket, addrConnect, false); |
49 | - if (nTimeout != 0) |
50 | - pnode->AddRef(nTimeout); |
51 | - else |
52 | - pnode->AddRef(); |
53 | + pnode->fNetworkNode = true; |
54 | CRITICAL_BLOCK(cs_vNodes) |
55 | vNodes.push_back(pnode); |
56 | - |
57 | - pnode->nTimeConnected = GetTime(); |
58 | - return pnode; |
59 | - } |
60 | - else |
61 | - { |
62 | - return NULL; |
63 | } |
64 | } |
65 | |
66 | |
67 | void ThreadSocketHandler2(void* parg) |
68 | { |
69 | printf("ThreadSocketHandler started\n"); |
70 | - list |
71 | int nPrevNodeCount = 0; |
72 | |
73 | loop |
74 | |
75 | vector |
76 | BOOST_FOREACH(CNode* pnode, vNodesCopy) |
77 | { |
78 | - if (pnode->fDisconnect || |
79 | - (pnode->GetRefCount() <= 0 && pnode->vRecv.empty() && pnode->vSend.empty())) |
80 | + if (pnode->fDisconnect) |
81 | { |
82 | - // remove from vNodes |
83 | - vNodes.erase(remove(vNodes.begin(), vNodes.end(), pnode), vNodes.end()); |
84 | - |
85 | // close socket and cleanup |
86 | pnode->CloseSocketDisconnect(); |
87 | pnode->Cleanup(); |
88 | |
89 | - // hold in disconnected pool until all refs are released |
90 | - pnode->nReleaseTime = max(pnode->nReleaseTime, GetTime() + 15 * 60); |
91 | - if (pnode->fNetworkNode || pnode->fInbound) |
92 | - pnode->Release(); |
93 | - vNodesDisconnected.push_back(pnode); |
94 | - } |
95 | - } |
96 | - |
97 | - // Delete disconnected nodes |
98 | - list |
99 | - BOOST_FOREACH(CNode* pnode, vNodesDisconnectedCopy) |
100 | - { |
101 | - // wait until threads are done using it |
102 | - if (pnode->GetRefCount() <= 0) |
103 | - { |
104 | - bool fDelete = false; |
105 | - TRY_CRITICAL_BLOCK(pnode->cs_vSend) |
106 | - TRY_CRITICAL_BLOCK(pnode->cs_vRecv) |
107 | - TRY_CRITICAL_BLOCK(pnode->cs_mapRequests) |
108 | - TRY_CRITICAL_BLOCK(pnode->cs_inventory) |
109 | - fDelete = true; |
110 | - if (fDelete) |
111 | + // is ThreadMessageHandler2() also done with this node? |
112 | + if (pnode->GetRefCount() <= 0) |
113 | { |
114 | - vNodesDisconnected.remove(pnode); |
115 | + // remove from vNodes |
116 | + vNodes.erase( |
117 | + remove(vNodes.begin(), vNodes.end(), pnode), |
118 | + vNodes.end()); |
119 | delete pnode; |
120 | } |
121 | } |
122 | |
123 | { |
124 | printf("accepted connection %s\n", addr.ToString().c_str()); |
125 | CNode* pnode = new CNode(hSocket, addr, true); |
126 | - pnode->AddRef(); |
127 | CRITICAL_BLOCK(cs_vNodes) |
128 | vNodes.push_back(pnode); |
129 | } |
130 | |
131 | CRITICAL_BLOCK(cs_vNodes) |
132 | { |
133 | vNodesCopy = vNodes; |
134 | - BOOST_FOREACH(CNode* pnode, vNodesCopy) |
135 | - pnode->AddRef(); |
136 | } |
137 | BOOST_FOREACH(CNode* pnode, vNodesCopy) |
138 | { |
139 | |
140 | } |
141 | } |
142 | } |
143 | - CRITICAL_BLOCK(cs_vNodes) |
144 | - { |
145 | - BOOST_FOREACH(CNode* pnode, vNodesCopy) |
146 | - pnode->Release(); |
147 | - } |
148 | |
149 | Sleep(10); |
150 | } |
151 | |
152 | } |
153 | } |
154 | |
155 | -bool OpenNetworkConnection(const CAddress& addrConnect) |
156 | +void OpenNetworkConnection(const CAddress& addrConnect) |
157 | { |
158 | // |
159 | // Initiate outbound network connection |
160 | // |
161 | if (fShutdown) |
162 | - return false; |
163 | + return; |
164 | if (addrConnect.ip == addrLocalHost.ip || !addrConnect.IsIPv4() || |
165 | FindNode(addrConnect.ip) || CNode::IsBanned(addrConnect.ip)) |
166 | - return false; |
167 | + return; |
168 | |
169 | vnThreadsRunning[1]--; |
170 | - CNode* pnode = ConnectNode(addrConnect); |
171 | + ConnectNode(addrConnect); |
172 | vnThreadsRunning[1]++; |
173 | if (fShutdown) |
174 | - return false; |
175 | - if (!pnode) |
176 | - return false; |
177 | - pnode->fNetworkNode = true; |
178 | - |
179 | - return true; |
180 | + return; |
181 | } |
182 | |
183 | |
184 | |
185 | vector |
186 | CRITICAL_BLOCK(cs_vNodes) |
187 | { |
188 | - vNodesCopy = vNodes; |
189 | - BOOST_FOREACH(CNode* pnode, vNodesCopy) |
190 | - pnode->AddRef(); |
191 | + BOOST_FOREACH(CNode* pnode, vNodes) |
192 | + { |
193 | + if (!pnode->fDisconnect) |
194 | + { |
195 | + pnode->AddRef(); // guard against premature CNode delete |
196 | + vNodesCopy.push_back(pnode); |
197 | + } |
198 | + } |
199 | } |
200 | |
201 | // Poll the connected nodes for messages |
202 | |
203 | CRITICAL_BLOCK(cs_vNodes) |
204 | { |
205 | BOOST_FOREACH(CNode* pnode, vNodesCopy) |
206 | - pnode->Release(); |
207 | + pnode->Release(); // now safe to delete CNode, if must |
208 | } |
209 | |
210 | // Wait and allow messages to bunch up. |
211 | diff -uNr a/bitcoin/src/net.h b/bitcoin/src/net.h |
212 | |
213 | |
214 | |
215 | bool AddAddress(CAddress addr, int64 nTimePenalty=0, CAddrDB *pAddrDB=NULL); |
216 | void AddressCurrentlyConnected(const CAddress& addr); |
217 | CNode* FindNode(unsigned int ip); |
218 | -CNode* ConnectNode(CAddress addrConnect, int64 nTimeout=0); |
219 | +void ConnectNode(CAddress addrConnect); |
220 | void AbandonRequests(void (*fn)(void*, CDataStream&), void* param1); |
221 | bool AnySubscribed(unsigned int nChannel); |
222 | void MapPort(bool fMapPort); |
223 | |
224 | int nMisbehavior; |
225 | |
226 | public: |
227 | - int64 nReleaseTime; |
228 | std::map |
229 | CCriticalSection cs_mapRequests; |
230 | uint256 hashContinue; |
231 | |
232 | fSuccessfullyConnected = false; |
233 | fDisconnect = false; |
234 | nRefCount = 0; |
235 | - nReleaseTime = 0; |
236 | hashContinue = 0; |
237 | nStartingHeight = -1; |
238 | fGetAddr = false; |
239 | |
240 | |
241 | int GetRefCount() |
242 | { |
243 | - return std::max(nRefCount, 0) + (GetTime() < nReleaseTime ? 1 : 0); |
244 | + return nRefCount; |
245 | } |
246 | |
247 | - CNode* AddRef(int64 nTimeout=0) |
248 | + void AddRef() |
249 | { |
250 | - if (nTimeout != 0) |
251 | - nReleaseTime = std::max(nReleaseTime, GetTime() + nTimeout); |
252 | - else |
253 | - nRefCount++; |
254 | - return this; |
255 | + nRefCount++; |
256 | } |
257 | |
258 | void Release() |
259 |