Bug fix for obscure shutdown sequence race conditions

2009-12-23

Charles Dickens said it best in A Tale of Two Cities; “It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to heaven, we were all going direct the other way”.

In short - debugging complex race conditions can be an emotional roller-coaster…

Last week some developers at a client of mine reported a shutdown problem during stress testing their IVR system. The problem was in the API I’d developed for them using The Server Framework. In certain situations they were getting a pure virtual function call error from the connection manager during shutdown. Since I have a policy of ’no known bugs’ where fixing bugs is the most important work that I can be doing (see The Joel Test, item 5) I immediately switched from what I was doing to looking at this problem. At this point I was convinced that it was a simple issue and that it wouldn’t take more than a couple of hours at most to resolve.

At first I thought it was a simple case of failing to wait for the connection manager to stop before allowing its destruction to complete; since the normal model for developing a custom asynchronous client is to derive from the standard CStreamSocketConnectionManager class and also from the IStreamSocketConnectionManagerCallback interface and pass yourself into your connection manager base as the callback interface you must be certain that you wait for the connection manager to shutdown completely in your destructor otherwise the destructor will allow you (and the callback interface) to be destroyed before the connection manager stops during its destruction; in this case there’s a race condition between your destruction and that of your base class…

Anyway, it wasn’t that. But luckily my client’s developers had managed to work out a sequence of breakpoints that made reproduction of the problem easier. I managed to reproduce the problem and the purecall WAS due to the race condition that I detailed above but only because the call that was trying to wait for the connection manager base class to cleanly shut down was throwing an exception during its operation and this meant that we weren’t waiting for the shutdown to complete before the destructor was called… The exception that was being thrown was due to us attempting to increment a socket’s reference count from 0 so that we could dispatch the socket to the I/O pool to provide an asynchronous ‘connection closed’ notification during when we aborted the connection.

During the shutdown of a connection manager object we have to abort all outstanding connections. This is essential as we cannot allow connections to outlive the connection manager that is responsible for them or we end up with connections generating events and calling back into a connection manager that no longer exists. The abort code simply asks the socket allocator to walk the list of active sockets and abort each one. It was during this traversal of the active sockets that the exception was being generated.

The client didn’t have any active socket connections at the time of shutdown; at the point where the shutdown code was being called we had completed all of our connections and released all of our sockets. Unfortunately, the race condition that was causing the problem was actually between the socket’s final call to release reducing the socket’s reference count to zero and removing it from the socket allocators active list and our abort code iterating that list. This final release was happening on a thread on the I/O pool as a read was completed in error due to socket itself being shutdown; whilst this was happening the main thread was happily walking along the socket allocator’s active list and aborting each connection.

Of course my first reaction was to jump into the middle of the Kübler Ross model and try and bargain my way out of the problem. I had skipped through stage 1; denial as the client clearly had a reproducible bug. I’m getting better at skipping stage 2; anger, and I was quite clearly to blame as they’d reduced the issue down to only using code that I’d developed for them… Given my incomplete understanding of the actual problem at that time, I tried to solve the obvious issue. Thus I wasted a day with attempts to allow the abort code to legally increase the reference count from zero in this ‘special’ situation… The reason I took this somewhat strange approach is, I think, because there’s already a situation where this kind of strangeness goes on and it’s directly related to the final release of a socket. During socket destruction we often need to close the socket, closing the socket generates a close notification and this close notification results in the socket being passed back into user code. At the point where we pass the socket back into user code we have a socket with a reference count of zero. Since we’re passing the socket into user code and the user might want to do something with it (a common requirement is to dispatch the close event to a thread pool for processing) you MUST be able to manipulate the reference count of the socket inside the notification callback. To enable this the final release of a socket can legally reanimate the socket if it has just closed the socket by incrementing its reference count from zero to one before calling the notification callback. The socket destruction process then puts itself on hold, and when the socket is released to zero next time the socket is already closed and so can’t be reanimated again and finally gets destroyed. This has been in place for a long time now and works well. Unfortunately no other code can work this magic. I hadn’t fully understood and accepted that at the time; or I had considered it but the changes always seemed so simple that they would obviously solve my problem… I cycled through stages 1-4 with a detour into “elation” every time I was convinced that I’d solved the problem.

Unfortunately my periods of elation were short lived; they were merely small periods of “The best of times” during what was proving to be the age of foolishness… Luckily for me the client’s hardware reproduced the problems much more reliably than my main development machine and so none of this nonsense got into production.

The problem is that the change DID solve the problem of an exception being thrown when the socket had its reference count increased during the dispatch of the close notification but that simply shifted the problem to an access violation when the socket was then accessed again after the dispatch. I’d spent almost a day stuck in a rut trying to solve the symptom rather than discovering the actual problem.

At this point I decided to call it a day. I’d been thrashing around the problem for most of the day and into the night; the client was under pressure because they only had a small window on Saturday when they could put a fix to the problem live at their client’s site. I was tired, they were tired and it was clear to everyone that I wasn’t actually getting anywhere. Before going to bed I sat down once more with the code and tried to look at it with fresh eyes; I didn’t expect to find anything but I knew that if I didn’t at least try then I would be waking up in the night trying to solve the problem. I convinced myself that I’d worked out the (startlingly obvious) underlying reason for the problem and went to bed.

Looking at the problem with fresh eyes in the morning finally brought me the realisation of what was actually wrong (the analysis I’d come up with before bed was correct although slightly incomplete), I also realised that I’d seen the symptoms before…

Whilst ’no known bugs’ is a very important thing to me, and the approach has worked very well over the years, occasionally there are things that show up once in a blue moon and that are nigh on impossible to get a reproducible bug from. I’d recently seen a couple of instances of the “incrementing a reference count from 0 during shutdown” problem in automated test runs on one of my build machines (but only one). The problem showed up during the shutdown of a multiple client test harness program that I run as part of my build tests for The Server Framework examples. I have around 55 example servers that ship as part of the licensed framework and most of these have the client test run against them during the build, each of these is built on between two and three build machines, with different operating systems, different numbers of CPU cores, using 5 different compilers and 5 different build configurations. All of the source is rebuilt at least 3 or 4 times on the build machines prior to a new release (mainly as last minute changes sneak in) and considerably more often during normal development. In all of that time I’ve seen this issue only a couple of times. It was hard to reproduce and only one build machine seemed to cause it to occur at all. I’d previously thought that it was due to another bug in the multiple client application but that had been fixed a couple of releases ago…

Anyway. The root cause of the problem was the rather obvious fact that the code that iterates the socket allocators list of active sockets and calls abort on them is doing so without owning a reference to the socket that it’s working with. The socket allocator or connection manager can’t own a reference to the socket as that would screw with the whole way that the reference counting scheme works. Although I had locked the socket allocator’s active socket list, so the destruction of a socket couldn’t actually complete whilst the abort was working and although I had locked the socket’s own lock before manipulating it there was nothing to stop a socket getting to the point where it was very very nearly destroyed (a reference count of zero with the thread blocked in the final stages of destruction on the socket’s lock. As soon as the abort completed the socket would be destroyed. Trying to be clever and increment the reference count from zero so that I could post a notification in this situation was doomed from the very beginning.

At this point I should, perhaps, point out that there are race conditions and there are race conditions… Sometimes it doesn’t matter which order things complete in and it’s quite OK not to know. There are ways of manipulating some parts of an object’s state without needing to lock the entire object; a subset of state is manipulated atomically in one thread and another subset is manipulated atomically in another. In these situations it’s often unknown which thread might complete its work first; it doesn’t matter as long as all combinations of completion order result in an object that is in a stable state. Since each thread is manipulating its subset of state in an atomic fashion this is the case.

There’s deliberately no locking in the socket’s reference counting operations; they use InterlockedIncrement() and InterlockedDecrement() to manipulate the reference count atomically and when the reference hits zero we know that nobody else is holding a reference so nobody else will be manipulating the socket so there’s no need to lock it during much of its destruction. Since our abort code doesn’t own a reference to the socket there’s a race condition between the last reference that’s held being released and our abort completing. No amount of locking will help us work around this without locking on entry to Release() and that is something that is just not going to happen and something that would simply change the problem to one of deadlock if it did happen; the final release would lock the socket lock and then the allocator lock, the connection manager shutdown abort would lock the allocator lock and then the socket lock… So there’s always going to be a race on a zero reference if you’re trying to manipulate a socket for which you don’t own a reference to… The problem here is not the race itself but the fact that you shouldn’t be manipulating a socket that you don’t own a reference to.

At this point I designed a solution to the problem, implemented it, tested it to death on my main development machine and the build machine that had shown up the problem in the past and sent it to the client for testing with a note saying that whilst this fixed the race condition problem in the client the knock on effects would need to be looked at with regards to the server; and, indeed, every single system built on The Server Framework, it was very much a breaking change.

The fix itself was pretty simple. Since a shutdown induced abort was different to an abort that was caused by someone who actually owned a reference to a socket it needed to be treated differently. During the shutdown abort we couldn’t do anything that required us to manipulate the reference count. This meant that we couldn’t post a close notification. Since lots of code is written on the basis that close notifications ALWAYS occur for any socket that has had an OnConnectionEstablished() notification (see the “lifecycle of a steam socket connection”) this could cause resource leaks in user code. Because of that I realised that I could flag the fact that the notification hadn’t been sent and send it if we ever attempted to close the socket in the future (as would happen when the final release happened on the socket). Due to how the locking around socket closure worked this was safe with no race conditions. The notification of closure would be delayed but you’d get it… At this point I spotted a situation where I could send the notification safely at a point prior to socket final release in some situations: If a read was pending on the socket when the socket was closed then the read returned with zero bytes read; much like when a client close occurs. With a few simple changes I could spot this situation and issue the correct close notification at this point. Only sockets without reads pending when the socket was aborted would suffer from a close notification that was delayed until just before the socket was destroyed.

This worked fine in the client code in question. The client was happy with the fix and it passed their tests and the pressure was off; at last. We were all back in the “spring of hope”. I switched to working out the ramifications of the change for their server software and for other framework examples. I was in stage 5 of the Kübler Ross model; acceptance and, interestingly, from this point on I stayed there…

Unfortunately not sending a close notification directly from the socket abort meant that one particularly common style of server design was now broken. Often you find that you need to hold on to sockets within the server for some reason; usually this is because you need to allow one connection to talk to another or perhaps it’s because the server generates data of its own to send on the connections. There’s a whole lump of code in the framework that makes this easier to do but unfortunately the change meant that all of this was broken…

Servers that hold references to a socket tend to release those internal references when the socket is closed. Client closure and connection reset errors are handled with the appropriate notifications but a server generated abort is handled by the OnConnectionClosed() event. Since we now send our close notification for shutdown aborts when the socket is finally released we’ll never get that event (or release the socket) for sockets which are being held on to by the server and which will be released when OnConnectionClosed() is called. What’s more, given the way I did send the notification if a read was pending at the time that the abort occurred I knew that this was a new intermittent, hard to track down, bug in the making; every so often servers that used to work fine would now hang during shutdown as a reference to a socket would never be released.

I spent a while looking at alternatives but most smacked of being a return to denial; the code didn’t own a socket reference so I couldn’t send a notification… I moved on and added a new callback which was called when an ‘abort all connections’ event had occurred and which told you of the number of connections that had been aborted and for which a close notification might be delayed. With this in place I could fix up the client’s server code to release the socket references that it held. The problem was fixed, for this client at least, but I’d yet to fully appreciate the full ramifications for other common server designs. I merged the client changes into a test branch of my main framework and picked a ‘complex server’ at random; I deliberately didn’t pick one that I knew held on to connections in the same way that the client’s server had. The OpenSSL server and its associated test client use most of the important parts of the framework and as I got these working I began to realise that the removal of this notification was a breaking change too far. Firstly there was nothing to cause a compiler error if you were about to fall foul of the problem; sure I could rename the notification so that you had to work out why I’d renamed it and perhaps you’d read the release notes and comments and perhaps you’d understand the problem and perhaps you’d fix your code but more than likely you’d do a search and replace for the old name with the new name and move on with a potentially broken system. Removing or delaying the notification was not acceptable; too much code relied on it.

I went back to the original problem and spent some time thinking. Then I stepped away from it for a while and waited to see if the thinking would result in anything…

As stated above, the root problem is that: During shutdown the connection manager potentially calls abort on a socket for which it doesn’t own a reference. Since it doesn’t own a reference it can’t legally manipulate the reference count though the way we lock the socket allocator and the socket itself allows it to safely manipulate the socket whilst it performs the abort.

The solution that requires the drastic breaking change assumes the worst; we don’t have a reference and we can’t get one because the reference count is not locked during manipulation and therefore there’s no safe way to increment it to “steal” a reference. What I really need is an InterlockedAddIfNotCompare() function that could compare the reference count to zero and add one if the compare failed. Unfortunately such a thing doesn’t exist. However due to how both the abort and the socket closure code that needs to be called during a socket’s destruction both lock the socket and test it for validity I realised that if I was actually inside the code that needed to steal a reference I could be sure that I could increment the reference count, test it to see if I’d just incremented it from zero and if so decrement it back to zero all without affecting the code that requires the reference to be zero during socket destruction… If I could steal a reference in this way then the whole problem of not owning a reference went away; since stealing the reference meant obtaining a reference to the socket only if the existing reference count was not zero, once I’d stolen the reference the potential race with other reference owners went away as the reference that I’d obtained was just as valid as anyone else’s. What’s more, in the situations where delaying the close notification would be a problem I was guaranteed to be able to steal a reference since this problem was due to a socket reference being held that needed the close notification to release it… In the situations where I couldn’t steal a reference I knew that we were currently in a race with the socket’s final destruction and, due to the locking and socket state checks that guaranteed we were ahead of the socket destruction code, that destruction code would issue the delayed notification shortly after we finished. This left me without the massively breaking change and with a solution that I was happy was correct; though one which had generated copious code comments to warn me against changing anything that the code relied on.

By this stage I had some unit tests in place. These tested the notification generation during the shutdown abort and helped document the behaviour. Unfortunately the original issue was not something that I could get into a unit test as it relied on a race condition that was too hard to manipulate within a test; sometimes you have to accept that some things are just untestable with reasonable resources.

Unfortunately this left me with one outstanding issue… The way that the socket allocator handles walking the list of active sockets deals with the fact that the socket we’re currently aborting may end up removed from the list before we step on to the next element in the list. This list doesn’t use the STL and is a custom invasive list so that all we need to do to guard against the node that’s being processed being deleted during processing is to make sure that we have already retrieved the ’next’ node before we process the current node. Unfortunately this doesn’t help if processing the current node causes the next node to become invalid. Certain styles of server can have one socket directly holding onto another socket so that when one socket is released it releases the socket it’s holding on to (gateway servers, or proxies might be designed like this). A fairly contrived test, whereby socket A holds a reference to socket B and socket A is released first during a shutdown abort causing the release of socket B can clearly demonstrate the potential problem. The list traversal jumps off into hyper-space as the next node is invalidated before we step to it. Locking doesn’t help here as everything is happening on the same thread and so we simply re-enter any locks recursively; using non re-entrant locks would simply deadlock us on ourselves.

Moving the reference stealing into the allocator’s abort loop and only processing sockets that can have a reference stolen fixes this final problem. By stealing the reference we know that the socket that’s being worked on cannot be deleted whilst we’re working on it so we don’t need to do anything clever to allow for sockets that release themselves and/or other sockets. However we now no longer abort connections that we are unable to steal a reference from. By definition a socket from which we’re unable to steal a reference is one that is in the later stages of being closed and released as such it’s probably better to allow it to close naturally rather than aborting it at this point. The revised design appears to be safe from problems caused by the race condition on socket destruction and perform better than the original design by allowing sockets that are closing anyway to continue to close naturally rather than being aborted at the last minute.

Getting from the original problem report to a satisfactory solution took some time, the journey was interesting but I went through a whole host of emotional states as I initially rushed to fix the symptoms without taking the time to fully understand the underlying problem. Once the pressure was off things went much more smoothly as I started to view each attempt at a fix as just that, an attempt. I didn’t feel pressured by the client even though they had a deadline, I was pressuring myself because I knew that they were under pressure and I was embarrassed to have found a bug in my code…

These fixes will be included in release 6.2 which currently doesn’t have a release date but which should be available in the first quarter of 2010. If you think you need them sooner then get in touch.