Server shutdown in 5.2.1 and 5.2.2

2008-07-23

There was a change in release 5.2.1 of The Server Framework which has caused some issues with clean shutdown. This issue is also present in 5.2.2.

Prior to 5.2.1 the CSocketServer or equivalent object that did the bulk of the work with regards to connections could be destroyed whilst there were sockets that it managed still in existence. This wasn’t usually a problem but it meant that it was possible for a socket to make a callback to code that didn’t exist anymore; which is a Bad Thing. This situation has never shown up in real life but when I added continuous integration and continuous testing to my build process I started seeing the issue sometimes in test failures. To prevent this a waitable counter is used and the destructor of the socket server waits for all of its sockets to be released before returning. This means that the possibility of the kind of failure that I was seeing (purecalls into callback objects that no longer exist when sockets managed to outlive their creator) can’t happen any more.

Unfortunately, servers that use one of our thread pool designs can hang during shutdown if they have active connections at the time that the shutdown is initiated. The problem is this, there may be items queued in the queue to the thread pool and these most likely include sockets for which a reference is taken when they are added to the queue and not released until the processing in the queue is completed. When the thread pool is shutdown it’s possible that work items are still in the queue and these are now inaccessible and hold references to sockets which will never be fully released. The server object will now wait forever for the sockets to be released.

In one case of this problem that I’ve encountered in the wild the fix was to flush the socket allocator during the shutdown sequence. It’s not ideal, but it means that all sockets are forcibly released which prevents the hang. This seems less than ideal so I’m adding some code to the thread pool implementation so that it calls a user defined function for each item that is left in the queue after the pool has been shutdown. This can then be used to release any resources cleanly.

This wont help if you opt to forcibly terminate the threads in the pool, but then once you do that you’re on your own anyway as who knows what state the locks in your process are in…