There will be a 5.2.3, or documenting leads to refactoring...

2008-07-29

I started to document part of The Server Framework’s behaviour as I expect that a client will be asking questions about it in the near future. Whilst writing the documents I found myself writing this:

“It’s pretty easy to deadlock The Server Framework in complex servers if you don’t abide by the rules. Unfortunately, the rules weren’t documented until now and, although I knew them, it’s probably more accurate to say “I knew of them”. In more complex server developments I often went through a phase where I’d put some code in, in the wrong place, and deadlock the server under certain circumstances.”

The situation was never that bad but it existed (and I went as far as writing a tool to help me protect against it) and as I sat down to write the rules I decided that it wasn’t enough just to document the bad code it should be fixed. So I fixed it.

The reason that it was easy to deadlock The Server Framework is that each socket has a lock to protect its internal state and keep it sane in the presence of multiple threads accessing it at the same time. This is a Good Thing. However, often, in more complex servers, the business logic of the server also manages locks. Due to the way the callbacks work there are two directions you can call into the framework. The first is the obvious one. You get an event callback and you make a call on the supplied socket, or the server or connection manager from within this callback. The second is that you take and hold a reference to a socket; by calling AddRef() and then call into the socket at a later time, perhaps because another connection has done something that your held socket needs to know about or perhaps because of an external event. Either way if you have your own lock, A, and the socket has its own lock, B, then when you call into the framework because of an external event you may lock A and then B and when the framework calls into you it may cause you to lock B and then A. From such lock inversions are deadlocks born.

The fix is logically easy; if no framework locks are held when the framework calls into user code then it doesn’t matter what the user code does with its own locks. It turns out that over the years since The Server Framework diverged from the free version of the framework the locking in The Server Framework has already improved over the free version; locks were held during fewer callbacks. However there were still a few key callbacks which held locks and occasionally caused problems in some server designs. This situation has been fixed in what will be release 5.2.3.

I’ve also added a todo item, to go through the code and check these situations in the rest of my library code. Right now I know there’s a potential issue with the timer queue if it’s used “incorrectly”. However, rather than documenting the correct, and narrow path to safe usage it seems better to refactor away the complexity and make the document easier, and less embarrassing, to write.