I always regret leaving the perfmon code out

2004-02-20

I had one of those “Doh!” moments yesterday. In summary, always put the performance monitoring code in early, looking at a program’s vital signs as a jiggly graph can show up all kinds of unexpected things…

The work on the gateway server that I’m currently building for a client has been going well but yesterday I had a weirdness develop. Very rarely a test would fail because the server would fail to respond to a request. It was hard to reproduce and it was one of those bugs that seemed to flit in and out of existence to the point that I wasn’t really sure if it was in the test code or the code under test…

I decided to leave that particular area of testing for a while; I was getting nowhere and needed to change what I was thinking about. Instead I looked at the list of open issues and started working down it. 1) Add support for the client’s MSMQ based error reporting system. 2) Add appropriate performance counters, etc. I put in the perfmon code and activated a set of appropriate counters; number of IO threads, number of IO threads active, number of inbound connections, number of outbound connection, etc, etc.

I fire up perfmon and run the tests to make sure the lines jiggle as expected and the bug happens and there on the perfmon display is a flatline; I wish perfmon had a ‘beep, beep, beep’ sound (think ER) because if it did I would have had a continuous tone as the number of active IO threads flatlined at two with no active client connections…

The active IO threads counter is incremented right after GetQueuedCompletionStatus() is called to remove a work item from the IOCP and decremented when the work item is complete, just before the thread calls GQCS() again. As such, it’s pretty much the heart beat of the server; as data flows through the server an appropriate number of threads run and the line jumps up and down from 0 to N. It’s fine for it to flatline at 0 when there’s no work to do but it’s definitely not fine for it to stick at any other value; that tends to indicate one of three things:

There’s an exception handler missing at the outermost level of the ‘do the work’ function and something has gone wrong and blasted right out through the worker thread’s work item processing loop and killed the thread whilst it was working. This one is unlikely, it’s the kind of problem that was fixed once and now stays fixed as the fix is in the library code…
There’s a logic bug that has created an endless loop in the code. The worker thread is spinning and burning CPU and never going to finish. This one’s easy to spot without the perfmon life signs chart. My dev box makes a whining noise as the fans increase in speed and the CPU meter notches up to 25% and stays there (dual, hyper threading CPUs, a spinning thread tends to take one CPU to max and keep it there, two spinning threads and we hit 50%, etc.).
There’s a deadlock.

I’d seen the deadlock scenario once before. It was text book stuff, two threads acquiring multiple locks where one of the threads held lock A and one held lock B and they both needed to hold both… Only neither thread should be holding a lock that the other needed and the order of lock acquisition would be the same in each thread and… Hmm. Breaking into the debugger at this point would show me where each thread was blocked and expanding the watch on the critical sections concerned would show me the owning threads, etc… But there shouldn’t be a problem here…

A quick glance at the server’s main showed that I’d seen this particular issue before, it was an ‘inappropriate optimisation’ thing and it was a 2 minute fix…

Each connection to the server uses a lock, a critical section. Since we would like to support a lot of connections without necessarily using an equal number of critical sections we trade scalability and resource usage for lock contention. Connections can share critical sections with other connections. The sharing of the critical sections is done via a factory interface, and is tunable. To acquire a lock the connection passes a key (its this pointer) to the factory and the factory gives it a lock. The implementation of the shared lock factory that we’re currently using creates a pool of locks of a fixed (tunable) size and when asked for a lock it hashes the key that it’s given and picks a lock out of the pool. This, in itself, doesn’t cause a problem. Connections don’t access the locks of other connections, or cause those locks to become locked during the work that happens in a connection; there’s no scope for deadlocking. Testing that statement is easy. Create a lock factory with a pool of size 1, all connections share a single lock, lock contention is high but the code runs and doesn’t deadlock.

The problem is that a recent addition to the code means that a connection uses a processor and a processor needs a lock. In main we were passing the same shared lock factory to the connection factory and the processor factory… Connections can lock their locks and then try and lock their processor’s lock; if the two locks come from the same pool then another connection could be doing the same thing. They’d be acquiring the locks in the same order so it wouldn’t look like any deadlocking could occur but… If the keys used to obtain those locks happened to hash in such a way that connection A was using lock 1 and handler A was using lock 2 and connection B was using lock 2 and handler B was using lock 1 then we have the classic deadlock all over again…

The solution is to make sure that when using a lock factory like this, locks from the same factory are acquired in the same order and you never try and acquire two locks from the same factory at the same time… The creation of a second lock factory in main fixed the deadlock…

If I’d had the perfmon code in the application from the start I would have recognised this particular bug straight away. I’d seen the life signs of another server do the same thing and debugged the problem before. Note to self, putting the perfmon code in should be right at the top of the list when starting a new project…