Deadlock detection tool updates

2006-04-02

When I came back from skiing in Colorado I had a bug report from a client and it took me a fair while to isolate the problem for them. The report suggested that a server that I’d built for them a while back was deadlocking in certain situations. The server had always had deadlock problems due to a couple of poor design decisions that I made early on and I’d built a tool to help me remove the potential for deadlock from it by analysing its lock acquisition patterns - the code was too complex for me to do this analysis by just looking at the code and although I should rewrite the poorly designed section of code I don’t have the time to do so (and it only causes problems when I make changes to the code, which doesn’t happen very often, and the problems can be caught by running my tool). My tool monitors lock usage whilst the program is running and then reports on any deadlocks that are possible, even if they don’t actually happen.

Anyway, this week I needed to run up the tool to track down this newly reported problem and I found that I didn’t have a build that compiled…

Note: the deadlock detector mentioned in this blog post is now available for download from www.lockexplorer.com.

I’d spent a lot of time working on the deadlock detection tool and getting the engine working correctly. It’s basically a custom debugger that uses API hooking (IAT patching) to do the lock monitoring. After getting the deadlock detection tool working nicely I decided to add a GUI to it (working nicely meant that it did the detection I wanted but the reports were a horrible mass of output rather than a nice clicky, drill-downable GUI). Whilst working on the GUI I decided that I needed to step away from the complexity of the deadlock detection tool and instead build the bulk of the GUI for a simpler tool; this allowed me to build various dialogs that I would need for the deadlock tool but also to explore the other issues that would come out of the desire to plug a GUI onto the debug engine.

At that time I had another problem that looked suited to a custom debugger with API hooking and I wrote a new tool to allow me to control the way an application viewed time. Given the way I structure my code, most of it ends up being pushed into libraries for reuse, it was easy for me to build this and other tools from the same codebase. This new time shifting tool was considerably simpler and I slapped a simple GUI onto it. The GUI components were coming on but I still didn’t have everything that I needed to put the GUI on the deadlock tool.

I found myself bogged down in the GUI and noticed that I was making less and less progress. I decided to put a halt to the GUI for a while and, instead, work on some of the meatier items from my list of things to do. Top of the list was making the debug engine work with .Net (CLR) processes. At this time my debug engine was simply using CreateProcess() to create a suspended process and then injecting a DLL and hooking the APIs that it wanted to hook and then resuming the process. This worked fine for “normal” exes and failed dismally with .Net exes. For some strange reason CreateProcess() appears to fail to suspend the correct threads in a .Net process and a suspended process simply runs….

The solution was to switch to using the Win32 Debugging API which gave me enough control over the process to be able to work out how to correctly start and safely suspend a .Net process. Once I was using the Win32 Debugging API I decided that I may as well cross another item off of my list by supporting multiple process debugging; my debug engine could now debug a process that started a child process and also debug the child process at the same time.

Then Christmas arrived and I was caught up finishing some work for a client, I integrated the new debug engine into the time shifter tool and updated the GUI, then we skied some and then my world ended.

So there I was, not really in the mood to code and with a sudden need for a working version of the deadlock tool. Due to some sloppiness on my part, I didn’t have one (I’d not bothered to tag the code of the last working build) and to get one either meant restoring a backup or knocking another item off of my list by updating the deadlock detection tool to work with the new debugger engine. I decided that I needed to get back into coding so opted for the later and spent two days getting my head back into the code and updating the deadlock detection tool to work in the same way as the time shifting tool, they both now use the new multi-process, CLR friendly, debug engine.

The tool found the problem by mistake. The actual bug that the client was reporting wasn’t actually a deadlock but my thrash testing of their server under the deadlock detection tool managed to reproduce the problem and the tool captured some useful information that helped me to track it down.

The problem turned out to be a subtle race condition. The server is a gateway, it accepts incoming connections and routes them to an outbound connection. The race condition was that the “connector” object could be deleted due to a connection closure whilst it was still being used due to data flow. The solution was to reference count the connector so that although the connection closure would release a reference the data flow would still be holding a reference and so destruction would be delayed until the connector was no longer in use.

Locating the problem identified an issue with the deadlock detection tool. It uses masses of memory when running for a long period on processes that perform lots of manipulations on lots of locks. I’m collecting all of the data (in all of the formats) as the manipulation happens. This means storing some of the information multiple times - I track details about which threads do what to each lock, I track the order in which each thread acquires lock sequences, I track everything a particular thread does with any lock, I track all of the operations that occur on any thread in the order that they happen, etc. All of this is useful information but much of it could be worked out after the event. At present we could derive much of this information at the point that we want to create the report, all from a single list of operations… When displaying information on a GUI some of the stuff we currently store needs to be derived for display, but not all of it, and not all of the time, and even if it does then the data structures used for holding the data for display will probably duplicate much of what we’re already collecting.

There was no real plan to how we’d collect the data. It just grew as I added more interesting reporting options. I’d decided not to summarise as I went along as this restricted the ways that I could slice and dice the data but now I was doing the opposite and collecting all manner of pre sliced data… Now that I know I’m collecting too much I need to go back and revisit the data collection and work out what is the minimum that I require…

But before I do that, I think I’ll explore the changes required to support multiple process deadlock detection; that is, add support for named mutexes, etc. Looks like I’m getting my head back into coding again…