Walking the call stack

Ned Batchelder has written about the code he uses to get a call stack out of a windows program (thanks for the link Barry). I’ve added a snippet of the code I use as a comment to his post.

Note: the deadlock detector mentioned in this blog post is now available for download from www.lockexplorer.com.

I started looking into working with windows call stacks a while ago when I was working on my deadlock detection tool. What surprised me was how easy it was to get a call stack once you understood the DebugHelp API. There are lots of examples on the web of how to use the API, but my requirements were a little different to most as I wanted to collect the call stack in one place and decode it into stack frames somewhere else…

The tool spots potential deadlocks in running code by looking at the order in which locks are taken out by different threads in the program. It also reports on concurrency issues and contention and allows you to see exactly how the program under test uses its locks. It’s a useful tool and one that has easily earned back its development time for me by allowing me to locate some deadlocks that hadn’t yet deadlocked… These bugs were in the code, lurking, and with the aid of the tool I nailed them before they bit.

Most examples of using DebugHelp assume that you want to walk the stack using StackWalk64() and decode the resulting STACKFRAME64 structure there and then. I didn’t want that, for many reasons, and so ended up using the web samples as a way of understanding the problem before writing my own implementation that used the DebugHelp API in the way that I needed it. My code is more flexible than most of the samples that I found as it can grab the stack frames in one call and then process them and the corresponding pdb files at another time to produce the call stack data for later display.

CallStack

There’s not a great deal of time between the grabbing and the decoding but there is some and the decoding occurs in a different process. As usual, for me at least, getting the display to work “just right” took longer than grabbing, processing and persisting the call stack data…