Practical Testing: 20 - Mind the gap

Back in 2004, I wrote a series of articles called “Practical Testing” where I took a piece of complicated multi-threaded code and wrote tests for it. I then rebuild the code from scratch in a test driven development style to show how writing your tests before your code changes how you design your code. Since the original articles there have been several bug fixes and redesigns all of which have been supported by the original unit tests and many of which have led to the development of more tests.

But… you need to be aware of the tests that you haven’t written… You may think you’re testing the code completely but often you’re not testing as well as you might think. Code coverage metrics don’t really help either, no matter how many tests you write and how complete your coverage is there’s no way to be sure that you’re actually covering everything. 100% code coverage might give you a warm feeling but unless you have 100% concurrent code permutation coverage then that warm feeling is misplaced.

In part 18 I redesigned the timer queue to allow for lock-free timer dispatch; that is we no longer hold an internal lock when calling back into user code. This removed the potential for deadlocking user code due to unexpected lock inversions but complicated the timer queue implementation due to the fact that the internals of the queue needed to remain sane when timer manipulation functions interrupted the unlocked portion of the timer dispatch code.

I added some tests to prove that these changes worked, unfortunately there’s a race condition in the code and the tests fail to cause the race condition to occur… Were we to examine the tests we’d see that we were covering all of the code but not all of the permutations.

In the internal timer data’s HandleTimeout() call we do this:

void CCallbackTimerQueueBase::TimerData::HandleTimeout()
{
   OnTimer(m_timedout);
  
   m_timedout.Clear();
}

This is where we call into user code, so this happens whilst the queue is unlocked and it must be safe to allow other timer manipulation calls to occur on the timer queue whilst this code executes. Unfortunately DestroyTimer() uses the fact that the m_timedout data is valid to decide whether or not to actually delete the timer data during the call to DestoryTimer() or whether to simply mark the timer data to be deleted when the timer handling is complete. Since we unset the data that DestroyTimer() is using to make this decision whilst we’re unlocked there’s a race condition. If we unset m_timedout before a call to DestroyTimer() checks the data for this timer then DestroyTimer() will think that the timer is not currently in the process of being timed out and will delete the timer data. When the timeout handling completes we may also delete the data or, worse, when we try to access the data it will already have been deleted.

Our existing tests for this kind of interruption test to see that it’s OK to call the various timer manipulation functions during timer dispatch and cause the interruption to occur between two calls, one of which is locked and so atomic and the other which is not. The bug shows up if we interrupt the timer dispatch after the actual dispatch to user code is complete but before we’ve re-entered the locked code in the timer queue to finish the dispatch. Adding some more tests to interrupt the timer dispatch at this point shows the fault.

Of course the unlocked timer dispatch could be interrupted at any point in its execution, luckily for us there’s only one point where it would cause us problems and that’s the point where we update the value of m_timedout.pTimer from a valid pointer to 0. The window of opportunity for the race condition is from after we’ve updated the pointer to before we re-acquire the lock. Our new tests therefore test the right thing…

And now on to fixing the problem…

The problem occurs because we’re using m_timedout.pTimer for two purposes. One of them, requires that it be updated immediately after the user’s timer code has been executed, this is so that we prevent OnTimer() being called multiple times. The second requires that it only be updated whilst the lock is held as we also use the value to determine if the timer is currently being processed. The solution is to add a flag to the timer data which is used to state that the timer is currently being processed. This flag is only ever updated when the lock is held and therefore removes the race condition. The unlocked update of m_timedout.pTimer is now safe as DestroyTimer() uses the new flag to determine if it should delete the timer data or not…

There’s still a potential race condition that would allow OnTimer() to be called multiple times if multiple threads were able to call HandleTimeout() concurrently for the same timer but since you’d have to jump through quite a few hoops to make that occur I think we’ll leave that for now.

Code is here and the new rules from last time apply. Note that this fix is included in release 5.2.3 of The Server Framework.