Practical Testing: 7 - Fixing the tick count wrap bug

Previously on Practical Testing: After far too much work we finally got to the point where we had a test for the tick count wrap bug. Now that we have a failing test we can fix the bug.

At long last we have a test for GetTickCount() wrapping from a big number to 0 every 49.7 days. The test is controllable and we can run the test at any time, not just when the PC in question has been operating for 49.7 days. The test currently fails because when we schedule timers that expire after the wrap point we end up with an absolute timeout time which is, possibly, less than other timers that are currently set but that actually expire before the new timer; the timers end up out of sequence and don’t go off in the expected order.

Imagine that GetTickCount() had a range of just 1000; the count ran from 0 to 1000 and then wrapped back to 0. If ’now’ was 900 and we set a timer for 101 ticks in the future the absolute timeout of the timer would be 900 + 101 = 1001 which would have wrapped to 1. When we insert the absolute time 1 into our sorted list of timers we would find that it’s likely to be the lowest time we have, so the timer would be inserted at the front of the queue. If we had no other timers in the queue then the timer would simply go off too early; the next time we checked for timeouts we’d see that we had a now of ‘900’ and a timer expiry of 1 and expire the timer. If we already had another timer in the queue for 990, this new timer would go off before it rather than after it; etc.

So, our first problem is to recognise the wrap situation and make sure the timers stay in the correct order. We can do this by inserting a dummy node in the timer list at the point where the tick count wrap occurs and then make sure that when we have a timer that has wrapped we insert it after this dummy node rather than before; in essence, we split the list into two lists, both sorted in ascending order, one with timeouts that occur before the wrap around and one with timeouts that occur after. This fixes our example above as follows; when setting the timer that has wrapped to 1 we notice that it has wrapped and so insert it after the dummy node, our list looks like this; 990, dummy, 1. The dummy node is treated as if it has an absolute timeout value of 1000 (the maximum tick count in our example) and so the timer expiry will wait 90 ticks for the 990 timer, expire it and then wait a further 10 ticks for the dummy, recognise it’s a dummy, do nothing, and then wait a further 1 tick for the next real timer…

The remaining problems are recognising that the timer has wrapped and dealing with the dummy in the timeout code. The first issue is relatively easy to deal with, if the sum of the timeout and ’now’ is less than ’now’ then the absolute timeout has wrapped around. The second problem is slightly more complex; when handling timeouts, if the node appears to have timed out but it is our dummy node then we need to remove it from the list and move it to the end of the list again before continuing to process timers that may have expired.

The tests pass, but I’m not especially happy with the resulting code. As I’ve mentioned a couple of times now, if I had developed the code after I had been bitten by the testing bug then I expect it would have been simpler and easier to test… In the next posting we’ll begin to address the complexity issues in the current version of the code base in a rather extreme manner; we’ll start to write a new version of the code from scratch in a test driven style.

Code is here. Same rules as before.