Practical Testing: 16 - Fixing a timeout bug

| 5 Comments

Back in 2004, I wrote a series of articles called "Practical Testing" where I took a piece of complicated multi-threaded code and wrote tests for it. I then rebuild the code from scratch in a test driven development style to show how writing your tests before your code changes how you design your code. Then, in 2005, I adjusted the code to be more scalable and I showed how the tests that had originally been written helped when code needed to be changed for performance purposes. Finally I uploaded a test utility program that I'd been working on, TickShifter, that allowed you to run a program and take control of how the GetTickCount() API operated within the program. The idea was that you could control time from outside the program to enable you to easily test edge conditions.

Time passed...

Recently I've had a bug reported against the timer queue code that was developed in the testing articles. You can find the bug report comment here. I'd like to thank the commenter for taking the time to report the bug in such a thorough manner; it made it much easier for me to validate the problem and craft a test that proved its existence.

The bug is as follows: If the tick count has wrapped but no timers have fired since it wrapped and you then add a new timer the two queues that are used to manage wrapped timers have not been adjusted to allow for the fact that the tick count has wrapped. This means that the new timer is added to the wrong queue and the queues are then not swapped and timers expire in the wrong order.

It's a nice edge case bug and it's one that was not tested for in the original test harness. The first thing that I did was set out to write a test to reproduce the bug. The commenter had used TickShifter to force the situation; which is what it's for! but for development and regression purposes it's better to have tests in the test harness that makes sure the bug stays fixed.

The first new test TestTickCountWrap2() sets up the environment exactly as the bug report stated. Take a look at the code for full details. First we set the tick count to be 1000ms before roll-over. Next we set a timer for 2000ms time, i.e. 1000ms after the tick count rolls over to 0. Next we set the tick count to 0, i.e. the point at which it rolls over. We then set another timer for 10000ms time. At this point we should have two timers set; the first will expire in 1000ms and the second in 10000ms. This is the point where the original code had a bug. Next we set the tick count to 1000. At this time the first timer should go off and we check that it does. We also verify that there are 9000ms until the next timeout. Finally we set the tick count to 10000 and verify that the second timer expires correctly.

Of course, when run with the code from part 15 this test fails for exactly the reason that the bug report stated. The problem was that we used two queues for the timers, one for timers before the "wrap point" and one for timers after. The only point that we ever switched the queues over was when a timer expired. At that point if we knew we had a timer that had expired and the current queue didn't contain any timers then we knew that the tick count had wrapped and that the timer we wanted was in the other queue. Of course, looking at this now, written down like that, it's quite obvious to see the flaws. It doesn't matter how many tests you write, if you don't write the correct tests then your code can still have bugs!

A solution to this bug is to keep a track of the tick count when we set a 'wrapped' timer. Then, when checking for timeouts, if the current tick count is less than the tick count when we last set a wrapped time we know that the count has wrapped and we can adjust the queues accordingly. The problem then is that we need to make sure the queues are correct when we set new timers as well as when existing timers expire. The simplest fix seems to be to cause a call to SetTimer() to first check for the expiry of existing timers. This means that the code that is used to check for a wrap when a timer expires is also executed before adding new timers. Hopefully, this means that the queues will always remain correct.

The second new test expands on the first to ensure that a timer that has expired when SetTimer() is called is correctly processed.

Code is here but new rules apply: The code will build with VC6, VS.Net 2002, VS.Net 2003, VS 2005 and VS 2008. The code builds as x86 or x64 with VS 2005 and 2008. The code will build with either the standard STL that comes with Visual Studio or with a version of STL Port. The code uses precompiled headers the right way so that you can build with precompiled headers for speed or build without them to ensure minimal code coupling. The code can also compile on VC6 with or without the platform SDK being installed. The various options are all controlled from the "Admin" project; edit Admin.h to change things...

5 Comments

Excellent post! I saw that comment about "The code uses precompiled headers the right way" and it intrigued me. I'm not familiar with a process that works the way you describe. Any chance you could go into more detail (or point me somewhere that does)?
Thanks!

I knew I should have searched your archives first :)
Thanks for the pointers.

Hi Len,

I found the bug in your code while writing my own timer queue based on your ideas. ;-)
My reason to write it myself was that the windows timer queue does not use a scalable algorithm in versions before Windows Vista.
After posting the bug report, my solution was to implement a GetTickCount64 on top of GetTickCount which is practically impossible to overflow. It overflows every 584942417 years if i calculated it correctly.
With this solution the code gets very straightforward.

Cunning. I'll probably switch to using GetTickCount64() on vista or later platforms and I may then take your approach and roll my own for earlier platforms. As you say it would simplify the code somewhat.

Leave a comment