Practical Testing: 29 - Fixing the timer wheel

Previously on “Practical Testing”… I’m writing a timer wheel which matches the interface used by my timer queue. This new implementation is designed for a particular usage scenario with the intention of trading space for speed and improving performance of some reliable UDP code. The last entry completed the development of the timer wheel. This time we fix a couple of the bugs that I’ve discovered since I started to integrate the code with the system that it was developed for.

I try to be pragmatic with my testing. I know that I could write tests for the rest of my life and never prove that the code under test was 100% correct and so I try and write the smallest number of tests that give the largest amount of validation and confidence in the code. Because of this I quite expect to discover bugs in my code and find that I need to add a new test that reproduces the bug before I then fix the bug and the test passes.

That said, I still should have included a test for these particular bugs in my original test suite for the timer wheel, it’s a fairly obvious hole in the testing. Both bugs are fairly serious, and fairly simple. Luckily once I integrated the code into the real application the problems showed up quickly and regularly; this meant that it was easy to track them down and once I had an idea of what the likely problems were it was easy to put together some tests that showed the problems.

The first bug is in how we handle calculating the next timeout when that timer has wrapped and is before our current position in the timer wheel rather than after it. None of the existing tests put the code in this position, but real usage caused it almost immediately.

Timer Wheel 4

The calculation for this situation was wrong, it was this:

nextTimeout = static_cast<Milliseconds>((
   (m_pFirstTimerSetHint > m_pNow ?
      (m_pFirstTimerSetHint - m_pNow) :
      (m_pNow - m_pFirstTimerSetHint)) + 1) * m_timerGranularity);

and it should be this:

nextTimeout = static_cast<Milliseconds>((
   (m_pFirstTimerSetHint >= m_pNow ?
      (m_pFirstTimerSetHint - m_pNow) :
      (m_pTimersEnd - m_pNow + m_pFirstTimerSetHint - m_pTimersStart)) + 1) * m_timerGranularity);

The original calculation caused the timer wheel to report an incorrect next timeout which caused timeout handling to stall until our first timer was larger than “now” again….

The second bug is slightly less serious but occurs if there are no timers set and we’re using the HandleTimeouts() call for timeout processing. The timer wheel’s view of the current time is updated during timer processing. If this there are no timers set then the timer processing loop is skipped inside of HandleTimeouts() and the wheel’s view of the current time begins to lag. This progressively reduces the value of the timeout that you can set with SetTimer(). The fix is to have SetTimer() reset the wheel if no timers are currently set. In this situation it’s safe to set the wheel to its initial state before setting a new timer. The fix is pretty simple, we just add this:

Milliseconds CCallbackTimerWheel::CalculateTimeout(
   const Milliseconds timeout)
{
   const Milliseconds now = m_tickCountProvider.GetTickCount();
  
   if (m_numTimersSet == 0)
   {
      m_currentTime = now;
  
      m_pNow = m_pTimersStart;
   }
  
   const Milliseconds actualTimeout = timeout + (now - m_currentTime);
  
   if (actualTimeout > m_maximumTimeout)
   {
      throw CException(
         _T("CCallbackTimerWheel::CalculateTimeout()"),
         _T("Timeout is too long. Max is: ") +
         ToString(m_maximumTimeout) +
         _T(" tried to set: ") + ToString(actualTimeout) +
         _T(" (") + ToString(timeout) + _T(")"));
   }
  
   return actualTimeout;
}

Where the fix is the code inside of the if (m_numTimersSet == 0) block.

I’ve also renamed the #define that’s used to enable monitoring; the previous name seemed a little back to front…

The code can be found here and the previous rules apply.