Practical Testing: 35 - Heap corruption

2016-11-09

Page content

Previously on “Practical Testing”… I’ve just fixed a new bug in the timer queue and in doing so I updated the code used in the blog posts to the latest version that ships with The Server Framework. This code included two fixes that had been fixed some time ago and which I hadn’t documented here. They also lacked unit tests… In this episode I find and fix the first of these issues by writing a unit test that triggers the issue.

This bug is another edge case that isn’t used that often by most of the code that uses the timer queue. The queue supports timers that can be set and reset a number of times via an interface that allows you to create a timer, set/cancel it and then destroy the timer. A less used interface allows you to create ‘fire and forget’ timers that can only be set once and that clean themselves up. Hardly any code that I write uses this interface but it’s there for backwards compatibility and the code required to support it is limited to the call that sets the initial timer, as the clean up is done by code that’s shared with timers that are deleted during their own timeout processing.

This bug also only affects the Timer Wheel implementation which has a considerably smaller set of users and a considerably narrower use case. There’s test coverage for “one shot” timers but only for timers that are processed whilst the timer queue’s internal lock is held. There is no test for a “one shot” timer for queues that dispatch without holding a lock. The code for lock-free dispatch is significantly different to the code for dispatch whilst holding a lock and that’s where the bug is.

The bug was originally found because it causes heap corruption due to a double delete. The first thing I’ll do is add a test for lock-free timer dispatch of “one shot” timers. This clearly demonstrates the bug when run in release mode on Windows 10 as the heap alerts us to the problem.

Adding a new test

The new test is similar to TestOneShotTimer() in TCallbackTimerQueueTestBase, which contains tests that run for all timer queue implementations. The TestOneShotTimer() test does everything needed to show that a “one shot” timer works but it only tests the code that dispatches timers using HandleTimeouts() to dispatch the timers. We need a test that uses the “lock-free” BeginTimeoutHandling(), HandleTimeout(), EndTimeoutHandling() sequence. That code looks something like this:

template <class Q, class T, class P>
void TCallbackTimerQueueTestBase<Q, T, P>::TestOneShotTimer()
{
   JetByteTools::Win32::Mock::CMockTimerQueueMonitor monitor;

   P tickProvider;

   tickProvider.logTickCount = false;

   {
      Q timerQueue(monitor, tickProvider);

      CheckConstructionResults(tickProvider);

      THROW_ON_FAILURE_EX(INFINITE == timerQueue.GetNextTimeout());

      tickProvider.CheckNoResults();

      CLoggingCallbackTimer timer;

      const Milliseconds timeout = 100;

      timerQueue.SetTimer(timer, timeout, 1);

      CheckTickProviderSetTimerResults(tickProvider);

      const Milliseconds expectedTimeout = CalculateExpectedTimeout(timeout);

      THROW_IF_NOT_EQUAL_EX(expectedTimeout, timerQueue.GetNextTimeout());

      tickProvider.CheckResult(_T("|GetTickCount|"));

      timerQueue.HandleTimeouts();

      tickProvider.CheckResult(_T("|GetTickCount|"));

      timer.CheckNoResults();

      THROW_IF_NOT_EQUAL_EX(expectedTimeout, timerQueue.GetNextTimeout());

      tickProvider.CheckResult(_T("|GetTickCount|"));

      tickProvider.SetTickCount(expectedTimeout);

      THROW_ON_FAILURE_EX(0 == timerQueue.GetNextTimeout());

      tickProvider.CheckResult(_T("|GetTickCount|"));

      timerQueue.HandleTimeouts();

      tickProvider.CheckResult(_T("|GetTickCount|"));

      timer.CheckResult(_T("|OnTimer: 1|"));

      THROW_ON_FAILURE_EX(INFINITE == timerQueue.GetNextTimeout());

      tickProvider.CheckNoResults();
   }

   THROW_ON_FAILURE_EX(true == monitor.NoTimersAreActive());
}

When this test is run on the CCallbackTimerWheel class, on Windows 10 in Visual Studio 2015, it drops us into the debugger in this code in free_base.cpp

// This function implements the logic of free().  It is called directly by the
// free() function in the Release CRT, and it is called by the debug heap in the
// Debug CRT.
extern "C" void __cdecl _free_base(void* const block)
{
    if (block == nullptr)
    {
        return;
    }
    
    if (!HeapFree(select_heap(block), 0, block))
    {
        errno = __acrt_errno_from_os_error(GetLastError());
    }
}

With the execution broken at the call to HeapFree(). Looking back up the callstack we can see that we’re in the destructor of the timer wheel as our new test exits successfully.

Working out what’s going on…

The destructor is cleaning up any active timers. It’s looping over the active timer handles and deleting the timer data. The thing is, there are no active timers at this point in this test as the timer was a “one shot” timer and it cleaned itself up after OnTimer() was called. Setting a breakpoint in the destructor, just before the delete, and running the test again shows that there IS an entry in the active handles map when there shouldn’t be. Looking at the code in HandleTimeout() shows that this “one shot” timer is deleted there but, unlike in HandleTimeouts() this line is missing:

         m_activeHandles.Erase(pDeadTimer);

Adding that correctly fixes the bug and prevents the double delete and the subsequent heap corruption.

Why doesn’t deleting a timer also break?

Q: The code that cleans up the timer after processing its timeout is the same code that cleans up a timer that is deleted during timeout handling. Why doesn’t that use case also suffer from this bug?

A: When you delete a timer you immediately remove the handle from the active handles collection. This is so that you can’t delete a timer and then continue to use the timer handle. So the handle removal after timeout isn’t needed for deleted handles to timers, only for “one shot” timers.

The code is here on GitHub and new rules apply.