Practical Testing: 23 - Another new approach: timer wheels

| 3 Comments

The most recent articles in the "Practical Testing" series have been discussing the performance of the timer queue that we have built. As I hinted when I first brought up the performance issues, the particular use case that I have which is causing problems might well be more efficiently dealt with using a different (more specialised and less general purpose) design.

The timer queue has adequate performance for general purpose use and can handle timers set within a range of 0ms to 49.7 days with a theoretical granularity of 1ms. It achieves this by using a balanced search tree to store the timers by absolute timeout. The performance of setting a timer is O(log n) due to the tree insertion required. Cancelling a timer is O(1) since we keep an iterator to where the timer was inserted and thus can navigate straight to it to cancel it. Timer expiry is also an O(log n) operation due to the tree lookup. Due to the use of the standard program heap the worst case contention of the queue is C(tq)+(C(tn-tq+1)+C(ts-tq+1)+C(tn-tq+1)) (see here for details of my crazy Big C notation for describing contention).

The more specialist use case is for driving reliable UDP protocols. This kind of work generally requires timers per connection for retransmission and data flow pacing. The timeouts tend to be short and the timers tend to expire rather than be reset without expiring. The range of timeouts is generally quite small; 0ms - 30seconds for the ENet system I'm working on. I'm currently looking at improving performance of the timer system for this kind of scenario and to do so requires that timer insertion speed be improved (so we can set timers more quickly), timer expiry speed be improved (so we can process timers faster) and contention be reduced, ideally tending towards C(tq) where we have contention only between users of the timer queue and not between any thread in the process.

As I have already mentioned the use of STL containers means that I'm doing more work than is strictly necessary when manipulating the timer queue (including dynamic memory allocation and release during timer insertion and removal). One way of improving contention is to switch to using custom STL allocators so that only the users of the queue ever access the allocators that we use for the queue. Another is to write a custom, invasive, balanced search tree that does not need to use dynamic allocation.

A third solution would be to use a simpler data structure. Our requirement is simply to store timers in order of timeout. Rather than using a complex tree structure we could use a simple sorted list. Unfortunately timer insertion would then rise to O(n) as we would need to traverse the list to locate the correct spot to insert our new timer. Cancellation can stay O(1) if we use our invasive CNodeList and timer handling becomes O(1) because we will always work from the head of the list when expiring timers. The usage pattern of the reliable retransmission means that we'll be inserting timers over the whole of our possible range, so the O(n) insertion would really bite us.

In a classic trade off between memory usage and performance we could use an array and have lots of wasted space in it. Setting a timer becomes O(1), you simply index directly into the array at the correct location. Cancellation and timer processing are also O(1) and there's no dynamic memory allocation required for insertion and removal so the worst case contention is C(tq). Such a structure is called a timer wheel due to the fact that the array is viewed as a circular buffer and timers are inserted with timeouts relative to a 'now' point on the wheel.

The amount of memory used can be reduced by reducing the granularity at which you can set your timers. For example, a timer wheel with a range of 0-30seconds and a granularity of 1ms requires 30,000 elements in the array, if you reduce the granularity to 15ms (which is pretty much the best you can get from GetTickCount() anyway), then the array size is reduced to a more manageable 2,000 elements. Given that the array is an array of pointers we're looking at 8kB on an x86 and 16kB on x64. Each array element points to either null if no timer is set or to the first timer in a doubly linked list of timers at this time. The list is invasive with the links being part of the data that is stored in the list. Insertion into the list is a case of simply pushing a new node onto the front of the existing list, cancellation is easy as the list is doubly linked and the node contains the links. Thus most timer manipulation becomes simply adjusting pointers.

TimerWheel-1.png

The wheel in the diagram above has a granularity of 5ms and has timers set at 30 and 50. The wheel is defined by two pointers, one to the start of it and one to one element beyond the end.

TimerWheel-2.png

This diagram clearly shows the circular nature of the array. This is just before we expire the 30ms timer. Note that the next timer is due in 20ms.

My implementation of a timer wheel is made easier by the fact that I have a set of tests that target the interface to which I wish to conform to. To start with I'll implement a basic timer wheel that allows us to create, set and cancel timers but that doesn't deal with any of the complexity of expiring timers. Also all of the nice and implied or explicit implementation details will be left out. Don't worry, once we write the tests for these pieces of functionality it'll be obvious where we're failing.

Creation and destruction of the timer wheel are pretty straight forward. We have an array of pointers to create, the size of which is based on the maximum timeout that we can set and the granularity of the timers that can be set. Destruction is similar to the timer queue in that we iterate any existing timers and clean them up. Timer creation is very similar to our timer queue as we dynamically allocate the timer data and insert it into a map for validation and clean up purposes. The timers themselves are, at present at least, quite simple. a link for the next timer in the list, a link to the previous timer and the timer and user data. Setting a timer simply involves validating it, locating the correct index into the timer wheel array and then adding the timer to the list of timers at that point in the array.

bool CCallbackTimerWheel::SetTimer(
   const Handle &handle,
   Timer &timer,
   const Milliseconds timeout,
   const UserData userData)
{
   if (timeout > m_maximumTimeout)
   {
      throw CException(
         _T("CCallbackTimerWheel::SetTimer()"), 
         _T("Timeout is too long. Max is: ") + ToString(m_maximumTimeout) + _T(" tried to set: ") + ToString(timeout));
   }
  
   TimerData &data = ValidateHandle(handle);
  
   const bool wasSet = data.CancelTimer();
  
   data.UpdateData(timer, userData);
  
   InsertTimer(timeout, data);
  
   return wasSet;
}
  
void CCallbackTimerWheel::InsertTimer(
   const Milliseconds timeout,
   TimerData &data)
{
   const size_t timerOffset = timeout / m_timerGranularity;
  
   TimerData **ppTimer = GetTimerAtOffset(timerOffset);
  
   data.SetTimer(ppTimer, *ppTimer);
}
  
void CCallbackTimerWheel::TimerData::SetTimer(
   TimerData **ppPrevious,
   TimerData *pNext)
{
   if (m_ppPrevious)
   {
      throw CException(
         _T("CCallbackTimerWheel::TimerData::SetTimer()"),
         _T("Internal Error: Timer is already set"));
   }
  
   m_ppPrevious = ppPrevious;
  
   m_pNext = pNext;
  
   if (m_pNext)
   {
      m_pNext->m_ppPrevious = &m_pNext;
   }
  
   *ppPrevious = this;
}

I'm using a pointer to the previous pointer rather than a pointer to the previous node as it makes things slightly simpler; honest...

With just enough code to get the first set of tests to run I have enough to get some initial performance figures out of the new timer system. Timer creation is about the same as with the queue, but that's expected as the code is almost identical; the contention for creation and destruction are also the same as for the queue and thus could also be improved with custom allocators and private heaps. The performance tests for SetTimer() show a dramatic improvement. On my test machine I get figures of around 4ms to set a single timer 100,000 times against 90ms for the queue and similar improvements in the other two performance tests for SetTimer(). What's even better is that SetTimer() would have a contention of C(t-queue) as we no longer have to do any of the dynamic allocation that was going on with the timer queue's STL manipulation.

Right now we're left with a failing test which points the way for what we need to do next which is deal with being able to process these timers when they time out, but before I look at that I think it's about time that I take a good hard look at the duplication in the tests. We're testing an interface with three implementations and we should have a single set of tests which does that and then have some implementation specific tests as well if we feel we need them. Having one set of duplicate test code for the Ex version of the queue was wrong but I could just about live with it, having another duplicate set for the timer wheel is just something I'm not prepared to put up with unless it's simply not possible to remove the duplication.

The code can be found here and the previous rules apply.

3 Comments

Interesting turn of events :)

I've suspected this for a while, that once you know more about the usage scenarios for your app, it makes sense to go back and revise the underlying data structures to better suit those patterns.

The old adage that 'the STL will always do it better than you could yourself' is _probably_ true for a general-purpose data structure, but the trade-offs you can make in a special-purpose data structure is likely to simplify design and improve performance.

Thanks for a nice report!

This particular scenario is definitely an edge case for the original design and I'm not too surprised that I could do better. Both the client and the scenario are pushing performance quite hard.

I agree that most of the time the STL is great and it gets the job done faster and more reliably than building your own containers etc, and replacing it when you know it's become one of the hot spots isn't as easy as it could be. BUT I think in certain situations it's worth doing; I do feel that I have to justify why I'm doing it though :) It will be interesting to see what gains I get if I replace the STL in the timer queue with a custom map...

We use a similar timing wheel at http://code.google.com/p/simulationcraft/ , which is at its core a event-driven simulator.

- Wheel length and granularity are both of the form 2^n, to allow using bitshift instead of division by the wheelmask when finding the time slice to insert.
- We use a singly-linked list with the list pointers stored on the timers themselves. This is important to reduce memory allocation ( even if its just a pointer ) when inserting a timer. Using a std::list results in massive allocation and is thus not suited for this task.
- We reuse timers if possible. This means that even though we need many thousands of timers for a whole iteration of the simulation, there are maximally eg.

Your article is very interesting to read and describes exactly the problems one faces when trying to create a high speed and performanant timer/event system. Thanks for sharing!

Leave a comment