Practical Testing: 30 - Reducing contention

Previously on “Practical Testing”… I’ve been looking at the performance of the timer system that I developed and have built a more specialised and higher performance timer system which is more suitable for some high performance reliable UDP work that I’m doing. Whilst developing the new timer wheel I began to consider the thread contention issues that the timer system faced and came up with a notation for talking about contention (see here). Both the general purpose timer queue and the new timer wheel suffered from more potential thread contention that they needed to because of the way the STL containers that I am using a) require memory allocations on insertion and removal and b) use the program heap for those memory allocations. This converted the contention for the timer system from contention between the number of threads accessing the timer system to contention between the number of threads accessing the program heap…

I mentioned a while back that a custom STL allocator would be one way to reduce the thread contention; the allocator could use a private heap that only the timer system used and so the potential contention during memory allocation and release would be reduced to the potential contention for the timer object itself. Today I’ll present the results of switching to a private heap using a custom STL allocator for the STL collections that I use.

I went looking for information about writing my own STL allocator and ended up with some useful code from Pete Isensee. I then boiled this down to something that fitted with my development style and that supported allocations using HeapAlloc(). Actually using the allocator was trivial, if slightly messy.

typedef std::deque<TimerData *> Timers;
 
typedef std::pair<size_t, Timers> TimersAtThisTime;
 
typedef std::map<ULONGLONG, TimersAtThisTime *> TimerQueue;
 
typedef std::pair<TimerQueue::iterator, size_t> TimerLocation;
 
typedef std::map<TimerData *, TimerLocation> HandleMap;

became

typedef std::deque<TimerData *, CAlloc<TimerData *> > Timers;
 
typedef std::pair<size_t, Timers> TimersAtThisTime;
 
typedef std::map<ULONGLONG, TimersAtThisTime *,
   std::less<ULONGLONG>,
   CAlloc<std::pair<ULONGLONG, TimersAtThisTime *> > > TimerQueue;
 
typedef std::pair<TimerQueue::iterator, size_t> TimerLocation;
 
typedef std::map<TimerData *, TimerLocation,
   std::less<TimerData *>,
   CAlloc<std::pair<TimerData *, TimerLocation> > > HandleMap;

And I had to add some allocators and a private heap to the class.

CSmartHeapHandle m_heap;
 
CAlloc<TimerData *> m_timersAllocator;
 
CAlloc<std::pair<ULONGLONG, TimersAtThisTime *> > m_timerQueueAllocator;
 
CAlloc<std::pair<TimerData *, TimerLocation> > m_handleMapAllocator;

And then adjust the constructors to make use of all of this:

CCallbackTimerQueueBase::CCallbackTimerQueueBase()
   :  m_heap(::HeapCreate(HEAP_NO_SERIALIZE, 0,0)),
      m_timersAllocator(m_heap),
      m_timerQueueAllocator(m_heap),
      m_handleMapAllocator(m_heap),
      m_queue(std::less<ULONGLONG>(), m_timerQueueAllocator),
      m_handleMap(std::less<TimerData *>(), m_handleMapAllocator),
      m_monitor(s_monitor),
      m_maxTimeout(s_timeoutMax),
      m_handlingTimeouts(InvalidTimeoutHandleValue)
{
   if (!m_heap.IsValid())
   {
      throw CException(_T("CCallbackTimerQueueBase::CCallbackTimerQueueBase()"), _T("Failed to create private heap"));
   }
}

Note that since we are taking responsibility for locking around access to the heap we can tell HeapCreate() not to bother locking internally with the HEAP_NO_SERIALIZE flag.

Unfortunately this makes performance worse, though arguably it has reduced contention. The problem is that HeapAlloc() isn’t as efficient as the standard implementation of new and so whilst we’ve reduced contention we’ve also reduced overall performance. Not good.

I did some research on high performance memory allocators and decided that PTMalloc was a good fit for what I needed. PTMalloc supports separate heaps by what it terms “malloc spaces” or mspace and it supports multi-threaded use where you’re responsible for locking. I wrapped the code in one of my library projects and created some helper code so that it integrated more easily with the rest of my code.

A new STL allocator implementation can then allocate and deallocate from a PTMalloc mspace rather than from a heap created with HeapAlloc(). The results were good, faster than the original new implementation and, due to the private heap, the contention of the timer queue was reduced to C(n threads using the queue).

The STL allocator isn’t the only place that dynamic memory is being allocated though, we’re also allocating a timer handle when we create a timer and in the timer queue we need to allocate the various structures that help us build and manage our queue. If these continue to use the standard program heap then our worst possible contention is still C(n threads using the program heap) rather than C(n threads using the queue).

Providing custom allocation and deallocation code, that uses the PTMalloc private heap, for the other memory allocations deals with both the contention and boosts performance.

In the case of the TimerData object we can add a placement new implementation for it that uses our private heap. For the simpler memory objects we allocate and construct them manually using the private heap’s allocator.

On my system the performance tests show some pretty nice improvements for the timer queues. Every operation is faster. Timer creation is down from 60ms per 100,000 to 40ms. Setting timers down from 130ms to 100ms (again per 100,000) and timer handling down a little.

Of course the allocation and STL allocator changes can also be applied, albeit with lesser results as the only STL collection used is for timer handle validation and the only dynamically allocated data is the timer handle.

The code for the STL allocator using HeapAlloc() can be found here.

The code for the STL allocator using PTMalloc can be found here.

And the code for all allocations using PTMalloc can be found here.

Please note that the previous rules apply.

There’s still scope for improvement. As I’ve mentioned before (here and here), the STL containers are not intrusive and so memory must be allocated for each item placed in them. Old school intrusive containers wouldn’t require memory allocation and release at all and so should improve performance somewhat. What’s more, a custom designed intrusive multi-map could allow for the “remove all entries that match this key” operation which I’m currently fudging using more dynamically allocated structures…