Activatable Object

| 0 Comments

I've written an article for Overload, one of the the ACCU's journals. It's based on my Efficient Multi-Threading blog post from a few weeks ago. Chris Oldwood mentioned to me about how the object described in Efficient Multi-Threading was similar to an Active Object which steals a calling thread to do its work rather than using one of its own and I agreed, he then suggested that I write it up for the ACCU journal.

You can read the article here.

Latest release of The Server Framework: 6.6.2

| 0 Comments
Version 6.6.2 of The Server Framework was released today.

This release is a bug fix and internal feature release which results in lots of change to the OpenSSL, SChannel, Compression and TCP flow control filters. See the release notes here, for full details of all changes.

All clients using earlier versions of both the OpenSSL Option Pack and the SChannel Option Pack are advised to upgrade to this release.

Bug fixes:

  • Bug fixes and potential fixes to all code that used function level non-trivial static objects. These have all been replaced with alternative designs as they were not thread safe and could, in very unlikely scenarios, cause problems. See here for more details.
  • Bug fix for JetByteTools::Win32::IQueueTimers::SetTimerWithRefCountedUserData(). We now wrap the SetTimer() call in a try/catch block and Release() the reference we created with AddRef() if there's an exception.
  • Bug fix to JetByteTools::IO::CBufferChain::AddDataToBufferChain() to allow for passing in an empty list of buffers to add.
  • Bug fix to JetByteTools::IO::CBufferChain::ConsoliateData() to ensure that buffers that we attempt to consolidate are writeable.
  • Bug fix in JetByteTools::Socket::TConnectionManagerBase::ReleaseSocket() to ensure exceptions are trapped appropriately.
  • Bug fix in JetByteTools::Service::CShutdownHandler::Run(). Changed the way we respond to shutdown and pause events (if enabled) so that we only respond to each event once rather than repeatedly routing shutdown events to the service whilst the event is set. This was causing non-paged pool exhaustion in some situations when the service failed to shutdown.

Additions:

  • Added JetByteTools::Win32::CRecursiveDirectorySearch.
  • Added JetByteTools::Win32::CRegistryKey::TryDeleteValue().
  • Added JetByteTools::Win32::TReusableIdManager::TryFree().
  • Added JetByteTools::Win32::TThreadSafeReusableIdManager::TryFree().
  • Added JetByteTools::Win32::CRingBuffer::GetSize().
  • Added JetByteTools::Win32::CActivatableObject. See here for details.
  • Added JetByteTools::Win32::CalculateRequiredPrecision() which takes a double value and returns the optimum precision required to format the value for display.
  • Added Intrusive containers. A set, map and multi-map which can be used to replace STL containers with ones which do not need to do memory operations during insertion and removal.
  • Added JetByteTools::IO::IBufferBase::GetSpace() which returns the available space in a buffer, size - used.
  • Added JetByteTools::IO::IBuffer::GetTotalSpace() which returns all available space in a buffer, this is GetSpace() plus any space that is currently unused at the front of the buffer due to calls to Consume().
  • Added JetByteTools::IO::IBuffer::RemoveSpaceAtFront() which compacts a buffer by removing any unused space at the front and copying any data back towards the front of the buffer.
  • Added JetByteTools::IO::CSequencedBufferCollection which is a collection based on JetByteTools::Win32::TIntrusiveRedBlackTree which stores the buffers in sequence number order.
  • Added JetByteTools::Socket::CFilterDataBase a base class for filter data that is based on JetByteTools::Win32::CActivatableObject.
  • Added JetByteTools::Socket::CWriteOnlyFilterData and JetByteTools::Socket::CReadOnlyFilterData as base classes for filters which only work on one side of the connection.
  • Added JetByteTools::Socket::CStreamSocketConnectionFilterBase a base class for filters which use JetByteTools::Socket::CFilterDataBase.
  • Added lots of functionality to JetByteTools::Socket::CFilteringStreamSocketConnectionManagerBase to allow for more complete filtering.
  • Added a Datagram flow control filter, JetByteTools::Socket::CFlowControlDatagramSocketConnectionFilter, see here for more details.

Changes:

  • Completely redesigned the OpenSSL and SChannel filters so that they use the new JetByteTools::Socket::CFilterDataBase filter data base classes.
  • Changed all versions of JetByteTools::Win32::ToString() which take a double or long double to also accept a precision value. This value defaults to 6, which is the default precision used if precision isn't specified.
  • Changed JetByteTools::Win32::CreateDirectoryIfRequired() to return a bool which indicates if the directory was created.
  • Changed JetByteTools::Win32::CreateDirectoriesIfRequired() to return a count of the number of directories created.
  • Changed JetByteTools::Win32::CThreadPool::OnThreadPoolThreadStopped() so that it calls JetByteTools::Win32::IMonitorThreadPool::OnThreadPoolThreadStopped() before deleting the thread and decrementing the counter. This makes it possible to clean up the monitor when the final thread is deleted (and the wait on the counter returns).
  • Changed JetByteTools::Win32::TLockableObject so that it only uses a SRWL if we're building for Windows 7 or later. This is so that the TryEnter() API is available for the SRWL.
  • We now use JetByteTools::Win32::CombinePath() where possible rather than combining paths by hand.
  • Changed JetbyteTools::Win32::CCallbackTimerQueue and JetbyteTools::Win32::CCallbackTimerWheel to use the new intrusive containers.
  • Changed JetByteTools::IO::IBuffer::MakeSpaceAtFront() so that it takes an optional value which indicates how much space must be available at the rear of the buffer. This allows us to be able to create space at the front only if there's enough space in the buffer for the space we need at the front and rear of the buffer.
  • Changed JetByteTools::IO::CBufferChainStoresNulls so that you can pass it an instance of JetByteTools::IO::IAllocateBuffer in its constructor and it can then use that to allocate a new buffer if the number of 'null buffers' stored becomes too great. This effectively removes the limit on the number of 'null buffers' that can be stored between non-null buffers.
  • Changed JetByteTools::IO::CSortedBufferChain so that it caches whether you can get the next buffer or not. This reduces the work that needs to be done to find out if you can get the next buffer.
  • Changed the implementation of JetByteTools::Socket::CCompressingStreamSocketConnectionFilter so that it uses the new filter base classes.
  • Changed the implementation of JetByteTools::Socket::CFlowControlStreamSocketConnectionFilter so that it uses the new filter base classes.

UDP flow control and asynchronous writes

| 0 Comments

I don't believe that UDP should require any flow control in the sending application. After all, it's unreliable and it should be quite OK for any stage of the route from one peer to another to decide to drop a datagram for any reason. However, it seems that, on Window's at least, no datagrams will be dropped between the application and the network interface card (NIC) driver, no matter how heavily you load the system.

Unfortunately most NIC drivers also prefer not to drop datagrams, even if they're overloaded (see here for details of how UDP checksum offloading can cause a NIC to make non-paged pool usage considerably higher). This can lead to situations where a user mode application can bring a box down due to non-paged pool exhaustion simply by sending as many datagrams as it can as fast as it can. It's likely that it's actually poorly implemented device drivers that are at fault here; by failing to gracefully handle situations where non-paged pool allocations fail, but it's the application that is putting these drivers into a situation where they could fail in such a catastrophic manner.

Since the NIC driver and the operating system will not drop datagrams it's down to the application itself to do so if it senses that it's overloading the NIC. I've recently added code to The Server Framework to allow you to configure behaviour like this so that an application can prevent itself from exhausting non-paged pool due to pending datagram writes.

Efficient Multi-Threading

| 0 Comments

Performance is always important for users of The Server Framework and I often spend time profiling the code and thinking about ways to improve performance. Hardware has changed considerably since I first designed The Server Framework back in 2001 and some things that worked well enough back then are now serious impediments to further performance gains. That's not to say that the performance of The Server Framework today is bad, it's not, it's just that in some situations and on some hardware it could be even better.

One of the things that I changed last year was how we dispatch operations on a connection. Before the changes multiple I/O threads could block each other whilst operations were dispatched for a single connection. This was unnecessary and could reduce the throughput on all connections when one connection had several operation completions pending. I fixed this by adding a per-connection operation queue and ensuring that only one I/O thread at a time was ever processing operations for a given connection; other I/O threads would simply add operations for that connection to its operation queue and only process them if no other thread was already processing for that connection.

Whilst those changes deal with serialising multiple concurrent I/O completion events on a connection there's still the potential for multiple threads accessing a given connection's state if non-I/O threads are issuing read and write operations on the connection. Not all connections need to worry about the interaction of a read request and a completion from a previous request, but things like SSL engines may care. We have several areas in The Server Framework, and more in complex server's built on top of the framework, where it's important that only a single operation is being processed at a time. Locks are currently used to ensure that only a single operation is processed at a time, but this blocks other threads. What's more most of the code that cares about this also cares about calling out to user callback interfaces without holding locks. Holding locks when calling back into user code is an easy way to create unexpected lock inversions which can deadlock a system.

Efficient-MultiThreading-1.png

In the diagram above, given the threads 1, 2, 3 & 4 with four work items A, B, C & D for a single object, the threads are serialised and block until each can process its own operation on the object. This is bad for performance, bad for contention and bad for locality of reference as the data object can be bounced back and forth between different CPU caches.

The code that I describe in the rest of this article forms the basis of a new "single threaded processor" object which enables efficient command queuing from multiple threads where commands are only ever processed on one thread at a time, minimal thread blocking occurs, no locks are held whilst processing and an object tends to stay on the same thread whilst there's work to be done.

There are a couple of undocumented Visual Studio compiler switches which can be useful occasionally:

  • /d1reportSingleClassLayout - which produces a dump of the in memory layout of a given class
  • /d1reportAllClassLayout - which does the same for ALL classes

See here and here for more details.

And if you liked that, you might find this collection of debugging tricks interesting.

Intrusive C++ containers

| 0 Comments

Recently, whilst continuing to improve the performance of various aspects of The Server Framework, I reached a point where I found I needed to replace some STL containers with intrusive containers which didn't need to perform memory allocations during data insertion and removal. I had been toying with the idea of implementing custom containers for some time but have only recently had a pressing need for them.

The STL provides some marvellously usable code which has changed the way people view container classes in C++. The result is that you hardly ever need to write your own containers as the STL provides containers with known performance characteristics which, thanks to templates, work with any data. The downside of this is that you hardly ever need to write your own containers and so you get out of the habit of doing so...

Practical Testing: 33 - Intrusive multi-map.

| 0 Comments

Previously on "Practical Testing"... I'm in the process of replacing STL containers with custom intrusive containers in the timer system that I have been developing in this series of articles. The idea is that the intrusive containers do not require memory operations for insertion or deletion as the book-keeping data required to store the data in the container has been added to the data explicitly. This reduces the potential contention between threads in the application and, hopefully, improves overall performance.

In the last instalment I showed how my usage of std::set in the timer wheel code could be replaced by an intrusive set. This time I'll also replace the std::set and std::map from the timer queue. These are quite major internal changes but since we have a large set of unit tests it's reasonably easy to make these changes without having to worry about breaking the external interface.

Practical Testing: 32 - Intrusive containers.

| 0 Comments

Back in 2004, I wrote a series of articles called "Practical Testing" where I took a piece of complicated multi-threaded code and wrote tests for it. I then rebuild the code from scratch in a test driven development style to show how writing your tests before your code changes how you design your code. Since the original articles there have been several bug fixes and redesigns all of which have been supported by the original unit tests and many of which have led to the development of more tests.

In 2010 I needed to improve the performance of the code for a specific use case. The resulting changes saw the creation of a Timer Wheel version of the code which has the same interface and uses the same tests. I also ruminated on contention, invented a crazy notation to help me talk about contention and experimented with custom STL allocators and private heaps in an attempt to reduce it. In the end I decided that the best approach was probably to use custom containers rather than the STL. The custom containers would be intrusive (or invasive as I called it at the time) and would avoid the need for memory allocation on insertion and removal of items.

I didn't have a pressing need to improve the performance further and so the intrusive container idea slipped down my list of things to do. Recently I've had a need for an intrusive map for another part of The Server Framework and have implemented some custom containers, set, map, multi-map, based on an intrusive red-black tree that I've written. It then seemed appropriate to integrate them into the timer queue and see what difference they make. That's what I'm doing in this instalment. As usual the existing tests support these, quite major, internal changes to the code.
Today I discovered that C++ scoped static initialisation (function level) in Visual Studio is not done in a thread safe manner. It's the kind of thing that I should have already known but I guess I assumed that since namespace level static initialisation was safe so was function level static initialisation. Unfortunately it's not. If you read the MSDN documentation in a particular way you could decide that the docs say that it's not; but it's imprecise and unclear and I doubt that I did read the MSDN documentation anyway. Looking at the generated code from Visual Studio 2013 clearly shows that there's no thread safety involved.

 test        byte ptr ds:[5C87D0h],1  
 jne         JetByteTools::IO::CBufferChainStoresNulls::InternalAddToFront+77h (0473967h)  
 or          dword ptr ds:[5C87D0h],1  
 xor         eax,eax  
 mov         dword ptr [esp+1Ch],eax  
 push        2Dh  
 push        56DFE8h  
 mov         ecx,5C87B4h  
 mov         dword ptr ds:[5C87CCh],7  
 mov         dword ptr ds:[005C87C8h],eax  
 mov         word ptr ds:[005C87B8h],ax  
 call        std::basic_string,std::allocator >::assign (0404130h)  
 push        55C8B0h  
 call        atexit (047E190h)  
 add         esp,4  
 mov         dword ptr [esp+1Ch],0FFFFFFFFh  

void CBufferChainStoresNulls::InternalAddToFront(
 mov         edi,dword ptr [esi+8]  

The code in bold is the one time static initialisation test, check a value and if set jump past the initialisation, else set the value. Clearly multiple threads could pass through this check at the same time and both end up creating and setting the value to the static variable. I was aware of this but had dismissed it as a minor issue which would, at worst, result in a little extra work and an extra copy of an object which would be cleaned up at program exit. There's a far worse problem though; a second thread could get to the test just after the first thread had set the 'constructed' flag but before it had actually completed constructing the variable. The second thread could then use the variable in an partially constructed state. Raymond Chen goes gives more detail on the issue here.

I'm not sure how I got into the habit of using this code pattern but I seem to have used it mostly as an attempt a micro optimisation where constant strings were required. I'm currently scanning my source code trees to remove all traces of it. There are very few situations where it might have been a valid choice and so far it's easy to remove.

I'm also not really sure how I'd convinced myself that block scoped static constant variables were OK. The fact that I knew they were lazily constructed at first use and the fact that I knew that I was using them as a micro optimisation means that I must have known that there was no compiler injected synchronisation. I guess I just assumed that "magic happens here"... It's one of those head slapping moments where the issue is SO obvious once you know about it. Anyway, I live and learn.

I guess I've been very lucky (or unlucky, you decide) up until now. Most of the scenarios where I've used this pattern have been where the object construction was obviously fast enough that I never had two threads in exactly the right places at exactly the right time. So it's a standard latent race condition... It finally came to my attention because I was using the "micro optimisation" to create some strings in a function that was being called by multiple threads. My build machine's unit tests were failing, rarely and strangely. I was getting an access violation in code that looked completely fine. Eventually I tracked the problem down by turning off compiler optimisation and generating a mini dump file when I converted the SEH exception for the access violation into a C++ exception... It then took a little while to work out exactly what the problem was.

Luckily I already compile with /Wall and with 'warnings as errors'. Visual Studio's Warning C4640 will locate the issue for me and cause my builds to fail until I fix it. All I need to do is remove the warning suppression #pragma that I have for that warning in my global warning suppression header file... Whilst I'm at it, I guess I should take a good look at the other warnings that are being suppressed...

TIME_WAIT perfmon counters

| 0 Comments
I've built a small Windows Service which exposes perfmon counters to track sockets in TIME_WAIT state. It can be downloaded from the links later in this post.

Back in 2011 I was helping a client look for issues in their systems caused by having too many sockets in a TIME_WAIT state (see here for why this can be a problem). This was affecting their connectivity. Rather surprisingly there seemed to be no way to track the number of sockets in TIME_WAIT using perfmon as there didn't seem to be a counter exposed. A counter would have been useful so that we could track the TIME_WAIT; connections over time along with all of the other metrics that we track for their system. Anyway, I put together a quick and dirty tool for the client, this worked like a special version of netstat which totalled up the number of sockets in each state by polling the system using GetExtendedTcpTable() (see here).

At the time I suggested that it wouldn't take much to build a small service which did this polling and exposed the results as a perfmon counter so that we COULD track this metric in the usual way. Well, I've finally got around to doing that (thanks to the encouragement of another client who had a similar issue). The result is TCPStatsPerfCounters which is a service that you can install and which provides counters for all of the states in the TCP table for both IPv4 and IPv6; ESTABLISHED, TIME_WAIT, CLOSE_WAIT, etc. There's an x86 and and x64 build available and the services come as a single exe which automatically deploy and install the required counter dlls when you run them with /install.

  • Download the x86 version of TCPStatsPerfCounters from here.
  • Download the x64 version of TCPStatsPerfCounters from here.
Note that the x64 version will install an x86 and an x64 counter dll for maximum interoperability. The x86 version only installs an x86 counter dll and so will not integrate with perfmon on an x64 box. You should only use the x86 version on x86 machines.

Usage

  • Unzip. You will end up with either TCPStatsPerfCounters.exe or TCPStatsPerfCounters64.exe. All of the rest of the instructions apply equally to either version of the program.
  • Run TCPStatsPerfCounters.exe with the /? command line switch to see help.
  • Run TCPStatsPerfCounters.exe with the /install command line switch to install the service. You may be prompted to elevate your credentials if you are not an admin.
  • To run as a service, open the services applet and locate the "JetByte TCP Stats Perfmon Counters" service, start it and, optionally set its start-up type to automatic.
  • To run TCPStatsPerfCounters.exe as an exe rather than as a service you still need to have installed it as a service but you can run from the command line with the /run command line switch.
  • You can get TCPStatsPerfCounters.exe to produce a log file - this may be useful if you want to track socket states outside of perfmon. Either /install or /run with the additional /createLog command line switch. By default the log is created in the same directory as the exe, you can change this with the addition of the /logPath command line switch.
  • By default the program polls the system every second, you can change this by using the /poll command line switch (again supply this either with /install or /run. The polling interval is specified in seconds.

Counters

There are counters for all of the socket states for both IPv4 and IPv6. Sockets transition through some of the states very quickly and so you may not see many of the states. There are also maximum counters for TIME_WAIT which show the highest value seen. In addition there are counters for the MaxUserPort and TcpTimedWaitDelay registry keys. I found it useful to have these values clearly visible for situations where machines happened to be configured in an unusual way and nobody had thought to check the registry. These values are set once on program start up and will be zero if the registry key is not set and the operating system's default value is in operation.

About this Blog

I usually write about C++ development on Windows platforms, but I often ramble on about other less technical stuff...

This page contains recent content. Look in the archives to find all content.

I have other blogs...

Subscribe to feed The Server Framework - high performance server development
Subscribe to feed Lock Explorer - deadlock detection and multi-threaded performance tools
Subscribe to feed l'Hexapod - embedded electronics and robotics
Subscribe to feed MegèveSki - skiing