Bug hunting

Page content

I’ve just spent a day tracking down a bug in a pending release of The Server Framework. It was an interesting, and actually quite enjoyable, journey but one that I shouldn’t have had to make. The bug was due to a Windows API call being inserted between the previous API call and the call to GetLastError() to retrieve the error code on failure. The new API call overwrote the previous error value and this confused the error handling code for the previous API call. This was a bit of a “school boy error” and it was made worse by the fact that the code in question had a comment which clearly explained why nothing should be placed between the API call and the call to GetLastError().

So, what went wrong?

This bug has severely delayed the release of 6.6.3 of The Server Framework. This is mainly due to the fact that it caused many of my “black-box” server tests to fail spectacularly with all manner of symptoms and with crash dumps indicating invalid memory accesses. This, in turn, caused several days of head scratching which used up the time I’d allocated to push the release out. Then client work took over and the release had to wait as I had no time slots large enough to allocate to finding and fixing the bug.

Annoyingly all of the unit tests passed. It was only under stress that the code in question failed because the code path in question wasn’t exercised in the unit tests. This is why I have the “black-box” server tests as part of my build and release process but they don’t get run as often as they should as I don’t practice continuous integration for various reasons. CI would have spotted the problem as soon as it was introduced into the codebase and it would have been easy to triage the changes, locate and fix the issue.

The release in question has always been a collection of unimportant minor changes that has been allowed to accumulate issues over time and that had no real plan to it. All of the changes were being merged in from client fixes and so all of the fixes were active in the field to various degrees. It should have been easy to push the ‘build, test and release’ button at any time and push the release out. However when I did this the lack of CI meant that this was the first time that the code had been stress tested and, obviously, the clients who had the broken fix weren’t putting the code into the same situation that my stress tests caused.

The code change that caused the issue was in response to this issue relating to multiple threads calling WSASend() or WSARecv() at the same time. The solution was to add some per connection locking around the issuing of WSASend() and WSARecv() calls. Since there were several places in the code where this change was required it became a bit of a ‘cut and paste’ fix. The fix being applied to the TCP and UDP code in a similar pattern but in multiple places.

One example of the change is this:

const int result = ::WSARecv(
      socket.GetSocket(),
      buffer.GetWSABUF(),
      buffer.GetWSABUFCount(),
      &dwNumBytes,
      &dwFlags,
      NULL,             // THIS makes us a sync read
      NULL);

   // In case the recv call fails, we have to obtain the last error before we unlock
   // the lock as otherwise the action of unlocking the lock will reset the last error
   // value...

   const DWORD lastError = ::WSAGetLastError();

   #if (JETBYTE_TRACE_STREAM_SOCKET_READ_TO_DEBUG_LOG == 1)
   JetByteTools::Win32::OutputEx(_T("SOCKET SyncRead Completed -") + socket.GetIdentifier());
   #endif

   if (pLock)
   {
      pLock->Leave();
   }

Which was changed to something like this:

   JetByteTools::Win32::ICriticalSection::ConditionalOwner lock(socket.GetLock());

   const int result = ::WSARecv(
      socket.GetSocket(),
      buffer.GetWSABUF(),
      buffer.GetWSABUFCount(),
      &dwNumBytes,
      &dwFlags,
      NULL,             // THIS makes us a sync read
      NULL);

   lock.Leave();

   // In case the recv call fails, we have to obtain the last error before we unlock
   // the lock as otherwise the action of unlocking the lock will reset the last error
   // value...

   const DWORD lastError = ::WSAGetLastError();

   #if (JETBYTE_TRACE_STREAM_SOCKET_READ_TO_DEBUG_LOG == 1)
   JetByteTools::Win32::OutputEx(_T("SOCKET SyncRead Completed -") + socket.GetIdentifier());
   #endif

   if (pLock)
   {
      pLock->Leave();
   }

As you can see, I completely ignored the comment which told me not to do what I was doing and why… The corrected change is something like this:

   const bool needToLockAroundRecv = (pLock == 0);

   JetByteTools::Win32::ICriticalSection::ConditionalOwner lock(socket.GetLock(), needToLockAroundRecv);

   const int result = ::WSARecv(
      socket.GetSocket(),
      buffer.GetWSABUF(),
      buffer.GetWSABUFCount(),
      &dwNumBytes,
      &dwFlags,
      NULL,             // THIS makes us a sync read
      NULL);

   // In case the recv call fails, we have to obtain the last error before we unlock
   // the lock as otherwise the action of unlocking the lock will reset the last error
   // value...

   const DWORD lastError = ::WSAGetLastError();

   if (needToLockAroundRecv)
   {
      lock.Leave();
   }

   #if (JETBYTE_TRACE_STREAM_SOCKET_READ_TO_DEBUG_LOG == 1)
   JetByteTools::Win32::OutputEx(_T("SOCKET SyncRead Completed -") + socket.GetIdentifier());
   #endif

   if (pLock)
   {
      pLock->Leave();
   }

A better change, and one that is already present in my “this changes everything” 7.0 release, is to wrap the call to WSARecv() in a function which returns the last error value rather than the two stage result and last error that’s shown above. The change in 7.0 makes it impossible to separate the two calls at their point of use.

The result of overwriting the last error value was that the subsequent error handling code treated ’normal’ pending I/O results as errors and decremented the socket and buffer references as the code assumed that the I/O operation would never complete. When the I/O operation completed the buffer and socket references were decremented again and this eventually led to objects that had been destroyed being accessed after death. Of course this was a nice little race condition and so was difficult to discover.

My unit tests, though extensive, obviously don’t force a situation where the I/O operation is executed asynchronously and with modern hardware this probably gets harder and harder to achieve without API hooking to massage API return values. Once again this is not an issue in 7.0 as the whole network API layer is replaceable which makes it mockable which means that all API call results can be simulated.

Tracking down the issue

It took a while to work out what was going on. Initially the symptoms were just that the server or client test would fail with crash dumps caused by invalid memory accesses. Examining the dumps clearly showed that buffer and socket references were screwed up. Enabling the internal debugging reference count tracking helped somewhat but the logs that it produces (stack traces for every reference change for every socket and every buffer) are hard to deal with. The logs are something that don’t get used very often but when they do they’re important and useful. I had got into the habit of post processing them manually and that finally proved to be enough of a pain for me to build a log post processor that extracted and formatted the ref counting messages for easier consumption.

The next problem was not believing what the logs were telling me. I looked at the code, and couldn’t see the problem. I ignored the comment, obviously. I’m generally one to believe that comments tend to be of little use and that the code should be trusted over comments. I should probably have read that comment though and either deleted it if it were incorrect or acted on it. To be fair the comment only appears in 2 of the 8 places where the problem could manifest (different API calls and TCP and UDP and the comment was only necessary for TCP reads). The change was probably introduced incorrectly first into the code areas that didn’t have the comment and then the ‘cut and paste’ fix mentality took over and by the time I applied the fix to this location I was on auto-pilot.

When I finally stopped tweaking the log analyser and decided that what it was telling me WAS the truth I added a ’note’ (basically a printf) to the ref count trace to tell me what error was causing the references to be released and I finally noticed that the last error was 0 and realised that it must be being corrupted somehow.

Improving my process

I think the biggest improvement that I could make is running my full build, test and release cycle after every check in to development branches like this. Use it as a real CI system. This would have caught the issue early and saved me heaps of time.

I should also read my comments more.

The CI system is currently problematic as I still use my hacked together version of Cruise Control .Net this has always been a creaky solution, mainly due to the impedance mismatch of CC.Net’s ’thread per job’ design and the number of projects that I need to build. I build all of my libraries for all supported compilers and all supported platforms and run all the unit tests, then I do the same for all of the server and client examples. This is a large number of projects and at present I manually configure and batch them to get it all done. This manual step is fine when I’m doing it once or twice a release but it’s not acceptable for CI. Also, having to have the three build servers up and running 24/7 is less than ideal (in summer) as they’re under my desk and they warm the room…

The answer, I expect, is some cloud based CI system…