TCP/IP Server Failures

2005-11-04

One of the good things about the server performance testing that I’ve been doing recently is that I have been able to witness some rather rare failure conditions in action and seen how The Server Framework handles them.

When using IO Completion Ports for asynchronous IO you get notifications when reads and write complete. Read completions are obviously an important event, it’s where you receive incoming data. Write completions, on the other hand, are often viewed as less important. All that a write completion tells you is that the TCP/IP stack has taken ownership of your data. It doesn’t guarantee that the data has been sent or that it has been received by the other party, you need to rely on your own protocol to transmit that kind of information. The current framework design provides virtual functions to deal with read and write completions. All servers need to provide an implementation for ReadCompleted() but, for the most part, most servers can pretty much ignore WriteCompleted(). The default implementation of WriteCompleted() used to look like this:

void CStreamSocketConnectionManagerCallbacks::WriteCompleted(
   IStreamSocket *pSocket,
   IBuffer *pBuffer)
{
   // Derived class overrides this to deal with write completions
  
   // The check below has never failed in production code.
  
   if (pBuffer->GetUsed() != pBuffer->GetWSABUF()->len)
   {
      OnError(pSocket, 
         _T("CStreamSocketConnectionManager::WriteCompleted")
         _T(" - Socket write where not all data was written - expected: ") + 
         ToString(pBuffer->GetWSABUF()->len) + 
         _T(" sent:") + ToString(pBuffer->GetUsed()));
   }
}

After running my test tool in such a way that the server is pushed to exhaust the non-paged pool or exceed the I/O page lock limit I have now actually seen the test above fail. The comment now reads:

   // The check below has never failed in production code
   // but can do so if you run out of non-paged pool at the right point...
   // If you only have a single write pending you can resync and resend based
   // on what you actually managed to send...

Obviously what you can actually do at this point will depend on your server design and the protocol that it’s talking. All of the failures that I’ve seen so far have been total failures. That is the amount of data sent has been 0 but then my buffer sizes have been smaller than the operating system’s page size. If your send buffers are multiple pages in size then I guess you could, perhaps, see partial failures where part of the buffer is sent (more testing required here I guess). Since you have the data buffer and details of how much of the buffer was actually sent you could attempt to resend the data (though trying straight away would probably be a bad idea). I’m pretty sure that this failure is due to non-paged pool exhaustion, but without seeing the code inside the TCP/IP stack I can’t be sure.

The other failure that I hadn’t seen before was due to exceeding the “I/O page lock limit”. The design of The Server Framework means that all IO operations are performed on the IO pool threads. That is the IO threads issue the actual read and write calls as well as handle the completions. I do this because it simplifies the use of the framework (as I mentioned in the original article, by doing this we avoid having to worry about thread termination affecting outstanding IO requests). The downside of this design is that there’s a slight performance hit due to the extra trip through the IOCP (although we optimise this away if we know we’re already on an IO thread) and the fact that if the actual calls to WSASend() or WSARecv() fail then we have to report that failure in a slightly convoluted manner. Up until this recent testing exercise these calls haven’t failed. The actual code looks something like this:

   if (SOCKET_ERROR == ::WSASend(
      pSocket->GetSocket(),
      pBuffer->GetWSABUF(), 
      1, 
      &dwSendNumBytes,
      dwFlags,
      static_cast<OVERLAPPED*>(pBuffer), 
      NULL))
   {
      DWORD lastError = ::WSAGetLastError();
  
      if (ERROR_IO_PENDING != lastError)
      {
         pSocket->OnConnectionError(WriteError, pBuffer, lastError);
           
         pSocket->WriteCompleted();  // this pending write will never complete...
  
         pSocket->Release();
         pBuffer->Release();
      }
   }

Failures get passed through to the socket’s OnConnectionError() which gets routed back through to the derived server or connection manager object where you can handle the failure. At present both read errors and write errors are routed to the same handler with an enum used to differentiate them. I don’t currently pass the dwSendNumBytes parameter through as I assume that a failure is always total (that’s now on my list of things to check). Again what you can do to recover depends on your server design and the protocol you’re using. One thing to be aware of is that if a read fails in this way and your design means that you only have a single read pending on each socket at any one time then you now do not have a read pending on this socket and, as is quite usual in servers that I design, if your connection is being held open purely by the fact that the reference count on the socket is held above zero by the pending read then your connection will close.