TCP/IP Server Failures

| 5 Comments
One of the good things about the server performance testing that I've been doing recently is that I have been able to witness some rather rare failure conditions in action and seen how The Server Framework handles them.

When using IO Completion Ports for asynchronous IO you get notifications when reads and write complete. Read completions are obviously an important event, it's where you receive incoming data. Write completions, on the other hand, are often viewed as less important. All that a write completion tells you is that the TCP/IP stack has taken ownership of your data. It doesn't guarantee that the data has been sent or that it has been received by the other party, you need to rely on your own protocol to transmit that kind of information. The current framework design provides virtual functions to deal with read and write completions. All servers need to provide an implementation for ReadCompleted() but, for the most part, most servers can pretty much ignore WriteCompleted(). The default implementation of WriteCompleted() used to look like this:

void CStreamSocketConnectionManagerCallbacks::WriteCompleted(
   IStreamSocket *pSocket,
   IBuffer *pBuffer)
{
   // Derived class overrides this to deal with write completions
  
   // The check below has never failed in production code.
  
   if (pBuffer->GetUsed() != pBuffer->GetWSABUF()->len)
   {
      OnError(pSocket, 
         _T("CStreamSocketConnectionManager::WriteCompleted")
         _T(" - Socket write where not all data was written - expected: ") + 
         ToString(pBuffer->GetWSABUF()->len) + 
         _T(" sent:") + ToString(pBuffer->GetUsed()));
   }
}

After running my test tool in such a way that the server is pushed to exhaust the non-paged pool or exceed the I/O page lock limit I have now actually seen the test above fail. The comment now reads:
   // The check below has never failed in production code
   // but can do so if you run out of non-paged pool at the right point...
   // If you only have a single write pending you can resync and resend based
   // on what you actually managed to send...

Obviously what you can actually do at this point will depend on your server design and the protocol that it's talking. All of the failures that I've seen so far have been total failures. That is the amount of data sent has been 0 but then my buffer sizes have been smaller than the operating system's page size. If your send buffers are multiple pages in size then I guess you could, perhaps, see partial failures where part of the buffer is sent (more testing required here I guess). Since you have the data buffer and details of how much of the buffer was actually sent you could attempt to resend the data (though trying straight away would probably be a bad idea). I'm pretty sure that this failure is due to non-paged pool exhaustion, but without seeing the code inside the TCP/IP stack I can't be sure.

The other failure that I hadn't seen before was due to exceeding the "I/O page lock limit". The design of The Server Framework means that all IO operations are performed on the IO pool threads. That is the IO threads issue the actual read and write calls as well as handle the completions. I do this because it simplifies the use of the framework (as I mentioned in the original article, by doing this we avoid having to worry about thread termination affecting outstanding IO requests). The downside of this design is that there's a slight performance hit due to the extra trip through the IOCP (although we optimise this away if we know we're already on an IO thread) and the fact that if the actual calls to WSASend() or WSARecv() fail then we have to report that failure in a slightly convoluted manner. Up until this recent testing exercise these calls haven't failed. The actual code looks something like this:

   if (SOCKET_ERROR == ::WSASend(
      pSocket->GetSocket(),
      pBuffer->GetWSABUF(), 
      1, 
      &dwSendNumBytes,
      dwFlags,
      static_cast<OVERLAPPED*>(pBuffer), 
      NULL))
   {
      DWORD lastError = ::WSAGetLastError();
  
      if (ERROR_IO_PENDING != lastError)
      {
         pSocket->OnConnectionError(WriteError, pBuffer, lastError);
           
         pSocket->WriteCompleted();  // this pending write will never complete...
  
         pSocket->Release();
         pBuffer->Release();
      }
   }

Failures get passed through to the socket's OnConnectionError() which gets routed back through to the derived server or connection manager object where you can handle the failure. At present both read errors and write errors are routed to the same handler with an enum used to differentiate them. I don't currently pass the dwSendNumBytes parameter through as I assume that a failure is always total (that's now on my list of things to check). Again what you can do to recover depends on your server design and the protocol you're using. One thing to be aware of is that if a read fails in this way and your design means that you only have a single read pending on each socket at any one time then you now do not have a read pending on this socket and, as is quite usual in servers that I design, if your connection is being held open purely by the fact that the reference count on the socket is held above zero by the pending read then your connection will close.

5 Comments

Len - some interesting articles/information here! Thanks for that.

I'm wondering, when you hit the NP pool limits in your testing, was that visible in an accurate fashion in TaskManager (i.e. if you use TaskManager -> View, Select Columns, NP Pool, and watch the NP Pool for the server process?). In other words, was the NP Pool usage for the process from Taskmanager hitting 256 MB or atleast a hundred megs?

(Note - I'm not using your code here - just using your articles to learn, since I'm building my own IOCP framework for personal use in apps.)


Not sure if you'll read this comment to a 4 month old post, but then you did end up replying to tons of comments on your earlier IOCP code posts, so its worth a try :)

The numbers were fairly accurate when I hit the NP limits but I found that I tended to hit the locked pages limits more often and there's no numbers that you can use to detect that.

Right, and WSAENOBUFS is the error common to both, so you can't tell which is which.

Im now beginning to think I'm running into locked page limit errors.

Interesting stuff :)

Indeed.

You probably are and there's no way to tell the difference and there's no real way to determine if you're about to hit the locked page limit because it's system wide and not process wide and I can't think of a reasonable way to work out how the system stands in regard of locked pages...

My solution was to allow the server to be able to limit the number of connections programatically and then, if you own the box where the server runs, you can configure it to be safe. I then allowed each 'server' (ie listening port) within a process to share this limit so that you can control multi host and or multi port servers... Seems to work but it's all a bit theoretical as my clients tend not to run into these limits on their production boxes anyway.

Wondering how best to minimize one's chances of hitting the locked page limit - is there anything else one can do apart from setting WSARecv to do 0 byte receives (and then fetching the data via non overlapped WSARecvs when the WSARecv completion notification arrives) ?

As far as debugging goes though, one can try to test specifically for NP Pool limits though - if operations fail with WSAENOBUFS, spawn a thread that just calls WSASocket 500 times or so...that should cause allocation of 1 MB of NPP, if all those calls do succeed, then you probably have a locked page issue.

Leave a comment