Being wrong.

2007-10-12

This week I’ve spent some of the time being wrong, which has proven useful as I’ve learned quite a lot and clarified my understanding of a situation. It all began when I had a bug report from a client who claimed that an obscure internal part of The Server Framework wasn’t working as he expected it to…

The report said that the m_outstandingWrites variable of the TAsyncSocket sometimes ‘stays above 2’. This counter is used internally to determine if there are writes that the user of the framework has issues on a connection that have not yet been passed down to the TCP stack for processing. Its sole reason for existence is to try and reduce the chance that you will lose any data that you intend to send if you issue a write call that is directly followed by a shutdown of the sending side of the socket connection. Since the framework will marshal all read and write requests onto its own threads before issuing them a shutdown that is issued directly after a write on a thread that isn’t one of the framework’s I/O threads is quite likely to shut the connection down before the write is even attempted… The counter prevents this.

The client had hijacked this counter to mean something else to them and it wasn’t doing what they wanted it to. They were peeking at the counter from time to time to determine if they wanted to send more data or if they wanted to queue the data in the server in a priority queue system. If the counter was over 2 then they queued data, if the counter was less than 2 then they sent data. Their problem was that on some “slow” connections when they were sending lots of data the counter was getting ‘stuck’ at 2 and their sending was stalling. We still haven’t got to the bottom of this problem, it only manifests on one of their test machines and we’re still gathering data… However, my initial reaction was that “you shouldn’t be doing that” and my reaction was only half right.

I stand by the half that I’ve always thought was right, that is, hijacking this particular part of the internals of the framework is not the best way to achieve their goal. What they should be doing is providing an implementation of OnWriteCompleted() and maintaining their own sending state counters or whatever within that. By hijacking an internal part of the framework that is clearly not designed for external use they are exposing themselves to changes in the framework’s internals and relying on undocumented behaviour…

The part that I was wrong about was then going on and saying that they shouldn’t be managing their flow control in this way at all, but that they should, instead, be adjusting their application level protocol to incorporate some form of explicit flow control. It’s strange that I took this particular viewpoint as I’ve written servers that drive their sending off of the completion of previous writes in much the same way that this client wanted to do. The thing is that I never had problems with these servers and here, in the depths of a bug report that I couldn’t really fully understand I decided that the approach that the client was taking was to blame for the problems they were having… It’s not my code, it’s your design… Hmm…

But it’s not their design and, perhaps, if they had been doing this in the way that I would have done it, i.e. by extending the framework at the point that it was intended to be extended for this kind of thing, then I wouldn’t have jumped to the wrong conclusions so quickly. Anyway, I did some googling and asked some questions and changed my mind. What the client is doing is good, and we need to work out why it’s not working.

So, what exactly are they doing by controlling the rate at which they issue send requests by the rate at which those send requests complete?

OnWriteCompleted() is called when an overlapped write completes. This happens when the data in the buffer that you have provided is either sent “onto the wire” or copied into buffers within the TCP stack. All that it means is that you can clean up the buffers that you were using for the send operation. It does NOT mean that the data that you have sent has reached the other end of the connection or anything else like that. The data is now in the hands of the TCP stack and it has taken responsibility for delivering it to the other end of the connection, once there the other end’s TCP stack will take responsibility for delivering the data to the application that is using the socket.

TCP/IP implements a system whereby there’s only allowed to be a certain amount of data ‘in flight’ between one end of the connection and the other at a particular time. This system is called the TCP receive window, or TCP window, and it’s described in detail here and here. Essentially, the local TCP stack can only have a certain amount of data ‘outstanding’ and once the limit is reached it doesn’t send any more data. This limit is negotiable and adjustable during the lifetime of a connection. As the local stack receives the ACKs that the remote TCP stack sends for the data that it has receieved it can send more data. A window size of one “packet’s worth” of data would mean that the local stack could only have one “packet’s worth” of data on route to the remote end at a time, when the remote end had received the data and the ACK had made its way back to the local end could send more. Obviously the window size is usually much bigger.

Once the TCP stack has reached the amount of data that it’s allowed to send it can buffer send requests in data buffers in the TCP stack. Once it reaches the limit of data that it can buffer your overlapped write requests will stop completing (the stack can’t complete the request) and your data buffers may then be “locked” and then used directly by the stack rather than having it copy your data into its buffers when space becomes available; though obviously this is an implementation detail that a) I might be wrong about and b) could change. There’s a limit on the number of pages that the system can have locked, so if you keep issuing overlapped writes when the stack has used up all of its buffer space then you run the risk of getting an WSAENOBUFS error from your write calls and putting the system in a state where other processes will also be suffering from the same problem (it’s a system wide limit)…

So, now that that’s clear, it should be fairly obvious that if you want to use TCP’s buffering and receieve window to implement your flow control, for example if you are streaming data to a client, then you can use OnWriteCompleted() to implement this. The trick is in controlling the number of outstanding writes that you allow (to avoid locking too many pages). What you should do is keep track of how many writes you have outstanding (and you should do this yourself rather than trying to hijack the framework’s own counter for this!) and pause your data flow on the connection when you have reached a preconfigured number of outstanding writes, once at this point you can then use the completion of your outstanding writes to drive the sending of more data. If you need to fine tune this then you can manipulate the connection’s TCP send buffer size (the amount of buffering that the TCP stack will do for you) by calling JetByteTools::Win32::IStreamSocket::SetSendBufferSize().

So, in summary, regulating your data flow using the TCP’s own windowing flow control is a very valid method and the way you do this using The Server Framework is to put your code in OnWriteCompleted()…

Now, to find and fix the problem that the client seems to be experiencing…