WebSockets is a stream, not a message based protocol...

| 8 Comments
As I mentioned here, the WebSockets protocol is, at this point, a bit of a mess due to the evolution of the protocol and the fact that it's being pulled in various directions by various interested parties. I'm just ranting about some of the things that I find annoying...

The first thing to realise about the WebSockets protocol is that it isn't really message based at all despite what the RFC claims.

   Clients and servers, after a successful handshake, transfer data back
   and forth in conceptual units referred to in this specification as
   "messages".  A message is a complete unit of data at an application
   level, with the expectation that many or most applications
   implementing this protocol (such as web user agents) provide APIs in
   terms of sending and receiving messages.  The WebSocket message does
   not necessarily correspond to a particular network layer framing, as
   a fragmented message may be coalesced, or vice versa, e.g. by an
   intermediary.
and...

   The WebSocket protocol uses this framing so that specifications that
   use the WebSocket protocol can expose such connections using an
   event-based mechanism instead of requiring users of those
   specifications to implement buffering and piecing together of
   messages manually.
Suggest that a message based, event driven design which presents complete messages to the application layer would be a sensible design. Unfortunately once you realise exactly how a message is made up it becomes impossible to provide an interface where you ONLY deliver messages as complete units to the application layer.
WebSocket messages consist of one or more frames. A frame can be either a complete frame or a fragmented frame. Messages themselves do not have any length indication built into the protocol, only frames do. Frames can have a payload length of up to 9,223,372,036,854,775,807 bytes (due to the fact that the protocol allows for a 63bit length indicator) and finally...

   The primary purpose of fragmentation is to allow sending a message
   that is of unknown size when the message is started without having to
   buffer that message.  If messages couldn't be fragmented, then an
   endpoint would have to buffer the entire message so its length could
   be counted before first byte is sent.  With fragmentation, a server
   or intermediary may choose a reasonable size buffer, and when the
   buffer is full write a fragment to the network.
So a single WebSocket "message" can consist of an unlimited number of 9,223,372,036,854,775,807 byte fragments. This makes it impossible for a general purpose Websocket protocol parser to only present complete messages to the application layer in such a way that the application doesn't need to do some form of "buffering and piecing together of messages manually". At best a general purpose parser could present Websocket data as a 'sequence of streams', given that each "message" is in fact simply a potentially infinite stream of bytes with a message terminator (the FIN bit in the frame header) at the end. It could do this by passing the application layer an interface that allowed the application to pull data from the Websocket "message" until it was complete, and that's less than ideal if you are used to working with asynchronous, push, APIs, or trying to avoid unnecessary memory copies...

Even if the maximum frame size was reduced, as some propose, the problem would still be present due to the fact that a single message can consist of an infinite number of fragments. Likewise a protocol parser can not take the easy route and simply disallow fragmented frames since the RFC states that...

   o  Clients and servers MUST support receiving both fragmented and
      unfragmented messages.

   o  An intermediary MUST NOT change the fragmentation of a message if
      any reserved bit values are used and the meaning of these values
      is not known to the intermediary.

   o  An intermediary MUST NOT change the fragmentation of any message
      in the context of a connection where extensions have been
      negotiated and the intermediary is not aware of the semantics of
      the negotiated extensions. 
Which means that although the application that you've written may send and receive WebSocket messages of an application restricted maximum size you may still find that you receive fragments because an intermediary has decided to fragment your frames. Unless, of course, you subvert the protocol extensions functionality by negotiating "x-{My own private GUID}" extension between your client and server which would neatly prevent any intermediaries (except ones that you'd written yourself) from changing the fragmentation of your frames... Then, of course, the intermediaries may simply decide to remove the client's request for your unknown extension from its initial handshake request to prevent it being negotiated. Or, perhaps more likely, close your connection with a 1004 close code...

There's resistance to proposals to allow the maximum frame size to be negotiated during the handshake phase but there's a standard close code for "frame too big"... Should an application just guess how large it's allowed to go?

The view of some on the discussion list seems to be that "A server (or client) which exposes the frames as its primary API is doing it wrong." but it seems to me that to write a flexible and general purpose protocol parser which can be used by both push and pull APIs you have no option but to expose details of the message framing. The reason for this is that a general purpose parser cannot buffer complete messages, or even complete frames and so must deliver the data either as a stream of bytes at an application level or as a sequence of partial frames and allow the application to decide how to accumulate the frames into messages. By hiding all of the framing from the application the application developer cannot take advantage of knowledge that he has of the message structure of his particular messages. This is especially useful with asynchronous APIs where the application might be pushed buffers with data in them as the data arrives - which is how most of The Server Framework happens to operate and how I/O Completion Port centric designs would tend to work. If an application works in terms of, say, messages that could be at most 4096 bytes and is dealing with buffers that can contain complete messages then it could use the details of the data framing to allow it to efficiently accumulate the data into a single buffer and then dispatch the complete message for processing when it receives the final frame. The alternative is to add complexity to the protocol parser by allowing it to accumulate 'messages' up to a configurable size and present complete messages via one callback and incomplete messages via another, or to provide only a stream based pull API which requires the application to needlessly copy data from the protocol parser's buffers into its own.

The 63bit fragment size and fragmentation in general appear to come from a requirement for streaming data from one end of the connection to the other, see here where a Unixy design idea of simply telling the application to "read x amount of data from this file handle" seems perfectly sensible... Of course this design will fail as soon as you need to send a stream that's bigger than the 63bit frame size will allow and it'll also fail if the frame is fragmented as then the API becomes "read x amount of data from this file handle, but you're not done yet, wait and I'll call you again with more"... At which point and given the possibility of intermediaries that fragment your large fragment down to, lets be generous, 1024 byte fragments anyway, you may as well simply limit the maximum size of frames to something more manageable... But I suppose "nobody will ever need more than 63bits of data length"...

Unfortunately, large frame sizes also open the protocol up to lazy application design. Lets say we're sending a file, we open the file and read some, send a fragmented frame header for the total file length and then simply start sending data. Cool, we don't need to worry about the protocol any more, no need to build messages, just pull the data from disk and send it to the other side. This works fine until you have a read failure, or any other reason to terminate the connection. Since you're in the middle of sending a single huge frame you can't send an application level frame that informs the other side of the problem. You can't even send a WebSocket close frame to shut down the connection cleanly, all you can do is abort the connection...

So, WebSockets presents a sequence of infinitely long byte streams with a termination indicator (the FIN bit in the frame header) and not a message based interface as you might initially believe. Given that a general purpose protocol handler can only work in terms of partial frames, we effectively have a stream based protocol with lots of added complexity to provide the illusion of a message based protocol that can actually only ever be dealt with as a stream of bytes.

8 Comments

I spent some time thinking about this before posting, and you're wrong.

Who cares what the RFC says in MUST? If someone starts sending my server a sequence of 1 byte fragments to a message, I'm going to drop them, RFC be damned.

It's a perverse reading of the RFC to say that you can't drop obviously malicious data.

If I have a chat protocol, I'll establish a maximum message size, and drop any messages that exceed that size, regardless of the fragmentation. Maximum message size can be established out-of-band, just as the protocol for the contents are established out-of-band.

I maintain a websockets server implementation, and it has the following configurables:

1) maximum message size 2) maximum fragments per message and 3) minimum transfer rate for a message

These are all necessary to prevent DDoS attacks and testing shows that no real-world clients have a problem with these.

You've very neatly pointed out part of the problem with the way the RFC is worded; everyone will have their own idea how best to fix it. This will likely lead to interoperability issues and confusion in the long term.

In practice I expect that your 'fixes' and the RFC's wording won't cause that many issues as I would expect that initially most client code will be produced by the same people producing the server code; and by this I mean the code that uses the WebSocket layer rather than code that implements it. As such the client developer knows the limitations and foibles of their server side implementation.

Your maximum fragments per message setting could cause you problems if an intermediary that changes fragmentation ever gets in the data path between your clients and servers; but I guess you can tune that when you find it an issue.

"Your maximum fragments per message setting could cause you problems if an intermediary that changes fragmentation ever gets in the data path between your clients and servers; but I guess you can tune that when you find it an issue."

Exactly, it's tunable. If small maximums were mentioned in the RFC, then it would be not tunable. Also, I currently keep it set fairly high; it an intermediary wants to split a message up into single-byte fragments, then I consider it a hostile intermediary, not something I need to interop with.

Plenty of HTTP servers implement workarounds that are absent in the HTTP RFC that would prevent a RFC compliant, but not well-behaved, HTTP client from successfully connecting.

The dirty truth of any networking protocol is that servers(clients) will interop with the clients(servers) they are tested with and few others, regardless of what the RFC describes.

If you are using the Browser, the frame can be at most 2^53 = 9,007,199,254,740,992 bytes due to the limit of the maximum Integer value in JavaScript[1]. More than 10% smaller. Okay, it doesn't make a big difference.

[1] http://stackoverflow.com/a/7179733/605890

I cam across your post after googling for "websocket message fragmentation". The reason for this was that I am implementing a small RPC style library using Google protocol buffers in C#. Looking at the ClientWebSocket API in .NET I saw that the WebSocketReceiveResult class had an "EndOfMessage" property and I just thought "What the h... is this?".

I found your post quite enlightening, and your conclusions unavoidable.

For me the focus should be on the message aspect as you can always implement an "infinite stream of messages" on top of that in the application layer when needed. A maximum message size for a given connection should thus be part of the protocol.

Well, this won't happen, I guess.

How strange that such obvious specification errors were not caught early on.

I think the problem was that the specification was very much a design by committee thing. There were lots of different people pulling in quite different directions and there was a distinct lack of requirements traceability.

Add to this the desire to achieve consensus and move on rather than "discuss" points for ever and it was pretty inevitable that the result would be a bit of a mess.

I'm pretty happy with the API that I eventually came up with to deal with this; in most situations users of my framework can work in terms of complete messages but if they need messages that are too big to buffer, or have messages that can be processed in pieces as they arrive then they can work that way as well.

I think, the Websocket protocol specifies, that you should implement such message limit in you client app:

10.4. Implementation-Specific Limits


Implementations which have implementation- and/or platform-specific
limitations regarding the frame size or total message size after
reassembly from multiple frames MUST protect themselves against
exceeding those limits. (For example, a malicious endpoint can try
to exhaust its peer's memory or mount a denial of service attack by
sending either a single big frame (e.g. of size 2**60), or by sending
a long stream of small frames which are a part of a fragmented
message.) Such an implementation SHOULD impose limit on frame sizes
and the total message size after reassembly from multiple frames.

Yes, it does now; though such a suggestion is pretty redundant really as of course implementations should do that. This was only added after the draft that I was talking about in this blog posting.

It's just a pity that there's no way to communicate those limits to peers using the protocol itself.

Leave a comment

About this Entry

The WebSocket protocol, design by committee and requirements tracing was the previous entry in this blog.

WebSockets - Why differentiate between text and binary frames? is the next entry in this blog.

I usually write about C++ development on Windows platforms, but I often ramble on about other less technical stuff...

Find recent content on the main index or look in the archives to find all content.

I have other blogs...

Subscribe to feed The Server Framework - high performance server development
Subscribe to feed Lock Explorer - deadlock detection and multi-threaded performance tools
Subscribe to feed l'Hexapod - embedded electronics and robotics
Subscribe to feed MegèveSki - skiing