Strangely fatal UDP issue on Windows...

UPDATED: 23 August 2021 see here

One of my clients runs game servers on the cloud. They have an AWFUL lot of them and have run them for a long time. Every so often they have problems with DDOS attacks on their servers. They have upstream DDOS protection from their hosting providers but these take a while to recognise the attacks and so there’s usually a period when the servers are vulnerable. Recently we’ve seen a DDOS on UDP that caused them problems. Initially we thought that it was a recent code release that had made the servers less stable; it was, but only in the sense that some tracing code has made the UDP recv path fractionally less efficient and this had exposed a problem that had been in the server all along. After spending a while in triage with the updated code we narrowed it down to two small code changes and removed them. Since the removal of these changes made no sense in terms of the problem we were seeing I suggested adding a small, harmless, delay in the udp recv path. Immediately the problem was back.

This morning I managed to reduce the problem to some code that didn’t include their server at all. The simplest synchronous UDP receiver and a simple UDP traffic generator running on a machine using either a real IP or localhost can demonstrate the problem. If the receiver isn’t running and the load generator is pushing datagrams into a black hole then there’s no problem. If the receiver is running fast enough then no problem. If the receiver has a delay in the receive loop then non-paged pool memory starts to grow at a rather fantastic rate.

More worrying is that if you bind a UDP socket to a port and then do NOT issue a recv call and a sender sends a stream of datagrams to that port then this causes non-paged pool usage to grow. Once you start to issue recv calls the non-paged pool usage will go down as you chew through the queued datagrams and as long as you can read the datagrams faster than the sender can send them then there’s no problem… I really should run a test to destruction with this, but there’s no sign of it slowing down and I really don’t want to make my development box blue screen this morning.

I expect that this is some “new-fangled nonsense” in the Windows network stack. Probably a service that’s doing “something useful” and doesn’t throw datagrams away if its source is providing them faster than its sink can consume them. This seems to be a common problem. There seems to be a real reluctance for people to throw network data away if the consumer can’t keep up. I’ve seen this before with network drivers that consume massive amounts of non-paged pool when failing to process outbound offloaded CRC calculations fast enough and when network drivers have ‘flow control’ options enabled which means they try and slow down data flow when the peer can’t keep up with them. At some point a decision needs to be made to throw data away but nobody ever seems to want to be the one to make that decision. Instead non-page pool becomes exhausted and boxes blue screen. At least with these NIC related issues you have a chance to exercise some control in your sending code, you can generally spot that the NIC is slowing down, even if you’re using asynchronous sends and you can decide to reduce the amount you’re sending. I’m pretty sure that, for once, it’s not network drivers this time. The problem is clearly demonstrated using localhost with no network needing to be present. However, network driver issues are easy to work around, usually being send related where we can make code changes or a component that is easy to swap out for another brand…

So, for now, we have a cunning plan to fix the DDOS, a potential support call to Microsoft and some long term grunt work disabling various network related services to see if we can work out where the problem is. Hopefully it’s something obvious that I’m missing or just me being an idiot.