WinRM/WinRS job memory limits?

2010-11-04

I’ve had one of those days. In fact I’ve had one of those days for a couple of days this week…

It started when I decided to improve the profiling that I was doing on a new memory allocator for The Server Framework by extending the perfmon and WinRS based distributed server testing that I wrote about a while back. This allows me to create a set of perfmon logs, start a server to collect data about and then start a remote client to stress the server. Some simple batch files make it extensible and repeatable and once I get a handle on manipulating the output data automatically it’ll be a really useful system.

The allocator that I’ve been working on is a pooling, per thread, allocator that uses thread local storage to avoid locking and allows other threads to steal from your pools or give to your pools to deal with the fact that sometimes one thread does all the allocation and others do all the deallocation. It works pretty well but could do with some tuning and so I wanted to put it through its paces in a real server with repeatable testing.

I decided to use one of the OpenSSL server examples as my test server since it did lots of stuff and put the allocator through its paces with a number of threads. The first problem was that I discovered a memory leak in the OpenSSL code. This was located and fixed pretty quickly using the buffer reference tracking debugging code that I built into The Server Framework a while ago to help a customer. A couple of new #define values in Config.h and each of the leaked buffers produced a set of call stacks around each reference count change. The problem was introduced when I redesigned the filtering code for 6.2, it looks like none of the customers that are using this code has put the 6.2 update into production yet as I’ve had no bug reports on this yet.

The memory leak cost me half a day of profiling as it took me a while to realise that the reason the allocators were not performing how I’d expect was because many buffers were never being released back to the pool. Of course I initially blamed the new allocator for the problem… Once that was sorted I realised that I wasn’t loading the server enough to stress the allocators and so I set about investigating why I was only getting a 100mb link from a 1Gb network card; loose cable would you believe!

Finally I adjusted my remote client to create 6000 connections sending data which should have used around 25% of the 1Gb bandwidth and which, I felt, was a good starting point for the stress test. The test ran fine when started manually from the client machine and failed randomly when run from WinRS remotely. At first I thought it was network related, which seemed unlikely but the test client was failing in strange ways and seeming to hang. After debugging the client I discovered that it was failing due to OpenSSL being unable to allocate memory during connection establishment and this was causing my wrapper code to throw an exception which left the internals of the client in a bit of a mess and which caused the client to hang during shutdown - it was waiting for sockets to be released that would never be released due to the exception leaving some references unaccounted for.

Once that problem was fixed (and that fix will find its way into 6.3.1 as well) I could finally run the client from WinRS and watch it fail cleanly when I stepped the number of connections from 2500 to 3000. The program was suffering from memory allocation failures - which it now dealt with correctly and shut down once the connections that had been established completed…

I did some googling and found nothing about any limits that WinRM/WinRS might impose on the processes that they run. I asked on ServerFault, but as yet have had no replies. I stopped and played with my son and ate and watched TV…

Job objects, I thought… WinRM must be running the processes that it launches under a job object (as any sensible process launching process should). Job objects can be set to limit the amount of memory that a process can commit. This would give the behaviour that I was seeing.

I haven’t continued my investigation yet as I’ve shut down the machines in question for the night. I expect that ProcessExplorer will show me that WinRM is using jobs. I expect a simple “allocate all memory” program will show me that when run using WinRS it fails at a number far far smaller than when run locally on the machine. If I’m lucky I’ll find some documentation on how to change the limit in WinRM. If I’m not I guess I need to write a replacement.

Still, the good news is that I’ve found and fixed a couple of bugs before my customers had problems with them. I’ve also decided that it would be quite handy to have a simple process launcher that could run a given process under a job with a specific memory limit so that I could check out how things behave in low memory situations and finally I could write another simple process launcher that uses jobs to monitor the maximum memory allocated during runs of my black box server tests, and this would help me catch future memory leaks during the testing that happens during my release process.