A bug has been discovered in Release 6.7 in the code that deals with TCP socket writes that involve more than a single buffer. These 'multi-buffer writes' are writes that involves either a buffer chain or a block of data passed as a pointer and a length where the length exceeds the size of the buffer allocator that the connection is using.
The bug prevents the 'multi-buffer write' from being executed as a single atomic write at the network layer and so can cause corruption of a TCP data stream if multiple sockets are writing to the same connection concurrently.
The bug is due to the removal in 6.7 of the code required to support Windows XP. In Windows XP we needed to include sequence numbers in write operations to allow for the way we always marshalled all I/O operations from the calling thread to an I/O thread to prevent I/O cancellation due to thread termination. This write sequencing code had the side effect of also protecting 'multi-buffer writes' from being interrupted by other writes.
The fix does not require the reintroduction of write sequencing but, instead, issues a single scatter/gather style write for the entire buffer chain. This is both efficient and correct.
A related bug also affects atomicity of 'multi-buffer writes' into filter layers, such as the SSL code. Similar fixes have been applied here.
The bug is fixed in Release 6.8 which will be released later today.
I've been trying various static analysis tools on the C++ code of The Server Framework. So far I'm using Resharper C++ and the Gimpel PC-Lint Plus Beta on a regular basis and I've now added CppDepend into the loop.
Full disclosure. I have been given a CppDepend license.
As I've said before, whilst CppDepend is easy to get hold of, easy to install and "just works" I don't find it that useful. I can certainly remember large enterprise clients where this kind of tool would be invaluable for management level analysis of large codebases but for a small development team of competent people it's less immediately useful. That said, I've found several warnings that it produces to be helpful and so I've been running it alongside the other tools as it fills some gaps.
Since I'm in the process of dropping support for several old compilers I can finally begin to move the codebase forward to slightly more modern C++. All of the tools help with this and CppDepend has some 'modernise C++' checks that I'm finding useful.
I like the fact that I can run CppDepend as a stand alone GUI. I prefer this method to using fully integrated Visual Studio extensions.
I like the idea of the regression reports but haven't actually set up a baseline report and run them...
I'm not sure that I would purchase a license for this tool but I know clients that could benefit from using it.
Previously on "Practical Testing"... Having just resurrected this series of blog posts about testing a non-trivial piece of real-world C++ code I've fixed a few bugs and done a bit of refactoring. There's one more step required to bring the code in this article right up to date.
The timer queue that is the focus of these blog posts is part of The Server Framework. This body of code has been around since 2001 and has evolved to support new platforms and compilers. One of the things that I do from time to time is remove support for old platforms and compilers. This allows me to start using exciting new C++ features (around 10 years after they were first 'new') and it means that code that is present just to work around issues in old platforms can be removed. The timer queue has quite a bit of code that needn't be there now that Windows XP has passed away.
In 2008, in episode 17, I added support for
GetTickCount64() which removed some of the complexity from timer management. We kept the old version around as you needed to have Windows Vista or later to use that API call. Now that I no longer support Windows XP every supported platform has
GetTickCount64() and there's no reason to support the XP version. Removing that removes quite a bit of complexity from both the internals and the public API; you no longer have to decide which version to use as there is only one!
As the recent bug fixes have shown, there are also two versions of timer dispatch. The original version which holds a lock whilst calling into user code during timer dispatch and the improved version that doesn't. Holding locks whilst calling back into user code via callback interfaces is a surefire way to create lock inversions and deadlock. The old dispatch method was originally kept because I was unsure of the performance characteristics of the new way. The improved version was added in 2008 in episode 18 and every piece of development that I've done since has used the new dispatch method. The performance difference, if present, is not relevant and the removal of the ability for code to take part in lock inversions is important for library code. So the original method of dispatch should be removed.
The removal of this functionality massively simplifies the code and the tests. 60 tests can be removed which is almost a third. Most of the code that was moved into the common base class for the two versions of the queues can be moved into the 'Ex' version with
GetTickCount64() support. I expect that I should, eventually, rename the
CCallbackTimerQueueEx class to
CallbackTimerQueue, but not yet. There are some systems that use this class explicitly rather than via
CThreadedCallbackTimerQueue and the enum that we're removing in this release. It will be easier to introduce another breaking change in a future release as it means that fixing the breaking changes due to this release is slightly easier and more obvious and the fix when I do change the name will be a simple rename in the code that uses the old class...
Previously on "Practical Testing"... I've just fixed a new bug in the timer queue and in doing so I updated the code used in the blog posts to the latest version that ships with The Server Framework. This code included two fixes that had been fixed some time ago and which I hadn't documented here. They also lacked unit tests... Last time, I wrote tests for, and fixed, the first bug. This time I fix the final bug.
This bug is in the "lock-free" timeout handling code and it would cause the threaded version of the timer queue to do a small amount of unnecessary work but otherwise work correctly. The problem is that we use a
ULONG_PTR value as an opaque handle when processing timeouts using the "lock-free"
EndTimeoutHandling() sequence and this opaque handle has a sentinel value of
0 that is used to indicate that there are no timeouts to handle. The code that we use to generate the timeout handle in
BeginTimeoutHandling() is a simple
::InterlockedIncrement() and, if enough timeout handles are generated, this will wrap to
0 and return the sentinel value when it should be returning a non-sentinel value.
Previously on "Practical Testing"... I've just fixed a new bug in the timer queue and in doing so I updated the code used in the blog posts to the latest version that ships with The Server Framework. This code included two fixes that had been fixed some time ago and which I hadn't documented here. They also lacked unit tests... In this episode I find and fix the first of these issues by writing a unit test that triggers the issue.
This bug is another edge case that isn't used that often by most of the code that uses the timer queue. The queue supports timers that can be set and reset a number of times via an interface that allows you to create a timer, set/cancel it and then destroy the timer. A less used interface allows you to create 'fire and forget' timers that can only be set once and that clean themselves up. Hardly any code that I write uses this interface but it's there for backwards compatibility and the code required to support it is limited to the call that sets the initial timer, as the cleanup is done by code that's shared with timers that are deleted during their own timeout processing.
This bug also only affects the Timer Wheel implementation which has a considerably smaller set of users and a considerably narrower use case. There's test coverage for "one shot" timers but only for timers that are processed whilst the timer queue's internal lock is held. There is no test for a "one shot" timer for queues that dispatch without holding a lock. The code for lock-free dispatch is significantly different to the code for dispatch whilst holding a lock and that's where the bug is.
The bug was originally found because it causes heap corruption due to a double delete. The first thing I'll do is add a test for lock-free timer dispatch of "one shot" timers. This clearly demonstrates the bug when run in release mode on Windows 10 as the heap alerts us to the problem.
TL;DR I'd like to like it. It does some good things but it also gets in my way and slows me down.
ReSharper is a Visual Studio addin. In general I don't like addins but this comes from my years working short contracts and multiple clients where it was easiest to be at home in a clean Visual Studio installation as no two clients would have the same addins installed. ReSharper's adding intercepts lots of standard Visual Studio functionality and I'm sure that I could get used to it but intellisense and stuff is fractionally slower than without it installed (and yes, I do have a machine that well exceeds the basic hardware requirements). It also does some things just differently enough that it means I automatically do the wrong thing and then have to redo the right thing. This interrupts my flow. I'm sure that if I spent enough time with it I'd get used to it but I'm not sure I will spend enough time with it.
Right now I'm most interested in the inspection side of ReSharper. Inspecting a solution takes quite a while, which is to be expected. Some of the coding violations that it reports are useful, some less so, some inappropriate, but that's always the way with these tools. It seems to be difficult to suppress violations. I like the way PC-Lint allows for inline suppression comments (which I hate, but use anyway) and external suppression files. I work on lots of different codebases and common suppression rules are a must and these MUST be project/work item specific.
I've had really good support from JetBrains. The addin popped up a message box telling me I was half way through my trial and asking me how I was getting on. I responded with some of my gripes and had an email within a day from a helpful support person with some work arounds and explanations.
I'm still using it, but I'm not sure I'll continue with it after the trial completes.
Back in 2004, I wrote a series of articles called "Practical Testing" where I took a piece of complicated multi-threaded code and wrote tests for it. I then rebuild the code from scratch in a test driven development style to show how writing your tests before your code changes how you design your code. Since the original articles there have been several bug fixes and redesigns all of which have been supported by the original unit tests and many of which have led to the development of more tests.
In the previous update in July 2014 I added custom intrusive containers to replace the stl containers and reduce memory allocation contention and improve performance. Since then the code has been in regular use by clients and users of The Server Framework and has stayed pretty stable.
Recently a client has upgraded some rather old code to a new version of The Server Framework and triggered a bug that has lain dormant in the timer queue since release 6.6 of The Server Framework when I added some new locking classes. Most clients haven't triggered this bug because they don't use the code in this particular way.
So, in this installment I detail the bug and present some tests which reveal it. The fix is pretty simple, we simply have to change the lock type used by the threaded timer queue so that it allows reentrant locking.