May 06, 2008

More on the CLR startup change

Last week I mentioned that some of my tests for my Win32 Debug API class had suddenly started failing. It seems that I was right and the changes are due to some .Net fixes that have been rolled out recently. The code runs and the tests pass if I run on a clean instal Vista x64 VM and fail on my day to day development box.

It seems that my plan to "stick a breakpoint in mscoree.dll's _CorExeMain()" wasn't such a good idea after all. With the new updates installed and with an x64 process using the Win32 debug API to run a CLR 1.0 or 1.1 app (x86) the breakpoint in mscoree.dll's _CorExeMain() never gets hit. Luckilly switching to sticking a breakpoint in mscorwks.dll's _CorExeMain() instead seems to work on both x86 and x64, running either x86 or x64 (where appropriate) CLR apps and on both a clean install of Vista and on patched systems. So, now if the app we're launching is a CLR app we ignore the start address entirely and use mscorwks.dll's _CorExeMain() as our start address. This seems to give a reliable way to halt a CLR app after it has started up, when it's in a stable state, and before it starts to do anything. Just what I need to inject my code.


Share this entry: Email it! | bookmark it! | digg it! | reddit!
Posted by Len at 03:37 PM | Comments (0) | Categories: Debugging Tools , Testing

Interesting blog

I found JP's blog whilst googling for some information on RVA forwarders in PE files. There's lots of good stuff there about reasonably low level Windows development, debugging, testing and API hooking. Recommended if you like the kind of stuff that I write about.

In addition to the blog postings JP has produced cfix a unit testing framework for C++. I haven't had a chance to look at it too deeply yet, but the documentation looks good and the source is available from SourceForge under the GPL.


Share this entry: Email it! | bookmark it! | digg it! | reddit!
Posted by Len at 12:30 PM | Comments (1) | Categories: Geek Speak

May 01, 2008

WOW64 Win32 DebugAPI CLR application startup change

Back in October 2007 I sumarised my findings from getting my Win32 DebugAPI based debug engine working on x64. One of the strange things that I found at the time was this:

When running a CLR app under the Win32 debug interface you only ever seem to hit the native entry point if you're running under WOW64. In all other situations you don't hit the native entry point ever. If you rely on it to pause your debug tools once the process is completely loaded and ready to roll then you need to stick a break point in _CorExeMain in mscoree.dll. What's more, if you're on x64 you might not even be able to access the native entry point's memory...

Well, that seems to have changed... Upon running up my "Debug Tools" test harness a couple of days ago I found I had some test failures when launching CLR 1.0 apps for debug from a Win32 debugger running on an x64 system. On my system only CLR 2.0 apps run as native x64, so, in effect the Win32 debugger was launching a Win32 CLR application whilst running under the WOW64 layer. The behaviour now seems to be identical to running a Win32 CLR application from a Win32 debugger on an x86 system; which, I suppose, is good. The downside is that I've no idea when this change was rolled out and I now have no sure fire way of building a VM box with the old style behaviour to see if I can write some code that works with box fixed and unfixed CLR start up semantics. I guess I can try a clean install of Vista x64...


Share this entry: Email it! | bookmark it! | digg it! | reddit!
Posted by Len at 04:43 PM | Comments (0) | Categories: Debugging Tools , Geek Speak , Testing

April 28, 2008

Socket connection termination

I've been putting together a sample server for a client that shows how to cleanly terminate a socket connection. This should have been a simple thing to do, but in doing so I've discovered some gnarlyness within the framework and the result has been some new TODO items for the 5.3 release...

When you have an active TCP/IP connection that you wish to terminate cleanly you need to initiate a TCP/IP protocol level shutdown sequence by calling shutdown(). This sends the appropriate packets between the two TCP/IP stacks (server and client) and terminates the connection. Once this is done you can close the socket by calling closesocket(); this cleans up the resources used by the socket (and associated data structures) within your program. Closing the socket without initiating the protocol level shutdown sequence implicitly triggers the shutdown sequence. This is explained here "Graceful shutdown, linger options, and socket closure".

Simple servers written using our server framework tend to operate as follows: When an incoming connection is detected an asynchronous read is issued, this increments a reference count on our socket class. When a read completes the last thing that happens before the function returns to the calling code within the framework is that a new read is issued. If the client closes the connection the pending read within the server returns with 0 bytes read, this is interpreted as a 'client close' and no further reads can be issued on the socket. This, eventually, causes the reference count on the socket to fall to 0 and the socket gets cleaned up. Part of that clean up involves calling closesocket(). If the server wants to terminate the connection then it calls Shutdown(ShutdownSend) on its socket to indicate to the client that it has no more data to send and this eventually results in the client shutting down its socket and the server socket cleanup sequence that I described earlier.

Due to the way the server is designed, there's some 'clever stuff' in there to make sure that if you have several writes pending but not yet issued by the framework then the call to shutdown() occurs after the last write has actually been passed off to the TCP/IP stack.

The socket class also exposes a Close() method which calls closesocket() on the socket directly; that is it doesn't do 'clever stuff' to deal with outbound data that is 'in flight' within the framework. You probably don't want to call Close() unless you don't care if the data gets to the other end or not; or if you know that there's no data 'in flight'.

It gets more complex...

Due to either my misreading of the docs for closesocket() (or the fact that they were originally less clear and have since been clarified) it was my belief that a graceful shutdown using closesocket() would block. Since one of the most important design decisions of the framework is that work done on the I/O threads should not block the default behavious for the automatic socket closure that happens when a socket is being cleaned up is for the close to be a 'hard' or 'abortive' close. That is we deliberately choose not to linger. Because this isn't always what you want (no kidding!) there's some code in there that allows a user to intercept the default behaviour and, potentially, call CloseSocket() yourself or to marshal the CloseSocket() call off to your own threads so that it could block them instead. However, graceful shutdowns that occur due to closesocket() do not block, so, it seems, most of that code isn't really needed...

Some of the example servers in the 5.2.1 release get this wrong, it doesn't cause them to lose data, since they're not doing anything that complex, but a more complex server that has been modelled on one of the examples may have problems. If you've been having this problem then I'm sure you'd have contacted me already, but, if not, do get in touch and I'll help sort things out for you.

So, in summary, at present, in version 5.2.1 of the framework or before, you should generally be calling Shutdown() to terminate your connections and the framework will deal with the resource cleanup and eventual call of closesocket() itself. You can call Close() but you shouldn't do that unless you KNOW that there cannot be any data 'in fligh' that the server has sent but that the client might not have recieved, OR you don't care if the data gets to the client.

This will become nicer in 5.3, I hope. I plan to make "standard" connection termination easier to manage and provide access to the, currently private, AbortiveClose() method on the socket class; this sets the socket's linger options in such a way that the socket is closed immediately and all pending data is discarded. What's especially useful is that this also sends a RST (reset) on the TCP/IP connection and this closes the connection without putting the closer into the TIME_WAIT state; which is useful sometimes.


Share this entry: Email it! | bookmark it! | digg it! | reddit!
Posted by Len at 06:05 PM | Comments (2) | Categories: Socket Servers

April 24, 2008

Spam gone...

The endless torrent of bounce messages began again yesterday evening. Once again it was arount one email every 2 minutes or so. I turned on my laptop this morning expecting a few thousand emails to download and only a third of them to be correctly classified as spam by Outlook... There were a few, but, probably, under 10. There was no other spam either... Two legitimate email messages... It was, well, rather strange.

Of course this didn't seem right. I sent myself a test email and that worked. I checked the webmail interface and the mailbox was really empty. I bothered the guy who runs my mail hosting via messenger and he explained that he'd changed the smtp server last night. He now uses qpsmtpd and it has a pluggin that checks emails for known spam urls and filters these spam messages out.

I'm still not convinced... So far most of the legitimate email that I should get on a daily basis is arrving OK; newsgroup notifications, NAS alerts, etc but one of my NAS devices doesn't seem to be getting through... And if that's not working, who else is having problems?

Overall the lack of spam it nice, if a little wierd and ever so slightly retro. Assuming it is actually working correctly then I think it's a great improvement. However, I can't help feeling slightly cut off from the 'heart beat' of the internet.

If anyone sends me an email to my jetbyte account and doesn't get a reply then you could try sending to my gmail account, which is the obvious address, or leave a comment here... Fingers crossed you wont need to...


Share this entry: Email it! | bookmark it! | digg it! | reddit!
Posted by Len at 08:52 AM | Comments (2) | Categories: Geek Speak

April 21, 2008

Spam problems

This morning a spammer somewhere seems to have used my main email address as the return address on a whole bunch of random spam that has been sent out from all over the place. As such I have around 3000 undeliverable mail responses flowing into my in box. No doubt this will now have knock on effects with ISPs who use DNSLB type systems as my domain is being used by spammers again.

What's the best way of dealing with this kind of problem?


Share this entry: Email it! | bookmark it! | digg it! | reddit!
Posted by Len at 08:14 AM | Comments (3) | Categories: General

April 17, 2008

Comments, captcha and blacklist...

I've turned the blacklist back on. I turned it off yesterday and have had a couple of spam comments get through. The blacklist itself doesn't always catch the spam comments but it does give me a one click method of removing them. With it turned off I lose the easy removal.

If your comment is refused you should get a message telling you why; the reason is logged, but, unfortunately the full comment txt isn't. The best approach if you have a legitimate comment that you cant post is to either email me, or leave a simple comment that explains that you cant comment ;) (I know...) Anyway, if you do that then I can remove the offending, over zealous, blacklist entry and post your comment for you...


Share this entry: Email it! | bookmark it! | digg it! | reddit!
Posted by Len at 05:40 PM | Comments (0) | Categories: General

April 15, 2008

What would I do??

There's an entry over on the Dr. Dobbs blog about testing and how you make sure that your tests are testing the right thing; effectively, who tests the test. There's a question at the end "What do you do?" and I think my rather pithy, I've had some wine, answer is, "I think harder".

The poster laments the fact that if you're doing TDD then the test fails first and then you write the code and then it works and therefore you know the test is testing the correct thing but if you have existing code then, well, it doesn't work that way. It only doesn't work that way if you're being lazy.

The example given is that an already developed component that has tests is now made multi-threaded. The poster decided that simply running all the existing tests in parallel would test the component for thread-safety. And was then surprised to find that it didn't due to how the tests all tested their own, isolated, instance of the component...

Hmm. Personally I feel that if you're trying to write a test to prove that something is thread safe then you need to write a test that deliberately puts the thing under test in a situation where the lack of thread safety shows itself. You don't just assume that running existsing tests together and having them work will somehow prove something... Writing tests for multi-threaded code is hard. You need to think about it. Abdicating thinking and then somehow pushing this failure back onto the tests themselves is, er, rather crap.

As I've said before, the tests act as scaffolding for the code and the code acts as tests for the tests. If either is wrong or is changed so that previously held beliefs are no longer true then the tests fail. You don't write tests for tests you write tests for code and if either disagrees with what should happen then the test fails. It's like aircraft having multiple redundant systems, they should agree, or there's a fault.

The original poster's problem is that he didn't actually bother to write a test for the situation that he wanted to test. I think he should be asking "who tests the tester"...


Share this entry: Email it! | bookmark it! | digg it! | reddit!
Posted by Len at 10:24 PM | Comments (0) | Categories: Geek Speak , Testing

April 14, 2008

PQR - A Simple Design Pattern for Multicore Enterprise Applications

There's an interesting article over on the Dr. Dobbs Code Talk blog; PQR - A Simple Design Pattern for Multicore Enterprise Applications. It documents a design that I'm pretty familiar with and one which has worked pretty well for me in the past (this project was built in this way, for example).

My variation on this idea is that it all tends to be in one process. Work items are passed from one 'processor' to another via queues and each processor can run multiple threads to process multiple work items in parallel. In simple systems you end up with a "pipeline" and work items flow from one end to another; more complex systems may be modelled as networks of processors. You can tune the system by adjusting the number of threads in each processor's thread pool and can also do things like having different processors run at different thread priorities (if you really want to). Since a work item is only ever being acessed by a single processor at a time, the data in the work item doesnt need any locking. If a processor needs to access data which can be shared (either by instances of a processor or by different processors) then normal locking is required but the situations where locking IS needed are greatly reduced.

I find it interesting that the Dr. Dobbs article points out that 'careful measurement is required'. I agree, this is one of those situations where it's vitally important to include performance monitoring (via perfmon counters?) from the outset. Unless you can see how many threads are active at each stage in the pipeline and how many work items are in each of the queues then you simply cannot tune the system in a meaningful manner.


Share this entry: Email it! | bookmark it! | digg it! | reddit!
Posted by Len at 05:44 PM | Comments (6) | Categories: Geek Speak

April 09, 2008

Practical Testing: 17 - A whole new approach

The comments to my last practical testing entry got me thinking. The commenter who had located the bug in part 15, which was fixed in part 16, suggested a new approach to the problem and I've been investigating it.

The suggestion is, essentially, to use a timer with a longer range before roll-over rather than GetTickCount() with its 49.7 day roll-over. In Vista and later we could just use GetTickCount64() but on earlier platforms that's not available to us. My commenter's solution was to build a GetTickCount64() on top of GetTickCount() and use that. Given that adjusting the code for Vista support via the real GetTickCount64() was on my list of things to do, I decided to also take a look at the potential of the hybrid approach suggested by my commenter.

Switching to using a greater range means that we can remove much of the complexity which was there to protect us from the rollover as this will now only occur after around 584942417.4 years of machine up-time rather than after 49.7 days...

In the zip file that accompanies this article there are two timer queues under test. The first, CCallbackTimerQueue uses a hybrid GetTickCount64() implementation that will work on any platform as it uses GetTickCount() to do the work and the timer queue manages the upper 32-bits itself. The second, CCallbackTimerQueueEx, uses the real GetTickCount64() call and will only run on Windows Vista or later platforms. You can build for pre-Vista systems by editing the Admin\TargetWindowsVersion.h header file and adjusting the values for NTDDI_VERSION and _WIN32_WINNT.

The native Vista version of the code is the simplest so I'll discuss that first. There are several additional issues that need to be dealt with if we are building our own GetTickCount64() and these get in the way of the simpler code...

The first thing, of course, is that I had the tests that were written for the previous versions of the code to make it easier for me to make these changes to the internals of the code. I did this before in part 15 when I fiddled around with the internals to make the code more scalable. The presence of the tests makes this kind of change quite fun; I can concentrate on hacking away at the old design and know that if I change some functionality that is covered by my tests then I should find out as soon as I run the tests. Looking at the header for CCallbackTimerQueueEx the first thing that you'll notice is that I've removed a couple of constructors; there's now no need to allow the user to tune the maximum timeout allowed. Next you'll see that the actual data structures used for the queue have been simplified; we only need one queue now rather than two and the timers are keyed by ULONGLONG rather than Millisecond (DWORD). There are less helper functions and we use an instance of IProvideTickCount64 rather than IProvideTickCount. Looking at the code itself, I've hardcoded the maximum timeout to one less than INFINITE which gives us the whole usable range of a DWORD for timeouts. I don't see any advantage in expanding the length of the timeouts that you can set to be ULONGLONGs as 49.7 days should be long enough for anyone ;) and, if it isn't, the user can set another timer when that one expires and build a longer timeout using the current implementation. Since all of the multiple queue stuff can go, setting timers is now simpler and we can go back to the functionality from part 15 where calling SetTimer() does NOT cause timed out timers to be handled automatically (I was never really comfortable with that change anyway!). InsertTimer() is simpler as we're only ever dealing with a single timer queue and rather than all the complexity that we had before for dealing with a timer that spans a rollover we can now simply disallow timers that do that; I don't feel too bad about doing this as I think it's reasonable to specify that the code doesn't support setting timers that cross a 584942417.4 years rollover point. GetNextTimeout() is now massively simplified as all it needs to do is look at the timeout value and compare it with now to see if it has expired. And that's it.

CCallbackTimerQueue is more complex, but not massively so. The complexity arises due to how I maintain the high 32-bits of the 64-bit counter. Since the code works in terms of the 32-bit counter value returned by GetTickCount() and we know that this wraps every 49.7 days I figure we can spot the wrap (now is less than the last time we checked) and use the event to increment the high 32-bit counter. The only potential risk is that we don't spot the wrap, that is, we don't call GetTickCount() for 49.7 days and the counter wraps and then becomes more than the last time we called GetTickCount(). To prevent this unlikely situation, the timer queue sets its own internal maintenance timer for the 32-bit counter roll over point. All this timer does is go off reset itself, but, I think, this is enough to cause GetTickCount() to be called often enough to prevent any problems...

The tests need to change a little due to the way that SetTimer() no longer implies HandleTimeouts() and because of the internal maintenance timer that is set upon construction.

The duplication in the code bothers me, so I expect the next instalment will deal with that, and any bugs that people report!

Code is here and the new rules apply.

Note: This release has been rushed, I haven't had a chance to check any of the builds except the VS 2008 and VS 2005 ones. I'll check the rest and fix any problems when I get back from Zermatt.


Share this entry: Email it! | bookmark it! | digg it! | reddit!
Posted by Len at 06:54 PM | Comments (0) | Categories: Source Code , Testing

April 07, 2008

It seems I'm not the only one...

It seems I'm not the only one to make mistakes with GetTickCount() based timer code, see: System.Threading.Timer fires immediately when specifying a large value for due time.


Share this entry: Email it! | bookmark it! | digg it! | reddit!
Posted by Len at 07:05 PM | Comments (0) | Categories: Geek Speak

April 04, 2008

Practical Testing: 16 - Fixing a timeout bug

Back in 2004, I wrote a series of articles called "Practical Testing" where I took a piece of complicated multi-threaded code and wrote tests for it. I then rebuild the code from scratch in a test driven development style to show how writing your tests before your code changes how you design your code. Then, in 2005, I adjusted the code to be more scalable and I showed how the tests that had originally been written helped when code needed to be changed for performance purposes. Finally I uploaded a test utility program that I'd been working on, TickShifter, that allowed you to run a program and take control of how the GetTickCount() API operated within the program. The idea was that you could control time from outside the program to enable you to easily test edge conditions.

Time passed...

Recently I've had a bug reported against the timer queue code that was developed in the testing articles. You can find the bug report comment here. I'd like to thank the commenter for taking the time to report the bug in such a thorough manner; it made it much easier for me to validate the problem and craft a test that proved its existence.

The bug is as follows: If the tick count has wrapped but no timers have fired since it wrapped and you then add a new timer the two queues that are used to manage wrapped timers have not been adjusted to allow for the fact that the tick count has wrapped. This means that the new timer is added to the wrong queue and the queues are then not swapped and timers expire in the wrong order.

It's a nice edge case bug and it's one that was not tested for in the original test harness. The first thing that I did was set out to write a test to reproduce the bug. The commenter had used TickShifter to force the situation; which is what it's for! but for development and regression purposes it's better to have tests in the test harness that makes sure the bug stays fixed.

The first new test TestTickCountWrap2() sets up the environment exactly as the bug report stated. Take a look at the code for full details. First we set the tick count to be 1000ms before roll-over. Next we set a timer for 2000ms time, i.e. 1000ms after the tick count rolls over to 0. Next we set the tick count to 0, i.e. the point at which it rolls over. We then set another timer for 10000ms time. At this point we should have two timers set; the first will expire in 1000ms and the second in 10000ms. This is the point where the original code had a bug. Next we set the tick count to 1000. At this time the first timer should go off and we check that it does. We also verify that there are 9000ms until the next timeout. Finally we set the tick count to 10000 and verify that the second timer expires correctly.

Of course, when run with the code from part 15 this test fails for exactly the reason that the bug report stated. The problem was that we used two queues for the timers, one for timers before the "wrap point" and one for timers after. The only point that we ever switched the queues over was when a timer expired. At that point if we knew we had a timer that had expired and the current queue didn't contain any timers then we knew that the tick count had wrapped and that the timer we wanted was in the other queue. Of course, looking at this now, written down like that, it's quite obvious to see the flaws. It doesn't matter how many tests you write, if you don't write the correct tests then your code can still have bugs!

A solution to this bug is to keep a track of the tick count when we set a 'wrapped' timer. Then, when checking for timeouts, if the current tick count is less than the tick count when we last set a wrapped time we know that the count has wrapped and we can adjust the queues accordingly. The problem then is that we need to make sure the queues are correct when we set new timers as well as when existing timers expire. The simplest fix seems to be to cause a call to SetTimer() to first check for the expiry of existing timers. This means that the code that is used to check for a wrap when a timer expires is also executed before adding new timers. Hopefully, this means that the queues will always remain correct.

The second new test expands on the first to ensure that a timer that has expired when SetTimer() is called is correctly processed.

Code is here but new rules apply: The code will build with VC6, VS.Net 2002, VS.Net 2003, VS 2005 and VS 2008. The code builds as x86 or x64 with VS 2005 and 2008. The code will build with either the standard STL that comes with Visual Studio or with a version of STL Port. The code uses precompiled headers the right way so that you can build with precompiled headers for speed or build without them to ensure minimal code coupling. The code can also compile on VC6 with or without the platform SDK being installed. The various options are all controlled from the "Admin" project; edit Admin.h to change things...


Share this entry: Email it! | bookmark it! | digg it! | reddit!
Posted by Len at 10:28 AM | Comments (5) | Categories: Source Code , Testing

March 22, 2008

And then there were three...

This morning a new Ready NAS NV+ unit arrived, so now I have three. I didn't have a sudden change of heart about buying a development box, the power supply in my office NAS died on Thursday afternoon and buying a new, bare, enclosure was the quickest way to get my existing disks back online.

However, once I get the faulty until returned from repair I'll have a unit that I could develop on...


Share this entry: Email it! | bookmark it! | digg it! | reddit!
Posted by Len at 08:53 AM | Comments (0) | Categories: Geek Speak

March 18, 2008

ReadyNAS development...

As I mentioned a while back I'm using a pair of ReadyNAS NV+ RAID systems as my on-site data store and off-site backup. These are both working well and I'm pleased with the solution. After a few hickups and delays due to the Netgear takeover of Infrant it seems that the firmware is developing nicely again and the latest thing that I've discovered is that you can now develop and deploy your own code to the devices (if you're brave enough!). See here for details. Given that I'd need to purchase a third unit as a development and testing box and given that I'm currently really busy with client work I don't expect that I'll actually be doing any development on the NAS in the near future but, hopefully, someone may now make it possible to run a CVS server on it...


Share this entry: Email it! | bookmark it! | digg it! | reddit!
Posted by Len at 09:22 AM | Comments (0) | Categories: Geek Speak

Bug in timer queue code

Whilst I've been away I've had a bug report for the TDD timer queue code that's available here. The report is completely correct and could result in a timer being scheduled out of sequence if it's scheduled around the point when GetTickCount() wraps. I've coded up a fix but I need to write it up and post it. It may take me a while to do this as I have a lot going on in my life at present, if you need the fix sooner then drop me a mail.

Updated: 4th April - Fixed code now available here.


Share this entry: Email it! | bookmark it! | digg it! | reddit!
Posted by Len at 09:15 AM | Comments (0) | Categories: Geek Speak