11th October is Thanksgiving Day in Canada

2004-10-08

I’ve had an interesting week. I’m back with the Refactoring Project and, although things were looking up last time I was here, they’ve managed to adopt some bad habits in my absence. The latest build that’s live with users isn’t tagged in CVS; we have 66 other builds that are tagged and the ability to rebuild an arbitrary release has helped on numerous occasions, but this time they decided not to bother. But worse than that, they haven’t been running the tests. Monday is Thanksgiving day in Canada, we found that out because they didn’t run the tests…

The first things I do each time I return to this client is pull out the latest source tree and run all the tests. It’s a “one click” affair, so I do it whilst I’m catching up on the team’s email, getting my first coffee and generally settling in. This week I ran the tests and a test in the FX library failed. I went and took a look at the test and found that it was one of the tests that helped to make sure we didn’t introduce subtle little date related bugs that only show up when there’s the right combination of currencies and bank holidays in play… As usual, the test involved Canadian Dollars (CAD is a bit of special case in the FX world as its spot day is 1 day earlier than most other currencies).

Due to a lack of understanding when the code was originally put together the domain model is a little bit crap; not crap enough to force allocation of the time required for a de-crapping phase, but crap enough to make seemingly simple changes harder to do than they should be. We’d spent a lot of time reducing the crap but it’s still less than ideal, but that’s life. Anyway, we had tests to prevent regressions and they’d been pretty successful. The business users were starting to trust the system and were no longer afraid of US holidays… So I was a bit disappointed to see that the tests were no longer being maintained; and I said as much to the developers, pointing out that we either have a broken system or a broken test and whichever it is it needs fixing. The response I got from the developer who had made the breaking change was that he thought the tests were a good idea, he appreciated why they’d been created, but he was far to busy to use them; besides his change had been live for some time now and “nobody had complained”… He ignored me when I asked him how many CAD holidays there had been between him putting his change live and now…

I decided to take a closer look at the change and, after a while, decided that, chances were good that it was the test that was still correct and the system that was broken. I pointed this out to him and I explained how we could write a test for the use case that his change fixed so that we could then adjust his change until both tests passed. He liked the idea, but decided not to bother, he was too busy with other (untested) stuff that was taking him far longer than he’d originally expected…

The interesting thing about this kind of situation is that it’s the kind of a bug that could potentially cost the users a lot of money… This was the reason that we’d build the tests in the first place, we were scared of making a simple mistake with big consequences… The new guy didn’t know enough to be scared enough and had broken the system in such a way that there was a time-bomb ticking… If he’d run the tests after making his change he would have been alerted to the problem straight away, as it was, the change was in production and he’d been blissfully unaware of the bug that would, and did, surface today.

Still, this is why I’m here. I escalated the problem and started to write the test that would help us to rework the breaking change into a proper fix. We warned the users about the bug and have a patch ready to go live tonight. If we had been a day earlier the user’s wouldn’t have had to know about it at all and we’d have finally passed their first US holiday without unexpected weirdness… I guess they’ll have to wait for 11th November for that now. What’s especially annoying about this is that it was all avoidable. They had tests, the tests broke in the right place when this change was made, it’s just that nobody bothered to run them…

Software development is about discipline and detail; code quality starts to decay as soon as developers forget this. All code decays, but tests can help to make this decay obvious earlier. However, for tests to be any use at all they need to be run! This client could do all kinds of technical things to try and prevent this from happening again but deep down the problem is more of developer education than automation. The client needs to educate their junior developers so that they don’t actively seek to avoid the processes that they have in place around their development activities… Or, perhaps, they should just let the programmer responsible for the bug deal with the irate users; sending them to explain to the head of the desk why the numbers are wrong, again, usually has the desired effect of focussing the mind on the consequences of their actions…