Wednesday 1 September 2010

Brittle tests, Broken Windows and Boiled Frog Syndrome

It's shockingly easy to find yourself getting used to kicking a build server when the tests fail. You know the tests all passed on your local machine, and the last time the build failed, they all passed on everybody's machine. So, you trigger another run, and they pass on the build server. You accept this and move on, because, after all, you are busy, and there are stories to be done.

We've all been there, and it doesn't really hurt anybody, right? Sadly, I'm going to share the story of how it easily gets out of hand with a real example from my current project ...

It's a pretty big project. Over the last 12 months or so, we've created a complex intertwining system of 4 apps that fit into the existing system. The core app is a bit of an impressively large beast now, but we've been following good practices, and quite a lot of the code is pretty good. There's some rubbish in there, but we've kept it under control, and we're doing a pretty good job of removing the rubbish whenever we find it. Test coverage is good - we're using TDD and being pretty good at sticking to it, although we're human, so occasionally we screw up. The current count is 1214 "unit" tests, 2 suites of functional tests (the full app is used, 197 and 209 tests respectively) and some performance tests, which involve deploying the whole shebang into a representative environment and thrashing it.

That all sounds pretty good, but there's one minor problem. Running the full set of functional tests takes the build server (a vm, so not exactly high spec, and shares its resources) nearly 10 minutes. Since the app is designed to be high performance, there's a great  deal of asynchronicity going on, which means that sometimes tests fail because the timing is out, particularly when resources are tight.

So what happens? Well, occasionally the build fails. You check the failure reason, verify it's just a timing issue, kick the build off again and go back to what you were doing. Ad nauseum. Until you get out of the habit of checking the reason. Little by little, you expect the build to fail once, or twice, on each commit. Eventually, you get to the point where somebody is prepared to kick the build 10 times to get it passing. Even if there's a legitimate reason for it to be failing.

If you're lucky, somebody goes on holiday. Or somebody new joins the team. They look at this build that's now failing more often than not and ask just how the hell we got here. Suddenly, because they've not been exposed to it constantly for the last 2 months, it's not a hidden pain they just accept, it's intolerable. This is a good thing. They're going to make damn sure you sort the problem out. Do so. It may take a pair 2 days to sort it out, so what. It's costing you hours every single week until you resolve it. It may take the whole team a week. You still have to do it. Stop the line. You know the drill, you've heard the buzz words. Don't suffer an unstable build. Analyse the problem and fix it, that's what you do. There is no excuse, here is always a way.

Then learn from it. Better yet, learn from me, or somebody else who's fucked it up. But we don't learn well from other people's mistakes. In which case, try to use our experience to help you spot your own before it gets too bad. Don't sit there in water that's gradually getting hotter and hotter, spot the problem early, while it's still relatively easy to fix, before you pissed away countless hours waiting for the build to go green so you can go home ...