Wednesday, 1 September 2010

Brittle tests, Broken Windows and Boiled Frog Syndrome

It's shockingly easy to find yourself getting used to kicking a build server when the tests fail. You know the tests all passed on your local machine, and the last time the build failed, they all passed on everybody's machine. So, you trigger another run, and they pass on the build server. You accept this and move on, because, after all, you are busy, and there are stories to be done.

We've all been there, and it doesn't really hurt anybody, right? Sadly, I'm going to share the story of how it easily gets out of hand with a real example from my current project ...

It's a pretty big project. Over the last 12 months or so, we've created a complex intertwining system of 4 apps that fit into the existing system. The core app is a bit of an impressively large beast now, but we've been following good practices, and quite a lot of the code is pretty good. There's some rubbish in there, but we've kept it under control, and we're doing a pretty good job of removing the rubbish whenever we find it. Test coverage is good - we're using TDD and being pretty good at sticking to it, although we're human, so occasionally we screw up. The current count is 1214 "unit" tests, 2 suites of functional tests (the full app is used, 197 and 209 tests respectively) and some performance tests, which involve deploying the whole shebang into a representative environment and thrashing it.

That all sounds pretty good, but there's one minor problem. Running the full set of functional tests takes the build server (a vm, so not exactly high spec, and shares its resources) nearly 10 minutes. Since the app is designed to be high performance, there's a great  deal of asynchronicity going on, which means that sometimes tests fail because the timing is out, particularly when resources are tight.

So what happens? Well, occasionally the build fails. You check the failure reason, verify it's just a timing issue, kick the build off again and go back to what you were doing. Ad nauseum. Until you get out of the habit of checking the reason. Little by little, you expect the build to fail once, or twice, on each commit. Eventually, you get to the point where somebody is prepared to kick the build 10 times to get it passing. Even if there's a legitimate reason for it to be failing.

If you're lucky, somebody goes on holiday. Or somebody new joins the team. They look at this build that's now failing more often than not and ask just how the hell we got here. Suddenly, because they've not been exposed to it constantly for the last 2 months, it's not a hidden pain they just accept, it's intolerable. This is a good thing. They're going to make damn sure you sort the problem out. Do so. It may take a pair 2 days to sort it out, so what. It's costing you hours every single week until you resolve it. It may take the whole team a week. You still have to do it. Stop the line. You know the drill, you've heard the buzz words. Don't suffer an unstable build. Analyse the problem and fix it, that's what you do. There is no excuse, here is always a way.

Then learn from it. Better yet, learn from me, or somebody else who's fucked it up. But we don't learn well from other people's mistakes. In which case, try to use our experience to help you spot your own before it gets too bad. Don't sit there in water that's gradually getting hotter and hotter, spot the problem early, while it's still relatively easy to fix, before you pissed away countless hours waiting for the build to go green so you can go home ...

Thursday, 29 July 2010

Why we test the simple stuff ...

One of the more popular arguments trotted out when you're trying to persuade somebody to give TDD a go is that the really simple stuff doesn't need testing. I can sometimes almost be persuaded that simple getters don't need testing, but frankly, if you don't have a test to drive them out (directly or indirectly) why are you implementing them at all? The fallacy that there is stuff so simple you don't need to bother testing it, because it's so trivial that nobody could possibly make a mistake doing it is proved wrong on a daily (if not hourly) basis on any project with more than two developers.

We found an interesting cock up this morning, in a class implementing AbstractConfigurationEntry, which I've talked about before. Clearly whoever implemented it (and yes, we know the culprit) forgot completely to write a test for it.

Spot the stupid mistake:

Yes, he really did create a new String array and then forget to put any values in it ...

Annoyingly, this is in library code, so took one of our pairs about 30 minutes longer than it should have to track down a bizarrely unexpected NullPointerException. Obviously the problem couldn't be in the library, since that's used all over the place, so they must have been making a stupid mistake themselves, right?

Sort your tests out, or pay for it.

Tuesday, 6 July 2010

Complete bafflement

I'm just completely baffled by this piece of code I just found in one of our classes:

Seriously, what the fuck is that about? How is that clearer, cleaner, better or more reliable than just making the three calls in a row? I can understand trying to refactor it out into something more clever and more expressive, but this is just wordy crap that obscures what you're doing.

Some days, I despair ...

Thursday, 1 July 2010

Getting rid of those damn globals

 I've just (well, yesterday) finished a nice little piece of refactoring which I felt was an interesting problem. I enjoyed it anyway, so I'll share it.

We have a number of home grown library packages that we use fairly extensively around our applications. One of the more useful ones is a configuration library. It has a handy little abstract class, AbstractConfiguration, that's designed to read and validate a config file based on the various configuration items you need.

That isn't terribly clear, but our tests should document it.

Oh dear, that's hardly documenting it. In fact, I'd go so far as to say that's making it less clear. Oh well, let's start out by making our test document how you'd be expected to implement an AbstractConfiguration by including one in the test class:

And then change the tests so that they test the implementation. Here's how the change affects the first two tests:

Anyway, as you can now see, we implement an AbstractConfiguration by extending it, and defining a number of AbstractConfigurationEntries, which we create getters for. The entries themselves contain validation rules (I could, for example, have shown you an IntegerRangeConfigurationEntry or a PositiveDoubleArrayConfigurationEntry - you get the picture ...) and those entries are validated on construction. If any validations fail, an exception is thrown with a meaningful message describing all the problems found. We find it handy.

The problem I ran into was that, I haven't shown you this yet, but the list of entries is a static variable on the abstract class. Since any mandatory fields that are missing will cause validation to fail, you can't have two different implementations in the same JVM. This hasn't proved a problem before - how many configurations does one app need? - but I now needed two. I could, I suspect, have worked around the issue, but that didn't appeal. There was a nasty smell hanging over the library.

First step, writing a test. This was hardly taxing, first I need another implementation in my test:

Followed by a test case:

Simple enough, and the test fails nicely. Ok, so what does AbstractConfiguration actually look like?

I've not shown you the detail of how it initialises and validates the entries, as there's already too much code in this post, but it ought to be fairly obvious.

Since the problem is obviously that entries is static, we can fix it by removing it's static nature and finding some other way to populate it. This obviously means that define can't add the entry being defined to the list, making it a pointless method. We'll deprecate it so we don't piss the rest of the teams using this off too much, and remove it in version 1.2

So, how to build entries dynamically? Well, we could take a step back, towards the 0.1 version, and build a list in a static block and pass that in to the constructor, but we moved away from that for a reason - it was rubbish. Since this isn't a class that you're going to be creating all over the place, why don't we use reflection? All of our entries are already constants, it's not terribly hard to get at them via reflection, we can just call this.getClass().getDeclaredFields() to get only the fields declared by our implementation, check to see if they're an AbstractConfigurationEntry, and if so, add them into our fields member. All of which can happily be done on construction. Giving us a new populateEntries method:

And there you have it. One less smell in our code base, and one more developer feeling slightly smug.