Wednesday 1 September 2010

Brittle tests, Broken Windows and Boiled Frog Syndrome

It's shockingly easy to find yourself getting used to kicking a build server when the tests fail. You know the tests all passed on your local machine, and the last time the build failed, they all passed on everybody's machine. So, you trigger another run, and they pass on the build server. You accept this and move on, because, after all, you are busy, and there are stories to be done.

We've all been there, and it doesn't really hurt anybody, right? Sadly, I'm going to share the story of how it easily gets out of hand with a real example from my current project ...

It's a pretty big project. Over the last 12 months or so, we've created a complex intertwining system of 4 apps that fit into the existing system. The core app is a bit of an impressively large beast now, but we've been following good practices, and quite a lot of the code is pretty good. There's some rubbish in there, but we've kept it under control, and we're doing a pretty good job of removing the rubbish whenever we find it. Test coverage is good - we're using TDD and being pretty good at sticking to it, although we're human, so occasionally we screw up. The current count is 1214 "unit" tests, 2 suites of functional tests (the full app is used, 197 and 209 tests respectively) and some performance tests, which involve deploying the whole shebang into a representative environment and thrashing it.

That all sounds pretty good, but there's one minor problem. Running the full set of functional tests takes the build server (a vm, so not exactly high spec, and shares its resources) nearly 10 minutes. Since the app is designed to be high performance, there's a great  deal of asynchronicity going on, which means that sometimes tests fail because the timing is out, particularly when resources are tight.

So what happens? Well, occasionally the build fails. You check the failure reason, verify it's just a timing issue, kick the build off again and go back to what you were doing. Ad nauseum. Until you get out of the habit of checking the reason. Little by little, you expect the build to fail once, or twice, on each commit. Eventually, you get to the point where somebody is prepared to kick the build 10 times to get it passing. Even if there's a legitimate reason for it to be failing.

If you're lucky, somebody goes on holiday. Or somebody new joins the team. They look at this build that's now failing more often than not and ask just how the hell we got here. Suddenly, because they've not been exposed to it constantly for the last 2 months, it's not a hidden pain they just accept, it's intolerable. This is a good thing. They're going to make damn sure you sort the problem out. Do so. It may take a pair 2 days to sort it out, so what. It's costing you hours every single week until you resolve it. It may take the whole team a week. You still have to do it. Stop the line. You know the drill, you've heard the buzz words. Don't suffer an unstable build. Analyse the problem and fix it, that's what you do. There is no excuse, here is always a way.

Then learn from it. Better yet, learn from me, or somebody else who's fucked it up. But we don't learn well from other people's mistakes. In which case, try to use our experience to help you spot your own before it gets too bad. Don't sit there in water that's gradually getting hotter and hotter, spot the problem early, while it's still relatively easy to fix, before you pissed away countless hours waiting for the build to go green so you can go home ...

5 comments:

  1. > That all sounds pretty good, but there's one minor problem. Running the full set of functional tests takes the build ...** nearly 10 minutes.

    I've noticed you comment about build times in the past. Do you run your builds in parallel. Surely unit tests don't ALL depend on each other's completions.

    As a matter of principle, I try to get my makefiles to be able to stand a ``make -j''. That way I ensure I understand (and have encoded) many of the dependencies within my code.

    I very rarely run an unbounded ``make -j'' except to ensure it works. But ``make -j 4'' is significantly* faster than a make.

    Don't know how many (concurrent) threads you have available to you in your build server; but parallel running will allow you to leverage disk wait times etc.

    * Where significantly is just less than four times



    ** I know I've elided: ``the build server (a vm, so not exactly high spec, and shares its resources)'', but I do so because I want to make a point of it.

    ReplyDelete
  2. As a rule of thumb, unit tests on different classes are safely independent, and can happily be run in parallel. As a rough rule of thumb, our unit tests are quick _enough_ to not need it. If that changes, they can pretty easily be made parallel, since we use maven and junit.

    Functional tests, which is what I'm talking about here, are almost always much, much slower - they certainly are in this case. And they're testing the whole application. Since we're talking about an application that's mostly about moving messages around a system, it's very sensitive to other things happening at the same time. Or rather, it _isn't_, but the tests _are_, since invariably we're making sure a message gets to the right place, or the right thing happens if we deliberately do something wrong. We could certainly run up multiple servers, and parallelise the tests that way, but that's a rather monstrous task. It would highlighted the cause of the brittleness even more though, I suspect, since a core part of the problem eventually turned out to be the fact that the tests were sending in the messages with an expiry time that was in the past (BST/UTC implicit conversion), and a slow running test (on a busy build server rather than a dedicated devs laptop) was likely to cause some of the messages to expire during the test run. Given the extreme frequency the message expiry job was running at (so we could test _it_ without waiting an age), that caused all sorts of excitement in tests that would put 1400 messages in a queue, for example. (You'd get 400 messages in to a test, then discover there were no messages left, except that's not what you saw, of course, you saw that messages were trickling too slowly from one queue to another to keep up with the super fast worker, since somebody had screwed up the log level for message expiry ...)

    ReplyDelete
  3. Results of unscientific timing test on one of my components. 8 Processor machine. As expected, all bounded parallel builds are faster than the serial build. Build is simple enough that unbounded parallel build takes not much longer than bounded parallel build.

    > make clean
    > time make

    real 0m40.44s
    user 0m34.07s
    sys 0m4.70s

    > make clean
    > time make -j

    real 0m19.27s
    user 0m37.61s
    sys 0m6.43s

    > make clean
    > time make -j 4

    real 0m17.33s
    user 0m35.05s
    sys 0m5.48s

    > make clean
    > time make -j 8

    real 0m17.40s
    user 0m35.19s
    sys 0m5.72s

    > make clean
    > time make -j 12

    real 0m17.40s
    user 0m35.29s
    sys 0m5.68s

    ReplyDelete
  4. Considered putting the logging itself on a queue?

    It's lossy, but we sometimes log and debug via a limited length queue (especially when stress testing). At the cost of possibly losing a (yet another, duplicate) log message you can get a good view on what's going on.

    (oh, could you take captcha off my posts, please)

    ReplyDelete
  5. The logging wasn't a problem, it was that some fool had decided nobody needed to see messages expiring, and logged them at debug level.

    Captcha is an all or nothing thing, as far as I can tell. If you can tell me how to disable it for you and not others, I'm happy to. I may try with it off anyway, we'll see.

    ReplyDelete