Friday, June 08, 2007

Latent bug syndrome

Today I have fallen foul of what I call the "Latent Bug Syndrome". This will hopefully help people understand just how complex software engineering is, and why bugs occur in software despite the best efforts of all involved. Let me explain.

Let's say you have a system like the following.

SystemA

where System A represents a fairly complex, fully featured, fully tested product that exposes cerrtain functionality to the end users. Now System A can be operating 100% perfectly (or at least for the overwhelming majority of scenarios that are encompassed by the exposed feature set), however there still may be a "Latent Bug" hidden deep within the underlying logic of this system that is never exposed to the end user because the exposed feature set never actually allows the system to get into the state that would cause this bug to exhibit. As a really contrived example, let's take the classic divide by zero problem

 

public stat double DoCalc(double val)

{

return 42 / val;

}

This function operates as expected under all conditions except where val = 0, in which case it throws an exception. If the exposed feature set only alows the user to select numbers from a drop down box that contains the numbers 1, 2, and 3, then every possible test that a tester can do will NEVER produce an error. Keep in mind this is possibly the simplest example one can find, in real life, the scenarios are far more convoluted. The wrong assumption from this is that System A has no bugs.

Now lets suppose that System A is augmented with System B as shown below.

systema_b

Now System B has the potenial of producing conditions in System A that not only were never tested, but were never even thought of by the original developers and testers of System A, and in my contrived example, a user input into system B may very well cause the value of the val parameter to be passed 0. The thing is that the developers (and testers) of System B are usually different to the developers (and testers) of System A, and often don't have access to the source code, or even if they do, don't have time to follow every code path through the interactions with System A accounting for every possible state that System A could possibly be in to find such a bug.

If the testers of System A are doing a good job, they may actually pick up the bug, then of course the blame game starts. Who is responsible for this bug, who's going to fix it why wasn't this bug picked up before etc....

So this happened to me today. Fortunately the bug never made it to a production system (and no it wasn't the divide by zero bug suggested above, it was infinitely more intricate), but it did make me aware that I had made some assumptions about System A that turned out to be less than 100% accurate.

 

So how can one guard against The Latent Bug Syndrome.

 

System A

  • Thorough unit tests of all public (and some times even private) methods exposed by system A will catch a vast number of these bugs.
  • Better documentation of publicly exposed methods to help developers integrating with your system understand what assumptions you are making about the input to a method, and what state you are expecting the system to be in.
  • Test Driven Development could actually go a long way to irradicating this all together because in TDD you only write code to satisfy the tests nothing less, nothing more, so you don't get the situation where there is code that under ALL conditions tested is never executed. However, I have yet to see a company embrace TDD to that extent as the sheer time involved is usually far too much to justify it commercially.

System B

  • Developers and testers of system B need to have a good understanding of how system A works, what it was intended to do, and how they are changing the way in which System A is being called. Understanding this may help you know where problem are likely to occur.
  • Know what your assumptions about System A are... and constantly question them.
  • Again unit tests are your friend.
  • wrapping your interaction with System A is also a good way of being able to respond to changes in System A if they occur.
  • Thorough end user testing (there is NO substitute).

No comments:

Post a Comment