A framework for productive bug diagnosis

Nov 18, 2016

Diagnosing bugs is a difficult activity to plan. Until the cause of the bug is known the amount of time spent in total to fix it cannot easily be planned. Estimates can be defined based on past behaviour, but ultimately its down to the how a bug is investigated that reduces its overall cost to the project schedule. What I’ve shared below is the framework that I use on a daily basis and which you may or may not be consciously aware of. I’ve also added a few anecdotes from personal experience along the way.

Start with the facts

It may sound obvious, but any diagnosis has to start with the known. Gain as much information about the symptoms of problem from the end user or actor (if another system) to establish the pattern of requests made by that client. Then go looking for errors that are related. In a lot of cases developers know from the symptoms which log files or databases to check to confirm what they suspect to be the cause. But many a time, I’ve seen developers just start with looking where they are most familiar (for example a database administrator checking the database to see what errors have been logged) and then try to relate it to the bug. Its an easy way to become side tracked and its amazing how often it happens.

So, rather than jumping in to where you could check for an error, first determine where makes sense to look, based on the reported symptoms. Thinking about an example that I helped investigate a number of years back, the end problem happened to be that the maximum number of connections on a network switch had been exceeded, so it was dropping new connections. Looking at the traffic at the application servers that sat behind this revealed nothing, only that we had less traffic than we thought we should. So, if in doubt, start at the very front of your network infrastructure and check network switch logs, HTTP server access logs and work your way through to confirm what traffic actually reached different parts of the application architecture, what errors were happening and whether these are related or not.

Spotting an individual set of steps performed by a user among concurrent user access logs can also be pretty difficult to do as you’re looking through different logs on different servers, so its useful to have other diagnostics to hand too. When I helped investigate performance issues from users uploading large images with a previous client, having the technique I shared in Making Error logging work harder to save you time really paid dividends in tying errors that occurred lower in the application with a request that was received at the top of the architecture.

Hypothesise

Once you start to build the knowledge of the architecture level requests and operations performed, you can then hypothesise to determine what might be the underlying cause. Sometimes, possible hypotheses are obvious based on the known facts, but the less specific and repeatable the problem, the harder this typically tends to be. From experience, the key is in identifying scenarios in which the problem is reliably repeated 100% of the time. This all may sound obvious, but I’ve experienced so many occasions where the pressure is on to establish the underlying issue that the team is grasping at possible explanations, and risking taking the investigation down tangents. Slow responses may be related to a server or database bottleneck, but just because that’s been the cause before, it may not be the issue this time. So take a step back, look at where the requests went, what else was going on at the time in that part of the architecture, what the coded logic is and establish some possible scenarios, based on the known facts.

Prove or otherwise

If you have a suspected cause in mind, define and run a test to validate, or disprove it. Again, if the known facts are very specific and the hypothesis follows quickly, then this step is often done without realising it. Its when the problem is tougher to solve that the value of this step is really understood and used multiple times. Once you have a proven cause, you can progress to fixing it.

The framework I’ve described above may sound obvious for most of the bugs you’ve fixed quickly. But every once in a while there’s a real tester (which can take days to resolve) in which the above really comes into its own. Its worth being conscious of the approach you take for every bug you investigate, so that you can tweak your approach for greater productivity for every bug. We’d love to hear your thoughts on this topic. Contact Us and tell us about your experiences.

 
 

To receive more free, regular productivity related tips, subscribe now!