I work for a software product company. We have large enterprise customers who implement our product and we provide support to them. For example, if there is a defect, we provide patches, etc. In other words, It is a fairly typical setup.

Recently, a ticket was issued and assigned to me regarding an exception found by a customer in a log file that has to do with concurrent database access in a clustered implementation of our product. So this customer's specific configuration may well be critical in the occurrence of this bug. All we got from the customer was their log file.

The approach I proposed to my team was to attempt to reproduce the bug in a configuration setup similar to that of the customer and get a comparable log. However, they disagree with my approach saying that I don't need to reproduce the bug as it's overly time-consuming and will require simulating a server cluster on VMs. My team suggests I simply "follow the code" to see where the thread- and/or transaction-unsafe code is and put in the change working off of a simple local development, which is not a cluster implementation like the environment from which the occurrence of the bug originates.

To me, working out of an abstract blueprint (program code) rather than a tangible, visible manifestation (runtime reproduction) seems difficult, so I wanted to ask a general question:

Is it reasonable to insist on reproducing every defect and debug it before diagnosing and fixing it?

Or:

If I am a senior developer, should I be able to read multithreaded code and create a mental picture of what it does in all use case scenarios rather than require to run the application, test different use case scenarios hands-on, and step through the code line by line? Or am I a poor developer for demanding that kind of work environment?

Is debugging for sissies?

In my opinion, any fix submitted in response to an incident ticket should be tested in an environment simulated to be as close to the original environment as possible. How else can you know that it will really remedy the issue? It is like releasing a new model of a vehicle without crash testing it with a dummy to demonstrate that the air bags indeed work.

Last but not least, if you agree with me:

How should I talk with my team to convince them that my approach is reasonable, conservative and more bulletproof?

sometimes, it makes no sense to insist on reproducing when you've got a log with stack trace. Some concurrency bugs in Java are just like that, actually easiest ones are when you get a log with NPE and stack trace pointing to a line that "apparently" uses some object created with new. And these bugs are not guaranteed to be reliably reproducible, in accordance with Java Memory Model specification
–
gnatOct 9 '13 at 20:38

4

Do you want the "correct" answer -- you must reproduce every bug so you know its fixed, or the "keep the customer paying us $$" answer -- sometimes you don't have the time and resources to do so, and your boss expects you to use your expertise to make a good effort to fix it anyway?
–
Michael EdenfieldOct 9 '13 at 20:45

Surprised that the community here is in accord with you. Frankly, I'm completely in agreement with your teammates. Sometimes, especially when concerning bugs in race conditions, it makes much more sense and is much more efficient to simply follow the code than to spend a ton of time creating a test environment that might not even expose the problem. If you can't find anything by tracing the code, then sure, see if it makes sense to expend the effort to create a test environment, but it's a bad allocation of time to begin by creating the test environment.
–
Ben LeeOct 14 '13 at 19:02

14 Answers
14

Is it reasonable to insist on reproducing every defect and debug it before diagnosing and fixing it?

You should give it your best effort. I know that sometimes there are conditions and environments that are so complex they can't be reproduced exactly, but you should certainly try if you can.

If you never reproduced the bug and saw it for yourself, how can you be 100% certain that you really fixed it? Maybe your proposed fix introduces some other subtle bug that won't manifest unless you actually try to reproduce the original defect.

If I am a senior developer, should I be able to read (multithreaded) code and create a mental picture of what it does in all use case scenarios rather than require to run the application, test different use case scenarios hands on, and step through the code line by line? Or am I a poor developer for demanding that kind of work environment? Is debugging for sissies?

I would not trust someone who runs the code "in their head", if that's their only approach. It's a good place to start. Reproducing the bug, and fixing it and then demonstrating that the solution prevents the bug from reoccurring - that is where it should end.

How should I talk with my team to convince them that my approach is reasonable, conservative and more bulletproof?

Because if they never reproduced the bug, they can't know for certain that it is fixed. And if the customer comes back and complains that the bug is still there, that is not a Good Thing. After all, they are paying you big $$$ (I assume) to deal with this problem.

If you fail to fix the problem properly, you've broken faith with the customer (to some degree) and if there are competitors in your market, they may not remain your customer.

"Reproducing the bug, and fixing it and then demonstrating that the solution prevents the bug from reoccurring - that is where it should end." -- my point exactly
–
amphibientOct 9 '13 at 18:13

1

"Because if they never reproduced the bug, they can't know for certain that it is fixed." Amen...
–
Marjan VenemaOct 10 '13 at 6:24

7

I'd also like to add to this answer, is that since you don't have this configuration your company should figure out if this is even a supported configuration. If your company is going to formally support such configurations, you really ought to have an environment similarly configured just to do your QA work. That certainly will add expense, and that's why the company should decide what configurations of their product to support.
–
AndyOct 10 '13 at 12:58

There should be a cost/benefit argument here. If it takes weeks to reproduce, the value of reproduction is probably low due to not tackling other issues. If it takes seconds to reproduce, the value of reproduction is probably high, due to the certainty of the fix. The decision should attempt to balance this, a blanket "should" or "shouldn't" statement is useless.
–
oripOct 14 '13 at 23:11

1

@orip: Cost/benefit analysis also needs to take the customer into account: Does the cost of ignoring the customer with the possible risk of losing the account (and possibly losing other customers because of what they hear from this original customer, or if they are also experiencing the bug but have yet to report it formally) outweigh the cost of developer time spent reproducing and fixing the bug?
–
FrustratedWithFormsDesignerOct 16 '13 at 17:56

How do they intend to verify that the bug in question was fixed? Do they want to ship untested code to the user and let them figure it out? Any test setup that was never shown to reproduce the error can't be relied upon to show absence of the error. You certainly don't need to reproduce the entire client environment, but you do need enough to reproduce the error.

I don't think it is unreasonable to attempt to reproduce every bug before fixing. However if you attempt to reproduce it and you can't then it becomes more of a business decision on whether or not blind patches are a good idea.

I agree, however if a bug is found by review, it can provide critical information needed to reproduce it. You then can reproduce it, and prove the fix is correct...
–
mattnzOct 9 '13 at 19:20

2

If you're able to find a multi-threaded race condition by code inspection you should be able to to reproduce it consistently by modifying the code with additional locking statements that force the threads to start/stop in a sequence that triggers it. ex Thread1-Startup and pause, thread2-Startup and pause, 1-begin using shared object and pause, 2-modify shared object and pause, 1-try using shared object and barf. The biggest issue with this sort of approach is that while it's something you can demonstrate in a debugger it's not suited for adding to an automated test suite. BTDT-GTTS.
–
Dan NeelyOct 10 '13 at 12:47

@DanNeely: If one thread writes a value to an array and then stores a reference into a field, and another thread reads that field and accesses the corresponding array element, how would one reproduce bugs which might occur if the JIT moves the write-reference operation before the write-element one?
–
supercatJan 29 at 22:35

Ideally, you want to be able to reproduce each bug so that, at the very least, you can test that it's been fixed.

But... That may not always be feasible or even physically possible. Especially with 'enterprise' type software where each installation is unique. There's also the cost/benefit evaluation. A couple of hours of looking over code and making a few educated guesses about a non-critical problem may cost far less than having a technical support team spend weeks trying to set up and duplicate a customer's environment exactly in hopes of being able to duplicate the problem. Back when I worked in 'Enterprise' world, we would often just fly coders out and have them fix bugs on site, because there was no way to to duplicate the customer's setup.

So, duplicate when you can, but if you can't, then harness your knowledge of the system, and try to identify the culprit in code.

I don't think you should make a reproducing the error a requirement to look at the bug. There are, as you've mentioned, several ways to debug the issue - and you should use all of them. You should count yourself lucky that they were able to give you a log file! If you or someone at your company is able to reproduce the bug, great! If not, you should still attempt to parse the logs and find the circumstances under which the error occurred. It may be possible, as your colleagues suggested, to read the code, figure out what conditions the bug could happen, then attempt to recreate the scenario yourself.

However, don't release the actual fix untested. Any change you make should go through the standard dev, QA testing, and integration testing routine. It may prove difficult to test - you mentioned multithreaded code, which is notoriously hard to debug. This is where I agree with your approach to create a test configuration or environment. If you have found a problem in the code, you should find it much simpler to create the environment, reproduce the issue, and test the fix.

To me, this is less a debugging issue and more of a customer service issue. You've received a bug report from a customer; you have a responsibility to do due diligence to find their issue and fix it.

"However, don't release the actual fix untested. " How ? If he cannot reproduce the conditions that caused the bug, how will he reproduce them to test the fix ? Also I wouldn't assume OP didn't do his best effort.
–
user61852Oct 9 '13 at 17:34

"If you have found a problem in the code, you should find it much simpler to create the environment, reproduce the issue, and test the fix." I read the OP's question to be, "Should I require all bug reports to have a repro case before attempting to diagnose the problem?" No, you shouldn't.
–
Michael KOct 9 '13 at 17:42

I would expect most of the testing to be regression testing of existing features.
–
Michael DurrantOct 9 '13 at 18:09

1

@MichaelK: Your answer seems to conflict with itself. If you don't determine what the steps are to reproduce the bug, how will you ever know what your test cases should be? You might not always need to reproduce the bugs yourself, but most of those cases will occur when the steps to reproduce are already known. If all you have is a log file with no known steps, then you have no test cases to QA with.
–
EllesedilOct 9 '13 at 18:38

8

I think what he's saying is, you don't necessarily have to reproduce the issue in order to investigate a fix for it. And assuming you track it down and find a fix, you'll then know the conditions to set up on the test server to reproduce. At that point you'd even know how to set up the previous code - set it up, verify that it's reproducible, deploy the fix, verify that it's fixed.
–
GalacticCowboyOct 9 '13 at 19:21

In my opinion ... as the decision maker, you must be able to justify your position. If the goal of the 3rd line support department is to fix bugs in the shortest time frame with the acceptable effort from the client, then any approach must comply with that goal. Furthermore, if the approach can be proven to give the fastest expected results, then there should be no problem convincing the team.

Having worked in support, I have always reasonably expected the client to be able to give some "script" of actions they performed to consistently reproduce the bug and if not consistently then candidate examples which have produced the bug.

If I were new to the system and had no background with the code, my first steps would be to attempt to identify the possible sources of the error. It may be that the logging is insufficient to identify a candidate code. Depending on the client, I might be inclined to give them a debug version in order that they might be able to give you back log files which give further clues as to the position of the offending code.

If I am able to quickly identify the code block then visual mapping of the flow may be enough to spot the code. If not, then unit test based simulation may be enough. It may be that setting up a client replicating environment takes less time, especially if there is a great deal of replicability of the problem.

I think you may find that your approach should be a combination of the proposed solutions and that knowing when to quit one and move on to the next is key to getting the job done efficiently.

I am quite sure that the team will support the notion that if there is a chance their solution will find the bug quicker, then giving them a suitable time frame to prove that will not impact too much on the time it takes to fix the bug whichever route you take.

Is it reasonable to insist on reproducing every defect and debug it before diagnosing and fixing it?

I say yes, with some caveats.

I think it's okay to read through the code and try to find places that look like they may be problematic. Create a patch and send that to the client to see if that resolves the problem. If this approach continues to fail, then you may need to investigate other options. Just remember that while you might be addressing a bug it might not be the bug that was reported.

If you can't reproduce it within reason, and you can't find any red flags in the code, then it may require some closer coordination with the customer. I've flown out to customer sites before to do on site debugging. It's not the best dev environment, but sometimes if the problem is environmental, then finding the exact cause is going to be easiest when you can reproduce it consistently.

I've been on the customer side of the table in this scenario. I was working at a US government office that used an incredibly large Oracle database cluster (several terabytes of data and processing millions of records a day).

We ran into a strange problem that was very easy for us to reproduce. We reported the bug to Oracle, and went back and forth with them for weeks, sending them logs. They said they weren't able to reproduce the problem, but sent us a few patches that the hoped might address the problem. None of them did.

They eventually flew out a couple of developers to our location to debug the issue on site. And that was when the root cause of the bug was found and a later patch correctly addressed the problem.

If you're not positive about the problem, you can't be positive about the solution. Knowing how to reproduce the problem reliably in at least one test case situation allows you to prove that you know how to cause the error, and therefore also allows you to prove on the flip side that the problem has been solved, due to the subsequent lack of error in the same test case after applying the fix.

That said, race conditions, concurrency issues and other "non-deterministic" bugs are among the hardest for a developer to pin down in this manner, because they occur infrequently, on a system with higher load and more complexity than any one developer's copy of the program, and they disappear when the task is re-run on the same system at a later time.

More often than not, what originally looks like a random bug ends up having a deterministic cause that results in the bug being deterministically reproducible once you know how. The ones that defy this, the true Heisenbugs (seemingly random bugs that disappear when attempting to test for them in a sterile, monitored environment), are 99.9% timing-related, and once you understand that, your way forward becomes more clear; scan for things that could fail if something else were to get a word in edgewise during the code's execution, and when you find such a vulnerability, attempt to exploit it in a test to see if it exhibits the behavior you're trying to reproduce.

A significant amount of in-depth code inspection is typically called for in these situations; you have to look at the code, abandoning any preconceived notions of how the code is supposed to behave, and imagine scenarios in which it could fail in the way your client has observed. For each scenario, try to develop a test that could be run efficiently within your current automated testing environment (that is, without needing a new VM stack just for this one test), that would prove or disprove that the code behaves as you expected (which, depending on what you expected, would prove or disprove that this code is a possible cause of the clients' problems). This is the scientific method for software engineers; observe, hypothesize, test, reflect, repeat.

Is it reasonable to insist on reproducing every defect and debug it before
diagnosing and fixing it?

No, it very definitely isn't. That would be a stupid policy.

The problem I see with your question and your proposal is that they fail to make a distinction between

bug reports

failures

errors

A bug report is communication about a bug. It tells you somebody thinks something is wrong. It may or may not be specific about what is supposed to be wrong.

A bug report is evidence of a failure.

A failure is an incident of something going wrong. A specific malfunction, but not necessarily with any clues as to what may have caused it.

An failure is caused by an error.

An error is a cause of failures; something that can (in principle) be changed in order to prevent the failures it causes from occurring in the future.

Sometimes, when a bug is reported, the cause is immediately clear. In such a case, reproducing the bug would be nonsensical. At other times, the cause isn't clear at all: the bug report doesn't describe any particular failure, or it does but the failure is such that it doesn't provide a clue as to what might be the cause. In such cases, I feel your advice is justified - but not always: one doesn't insist on crashing a second $370 million space rocket before accepting to investigate what caused the first one to crash (a particular error in the control software).

And there are also all sorts of cases in between; for instance, if a bug report does not prove, but only suggests, that a potential problem you were already aware of might play a role, this might be enough incentive for you to take a closer look at it.

So while insisting on reproducibility is wise for the tougher cases, it is unwise to enforce it as a strict policy.

As with everything else in software development, the correct answer is a compromise.

In theory, you should never try to fix a bug if you cannot prove that it exists. Doing so may cause you to make unnecessary changes to your code that don't ultimately solve anything. And proving it means reproducing it first, then creating and applying a fix, then demonstrating that it no longer happens. Your gut here is steering you in the right direction -- if you want to be confident that you've resolved your customer's problem you need to know what caused it in the first place.

In practice, that is not always possible. Perhaps the bug only occurs on large clusters with dozens of users simultaneously accessing your code. Perhaps there is a specific combination of data operations on specific sets of data that triggers the bug and you have no idea what that is. Perhaps your customer ran the program interactively non-stop for 100's of hours before the bug manifested.

In any of those cases, there's a strong chance that your department is not going to have the time or money to reproduce the bug before you start work. In many cases, it's far more obvious to you, the developer, that there's a bug in the code that points you to the correct situation. Once you've diagnosed the problem, you may be able to go back and reproduce it. It's not ideal, but at the same time, part of your job as senior developer is to know how to read and interpret code, partly to locate these kind of buried bugs.

In my opinion, you are focusing on the wrong part of the question. What if you ultimately cannot reproduce the bug in question? Nothing is more frustrating to a customer than to hear "yeah, we know you crashed the program but we can't reproduce it, so it's not a bug." When your customer hears this, they interpret it as "we know our software is buggy but we can't bother to fix and fix the bugs so just cross your fingers." If it better to close a reported bug as "not reproducible", or to close it as "not reproducible, but we have made some reasonable changes to try to improve stability"?

Reading the question, I don't see any fundamental opposition between your position and your team's.

Yes, you should give your best effort to reproduce the problem occurring in the client setting. But best effort mean that you should define some time box for that, and there may not be enough data in the log to actually reproduce the problem.

If so, all depends on the relationship with this customer. It can go from you won't have anything else from him, to your may send a developper on site with diagnosis tools and ability to run them on the failing system. Usually, we are somewhere in between and if initial data is not enough there are ways to get some more.

Yes, a senior developper should be able to read the code and is likely to find the reason of the problem following the log content. Really, it is often possible to write some unit test that exhibit the problem after carefully reading the code.

Suceeding writing such unit tests is nearly as good as reproducing the breaking functional environment. Of course, this method is not a guarantee either that you will find anything. Understanding the exact sequence of events leading to failure in some multi-threaded software can be really hard to find by just reading the code, and ability to debug live is likely to become critical.

Summarily, I would try for both approaches simultaneously and ask for either a live system exhibiting the problem (and showing that it is fixed afterward) or for some breaking unit test breaking on the problem (and also showing it is fixed after the fix).

Trying to just fix the code and send it in the wild, does indeed look very risky. In some similar cases that occured to me (where we failed to reproduce the defect internally), I made clear that if a fix went in the wild and failed to resolve the customer problem, or had any other unexpected negative consequences, the guy that proposed it would have to help the support team to find the actual problem. Including dealing with the customer if necessary.

Very good question! My opinion is that if you can't reproduce the problem then you can't 100% for sure say that the fix you made will not:

a) actually fix the issue.
b) create another bug

There are times when a bug occurs and I fix it and I don't bother to test it. I know 100% for sure that it works. But until our QA department says that it's working then I consider it still a possibility that there is still a bug present... or a new bug created from the fix.

If you can't reproduce the bug and then install the new version and confirm that it is fixed then you can't, with 100% certainty, say that the bug is gone.

I tried for a few minutes to think of an analogy to help you explain to others but nothing really came to mind. A vasectomy is a funny example but it's not the same situation :-)

Suppose e.g. that one receives a report that a program occasionally incorrectly formats some decimal-formatted numbers when installed on a French version of windows; an search for culture-setting code reveals one discovers a methods which saves the current thread culture and sets it to InvariantCulture within a CompareExchange loop, but resets it afterward [such that if the CompareExchange fails the first time, the "saved" culture variable will get overwritten]. Reproducing the circumstances of failure would be hard, but the code is clearly wrong and could cause the indicated problem.
–
supercatJan 28 at 17:55

In such a case, would it be necessary to reproduce the failure, or would the fact that the code in question would clearly be capable of causing failures like the indicated one be sufficient if one inspects the code for any other places where similar failure modes could occur?
–
supercatJan 28 at 17:58

Well that's that whole, "it depends" on the situation argument. If it was a mission critical life or death system or customer expected that kind of testing then yes, make a best effort at reproducing the issue and test. I have had to download code to a customers machine so I could debug because we could not reproduce an issue in our test servers. It was some sort of windows security issue. Created a fix and everyone is happy. It's difficult if setting up the test environment is harder than fixing the bug. Then you can ask the customer. Most of time they are ok with testing it themselves.
–
Jaydel GluckieJan 29 at 22:12

With suspected threading problems, even if one can manage to jinx things in such a way as to force things to happen at precisely the "wrong" time, is there any way to really know whether the problem you reproduced is the same one observed by the customer? If code has a defect such that things happening with a certain timing would cause a failure, and it is at least theoretically possible for such timing to occur, I would think the code should be fixed whether or not one can jinx a test environment to make the requisite timings occur. In many such situations...
–
supercatJan 29 at 22:25

...testing and production environments are apt to have sufficient timing differences that judging whether particular bad timings can actually occur is apt to be extremely difficult and not terribly informative. What's important is to examine places which could be potentially timing-sensitive to ensure that they aren't, since tests for timing sensitivity are prone to have a lot of false negatives.
–
supercatJan 29 at 22:30

Is it reasonable to insist on reproducing every defect and debug it before diagnosing and fixing it?

I'd not spend too much time trying to reproduce it. That looks like a synchronization problem and those are more often found by reasoning (starting from logs like the one you have to pinpoint the subsystem in which the issue occurs) than by being able to find a way to reproduce it and attacking it with a debugger. In my experience, reducing the optimization level of the code or sometimes and even activating additional instrumentation can be enough to add enough delay or the lacking synchronization primitive to prevent the bug to manifest itself.

Yes, if you don't have a way to reproduce the bug, you won't be able to be sure that you fix it. But if your customer doesn't give you the way to reproduce it, you may also be looking for something similar with the same consequence but a different root cause.

Both activities (code review and testing) are necessary, neither are sufficient.

You could spend months constructing experiments trying to repro the bug, and never get anywhere if you hand't looked at the code and formed a hypothesis to narrow the search space. You might blow months gazing into your navel trying to visualize a bug in the code, might even think you've found it once, twice, three times, only to have the increasingly impatient customer say, "No, the bug is still there."

Some developers are relatively better at one activity (code review vs constructing tests) than the other. A perfect manager weighs these strengths when assigning bugs. A team approach may be even more fruitful.

Ultimately, there may not be enough information to repro the bug, and you have to let it marinate for awhile hoping another customer will find a similar problem, giving you more insight into the configuration issue. If the customer who saw the bug really wants it fixed, they will work with you to collect more information. If this problem only ever arose one time, it's probably not a high priority bug even if the customer is important. Sometimes not working a bug is smarter than blowing man-hours flailing around looking for a really obscure defect with not enough information.