Archive for the ‘learning’ Category

After reading How many bugs are left? I was intrigued by the use of the Lincoln Index to estimate the number of bugs residing in a solution. But after reading the blog post I was bit baffled that the conclusion didn’t pick up on what was really reflected in the data.

In the blog post there are 2 examples concerning 2 QAs, A and B, finding 20 and 30 bugs respectively in the each case. The real difference is the overlap.

In the first example there is only 1 bug in the overlap, and the Lincoln Index is then 20*30/1 = 600 - in total 49 bugs found

In the second example there are 18 bugs in the overlap, making the Lincoln Index 20*30/18 = 33.3 - in total 32 bugs found

The probability that a QA finds a bug is then:

QA A

QA B

Total

Example 1

20/600 Â = .03

30/600 Â = .05

49/600 Â = .08

Example 2

20/33.3 = .6

30/33.3 = .9

32/33.3 = .96

While this is an example of the method it tells me something not mentioned in the blog post: The bugs in Example 2 must have been extremely obvious making it questionable whether the trials are independent.

Another thing, while it may seem like overkill to have 2 QAs in the 2nd example, it seems too little to be worth the effort in the first example, but we really should have 3 QAs in both cases.

There is nothing indicating the size of the example solutions - which is part why the example is good, and part why I was a bit skeptical at first. There is no right answer for the examples, but if the Lincoln Index are to be considered sufficient estimates on the number of bugs in the systems, then what should we do?

Starting with Example 2 we have found almost all the bugs, and hopefully the fixes will not introduce new ones. There is a good probability that the remaining bugs will be fixed when the code base is fixed - after all 33.3 bugs in a code base is not a lot (depending on the size of the code base itself naturally).

Examining Example 1 we have a different problem. We have discovered approximately 1/12th of the bugs, and we have an estimated 600 bugs in the system. It would seem that we are in dire need for some sort of assistance. Possibly rework of the system as well.

That is, if Example 1 is 200 kloc with 3 bugs/kloc, and Example 2 is 666 lines with 50 bugs/kloc, then Example 1 is 300 times the lines, but only 18 times the bugs - in which case it is a rather small amount of bugs even at 600. Example 2 though should really clean up the mess.

If it is the opposite, that is Example 1 12,000 lines at 50 bug/kloc, and Example 2 11,111 lines at 3 bug/kloc, then the number of lines are almost the same, yet the number of bugs is 18 times higher. In this case Example 1 is truly in dire needs of some help.

Alternative Analysis

These speculations are really afterthoughts on the blog’s content. My real beef was with the Lincoln Index itself - it degenerates at a 0 overlap, basically saying that if two observers examine the same area, they must find some of the same elements. That is a natural assumption if observers are stringent and actually look at the same things. Seeing some of the Escape Room issues where contestants overlook the obvious it would seem that for a software solution there would be several opportunities for QAs to overlook something the developers already overlooked.

While there are some suggestions on improving the Lincoln Index in case the overlap is less than 10, e.g. Bailey (1952) suggesting N = A*(B+1)/(C+1), which would lead Example 1 to 310 bugs instead of the 600. My idea was to turn to the German Tank Problem and estimate the number of bugs from the Bayesian credibility score.

By applying our own serial number system to the bugs (tracking ID) we aren’t really playing into the correct scenario, but bear with me. The maximum serial number we see is thus the total number of unique bugs found. We only have 2 observations - one from each QA.

Only having 2 observations mean the mean, Âµ, is infinite. We should have at least 4 observations to come up with a mean and standard deviation.

We can still try to make a credible guess. Given at least 2 observations, the credibility that the number of bugs is equal to n, is:

0

if n < m

k-1/k * C(m-1,k-1)/C(n,k-1)

if n >= m

m = number of distinct bugs found in the k observations

As k is 2 in our case, the formula simplifies into:

0

if n < m

(m-1)/(n*(n-1))

if n >= m

The credibility that we have more than n bugs is:

1

if n < m

C(m-1, k-1)/C(n, k-1)

if n >= m

Again with k = 2 this simplifies into:

1

if n < m

(m-1)/n

if n >= m

This latter formula means that if we want to be 95% confident in the number of bugs, n, then 5% risk that N > n: .05 = (m-1)/n <=> n = (m-1)/0.05 = 20*(m-1)

Running the examples under the German Tank Problem setting we get:

Example 1: A = 20, B = 30, C = 1, m = A+B-C = 49

Number of bugs at 95% confidence: 20*(49-1) = 960

pA Â Â = 20/960 = 0.02

pB Â Â = 30/960 = 0.03

total = 49/960 = 0.05

Example 2: A = 20, B = 30, C = 18, m = A+B-C = 32

Number of bugs at 95% confidence: 20*(32-1) = 620

pA Â Â = 20/620 = 0.03

pB Â Â = 30/620 = 0.05

total = 32/620 = 0.05

We see that we have a lot more bugs than our previous estimates, but the QAs probability of finding bugs are almost the same (below 5%) for both examples, and we have an estimated 5% of the total amount of bugs.

Looking at the accumulated credibility score, we can see that it grows rapidly, then slows down, perhaps an 80% confidence is sufficient. In this case .2 = (m-1)/n <=> n = 5*(m-1), this is a quarter of the 95% confidence numbers.

Example 1: A = 20, B = 30, C = 1, m = A+B-C = 49

Number of bugs at 80% confidence: 5*(49-1) = 240

pA Â Â = 20/240 = 0.08

pB Â Â = 30/240 = 0.13

total = 49/240 = 0.20

Example 2: A = 20, B = 30, C = 18, m = A+B-C = 32

Number of bugs at 80% confidence: 5*(32-1) = 155

pA Â Â = 20/155 = 0.13

pB Â Â = 30/155 = 0.19

total = 32/155 = 0.20

This is certainly better for Example 1 both with regards to the 95% confidence, but also with regards to the Lincoln Index - even the improved estimate.

Conclusion

I didn’t know about the Lincoln Index, so I learned something new today - that is always good. The original application to estimate the number of bugs in total seems good, at least better than disregarding data from the trenches.

John D. Cook suggests calibrating through experiments. This blog post has been a thought experiment on some of the deliveries presented by the data and an unrealistic application of the German Tank Problem - the odds of getting the “tanks” in sequence diminishes quickly, thus improvements can be applied to the m estimate.

Cutting the confidence level from 95% to 80% may seem drastic - and it is - as it cuts 75% off of the number of expected bugs, but for thought experiments it may be good enough.

QAs are valuable, and there is value in having several (at least 2, but 4 is better) to test a product.

While it is easy to point out when people are getting things wrong - IMHO or in your opinion - it may serve a greater purpose to examine why things may go so utterly wrong as they often do, especially when we’re speaking about software development.

Software development is mostly about communication. Whether it is communicating with a programmer to make what you want, or it is telling a project manager to get them to tell a programmer what you want - it is in any case a matter of communicating vision to understanding.

So let us try to map out the different possibilities when facing a decision - or what may seem clear to you, but isn’t for at least one of the links in the development chain.

binary tree

I have chosen a binary tree to depict the decision “right” or “wrong”. While normal interpretation of such a tree is that it is a 50/50 split, let us not make such a hasty assumption - at least we - as developers - should be better than a 50% guess at understanding customer requirements.

In the binary tree above there are only 4 decisions Â which has to be right. If we simplify the model to have a fixed probability, p, that we make the right decision, we can use Bernoulli’s binomial distribution to determine the odds of making s successes in as many trials. In this case the binomial distribution deteriorates into a simple power function, ps.

Given either p or s we can calculate the other if we want at least a 50% chance of ending with a right solution.

That is, if we have an almost unheard of quality for understanding customer communication, then at a bit more than 200,000 decisions, the solution has a 50/50 chance of hitting the anticipated solution.

If we want to be 90% sure, then we cannot make more than 30988 decisions with 6-sigma understanding.

So, let us try the other way around - we would like to know with a sufficiently high confidence that our project meets our expectations, let us say 90% sure. We have identified 10,000 key decisions. How good must the communication then be?

s log(p) = log(x) <=> p = exp(log(x)/s)

p = exp(log(.90)/10000) Â = 0.999989

Which means we need 6-sigma communication to achieve this goal.

On top of all this, then the calculations are assuming that the customer knows and communicates exactly what he or she wants, and that all decision points are uncovered and communicated at the same high level.

The only immediate sane solution to improving the odds is to reduce the scope drastically. It may sound silly, but more having to fulfill 10 things right becomes daunting for most of us. In the binary tree above, we would need to have 2 (10+1) -1= 2047 nodes - the sheer size of such a tree should be sufficient to deter anyone wanting more than 10 decisions.

Reduce scope. Improve communication by shortening the feedback loop.

Naturally, we could reduce scope right down until a single decision - but that would quickly throw us off balance, as a single point makes it impossible to determine direction.

When trying to understand a new concept the important thing to understand is not what the concept is, but why it exists. Thereby getting to the essence of the thing in itself.

This is probably why the 5 Whys is an important tool for root cause analysis and incident investigation albeit it doesn’t fit all purposes. But if it is a sequence of burrowing down to the core of an issue, then it is probably one of the better methods of examining unknown processes.

As in the story about the newlywed couple. One evening, the husband noticed than when his wife began to prepare a roast beef for dinner she cut off both ends of the meat before placing it in the roasting pan. He asked her why she did that. “I don’t know,” she said. “That’s the way my mother always did it.” The next time they went to the home of the wife’s parents, he told his mother-in-law about the roast beef and asked her why she cut off the ends of the meat. “Well, that’s the way my mother always did it” was her reply.

He decided that he had to get to the bottom of this mystery. So when he went with his wife to visit her grandparents, he talked to his grandmother-in-law. He said, “Your daughter and granddaughter both cut off the ends of the meat when they fix roast beef and they say, ‘That’s the way my mother always did it.’ How about you? Why do you cut the meat in this way?” Without hesitation the grandmother replied, “Oh, that’s because my roaster was too small and the only way I could get the meat to fit in it was to cut off the ends.” (I’ve heard it before, but the only text I could find was from The Everlasting Tradition on Google Books)

If you don’t know the root cause you may end up doing unnecessary work at best, but most likely limiting, and in worst case counterproductive and wasteful work.

Don’t ask people what they want or do, but why they want or do it. It’s just as Henry Ford said: “If I had asked people what they wanted, they would have said faster horses.” They would have asked for faster horses, because horses was something they knew about, and faster or stronger would make transportation better.

In the same vein, it is just as important to learn the reason behind, when embarking on a new project with unknown entities. In particular when starting on new software project, and especially for project managers on both sides of the table. You need to know what to deliver to be able to deliver it in the first place, you can’t tell a developer what you need, if you don’t know what it is, and you cannot accept or test the thing if you don’t know how it should behave.

If a feature has to be cut it is paramount that you can argue why that doesn’t impair the end product too much.

If a feature can be implemented in multiple ways, then the simpler should be opted for. If you don’t know the essence of the feature, you don’t know the feasible ways, and you may choose a too simple solution - these are the solutions which seems to almost work.

Going back to Ford’s quote, it is important that you know what to abstract and how to abstract it, e.g. “faster horses” to “faster means of transportation” and not “faster animals” - that would lead to trying to hitch a cheetah or a bear to a buggy.

As the character Forrest Gump is accustomed to say: “Stupid is as stupid does.” - if we don’t know better, then we do stupid things. If you know why you do things, you may have a chance not to act stupid.

When knowing why as opposed to just what, then you are closer to the Ha step of Shu Ha Ri, because you already know the mechanics, and you are armed with the path. You may not know which quantum leaps you have to make to diverge to another stable level, but at least you know whether a path is perpendicular to the current flow or perhaps an ever so slightly diverging path.

On a much more pragmatic level, it is better to know why a certain color or method is chosen, especially when the time to change it comes around. Which is why the “why” is a much better comment for source code than the “what” - which should be evident by the code itself. And if you have complete memory of the history of changes, you can check if we’re going in circles.