Vincent, your arguments are decent, but the application of the arguments to the population in general is not useful.

Look at it this way, most of use think "common sense" is a shared thought by the population in general. In statistical terms, this would mean your thoughts are very close to the mean - and in the worse case, no further out than 1 std dev.

A more rigorous approach to this would be to assume one's thoughts are 5 or 6 std dev's from the mean and then look for data that indicates this thought is indeed, "common sense".

Your arguments have a built-in assumption that the developers really, really care about doing good work. Whatever that means. If mgm't doesn't define what really good work is, then folks are left to their own devices. Consequently, everyone does good work, regardless of the "real" quality.

Therefore, it's incumbent upon mgm't to define good work. Right now, all they know (and all we are willing to help them with) is: schedule, bug count, cost & features. After that, it doesn't matter - we all do good work! My self esteem is growing by the minute!

Is it possible for Cadillac to build a quality car w/ inferior parts? Does each component manager ignore the internals of their parts and rely on the good graces of the engineers to "do the right thing"? Would you like medical software or your IRA to be build in this manner?

First of all let me thank you for keeping this conversation/debate open and useful by demonstrating good faith by actually experimenting with one of the metrics in questions before expressing an opinion. What a concept :-).

I will try to follow your good example and REALLY try to understand your position better. To keep things simple, I am going to focus on a specific metric (I will use CRAP since we recently discussed it and we've both using it on our code) but the concepts should apply to most other software metrics.

Your argument - and please correct me if I paraphrase it incorrectly - is that metrics such as CRAP oversimplify to such an extent that the metric number by itself carries no actionable information at the 'manager level'. Since it takes a developer to go and look at the details before deciding if any justifiable action should be taken, the metric should not be reported above the developer level.

If that's the case, I believe we are more in agreement than you think. I agree that a metric such as CRAP is much more useful to, and actionable by, developers than managers. CRAP is actually meant to be used primarily by developers (hence the IDE plug-in). I also agree that, in mapping the code into a single digit there is an unavoidable loss of information. But I don't agree that such a number does not provide any useful information to a manager. As a matter of fact, I don't think you believe something that black-and-white (i.e. that it's completely useless) either, because you say:

>...but the fact is that the original metric >is just a flag...

I think I see an opportunity for reaching some sort of agreement because a 'flag' is not completely useless. I need to do some a lot more thinking about it, but I am willing to explore the possibility that metric may be too strong a word for most things we call metrics. Perhaps they are just flags; warning signs, binary indicators that something is potentially wrong and that someone should go take a closer look.

Maybe these metrics will end up being used more like medical tests. For example, if someone has high cholesterol, it doesn't mean that they are "bad code" and will die of a heart attack. But it means to watch out for other things. For example, if your code had a bad "cholesterol metric", maybe it works fine. It sells great, acceptable number of bugs, customers are happy.

But when marketing wants to add a "sit around on your butt all day" feature and integrate into the new Web 3.0 "smoke and eat donuts" XML standard, it will be more dangerous to port that code, and he should expect it to take more time, require more testers, etc...

Another, more relevant example. I frankly never worry about cyclic coupling. Cause, in my experience, it's best to bundle everything into a very small number of big .jar files anyway. (see Note a below) So, other than a few basic divisions, there's really no "value" in working very hard to layerize the code. So, I'm guessing that my code has a high cyclic coupling. IF management wants or needs to split it up into many smaller jars, that's a red flag.

(Note a)

One project at our company had ~150 distributable jars. I guess the idea was to have simple updates, versions, etc. Also, not all were required for all applications.

But, it's literally impossible to manage this. For example, if there are 3 versions of each, plus a "null version" (not having that jar at all) you have 4^150 possibilities. Which I think is more than the estimated number of particles in the universe.

> Your argument - and please correct me if I paraphrase it> incorrectly - is that metrics such as CRAP oversimplify to> such an extent that the metric number by itself carries> no actionable information at the 'manager level'.> Since it takes a developer to go and look at the details> s before deciding if any justifiable action should be> taken, the metric should not be reported above the> developer level.

Yes indeed. That's a very good summation of my argument.

> If that's the case, I believe we are more in agreement> than you think. I agree that a metric such as CRAP is> much more useful to, and actionable by, developers> than managers. CRAP is actually meant to be used> primarily by developers (hence the IDE plug-in). I also> agree that, in mapping the code into a single digit there> is an unavoidable loss of information.

You're right. We broadly agree about the tool, what it does and how it can be used by a developer.

> But I don't agree> that such a number does not provide any useful> information to a manager. As a matter of fact, I don't> think you believe something that black-and-white (i.e.> that it's completely useless) either, because you say:> > >...but the fact is that the original metric > >is just a flag...

Yes, here is the nub of our disagreement. I've just delivered a small application as part of a two man development team, reporting to a team leader and thence - as required - to various managers (the development/delivery manager, the manager of the maintenance team that will be looking after the code into the future and, not neast, the manager of the people that will be using the code (oh and the intranet manager that will be hosting the code)).

In all these cases, none of these managers have any interest in the source code nor any particular property of that code. They do, however, want to know that the code works, is compatable with existing systems, is written in a company approved language/framework that the maintenance team are familiar with, is in the company CVS repository.

To be assured on these fronts, all the managers delegate getting assurance on all these questions to their teams and those team members come (at various stages of the project) and eyeball what we're delivering (UI, system interfaces or code, as appropriate). Those people may (potentially) look also at metrics, but - to date - none has. They will then report back to their managers on whether or not our code is fit for purpose but it's unlikely that the report would carry specific metrics on the source code. (Well that's not entirely true. They do seem to have an unnatural preoccupation with code 'volume' here, e.g. "Application X takes up 3.56Mb of disk space on the server", etc.)

> I think I see an opportunity for reaching some sort of> agreement because a 'flag' is not completely useless.

OK, agreed that it's "not completely useless" for managers(provided you agree that that's a synonym of "not very useful"). :)

> I> need to do some a lot more thinking about it, but I am> willing to explore the possibility that metric may> be too strong a word for most things we call metrics.> Perhaps they are just flags; warning signs, binary> indicators that something is potentially wrong and> d that someone should go take a closer look.> > Are we moving closer Vincent?

In all likelihood. I think the problem is not so much in the mechanics of generating a metric (where you appear to be well grounded) but in determining just what exactly is the best way to "summarise" a block of code, what it is that needs to be said about the code (and to whom) and how much or how little informationneeds to be provided.

> I am not arguing that all software should be held to the> same standard wrt any given metric. For example, I expect> that medical applications to be more thoroughly tested> than, say, a video game.

Actually, in practice, it would surprise me if this were generally true. At least console video games get very hard testing for their domain; the general rule for allowing a release used to be 100 hours of testing without finding any flaw, *after* all flaws found in previous testing had been fixed. I suspect that medical software that isn't used in life-and-death situations see significantly less quality assurance.

This illustrates a point: It is very hard to accurately define what domains it would be reasonable to use what metrics over. "Medical software" isn't a single domain - there's a large difference between a reporting app in psychiatry, where a crash is just an annoyance wasting a bit of a doctor's time, and the embedded software running on a pacemaker, where a crash kill the user.

It is even different in different parts of the same app. In the reporting app, recording data may be critical - it's non-fixable - so that part of the code may need be simple, well tested, well documented, etc. On the other hand, a graphical version of textual report generated by the tool may be called up once a forthnight and save a user 10 minutes. Here, the code is non-critical - it can be complex and not have automatic tests. This may be OK, in the same way that lack of automatic tests for video games is OK, because the code end up with very little maintenance done, and manual inspection of the output as OK and manual testing finding that the code run without crashes/memory leaks can be enough.

So, I don't think we need standard metrics. What I think we would probably benefit from is good metric tools and practitioners with knowledge of metrics, so the developers themselves can use metrics to find out what to clean up - possibly most when picking up a new codebase.

I would like to question the idea of comparing two completely different applications. There should be some reason to do that which changes from case to case. And this reason should dictate the choice of metrics. So, no we can't have a single universal metric. Though we can use a combination of several metrics to emphasize or support our conclusions.

On the other hand people comparison is much more compelling for managers :) So, for me the real question sounds like "is there a way to compare two developers working on different projects?"

I am totally convinced that it is possible, but again it shouldn't narrow down to some universal metric. Rather everybody chooses what are the most important indicators for him or her.

> > I am not arguing that all software should be held to> the> > same standard wrt any given metric. For example, I> expect> > that medical applications to be more thoroughly tested> > than, say, a video game.> > Actually, in practice, it would surprise me if this were> generally true. At least console video games get very> hard testing for their domain; the general rule for> allowing a release used to be 100 hours of testing without> finding any flaw, *after* all flaws found in previous> testing had been fixed. I suspect that medical software> that isn't used in life-and-death situations see> significantly less quality assurance.> Of course software for game consoles has several things that make it easier to test well (and more vital as well):1) it's a closed environment, there's no miriad different hardware/software combinations it's run on top of (and concurrently with).2) it's distributed on a medium that makes updating it to fix bugs impossible (or nearly so). ROM Cartridges and CDs/DVDs need to be physically replaced for every customer, you can't just send them a patchfile as a download (modern consoles with embedded harddisks make this somewhat possible, but they're a very recent development).

That medical software (at least most of it) is run on PCs, each of which has potentially different hardware and other software (including operating system patch levels) installed.At the same time sending the users a patch via email or a download link is easy and cheap.

It's therefore (despite the seemingly more important problem domain in medicine) more important economically for the manufacturer to get that game console software correct out of the box than that medical software (as long as no patient dies of course, in which case the liability claims can run into astronomical figures).

> That medical software (at least most of it) is run on PCs,> each of which has potentially different hardware and other> software (including operating system patch levels)> installed.> At the same time sending the users a patch via email or a> download link is easy and cheap.> > It's therefore (despite the seemingly more important> problem domain in medicine) more important economically> for the manufacturer to get that game console software> correct out of the box than that medical software (as long> as no patient dies of course, in which case the liability> claims can run into astronomical figures).

What you say is correct but I'd like to add a little to it. Medical software is broad. Focusing exclusively on the laboratory system: embedded software in the lab equipment which took the actual readings, a "collector" package that ran on a UNIX server, an HL7 message translator running on a UNIX server, a transaction server that took translated HL7 messages/transformed them per physician's rules/loaded them into the patient chart database, and the client piece running on PC's. Everything except the embedded software had the luxury of patching.

As far as testing goes, something as simple as changing the measurement unit in the patient chart system for a particular lab test, per physician request, involved a full regression test of the transaction server and client pieces. Every possible transaction was retested along with all known error conditions and the server's ability to handle them. Physicians' training can prevent data errors from being catastrophic, but physicians can't catch everything - especially not when in hour 25 of a 32 hour shift. In a hospital, information in medical systems CANNOT be wrong. The motivation for developers isn't financial (liability), it's watching life and death play out.

I think the biggest problem with software metrics is that we don't have any.

Consider "coverage" for example. What does "coverage" actually measure? We know how to compute coverage (for simplicity, let's count the percentage of statements tested), but that's just a count. What's the meaning behind this count?

In most fields, measurement starts from an attribute (aka a construct), something we want to measure. For example, we might want to measure productivity, quality, intelligence, aptitude, maintainability, scalability, thoroughness of testing, reliability--these are attributes.

Given an attribute, we use a measuring instrument of some sort to map the "value" of the attribute to a number. The instrument is easy to identify in some cases--think of rulers and voltmeters. Some instruments are more complex--for example, intelligence tests. Some instruments require multiple readings under different circumstances--for example, we might try to measure how loud a sound is, to you, by having you compare it to dozens of other sounds, indicating for each comparison which sound was louder. (If you wear glasses, you've gone through this type of measurement of subjective visual clarity.)

The reading from the instrument is the value that software engineers call "the metric." (There are varying uses of the word "metric"--see wikipedia http://en.wikipedia.org/wiki/Metric)

In most fields that use measurement, the fundamental question is whether the instrument you are using actually measures the construct (or attribute) that you think you are measuring. That concern is called "construct validity."

If you search the ACM's Guide to the Computing Literature (which indexes ACM, IEEE and many other computing publications), there are only 490 papers / books / etc. that include the phrase "construct validity" out of 1,095,884 references searched. There are 48,721 references that refer to "metrics" (only 490 of them mention the "construct validity" of these "metrics"). I read most of the available ACM-Guide-listed papers that mentioned "construct validity" a few years ago (Cem Kaner & Walter P. Bond, "Software engineering metrics: What do they measure and how do we know?" 10th International Software Metrics Symposium (Metrics 2004), Chicago, IL, September 14-16, 2004, http://www.kaner.com/pdfs/metrics2004.pdf) -- of those, most were discussions of social science issues (business, economics, psychology) rather than what we would normally think of as software metrics.

The problem with taking "measurements" when you don't have a clear idea of what attribute you are trying to measure is that you are likely to come up with very precise measurements of something other than the attribute you have sort-of in mind. Consider an example. Suppose you wanted to measure aptitude for algebra. We sample the population and discover a strong correlation between height and the ability to answer algebra questions in a written test. People who measure between 5" and 30" tall (who are, coincidentally, very young and they don't yet know how to read) are particularly algebra-challenged. What are we really measuring?

When people tell me that you can measure the complexity of a program by counting how many IF statements it has (McCabe's metric), I wonder whether they have a clue about the meaning of complexity.

When people tell me you can measure how thoroughly a program has been tested by computing the percentage of statements tested, I wonder if they have a clue about testing. See "Software negligence and testing coverage." (Keynote address) Software Testing, Analysis & Review Conference, Orlando, FL, p. 313, May 16, 1996. http://www.kaner.com/pdfs/negligence_and_testing_coverage.pdf

There is a lot of propaganda about measurement, starting with the fairy tale that "you can't manage what you don't measure." (Of course we can. We do it all the time.)

Much of this propaganda is moralistic in tone or derisive. People get bullied by this and they conform by using measurement systems that they don't understand. (In many cases, that perhaps no one understands.) The result is predictable. You can blame individual managers. I blame the theorists and consultants who push unvalidated "metrics" on the field. People trust us. When we put defective tools in the hands of executives and managers, it's like putting a loaded gun in the hands of a three-year old and later saying, "guns don't kill people, people kill people." By all means, blame the victim.

Capers Jones wrote in one of his books (maybe many of his books) that 95% of software companies don't use software metrics. Most of the times I hear this quoted, the writer or speaker goes on to deride the laziness and immaturity of our field. Mature, sensible people would be in the 5%, not the great unwashed 95% that won't keep their records.

My experience is a little different. I'm a professor now, but I did a lot of consulting in Sili Valley. I went to company after company that didn't have software measurement systems. But when I talked to their managers / executives, they told me that they had tried a software measurement system, in this company or a previous one. Many of these folks had been involved in MANY software measurement systems. But they had abandoned them. Not because they were too hard, too time consuming, too difficult -- but because, time after time, they did more harm than good. It's one thing to pay a lot of money for something that gives valuable information. It's another thing to pay for golden bullets if all you're going to use them for is shooting holes in your own foot.

It takes years of work to develop valid measurement systems. We are impatient. In our impatience, we too often fund people (some of them charlatans) who push unvalidated tools instead of investing in longer term research that might provide much more useful answers in the future.

> It takes years of work to develop valid measurement> systems. We are impatient. In our impatience, we too often> fund people (some of them charlatans) who push unvalidated> tools instead of investing in longer term research that> might provide much more useful answers in the future.

Well said, I think. Do you have any suggestions for less-suckful metrics in the software world?

There is a theme that I am reading in many posts here that seems something like "metrics are imperfect and flawed therefore they are useless." I suspect that this idea says much more about the speaker than metrics per se. Many people appear to be drawn to software development because of comfort with black and white, clear specific realities. The irony is that software development embraces a number of chaotic non-predictable processes and is much less black and white than people think.

Yet - how do you compare systems without metrics. A recent conversation:

Me: SO can give me an estimate of how large system X is?"DEV: It's largeME: How big is large? DEV: Um I don't know exactlyME: order of magnitude? Are we talking about thousands, tens of thousands, even 100s thousands of classes?

I don't understand how someone can work on a system without knowing its size, and where the size is increasing or decreasing.

> There is a theme that I am reading in many posts here that> seems something like "metrics are imperfect and flawed> therefore they are useless." I suspect that this idea says> much more about the speaker than metrics per se. Many> people appear to be drawn to software development because> of comfort with black and white, clear specific realities.> The irony is that software development embraces a number> of chaotic non-predictable processes and is much less> black and white than people think.> > Yet - how do you compare systems without metrics. A recent> conversation:> > Me: SO can give me an estimate of how large system X is?"> DEV: It's large> ME: How big is large? > DEV: Um I don't know exactly> ME: order of magnitude? Are we talking about thousands,> tens of thousands, even 100s thousands of classes?> > I don't understand how someone can work on a system> without knowing its size, and where the size is increasing> or decreasing.

In the interest of being a pedantic shmuck, if my system has no classes, is it then small and never growing?

Your token DEV's response should have been to ask "large how?". Lines of code? Cyclomatic complexity? Class count? Do you want just the classes we wrote or all classes that might be called by the program because it runs on the .NET framework? If the system consists of components in a variety of languages, how do you count the parts that may not have any classes?

What does 'large' in and of itself tell you anyway? I currently work on some large systems by any measure that I don't worry too much about the actual size because the pieces are well thought out and pretty well put together and updates and changes are pretty easy. I've worked on small systems (again, by just about any measure) that have made me want to cry because the were fragile and horrible to update.

I think most people's issue with metrics is that they attempt to take something that is, as you say, chaotic, and reduce to something very black and white. I don't get how you can take something that has been worked on for months or years, run it through some sort of crank and get a number and have it have any real meaning. I think most people imagine the following exchange when it comes to the use of metrics:

Manager: What is the WizzleWub count of the DungBomb project?

Dev: 17

Manager: We were hoping for at least 19. I need to see you in my office...

I can only speak for myself, but my issue with metrics isn't whether they are flawed, but when they are used by people that don't really know what they represent to make some absolute evaluation of a system. That leads to trouble in most cases. And then instead of making the system better (by fixing problems, adding features, etc.) you are more worried about bringing the WizzleWub count up.

The ultimate measure of any software system is does it do what it was intended to do. Unfortunately there isn't any single number or metric that will tell you that.

> Manager: What is the WizzleWub count of the DungBomb> project?> > Dev: 17> > Manager: We were hoping for at least 19. I need to see you> in my office...

hm, seems the right way to check up on something is basically to have, in effect, a whole bunch of metric values. I mean, if you dig into the code and then make judgments based on experience, you are basically deciding on metric values in your head.

so if the "dashboard" which previously only showed the single WizzleWub value instead also showed umpteen other values, and then even synthesized some sum-up values out of those, would that seem any more reasonable to those who distrust metrics so much?

(metrics seem great to me, yet i also completely agree that any tool can be abused. so if my boss is a jerk who uses nothing but WizzleWub... sucks to be me, for sure.)

> so if the "dashboard" which previously only showed the> single WizzleWub value instead also showed umpteen other> values, and then even synthesized some sum-up values out> of those, would that seem any more reasonable to those who> distrust metrics so much?> That seems very reasonable. It also seems like a lot of work. And it requires experience and first hand knowledge in the trenches to do that well. Two criteria that make it likely not to be adopted by any big organization. Better to have lots of simple, barely meaningful measures that are easy to document and generate so as to get your desired CMMI level certification.

And it would require a manager to let go a little bit. If you already have such a manager that can do this, then this isn't a problem. If you have a manager that you already have issues with, this is yet one more weapon they can use to bludgeon you with their ignorance and stupidity. Up until the last job I had which I left in September, I have been blessed with good managers throughout my career. Some of them required metrics but they were nothing more than a tool. Sometimes they were misapplied but we were able to take a step back and say "ok, that's interesting information, but it doesn't make much sense or isn't telling us anything useful" and we would change it.

The only metric I really care about is open bug count. If it is going down that's good. If it is going up that's bad. I don't mind my manager holding that against me as long as the source of the defects is kept in mind. There is nothing so frustrating come review time as being penalized by inheriting an old, buggy system. Nothing like having your bug count triple for reasons way beyond your control and then getting hurt for it. Getting punished for other people's sins is no fun. And I've had that happen a couple of time. It stinks.

I think coming from first principles is one good way to solve problems (it worked for Einstein). So, having a clear idea of what you are trying to measure is important.

As to the ACM search, I am curious how the synonym searches for construct validity worked out. It could very well be that people are describing the same concept with different terms; it happens all the time. How did the other 48,000 papers check out? I am certain you did good research, and that you just abbreviated this description to make your point. It would be interesting to hear more about how you measured the presence or absence of 'construct validity' in the actual approaches taken in all these papers.

Another interesting thing you said: on coverage. You ask what it means. If you wanted a clearer answer of what it measures, I recommend an interesting survey paper by Hong Zhu, Software Test Adequacy Criteria (it's in the ACM dl) that examined most of the, up-to-that-point work on testing adequacy criteria. It seems quite appropriate given that coverage is one adequacy criteria that could be measured. There are many criteria like def-use paths, state coverage, and so on and on and on.

It seems that the whole point with metrics is to put them into context, understand the narrow story they tell about the system being measured and then make intelligent decisions. To throw complexity or coverage out completely seems to insist that since we have no perfect answers we should give up and go home.

Your statement about complexity was a further curiosity. I think you made a slight equivocation. When someone tells you about the complexity as measured by the decision points, I hope it is understood by both of you that you are using jargon. "Complexity" in this instance only references McCabe's work. And hopefully, you both realize that within that context it is a measure (or a metric) for an aspect of the system that seems to be somewhat correlated with defect density (check McCabe's 96 NIST report where he points to a couple of projects that saw a correlation.) Based on that context, a complexity score is possibly a useful thing to know and to use for improving the software.

Even if it isn't correlated in perfect lock-step all the time, anyone who has written any substantial software knows the anguish of maintaining really ugly large methods/functions. McCabe is trying to measure something we know is there. Is his measure complete? No. Is it sufficient? No. Is it useful? Yes. It seems to be supported in the studies too. If you have references for field studies that contradict the 96 report, please post them.

Later you say, "When we try to manage anything on the basis of measurements that have not been carefully validated, we are likely to create side effects of measurement ...There is a lot of propaganda about measurement, starting with the fairy tale that "you can't manage what you don't measure." (Of course we can. We do it all the time.)"

So, this seems to contradict itself. If I understood the aphorism about managing and measuring, admittedly I haven't heard Tom DeMarco say it personally, what I took it to mean is that there is an implied "good" after the word 'managing'. That is, he was saying, we cannot do a good job managing without measuring. On the other hand, I agree that people manage (with no good after it) all the time without measuring. You might say that managing without measuring is a derivative form of managing on the basis of measurements that have not been carefully validated.

While we are clearing things up, I got the point you are trying to make, but what does it mean when you make the quote, ""guns don't kill people, people kill people." By all means, blame the victim." Where in there is the victim blamed?

Traditionally, that argument means that the guns should not be outlawed, but that criminals should be jailed. It's an argument by gun owners to keep their guns legally. How is it blaming the victims?

On that point, you say "putting defective tools in the hands of managers". Let's not forget putting tools that require expertise in the hands of managers. That is as likely to blow up in somebody's face, and is arguable the current state of affairs.

As to your summary point, I think we agree. It takes a lot of thinking to do metrics right. Most people get them wrong. We should spend tons of money on research that validates metrics. (I am willing to co-write a grant to study crap4j if anyone is game?)

What I disagree with is a perception that metrics are not useful, that we are managing just fine without them, and that because some people misuse them (over and over again no less) that nobody should use them without exorbitant expenditures of time and money. It sounds a lot like trying to ignore the problem.

We must keep trying to improve our measures by studying them, by validating them, and by improving them based on that study. And without a doubt, it requires a coherent approach, and a clear understanding of what is being measured -- whether we call it construct validity or something else.