02:00 May-23-2003 Re: Some textbook or information about Probability of Detection : Dear all,.: Could anybody tell me where I can get some textbook (reference book) or information about the issue of Probability of Detection?.

If you are interested in methodologies for determining POD, some useful documents are:

(1) AGARD-LS-190, which is sort of a preview of(2) USAF MIL-HDBK-1823 (United States Dept. of Defence Military Handbook)

02:16 May-23-2003 Re: Some textbook or information about Probability of Detection : Dear all,.: Could anybody tell me where I can get some textbook (reference book) or information about the issue of Probability of Detection?.Hopefully this won't sound too self-serving but NTIAC has some POD publications, such as the POD for NDE book or the NDE Databook (http://www.ntiac.com/databook.html), which has some 400 or so POD curves. If I remember correctly the Databook on CD includes the POD curves as Excel files, meaning you could use them as templates of a kind to build your own POD curves if you were so inclined.

Take a look at our order form at http://www.ntiac.com/pubs/order.html for info on pricing and availability.

09:32 May-24-2003 Re: Some textbook or information about Probability of Detection : Dear all,.: Could anybody tell me where I can get some textbook (reference book) or information about the issue of Probability of Detection?.Hi Jacky,There was a booklet published entitled "Practical Sensitivity Limits of Production NDT". This was published in 1975 by Boeing. About 165 pages.

It was reproduced by National Technical Information Service of the US Department of Commerce.I guess that in the intervening 28 years that organisations have moved on.Don't know if you can still get a copy nowadays.

01:50 Jul-01-2003 Re: Some textbook or information about Probability of Detection : Dear all,.: Could anybody tell me where I can get some textbook (reference book) or information about the issue of Probability of Detection?

Textbooks and reference books are the wrong place to look for an understanding of the issue of Probability of Detection. Without an exception that is known to me, all of them incorporate a false assumption. This is that defect detection tests preserve probability theory. For a discussion of this topic, see http://www.ndt.net/article/v04n05/oldberg/oldberg.htm .

09:44 Oct-30-2003 Dave's Advice on Protocols for Research on NDT Reliability : : Dear all,: .: : Could anybody tell me where I can get some textbook (reference book) or information about the issue of Probability of Detection?: ..: If you are interested in methodologies for determining POD, some useful documents are:.: (1) AGARD-LS-190, which is sort of a preview of: (2) USAF MIL-HDBK-1823 (United States Dept. of Defence Military Handbook).: and the set of:.: (3)Floyd Spencer, Giancarlo Borgonovi, Dennis Roach, Don Schurman, Ron Smith: Reliability Assessment at Airline Inspection Facilities Volume I:: A Generic Protocol for Inspection Reliability Experiments: DOT/FAA/CT-92/12, I: March 1993.: (4) Floyd Spencer, Giancarlo Borgonovi, Dennis Roach, Don Schurman, Ron Smith: Reliability Assessment at Airline Inspection Facilities Volume II:: Protocol for an Eddy Current Inspection Reliability Experiment: DOT/FAA/CT-92/12, II: May 1993.: (5) Floyd Spencer, Don Schurman: Reliability Assessment at Airline Inspection Facilities Volume III:: Results of an Eddy Current Inspection Reliability Experiment: DOT/FAA/CT-92/12, III: May 1995.: The above are geared to the aerospace industry, and are essentially accepted practice within the aerospace industry..: The July 2001 issue of Materials Evaluation also has a number of useful articles, of broader perspective than the how-to stuff above..: If you have more specific questions, I'm sure you can get many responses on this forum..: Dave.

Warning: Do not trust the documents cited by Dave. The research protocols that are advocated by these documents violate the axiom of probability theory that is known as Unit Measure. The penalty for deciding to follow one of these protocols is to have your findings empirically invalidated from the start.

By the way, I believe Dave is correct in saying that these protocols are accepted practice in the aerospace industry. Similar protocols are accepted practice in the nuclear industry. That these protocols are accepted practice and that the resulting findings are empirically invalidated means that we have a crisis.

If the existence of this crisis is not readily apparent, a portion of the blame lies with authors of advice on research protocols who, like Dave and the authors of the works that he cites, have failed to deal with prior findings in the literature.

In particular, my 1995 paper "Erratic Measure" ( http://www.ndt.net/article/v04n05/oldberg/oldberg.htm )antedates Dave's posting plus all of the works that Dave cites and it exposes the problem of Unit Measure violations, yet neither Dave nor any of the works that he cites come to grips with this problem or reference my paper.

05:25 Dec-08-2003 Re: Dave's Advice on Protocols for Research on NDT Reliability Rejoining this discussion, a little late, but late is better than never. When we left off, I think Terry and I had agreed that he had a valid objection to the flaw counting procedure used in the experiment Terry references in his paper (see http://www.ndt.net/article/v04n05/oldberg/oldberg.htm ). I think where we disagree is in the validity of the counting procedure used in many other experiments.

Let me propose a virtual experiment. If I have a set of widgets and ask an inspector to provide a decision flawed/no flaw for each widget, in my opinion this experiment satisfies probability theory and we can use the MIL-HDBK-1823 methods to assess the results. The inspection results for each widget are either flawed or not flawed, and upon verification I can assign a ground truth evaluation of flawed or  not flawed. Therefore, the inspector results are either a hit, a miss, a false call, or a correct no call. These results are mutually exclusive and comprehensive, which obviates the objection Terry has in his paper to the quoted study.

In my opinion, as long as my experiment satisfies this fundamental issue, then I have avoided the mistakes Terry describes in his paper. I am not aware that any results our group have published have violated this.

Terry also alludes to another issue with this methodology, but I do not know what it is, and so I can not provide my opinion on it.

I do think there are shortcomings to the MIL-HDBK-1823. The handbook does not explicitly address false calls. The methodology assumes one can reduce cracks to a one dimensional measure, while many flaws (corrosion, impact damage, delamination) are not easily characterized in one dimension. Our research group continues to develop more sophisticated approaches to NDI reliability, as do others. There are industries that do not subscribe to this methodology, the European program ENIQ is a prime example. However, I do believe there are many occasions where the MIL-HDBK-1823 methodology works and we will continue to use it.

03:34 Jan-10-2004 Re: Dave's Advice on Protocols for Research on NDT Reliability Dave:

Thanks for sticking with the conversation.

I agree, with certain caveats, that your "widgets" methodology is exemplary in its conformity to probability theory and normal statistics. However, we seem to disagree on the content of MIL-HDBK-1823. Also at issue is DOT/FAA/CT-92/12 which you recommended in your posting of May 23, 2003.

The widgets methodology works because it meets most of the following requirements on a methodology that is free from statistical quirks:1) Events that are certain to occur (e.g., false calls) are one-to-one with sampling units (the widgets)2) The sampling units do not intersect3) The union of the sampling units is the inspected material4) Because 2) and 3) are true, the sampling units are a partition of the inspected material5) A definitive test is identified which determines the true state (the "ground truth" in your nomenclature) of the sampling units6) a sample is drawn for definitive testing via random sampling7) if probability estimates are extracted from artificial samples, it is possible, in principle, to confirm the estimates via measurements in the field

It easy to compare a proposed methodology against these requirements. My understanding is that neither MIL-HDBK-1823 nor DOT/FAA/CT-92/12 stands up to this kind of scrutiny

I lost my electronic copy of MIL-HDBK-1823 in a hard disk crash so will have to rely on memory about what it says. What I recall is that it is consistent with the U.S. military's doctrine of the past 30 years on how to inspect aircraft. This doctrine states that the reliability of a method of NDT is to be characterized by a function that maps flaw size to a Probability of Detection. The sampling units are not identified but, from the context, may be inferred to be flaws. The set of flaws is not, however, a partition of the inspected material and it follows directly that Requirement 4) is violated. When a true positive is certain to occur, it may have a one-to-many relationship with flaws and it followsunder that Requirement 1) may be violated. Samples are not drawn randomly and it follows that it Requirement 6) is violated. The function that maps flaw size to POD is usually extracted from artificial samples and a method for verifying that these POD estimates apply in the field is not identifed so Requirement 7) is violated. In some cases, the sampling units may intersect and it follows that, in these cases, Requirement 2) is violated.

I'm a bit fuzzy on DOT/FAA/CT-92/12, as I've not read it in a number of months. Please correct me if I mischaracterize it.

My recollection is that it is like MIL-HDBK-1823 in generating a function that maps flaw size to POD and that the sampling units underlying the sampling units are implied to be flaws. It differs in generating a Probability of False Call.

A number of years ago, I called one of the authors of the final report of the U.S. Federal Aviation Agency's study on Aging Aircraft because the report left the nature of the sampling units unclear. The purpose of the study was to characterize the reliability of with which rivet joints were inspected by eddy current testing. I gathered from the report that the inspections were carried out with the purpose of detecting flaws.

I believe my informant was a member of the group at Sandia Laboratories in Albuquerque that authored DOT/FAA/CT-92/12. If so, what he told me is revealing about DOT/FAA/CT-92/12. What he told me was that the sampling units were rivet joints.

If the rivet joints were defined concretely, say, by defining them as annuli with specified outer radii, if the tests were configured to classify each annulus as flawed or defective, if a definitive test were defined that revealed their true state and if the remaining 7 Requirements were satisfied, then all would have been well. However, this could not be true.

One inconsistency was that a flaw was not an annulus. Another was that the union of the set of flaws and the set of unflawed annuli was not a partition of the inspected material. I won't go into the rest of the many ways in which the 7 Requirements on a quirk-free methodology were violated by the methodology.

This left, as a mystery, how the researchers extracted the probabilities from the data. I can only speculate about the answer. My speculation is that the "probability of false call" which they reported is actually the probability that an unflawed annulus is detected and the "probability of detection" is actually the probability that a flawed annulus is detected. Remaining mysteries include: 1) what was the criterion by which an annulus was declared as "detected" when the requirement on the inspectors was not to classify annuli but to detect flaws? 2) As annuli generally contained more than one flaw, which length was it that was mapped to the POD?

All of these quirks in the FAA's research would have been eliminated if the FAA, the airlines and the inspection agencies had gotten together beforehand on a redefinition of the tests such that they supported your "widgets" methodology and my 7 requirements for quirk-free inspection reliability research. I don't think this was possible then and will be (pleasantly) surprised if you convince me it is happening now. Seemingly, the NDT specialists of the world have been in the grip of an orthodoxy that has said the purpose of NDT is to detect flaws and they have preferred to produce statistically flawed research (with the flaws often papered over) than to abandon the othodoxy.

Terry

----------- Start Original Message -----------: Rejoining this discussion, a little late, but late is better than never. When we left off, I think Terry and I had agreed that he had a valid objection to the flaw counting procedure used in the experiment Terry references in his paper (see http://www.ndt.net/article/v04n05/oldberg/oldberg.htm ). I think where we disagree is in the validity of the counting procedure used in many other experiments.: Let me propose a virtual experiment. If I have a set of widgets and ask an inspector to provide a decision flawed/no flaw for each widget, in my opinion this experiment satisfies probability theory and we can use the MIL-HDBK-1823 methods to assess the results. The inspection results for each widget are either flawed or not flawed, and upon verification I can assign a ground truth evaluation of flawed or  not flawed. Therefore, the inspector results are either a hit, a miss, a false call, or a correct no call. These results are mutually exclusive and comprehensive, which obviates the objection Terry has in his paper to the quoted study.: In my opinion, as long as my experiment satisfies this fundamental issue, then I have avoided the mistakes Terry describes in his paper. I am not aware that any results our group have published have violated this. : Terry also alludes to another issue with this methodology, but I do not know what it is, and so I can not provide my opinion on it. : I do think there are shortcomings to the MIL-HDBK-1823. The handbook does not explicitly address false calls. The methodology assumes one can reduce cracks to a one dimensional measure, while many flaws (corrosion, impact damage, delamination) are not easily characterized in one dimension. Our research group continues to develop more sophisticated approaches to NDI reliability, as do others. There are industries that do not subscribe to this methodology, the European program ENIQ is a prime example. However, I do believe there are many occasions where the MIL-HDBK-1823 methodology works and we will continue to use it.: Regards, Dave ------------ End Original Message ------------

02:16 Jan-21-2004 NDT Reliability methodology My apologies to the rest of the readers, I have tried to clarify the following conversation by prefacing paragraphs with the author. This is my response (prefaced by DF reply:) to Terry Oldbergs posting of 10 Jan 2004 (prefaced by TO:). I have changed the subject heading to reflect our subject better I hope.

TO: I agree, with certain caveats, that your "widgets" methodology is exemplary in its conformity to probability theory and normal statistics. However, we seem to disagree on the content of MIL-HDBK-1823. Also at issue is DOT/FAA/CT-92/12 which you recommended in your posting of May 23, 2003.

TO: The widgets methodology works because it meets most of the following requirements on a methodology that is free from statistical quirks:1) Events that are certain to occur (e.g., false calls) are one-to-one with sampling units (the widgets)2) The sampling units do not intersect3) The union of the sampling units is the inspected material4) Because 2) and 3) are true, the sampling unitsare a partition of the inspected material5) A definitive test is identified which determines the true state (the "ground truth" in your nomenclature) of the sampling units6) a sample is drawn for definitive testing via random sampling7) if probability estimates are extracted from artificial samples, it is possible, in principle, to confirm the estimates via measurements in the field

DF reply: in regards to your point 6), I would be reluctant to accept anything other than full tear down of all test components for most POD studies. There are some exceptions.

DF reply: in regards to your point 7) one may perform a study only applicable within laboratory conditions, and it would then be wrong to apply to field conditions, but this does not invalidate the study. We have used this sort of study to pre-select techniques for further development.

TO: It easy to compare a proposed methodology against these requirements. My understanding is that neither MIL-HDBK-1823 nor DOT/FAA/CT-92/12 stands up to this kind of scrutiny

TO: I lost my electronic copy of MIL-HDBK-1823 in a hard disk crash so will have to rely on memory about what it says. What I recall is that it is consistent with the U.S. military's doctrine of the past 30 years on how to inspect aircraft. This doctrine states that the reliability of a method of NDT is to be characterized by a function that maps flaw size to a Probability of Detection. The sampling units are not identified but, from the context, may be inferred to be flaws. (cut here)

DF reply: I think this is where we are going to disagree. I think it is possible to do things wrong within MIL-HDBK-1823 or the FAA guidelines, but I do not think it is inevitable. My previous widgets example is something that I think follows the MIL-HDBK-1823/FAA guidelines. More below

TO: I believe my informant was a member of the group at Sandia Laboratories in Albuquerque that authored DOT/FAA/CT-92/12. If so, what he told me is revealing about DOT/FAA/CT-92/12. What he told me was that the sampling units were rivet joints.

TO: If the rivet joints were defined concretely, say, by defining them as annuli with specified outer radii, if the tests were configured to classify each annulus as flawed or defective, if a definitive test were defined that revealed their true state and if the remaining 7 Requirements were satisfied, then all would have been well. However, this could not be true.

TO: One inconsistency was that a flaw was not an annulus. Another was that the union of the set of flaws and the set of unflawed annuli was not a partition of the inspected material. I won't go into the rest of the many ways in which the 7 Requirements on a quirk-free methodology were violated by the methodology.

DF reply: In the FAA rivet study, one could state the inspection problem as is there a fatigue crack within this radius from the fastener hole. The inspection result is either yes or no. The result can then be confirmed by teardown (or by knowledge of which holes had flaws manufactured in them), and the yes or no is either true positive, false positive, true negative, false negative; and I think we then have a valid statistical partition.

TO: This left, as a mystery, how the researchers extracted the probabilities from the data. I can only speculate about the answer. My speculation is that the "probability of false call" which they reported is actually the probability that an unflawed annulus is detected and the "probability of detection" is actually the probability that a flawed annulus is detected. Remaining mysteries include: 1) what was the criterion by which an annulus was declared as "detected" when the requirement on the inspectors was not to classify annuli but to detect flaws? 2) As annuli generally contained more than one flaw, which length was it that was mapped to the POD?

DF reply: I do not see how your point 1) works, I do not see the importance of this difference in terminology between annulus being declared as detected flawed, and fatigue crack found within the annulus. I agree number 2) is a problem, unless one can register in space the NDI signal to the crack location. This is something which must be accounted for in experiment design. In our published bolthole eddy current work, for example, probe positioning was recorded so we could correlate each signal to a particular fatigue crack.

DF reply: I do not think we are really too far apart here. I also have concerns over the use of words like flaw and defect, I think ASNT has it right when it uses discontinuity. When I give lectures on this topic, I discuss these definitions and how they can be misleading. Poor choice of vocabulary, whether intentional or not, often leads to misunderstandings.

DF reply: I do agree that the study discussed in Terrys original paper on this topic, Erratic Measure (search on ndt.net to find it), was a flawed study. I don't think it is representative. Most of the problems I have seen with reliability studies are one of two issues: 1. unrepresentative specimens (ie. EDM notches) 2. use of POD data outside its realm of applicability.

DF reply: Is anyone else on this forum interested in this discussion between Terry and I?

Thank you for remaining in this dialogue. In the following paragraphs, I respond to your most recent posting.

>DF reply: In the FAA rivet study, one could state the >inspection problem as is there a fatigue crack >within thisradius from the fastener hole. The >inspection result is either yes or no. The result >can .then be confirmed by teardown (or by knowledge >of which holes had flaws manufactured in them), and >the yes or no is either true positive, false >positive, true negative, false negative; and I think >we then have a valid statistical partition.

TO response: The FAA rivet study contains an inconsistency that you do not address. The study's report provides a function that maps crack size to POD. This implies that the sampling units underlying the POD are cracks and not cracked annuli. However, the union of the cracks and the uncracked annuli is not a partition of the inspected material. That this union is not such a partition violates probability theory empirically.

>DF reply: I do not see how your point 1) works, I do >not see the importance of this difference in >terminology between annulus being declared as detected flawed, and fatigue crack found within the annulus.

TO response: For the benefit of others, I'll explain that my "point 1)" referenced the FAA rivet study. I meant it to ask the question "...what was the criterion by which an annulus was declared as 'detected' when the requirement on the inspectors was not to classify annuli as cracked or uncracked but rather to detect cracks?" The answer is notobvious. For example, if an inspector "found" a crack that was not present but did not find cracks that were present in the same annulus, was this annulus classified as a true positive or as a false negative? The report of the FAA study doesn't tell us, as I recall.

>DF reply: I also have concerns over the use of words >like flaw and defect, I think ASNT has it right >when it uses discontinuity. When I give lectures on >this topic, I discuss these definitions and how they >can be misleading. Poor choice of vocabulary, whether >intentional or not, often leads to misunderstandings.

TO response: Misuse of statistical terms such as "probability," "population," "sampling unit," and "sample" is an additional, semantic problem of NDT.

>DF reply: I do agree that the study discussed in >Terrys original paper on this topic, Erratic Measure >(search on ndt.net to find it), was a flawed study. I >don't think it is representative. Most of the >problems I have seen with reliability studies are one >of two issues: 1. unrepresentative specimens (ie. EDM >notches) 2. use of POD data outside its realm of >applicability.

TO response: I'll addess my experience with the representativeness of the study that you've referenced as the "flawed study" by providing autobiographical information. In 1982, I accepted the job of organizing R&D on the inspection of pressurized water reactor steam generator tubes. I soon found that forces beyond my control made itimpossible to perform the job I had accepted.

The inspections were performed under the ASME Boiler and Pressure Vessel Code. Test procedures that resulted from the ASME's rules did not define sampling units. Thus, though it was possible to develop inspection technologies, one's ability to measure their reliability was crippled.

By 1984, the research branch of U.S. Nuclear Regulatory Commission (NRC) was planning the study which you have referenced as the "flawed study" and I was sitting on its steering committee. Thus, I informed NRC staff members, plus others in the nuclear power industry, of the significance of the ill-defined sampling units for the study. They ignored me.

This response was consistent with thinking of the time. The thinking was that the structure of the tests of NDT presented no barriers to NDT reliability studies.

In 1991, I addressed this thinking in a paper that was published in "Materials Evaluation." The paper pointed out that, if they were defined, the sampling units of NDT would be a partition of the material that was inspected; discontinuities didn't qualify. People ignored my paper.

Several years later, Jean Perdijon of France's Commissariat a l'Energie Atomique published a paper on a similar topic in the same journal. He opined that the sampling units of NDT should be "equal volume elements." That Perdijon's "elements" were a partition of the inspected material demonstrated that his thinking was similar to mine.

Still trying to draw peoples' attention to the fact that the sampling units were ill-defined, I wrote to a individuals and groups in the nuclear and aviation industries to ask them to identify the sampling units. I received only one reply that addressed the issue, when an ASME committee mailed me a photomicrograph of a crack and the surrounding material. Did they mean that a) the crack was an example of a sampling unit or b) the crack plus the surrounding material was an example? Was the union of this committee's sampling units a partition of theinspected material? Could one of their sampling units contain a number of discontinuities other than 1? All of these matters were important and none of them were resolved by their communication with me. I attempted further communication but they did not respond to it.

In 1994, in preparation for the paper that was to be entitled "Erratic Measure," I skimmed the contents of Stanford University's library on NDT, going back to about 1945. Some of the generalizations that emerged from this research were a) so far as I could determine, nobody had clearly identified the sampling units of NDT b) few studies had been performed on NDT's reliability c) in virtually every case, reports of studies of NDT's reliability presented functions that mapped the size of a discontinuity to its "Probability of Detection" or "POD" d) a "false call" was said to have occurred when an indication was not proximate to a discontinuity e) a probability of false call was not extracted from the data or reported. f) protocols for NDT reliability research published in the Metals Handbook, the ASNT Handbook and the draft of a U.S. Military Standard assumed probability theory but did not ensure that it was preserved; in fact, through omissions, they positively led people astray.

In the fall of 1994, Ron Christensen and I submitted "Erratic Measure" to an ASME conference. The paper pointed out that empirical violations of probability theory occurred if sampling units were not one-to-one with events that were certain to occur, such as false positives. We demonstrated the existence of these violations in the "flawed study."

One of the paper's findings was that the "probability of detection" of the "flawed study" violated probability theory due to a one-to-many relationship between the certain event of a true positive and discontinuities. If one used best case analysis, the POD for large flaws was 1; if one used worst case analysis, it was 0. In virtually every case, engineers use worst case analysis, for they wish to ensure the robustness of their designs. The use of best case analysis ensures a fragile design. However, in reporting the results of the "flawed study", the NRC had used best case analysis. This meant that large discontinuties could go undetected even though the NRC's "probability" of detection was 1 for them.

As soon as the ASME's referees approved "Erratic Measure" for publication, I forwarded a copy to the NRC's research director, recommending action on the situation it had revealed. He did not respond to me. I heard later that someone from his organization had contacted the ASME conference committee to ask them to quash publication of "Erratic Measure."

The conference committee responded by bringing an academic statistician on board, as a referee. He read "Erratic Measure" and, I was told, judged it "excellent." The conference committee reacted by rejecting the NRC's request.

Late in 1994, I presented "Erratic Measure" to an audience of NDT specialists who were gathered at a meeting of the ASME's NDE Engineering Division in Houston. At the end of my talk, a member of my audience arose. He said he was affiliated with one of the U.S. National Laboratories and that he was encountering phenomena, of the types I had described, in his own laboratory. He pointed out that most of us had gotten to the meeting in aircraft and stated that the methods by which these aircraft were inspected contained inconsistencies of the type I had just revealed. He expressed fear of flying home because of them.

At the break, a man approached me with the information that he had recently retired from the ASME Section XI Committee. He suggested that I talk to the Committee about the problems described in "Erratic Measure."

The Committee was scheduled to meet near my home 9 months later, so I wrote to offer to present "Erratic Measure" to them at this meeting. When the Committee did not respond for many months, I called its Secretary. He apologized for the delay, said I would be welcome to attend the meeting and said I would be welcome to discuss any topic other than "Erratic Measure." This and other responses from the Committee obviated the possibility of discourse with them.

At about the same time, I contacted the FAA, whose study on rivet joint inspection has been discussed earlier. They told me that, as a matter of courtesy, they sometimes responded to reports of problems in airliner technology but that they were lukewarm about meeting because useful results seldom came from this type of meeting. They added that they had a policy against providing compensation for consultants like me. This effectively thwarted the possibility of a dialogue.

Shortly after the ASME published "Erratic Measure," Business Week magazine printed about a million copies of an article about it. However, so far as I could determine, the publicity did not embarrass anyone in the NDT community into taking action on the inconsistencies that were revealed by the paper.

About 4 years ago, ndt.net republished "Erratic Measure" onthe Web. At the time, I had been involved in the issue that was raised by the paper for 16 years. So far as I could tell, nobody had taken action to either a) refute the contentions of "Erratic Measure" or b) deal with its findings in a constructive manner. In particular, it appeared that nuclear reactors were inspected and studies of the reliability of NDT in these reactors were performed in much the same way as before my involvement.

To break the apparent logjam, I began to use the forum of ndt.net to point out inconsistencies in what people were saying. The results of this dialogue are available in ndt.net's archives. To generalize, a refutation of "Erratic Measure" did not emerge from this vetting of its contentions.

In couple of years ago, Materials Evaluation published an article by an author who was identified as an expert on NDT reliability. The expert recommended use of the protocol in MIL-HDBK-1823.

This document is a rare book. When I was finally able to put my hands on a copy, Ifound that a) it was published years later than "Erratic Measure" b) it did not reference "Erratic Measure" c) it dealt with none of the issues raised by "Erratic Measure." When I read DOT/FAA/CT-92/12, I found that it had these same characteristics.

To summarize my autobiographical remarks, "Erratic Measure" was accepted for publication within the peer review system more than 8 years ago. It survived a challenge from a powerful foe. Post publication, it has been vetted in public for 8 years, 4 of them in the eminently accessible location of an Internet Web site. Despite this scrutiny, it has not been refuted. Therefore, it seems to me that action on its findings is long overdue.

Some of the events that should take place immediately are: a) Existing protocols for the performance of NDT reliability studies should be scrapped b) Protocols should be written that admit the errors of the past and guarantee they will not recur c) A Commission should be formed to review the reliability research of the past, identify whatever errors are present in it, bury the research that cannot be salvaged and salvage as much as possible from the rest of the data e) results of the Commission's findings should be publicized with enough fanfare to ensure that nobody is deceived by faulty research of the past f) For the future, all of the tests of NDT should include a complete specification of how their false negative and false positive error probabilities shall be measured or, if probability theory cannot be preserved, present the non-probability-theory-based decision theory that replaces probabilistic decision theory.

This brings me to your remarks on the MIL-HDBK-1823 and TOD/FAA/CT-92/12 protocols. You say "I think it is possible to do things wrong within MIL-HDBK-1823 or the FAA guidelines, but I do not think it is inevitable." I say we need protocols that make the kinds of error we've been discussing impossible. We must not place our society in the position of making important decisions on information that is faulty or misleading.

Me was told about this discussion by a colleage from Airbus France asking several people in our european aeronautic NDT community wether we are aware of this discussion.Still design of POD studies seem to us a great point, as the world is very often not so simple like in case of discontinuities in lap joints, where it seems to much easier to define clear samples than for instance if you put cracks on an edge of a part versus indications obtained with a sliding probe, especially if you want to calculate the false alarms...What is "location with discontinuity" and a "location without" do you calculate with smaples of 1 mm in length, or in samples with length of probe diameter (as the active area) or how do you define a sample location... is it only a test specimen?For a good calculation of a false alarm rate you need a high number of samples without discontinuities... a very expensive situation, if you declare only a complete testspecimen to be "one" sample...

Hope I have put the situation in correct words as I am not a "native english speaking expert in statistics", so some expressions used in german for this topic can't be easily retrived from regular dictionaries (I mostly use http://dict.leo.org - a very good living project of the univ. of Munich), sometimes it is hard for us to hit the definition 100%.

Conclusion: I like this discussion you did, very much and maybe it gives us some additional ideas to improve our work to the benefit of safety