Fibrosis with its endpoint, cirrhosis, is the main complication of chronic liver diseases and especially chronic hepatitis C. It is a key histologic feature in chronic hepatitis useful both for evaluation of severity of the disease, for treatment decisions, and for assessing drug efficacy. To date, liver biopsy remains the gold standard for fibrosis assessment in hepatitis C, but this procedure has several limitations including morbidity and mortality, observer variability, and sampling variation. Several studies already have addressed the problem of interobserver variability in fibrosis evaluation in chronic hepatitis and they generally conclude that reproducibility in scoring fibrosis is good whatever the scoring system used by the pathologists. In contrast, the sampling variability of liver fibrosis has been poorly investigated. This issue is relevant because the liver core biopsy specimen represents only a very limited part of the whole liver and fibrosis is a heterogeneously distributed lesion, as pointed out by several pioneer studies. To avoid these caveats and limit the risk for false evaluation, the use of a biopsy specimen of sufficient length and including a sufficient number of portal tracts is usually recommended. However, the threshold length recommended for avoiding sampling errors is arbitrary, and to our knowledge no studies have sought to define this threshold on a rational basis. Whether the magnitude of fibrosis heterogeneity is similar whatever the stage of the disease is another unresolved question.

Several tools are available to the pathologist for assessing liver fibrosis on a biopsy. Various semiquantitative scoring systems of increasing complexity that assess fibrosis have been proposed. These are based on a limited number of well-characterized morphologic patterns. In addition to these semiquantitative scores, several investigators have recommended the use of image analysis, a partly automated technique based on computerized pattern recognition that bypasses the subjectivity of the pathologist's judgment. This very accurate method has been used recently in several therapeutic trials in experimental models or in humans. In those studies, morphometry was used to quantitate very accurately the extent of fibrosis as a ratio of the relative area of fibrosis present on a whole-liver biopsy specimen.

To address sampling variability in liver fibrosis in chronic hepatitis C, we used a digital approach that allowed us to reconstitute many virtual biopsy specimens of different sizes artificially created from a large surgical section. This material was used to assess the influence of sampling heterogeneity in defining the size threshold when using both image analysis and the METAVIR scoring system.

ABSTRACT. Fibrosis is a common endpoint of clinical trials in chronic hepatitis C, and liver biopsy remains the gold standard for fibrosis evaluation. However, variability in the distribution of fibrosis within the liver is a potential limitation. Our aim was to assess the heterogeneity of liver fibrosis and its influence on the accuracy of assessment of fibrosis with liver biopsy. Surgical samples of livers from patients with chronic hepatitis C were studied. Measurement of fibrosis was performed on the whole section by using both image analysis and METAVIR score (reference value). From the digitized image of the whole section, virtual biopsy specimens of increasing length were produced. Fibrosis was assessed independently on each individual virtual biopsy specimen. Results were compared with the reference value according to the length of the biopsy specimen. By using image analysis, the coefficient of variation of fibrosis measurement with 15-mm long biopsy specimens was 55%; and for biopsy specimens of 25-mm length it was 45%. By using the METAVIR scoring system, 65% of biopsies 15 mm in length were categorized correctly according to the reference value. This increased to 75% for a 25-mm liver biopsy specimen without any substantial benefit for longer biopsy specimens. Sampling variability of fibrosis is a significant limitation in the assessment of fibrosis with liver biopsy. In conclusion, this study suggests that a length of at least 25 mm is necessary to evaluate fibrosis accurately with a semiquantitative score. Sampling variability becomes a major limitation when using more accurate methods such as automated image analysis.

DISCUSSION BY AUTHORS

Liver fibrosis is a primary endpoint in the evaluation of the severity of chronic liver disease as well as in therapeutic trials of chronic viral hepatitis. Liver biopsy remains the gold standard for assessment of fibrosis although several studies already have pointed out sampling variability as a major potential limitation of biopsy. Most of these studies were published before the development of semiquantitative scoring systems for chronic hepatitis. However, these studies already underlined the difficulty caused by sampling errors, especially when using biopsy specimens to diagnose cirrhosis. Maharaj et al., by performing 3 transcutaneous biopsies in the same patients using different entry points, reported that in proven cirrhotic patients, a histopathologic feature of cirrhosis was present in all 3 biopsy specimens of only 50% of the patients. Similarly, Abdi et al. performed several post mortem biopsies and showed that the diagnosis of cirrhosis could be obtained from one biopsy specimen in only 16 of 20 cases, but that the performance increased to 100% with 3 biopsy specimens. For the diagnosis of chronic active hepatitis, the same problems were raised by several investigators, although to a lesser extent. Although Soloway et al. found major discrepancies in diagnosis in cirrhotic patients with sequential biopsies, they found a 90% degree of consistency in grading individual histologic features of chronic hepatitis including fibrosis. Less favorable results were published by Baunsgaard et al., who showed a concordance in only 36 of 50 patients when evaluating the fibrosis amount with 2 biopsies in the same patient. These studies already pointed out one of the major potential limitations of fibrosis assessment: sampling variation. As shown by these studies, the problem of heterogeneity could be resolved partially by taking several biopsy specimens from the same patient, a procedure that raises major ethical concerns due to an increase in the risk for morbidity and mortality.

In this context of sampling variability, the size of the biopsy specimen must be considered because it is evident that the longer the biopsy specimen, the lower the risk for an erroneous evaluation due to sampling error. Because of ethical problems, this question has been rarely addressed. In an elegant study, Hohlund et al. assessed, on 25-mm long biopsy specimens, the accuracy of diagnosis of chronic active hepatitis or cirrhosis when increasing amounts of tissue were visible. They concluded that diagnosis is safe for biopsy specimens that are 15 mm or longer. Their study did not assess either individual lesions or their semiquantitative evaluation, and did not consider the possibility that a 25-mm long biopsy specimen, which was their reference, might not be long enough to make a correct diagnosis in 100% of cases.

To answer this question, we used virtual biopsy specimens to artificially obtain a large number of samples of variable length from one large section in which the fibrosis measurement served as a reference value. By matching the width of the optical field to the width of standard transcutaneous liver biopsy performed with an 18-gauge needle, we were able to obtain from one surgical section more than 500 virtual biopsy specimens from 2.5 to 200 mm in length. By using this approach and comparing scores obtained on each individual virtual biopsy specimen with the reference score, we showed that with 25-mm long biopsy specimens (a length usually considered adequate by pathologists), only 75% of the biopsy specimens were scored correctly using the METAVIR scale. This yield decreased to 65% for 15-mm biopsy specimens, a size threshold that usually is accepted for assessment of fibrosis in clinical trials. When the percentage of correctly classified biopsy specimens was compared stage by stage, a similar trend was observed for each stage, with a sharp increase in the percentage of correctly classified biopsy specimens up to a size of 25 mm, and no major improvement for longer samples. The performance was less favorable for reference values between F3 and F1, but was better for F0 and F4. This result should be interpreted with caution because F0 and F4 are both at the extremes of the METAVIR scale, a situation that limits the possibilities of erroneous diagnosis by comparison with F1, F2, and F3 stages, at which errors can be made on both sides of the score. Furthermore, it must be pointed out that the METAVIR score of fibrosis in virtual biopsy specimens was established by taking into account only the amount of fibrosis. In fact, another architectural criterion, nodular regeneration, is used in association with annular fibrosis, to define the F4 cirrhotic stage. Such a bias might slightly overestimate the percentage of incorrectly classified cirrhotic livers.

Quantification of fibrosis by image analysis is a powerful method with several potential advantages but also limits. It is quantitative and very accurate, thus theoretically enabling detection of the slightest variation in fibrosis between biopsy specimens of a given patient. This method has been used in several therapeutic trials of chronic hepatitis C. Contrary to semiquantitative assessment, which is based on the pathologist's analysis and is therefore in part subjective, image analysis is a partly automated method that eliminates errors caused by the pathologist's judgment. Our results show that the size of the biopsy specimen also has a major influence on the performance of the technique. When the performance of image analysis for discriminating 2 adjacent scores is assessed by ROC curves, the AUC increases with the size of the biopsy specimen, whatever the stage being considered. Curves that plot the area of fibrosis according to the size of the biopsy specimen clearly show major coefficients of variation for 15-mm or even 25-mm long biopsy specimens, a magnitude of CV that clearly would discard any biologic test from use in clinical practice. Such a discrepancy cancels the potential advantage provided by the high accuracy of the technique for fibrosis assessment except for biopsy specimens of 40 mm or longer, a condition rarely obtained with a single biopsy in routine practice.

This study included only biopsy specimens or surgical samples with chronic hepatitis C. Whether the results and conclusions of this study also are valid for other chronic liver diseases is unknown. Chronic liver fibrosing diseases can be split into 2 different categories according to the mechanism of fibrogenesis. Those diseases in which fibrosis starts in zone 3 such as alcoholic and nonalcoholic steatohepatitis, or chronic vascular liver disease, and those that are related closely to inflammation and necrosis and begin in periportal areas such as viral, autoimmune hepatitis or chronic biliary diseases. Whether the same heterogeneity with its practical consequences applies to this last category is a possibility, but remains to be proven with such a study.

Comparison between the METAVIR score and image analysis of fibrosis confirms that an increase in METAVIR stage is associated with a progressive increase in the fibrosis area. Interestingly, this increase of fibrous tissue accumulation is not linear because, for example, F2 has 3-fold more fibrosis than an F0 stage (normal liver), whereas F3 and F4 are, respectively, 7- and 12-fold F0. This data should be taken into consideration when modeling the rate of fibrosis progression in chronic liver diseases. Our study suggests that if the METAVIR score of fibrosis appears to increase linearly with time, this in fact corresponds to an exponential accumulation of fibrous tissue within the liver.

Besides the METAVIR scoring system, more detailed scores are used to score fibrosis in chronic hepatitis C. Whether limitations apply also to other more detailed scores than the METAVIR is a strong possibility but needs also to be shown by a specific study.

Finally, and according to sampling variability, we might raise the question as to whether liver biopsy should remain the gold standard for assessment of liver fibrosis, particularly in therapeutic trials in viral hepatitis. Noninvasive methods of fibrosis assessment such as surrogate serum markers or imaging techniques might avoid these caveats. Their performance also should be questioned because their diagnostic value usually is assessed by comparison with the results of liver biopsy, the performance of which is limited by sampling variability. Sampling errors of liver fibrosis evaluation as shown in this study, should be taken into consideration in therapeutic trials with liver fibrosis as an endpoint, both when assessing the number of patients to be included and when defining size threshold and method of fibrosis measurement.

Sample size and the uneven distribution of lesions in the whole liver have always presented a problem to both clinician and pathologist. Needle biopsy specimens generally measure between a few millimeters and several centimeters in length, so that each represents only one hundred-thousandth to one thirty-thousandth of the whole organ.

Small and therefore unrepresentative samples can make histologic diagnosis difficult or impossible and can even lead to a wrong diagnosis. When liver biopsies are used to determine therapy for patients with chronic hepatitis C, and when they form the basis of grading and staging in clinical trials of new therapies, size matters. Pathologists like big samples, but smaller samples are usually considered to be safer for the patient. A compromise has to be reached, yet there has so far been little attempt to study the problem scientifically.

In this issue of HEPATOLOGY, Bedossa et al. from France report on a carefully conducted study relating biopsy length to the variability of fibrosis as measured by image analysis. Their aim was to determine the minimum cylinder length giving an acceptably representative measurement. Their method was to measure fibrous tissue, stained with picrosirius red, in large tissue sections from 17 resected livers from patients with chronic hepatitis C. They then calculated the results for virtual needle biopsy samples of different lengths, theoretically obtained from the same tissue blocks. As expected, variability of the relative amount of fibrous tissue per unit area decreased as specimen length increased, with little improvement beyond a specimen length of 40 mm.

The results of image analysis were correlated with semiquantitative staging using the METAVIR scoring system,1 and the investigators concluded that the minimum acceptable specimen length for staging was 25 mm. They did not do the same correlation with other widely used staging systems such as the one proposed by Ishak et al.,2 but the definitions of stages in different systems are broadly comparable, and it seems likely that the results will be applicable to other systems. This will need confirmation.

A different approach to the same problem has recently been taken by Colloredo et al.3 This Italian group used the Ishak system2 to score 161 large (30 mm or more in length) biopsies from patients with hepatitis B and C and examined the effect on grading and staging scores of evaluating complete biopsy sections and then sections in which part of the tissue had been masked. Reducing the amount of tissue studied significantly reduced the scores for both necroinflammation and fibrosis. The same effect was produced by reducing biopsy specimen diameter in order to mimic biopsies taken with fine needles of 1-mm rather than 1.4-mm internal diameter. They concluded that evaluating small or slender biopsies was likely to lead to underestimation of disease severity and recommended that grading and staging should be carried out using specimens at least 20 mm long and 1.4 mm wide. The investigators believe that one of the important factors in reducing reliability in small specimens is the low number of portal tracts included in a section; their recommendation for grading and staging is a minimum of 11 complete portal tracts.

The specimen sizes recommended both by Bedossa et al. and Colloredo et al.3 are greater than those often accepted in past studies,4-8 while sometimes no minimum size is specified.9,10 Kage et al.11 rejected sections with less than 5 portal tracts, and Kaserer et al.12 those with less than 8. Bravo et al.,13 reviewing the technique of liver biopsy, commented that a specimen 1.5 cm long was usually adequate for diagnosis of diffuse liver disease but did not specify the length needed for grading or staging. The investigators thought that most pathologists were satisfied with specimens containing 6 to 8 portal tracts. In a recent study of the natural progression of hepatitis C, Zarski et al.14 gave a specimen length of 20 mm as the minimum accepted, in keeping with the size recommended by Colloredo et al.3

The reports by Bedossa et al. and Colloredo et al.3 both have important implications for future studies using liver biopsy as the method of assessment of severity of hepatitis and structural changes. In clinical trials, investigators will be encouraged to specify the minimum biopsy size accepted for analysis. Results obtained using smaller specimens will inevitably be open to question. Even with relatively large specimens as recommended in the two articles, sampling variability can never be completely eliminated. It can, however, be partly overcome by the use of large cohorts of patients and biopsies; when large numbers are studied, variation is likely to be random and multidirectional, and therefore of less consequence than in small studies.

There are implications for studies attempting to find noninvasive methods for assessing fibrosis. Such studies are currently evaluated by comparing results with liver biopsy findings, and the results, while sometimes encouraging, do not yet allow liver biopsy to be replaced by surrogate markers.15 Again, future studies will have to take biopsy size into account, so that surrogate markers and histologically assessed fibrous tissue can be correlated with greater confidence. Interestingly, Poynard et al.16 recently found that specimens longer than 15 mm and containing 6 or more portal tracts gave better correlation of histology with biochemical surrogate markers of fibrosis and disease activity than smaller samples.

In the individual patient, comparison of one biopsy with another is never very accurate because of the confounding factors of intra- and interobserver error (reduced or even eliminated when multiple biopsies are compared at one time by the same observer), the subjective nature of histologic assessment, and sampling. The latter problem will be reduced by applying the new information from the Italian and French studies.

Bedossa et al. found that the relationship between fibrosis as measured by image analysis was related to the METAVIR stages in a nonlinear fashion. For example, stages F3 (septa but no cirrhosis) and F4 (cirrhosis) represented 7- and 12-fold increases over F0 (normal fibrous tissue only). This nonlinearity must be taken into account when staging is used to evaluate the rate of fibrosis progression. It also underlines the fundamental difference between staging and image analysis. As the French group emphasizes, staging is a subjective procedure while image analysis provides objective measurement. Moreover, staging takes structural changes such as nodule formation into account in addition to assessment of fibrous tissue. Staging depends on the skill, experience, and bias of each individual pathologist. Differences in the results obtained by one pathologist and another for the same sample do not mean that one or both of them is wrong; the differences are to be expected in a subjective analysis. A further important difference is that staging and grading create discrete categories, which could just as well be denoted by letters as by numbers. They require a different statistical approach from that appropriate for measurements. An example of an appropriate method for ordered categorical data is given in some detail in a report on fibrosis in hepatitis C by Lagging et al.17

Which is better for the study of progression of chronic hepatitis C, image analysis or staging? The answer is surely neither. Image analysis and staging generate different types of information, and each may lead to helpful insights into disease progression. The results of the two procedures are broadly similar,5,18,19 although in one comparative study20 correlation was restricted to specimens with high staging scores. Because staging can be performed quickly and requires no special equipment, it is likely to remain the more common way of assessing fibrosis in routine practice for the time being. Image analysis should provide useful data in clinical trials and other research projects.