Background

Today, the standardized uptake value (SUV) is essentially the only means for quantitative evaluation of static [18F-]fluorodeoxyglucose (FDG) positron emission tomography (PET) investigations. However, the SUV approach has several well-known shortcomings which adversely affect the reliability of the SUV as a surrogate of the metabolic rate of glucose consumption. The standard uptake ratio (SUR), i.e., the uptake time-corrected ratio of tumor SUV to image-derived arterial blood SUV, has been shown in the first clinical studies to overcome most of these shortcomings, to decrease test-retest variability, and to increase the prognostic value in comparison to SUV. However, it is unclear, to what extent the SUR approach is vulnerable to observer variability of the additionally required blood SUV (BSUV) determination. The goal of the present work was the investigation of the interobserver variability of image-derived BSUV.

Methods

FDG PET/CT scans from 83 patients (72 male, 11 female) with non-small cell lung cancer (N = 46) or head and neck cancer (N = 37) were included. BSUV was determined by 8 individuals, each applying a dedicated delineation tool for the BSUV determination in the aorta. Two of the observers applied two further tools. Altogether, five different delineation tools were used. With each used tool, delineation was performed for the whole patient group, resulting in 12 distinct observations per patient. Intersubject variability of BSUV determination was assessed using the fractional deviations for the individual patients from the patient group average and was quantified as standard deviation (SD is), 95% confidence interval, and range.

Interobserver variability of BSUV determination was assessed using the fractional deviations of the individual observers from the observer-average for the considered patient and quantified as standard deviations (SD p, SD d) or root mean square (RMS), 95% confidence interval, and range in each patient, each observer, and the pooled data respectively.

Results

Interobserver variability in the pooled data amounts to RMS = 2.8% and is much smaller than the intersubject variability of BSUV (SD is= 16%). Averaged over the whole patient group, deviations of individual observers from the observer average are very small and fall in the range [ − 0.96, 1.05]%. However, interobserver variability partly differs distinctly for different patients, covering a range of [0.7, 7.4]% in the investigated patient group.

Conclusion

The present investigation demonstrates that the image-based manual determination of BSUV in the aorta is sufficiently reproducible across different observers and delineation tools which is a prerequisite for accurate SUR determination. This finding is in line with the already demonstrated superior prognostic value of SUR in comparison to SUV in the first clinical studies.

Today, the standardized uptake value (SUV), defined as the tracer concentration at a certain time point normalized to injected dose per unit body weight, is essentially the only means for quantitative evaluation of static [18F-]fluorodeoxyglucose (FDG) positron emission tomography (PET) investigations. However, the SUV approach has several well-known shortcomings, notably, uptake time dependence of the SUV, interstudy variability of the arterial input function (AIF), and susceptibility to errors in scanner calibration [1–3], which adversely affect the reliability of the SUV as a surrogate of the metabolic rate of glucose consumption. This possibly explains the unsatisfactory performance of SUV-based therapy outcome prediction for various tumor diseases [4–16]. In recent publications, we were able to show that the uptake time-corrected ratio of tumor SUV to (image-derived) blood SUV (standard uptake ratio (SUR)) overcomes most of these shortcomings [17, 18], decreases test-retest variability [19], and increases the prognostic value compared to SUV in patients with esophageal carcinoma [20, 21] and non-small cell lung cancer [22].

While the assumptions underlying the SUR concept [17, 18] are sound, reliability of the image-based blood SUV (BSUV) determination required for SUR computation might be questioned. In our previous clinical studies [20–22], BSUV was consistently determined by the strategy described in the “Materials and methods” section and used for SUR computation. The observed superior performance of SUR in comparison to SUV demonstrates that insufficient accuracy of BSUV determination was not a critical issue in these studies. However, in all these investigations, the same individual determined BSUV with the same delineation tool and it is conceivable that reliability of BSUV is distinctly inferior when it is determined by different observers with the same or a different delineation tool. Both systematic as well as random interobserver differences would obviously limit the usefulness of SUR in longitudinal as well as cross-sectional clinical studies.

Consequently, the goal of the present work was the investigation of the interobserver variability of image-derived BSUV within single patients and across a substantial patient group. For this purpose, 8 observers from 6 institutions determined BSUV in image data from 83 patients using one or more of five different delineation tools.

Patient group and data acquisition

The investigated patient group included 83 patients (72 male, 11 female, mean age 59.5 years, range 37–84). Data were acquired prospectively from August 2005 to August 2009 at the University Hospital, Technische Universität Dresden, in the context of two different studies (ClinicalTrials.gov identifier: NCT00180245, patients with head and neck squamous cell carcinoma (HNSCC), N = 37 and ClinicalTrials.gov identifier: NCT00180154, patients with non-small cell lung cancer (NSCLC), N = 46) and were evaluated retrospectively in the present study. All patients included in the prospective studies were also included here. Retrospective evaluation of the data was approved by the local Clinical Institutional Review Board and complies with the Declaration of Helsinki.

BSUV determination

Define a circular ROI at the center of the aorta in this CT image. Adjust radius to keep approximately 8 mm away from the aortic wall. Step through consecutive planes along the descending aorta and repeat ROI definition. Skip the plane in case of

Visible spill in into the aorta from adjacent “hot” structures

Visible attenuation correction artifacts affecting the aorta

3

Exclude planes near and below the diaphragm (which are susceptible to motion-induced attenuation artifacts)

4

Process a sufficient number of planes to obtain a total ROI volume of at least 5 ml. If the minimum volume cannot be achieved in the descending aorta alone, delineation can be extended to the ascending aorta

5

Review the final delineation and verify its integrity regarding the mentioned exclusion criteria

6

Copy the resulting ROI to the corresponding PET data and compute BSUV as the mean value of the aorta ROI

Example of a valid aorta ROI delineation (highlighted in red) observing the prescription described in the “Materials and methods” section

The observers were free to use a delineation tool of their choice for the delineation task. The required time for a single data set was below 5 min with all used delineation tools. Overall, delineation was performed by eight observers using five different delineation tools. Each chosen tool was applied to the whole patient group by the observer. Six individuals used a single tool, and two individuals used three different tools, resulting in a total of D=12 delineations for each of P=83 patients, see Table 1. In the following, we denote the individually derived values as BSUVdp(d=[1 −− D],p=[1 −− P] where p enumerates the patients and d enumerates the observer/delineation tool combinations). In the following, we simply use the term “observer” to denote the different observer/delineation tool combinations.

Table 1

Overview of the software tools used for aorta delineation

Software

Versions

No. of observers

EBW; Philips Healthcare Best, The Netherlands

4.0.3.5

3

MIM; MIM Software Inc. Cleveland, OH, USA

6.7.6

1

OsiriX; Pixmeo SARL Bernex Switzerland

9.5; 10.0

2

PMOD; PMOD Technologies LCC Zurich, Switzerland

3.905; 3.804

3

ROVER ABX; GmbH Radeberg, Germany

3.0.32; 3.0.36

3

The third column shows the number of observers who applied the respective software to the whole patient group

Data evaluation

was used as the best available estimator of the true (observer) population mean (the theoretical value resulting from averaging over infinitely many observers performing the delineation for this patient). Description of the intersubject variability of this quantity was based on the fractional deviation of individual patients from the patient group average \(\overline {\text {BSUV}} = \frac {1}{P} \cdot \sum _{p=1}^{P} \overline {\text {BSUV}}_{p}\):

Interobserver variability was quantified as standard deviation, 95% CI, and range of ΔBSUVdp separately for each patient and each observer, respectively. In the pooled group of all patients and observers, the standard deviation is replaced by the root mean square (RMS) deviation for description of the width of the distribution since it follows from Eq. 1 that the mean \(\Delta \overline {\text {BSUV}}\) (the average over all observers and patients) is exactly zero:

A boxplot of the observed BSUVdp grouped by patient is shown in Fig. 2. The corresponding boxplot of ΔBSUVdp is shown in Fig. 3. There is a clear patient dependence of the interobserver variability as signaled by the variable interquartile ranges in these plots. A pairwise comparison of the variances of the corresponding distributions revealed in 30% of the comparisons a significant difference (P < 0.05) according to a two-tailed F test. This patient dependence is further illustrated in Fig. 4 which shows the frequency distribution of SD p. A boxplot of the derived ΔBSUVdp grouped by observer is shown in Fig. 5. Averaged over the whole patient group, the individual observers differ only slightly (range [ − 0.96, 1.05]%) from the observer average (although the difference reaches statistical significance in 5 out of 12 observers according to a two sided Mann-Whitney test). No significant difference of the variances of the corresponding distributions was found in a pairwise comparison. Figure 6 shows the corresponding SD d distribution which demonstrates the (small) differences in observer performance. Finally, Fig. 7 shows the histogram of the complete pooled ΔBSUVdp data. The relevant quantitative measures are summarized in Table 2.

Fig. 2

Boxplot of the observed blood SUV (BSUVdp), grouped by patient. Note that intersubject variability is much larger than interobserver variability for each patient

Fig. 3

Boxplot of fractional deviation from observer mean for the respective patient (ΔBSUVdp), grouped by patient. Note the patient dependence of the magnitude of the interobserver variability

Fig. 4

Histogram of patient-specific interobserver variability, described by SD p (Eq. 3), the standard deviation of the distribution of fractional deviations ΔBSUVdp (Eq. 1) from observer mean for the respective patient grouped by patient as illustrated in Fig. 3

Fig. 5

Boxplot of fractional deviation from observer mean for the respective patient (ΔBSUVdp), grouped by observer. Note the comparable performance of all observers

Fig. 6

Histogram of observer performance contribution to the interobserver variability, described by SD d (Eq. 4), the standard deviation of the distribution of fractional deviations ΔBSUVdp (Eq. 1) from observer mean for the respective patient grouped by observer as illustrated in Fig. 5

Intersubject and interobserver variability of BSUV described by the quantities defined in Eqs. 1–4 (at the stated accuracy level, RMS of ΔBSUVdp according to Eq. 2 is identical to the standard deviation)

In this study, we investigated the interobserver variability of image-based BSUV determination in the aorta. In the pooled group of all observers and patients, we found an interobserver variability of RMS=2.8%. This figure has to be compared with an intersubject variability of (observer-averaged) BSUV of SDis=16% in the investigated patient group (which is in complete agreement with other reports [24, 25]).

Thus, our main result is that interobserver variability of manually determined BSUV is much smaller (by nearly a factor of six) than the typical intersubject variability of this quantity and has, therefore, no relevant negative effect on assessment of true intersubject variability of BSUV. Regarding the use of image-derived BSUV in SUR computation, this finding demonstrates that validity of the SUR approach is not compromised by observer-induced uncertainties of BSUV determination. It should be emphasized that it is of no concern in this context, whether part of the observed substantial intersubject variability of BSUV is possibly caused by imperfections of SUV calibration of the considered PET system and/or trivial errors such as erroneous dose or body weight since any such effect causes a global rescaling of the image data and will thus cancel in computation of SUR.

As demonstrated by our data, it is, however, relevant to ensure that the evaluated portions of the reconstructed images are free of spurious changes of the lesion to blood image contrast which might be caused by attenuation and scatter correction related effects in certain regions, notably induced by organ motion near the diaphragm and liver dome. Indeed, while the overall interobserver variability in the investigated patient group is very small, closer inspection of the data on a per-patient basis revealed that some patients exhibit substantially increased interobserver variability (see Figs. 2 and 3). Consequently, the SD p histogram in Fig. 4 shows a tail towards higher SD p values in a small fraction of patients. Retrospective examination of the affected image data identified in most of them spurious, motion-induced signal decrease due to attenuation undercorrection and/or scatter overcorrection (caused by attenuation/emission mismatch near the liver dome). This signal drop also affects part of the aorta, and the affected areas were erroneously not excluded from delineation by some observers (thus deviating from the provided procedure guideline). Such sporadic oversights are possibly unavoidable, as their occurrence in the present study suggests. It might therefore be advisable to exclude the potentially affected region categorically (instead of letting the observer decide this on a per case basis) by not extending delineation below a plane about 5 cm above the diaphragm. But even with the presently used prescription, the worst case deviation from the observer mean for any patient remained below 11% which still is much smaller than the observed BSUV intersubject variability (range [ − 37, 41]%). Nevertheless, a clear patient dependence of the interobserver variability as described by SD p is present which has a range equal to [0.7, 7.4]%. In comparison, the overall performance of the different observers when averaged over the whole patient group is rather similar as illustrated by Fig. 5 and the small SD d range of [2.3, 3.4]%.

A potential shortcoming of the present study is the limited number of observers and delineation tools included. However, considering the very consistent performance of all observers and software tools regarding variability and deviation from the observer average, the obtained results are statistically already sufficiently reliable in our view. Therefore, our results overall demonstrate a very low interobserver variability of image-derived BSUV. Theoretically, the obtained BSUVs could still be negatively biased by partial volume effects (which would lead to systematic errors when computing SURs). However, by using a prescribed safety margin of about 8 mm to the aortic wall, partial volume effects are reduced to a negligible level. Even for a rather pessimistic scenario with a combination of small luminal aorta diameter of 21 mm [26, 27] and low spatial resolution in the image data of FWHM=8 mm, signal recovery of delineation-averaged BSUV in a straight cylinder is equal to 0.985.

The present investigation demonstrates that the image-based manual determination of BSUV in the aorta is sufficiently reproducible across different observers and delineation tools which is a prerequisite for accurate SUR determination. This finding is in line with the already demonstrated superior prognostic value of SUR in comparison to SUV in the first clinical studies. The next logical step will be to fully automatize BSUV determination for a more streamlined use of SUR in the clinical setting. The presented data might serve as a valuable resource for validation of such future algorithms.

Acknowledgements

Funding

This work was supported in parts by the German Federal Ministry of Education and Research (BMBF contract 03ZIK42/OncoRay).

Availability of data and materials

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Authors’ contributions

FH, JM, and JvdH designed the study, performed the data evaluation, and wrote the manuscript. All authors performed aorta delineations and/or contributed to the final manuscript. FH and JM share first authorship. All authors read and approved the final manuscript.

Ethics approval and consent to participate

Retrospective evaluation of the data was in accordance with the ethical standards of the institutional research committee of each site and approved accordingly (ethics committee of the University Hospital Carl Gustav Carus of the Technische Universität Dresden – EK161082004 and EK158082004). Written informed consent was obtained from all individual participants included in the study. Patients were diagnosed and treated according to national guidelines and the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.