The RRASOR, Static-99R and Static-2002R All Add Incrementally to the Prediction of Recidivism among Sex Offenders

Abstract

Empirically derived actuarial tools are increasingly being used in applied psychology, particularly for the assessment of risk for crime and violence. Although evaluators commonly use more than one scale, it is unclear how evaluators should interpret divergent findings. The current study examined the predictive accuracy and incremental validity of three risk assessment scales (RRASOR, Static-99R, and Static2002R) in twenty distinct samples of sex offenders (N = 7,491). Static-99R and Static-2002R outperformed the RRASOR in the prediction of sexual, violent, and any recidivism. No differences in predictive accuracy were found between Static-99R and Static-2002R. Nevertheless, almost all the scales provided incremental validity to the prediction of all types of recidivism. The direction of the incremental effects, however, was not consistently positive. When controlling for the other measures, high scores on the RRASOR were associated with lower risk for violent and general recidivism. Consequently, decisions concerning the interpretation of multiple risk scales must be informed by the construct validity of the measures. When scales measure the same domain of risk factors, an averaging approach can be justified. If the selected scales are not sampling the same types of risk factors, then evaluators need a defensible model concerning (1) the latent constructs measured by the scales and (2) empirical evidence concerning how the constructs should be weighted and combined.

Authors' Note

The views expressed are those of the authors and not necessarily those of Public Safety Canada. Correspondence concerning this report should be addressed to: R. Karl Hanson, Corrections Research, Public Safety Canada, 340 Laurier Avenue West, Ottawa, ON, Canada, K1A 0P8. E-mail: karl.hanson@ps-sp.gc.ca

The RRASOR, Static-99R and Static-2002R All Add Incrementally to the Prediction of Recidivism among Sex Offenders

Most psychological tests are designed to assess latent constructs and their results have practical importance to the extent that the latent constructs are related to outcomes of interest. Although desirable, it is not always necessary to fully understand the latent psychological constructs being assessed for a measure to have practical utility. In fact, complete understanding is rare (Cronbach & Meehl, 1955). Experts can continue to argue about the nature of major psychological constructs (e.g., positive mental health, intelligence, sexual deviance) while agreeing on the practical utility of existing measures for applied decision-making (e.g., discharge from treatment, school placement, risk assessment). Measures can have importance based simply on their empirical relationships with the outcome of interest (e.g., Meehl, 1956). Such an empirical prediction is particularly relevant when the evaluator's primary concern is predicting a discrete (i.e., yes/no) outcome (e.g., depression relapse, school failure, sexual recidivism).

One domain in which empirical prediction has gained prominence in recent years is in the evaluation of risk for crime and violence (Hanson, 2005, 2009; Quinsey, Harris, Rice, & Cormier, 2006). In the United States, the Daubert criteria (Daubert v. Merrell Dow Pharmaceuticals, Inc., 1993) is the most commonly used legal standard to determine whether scientific evidence (e.g., risk factors) is admissible in court (Monahan & Walker, 2010). The Daubert criteria requires that testimony has empirical support but the expert does not need to convince the court of a "cosmic understanding" (Daubert v. Merrell Dow Pharmaceuticals, Inc., 1993, para. 43) of the issues at hand. Using the Daubert criteria, US courts routinely accept empirical evidence on risk factors for crime and violence without necessarily understanding the causal mechanisms involved.

Although there is consensus that risk factors need to be empirically established (e.g., Kraemer et al., 1997), evaluators disagree on the best way of combining risk factors into an overall evaluation. Research has consistently found that structured risk assessments are more accurate than unstructured professional opinion (Gendreau, Goggin, & Law, 1997; Grove, Zald, Lebow, Snitz, & Nelson, 2000; Hanson & Morton-Bourgon, 2009); there is no consensus on how they should be structured.

In the violence risk assessment field, most evaluators use some form of structured professional judgement (SPJ; Archer, Buffington-Vollum, Stredny, & Handel, 2006). In this form of evaluation, the risk factors are selected in advance based on their relationship with the outcome of interest. The combination of these items into an overall evaluation, however, is left to the judgement of the evaluator (Douglas & Kropp, 2002). In contrast to SPJ, mechanical prediction tools specify in advance the items and provide explicit methods for combining the items into a total score (Grove et al., 2000). When mechanical prediction tools also provide empirically derived probability estimates for a particular outcome of interest, they are called actuarial (Dawes, Faust, & Meehl, 1989; Meehl, 1954).

The use of actuarial risk tools is common in certain high-stakes risk evaluations. In sexual civil commitment trials, for example, 95% of civil commitment evaluators report using Static-99 (an actuarial risk tool for sexual recidivism) always or most of the time (Jackson & Hess, 2007). This contrasts with decision-making methods in general clinical psychology, where the majority of psychologists (68%) rely on unstructured, clinical prediction (Vrieze & Grove, 2009).

Although the use of mechanical and actuarial risk tools has clear strengths (e.g., reduced bias, high reliability; Garb, 2003), there are barriers to their routine use. For many applied decisions, validated prediction tools are simply not available (Vrieze & Grove, 2009). The current study, however, addresses the opposite problem: What should evaluators do when there are several different risk predictions tools available?

Although evaluators often use more than one measure (Jackson & Hess, 2007), it is not clear how to interpret the results when the measures disagree, and unfortunately, divergent results are common (e.g., Mills & Kroner, 2006). Barbaree, Langton, and Peacock (2006) found that less than 8% (n = 20) of sex offenders sampled (N = 262) were consistently identified as high risk or as low risk by five commonly used actuarial risk tools (i.e., Violent Risk Appraisal Guide [VRAG; Quinsey et al., 2006], SORAG, Static-99, RRASOR, and MnSOST–R). Consequently, evaluators interested in actuarial risk prediction with sexual offenders must decide which measures to use, and, if they use more than one, how to interpret divergent results.

When these general criteria are applied to sexual risk assessment, however, no one instrument is identified as superior. Specifically, all actuarial risk tools for sex offenders have acceptable and similar levels of interrater reliability (Barbaree, Seto, Langton, & Peacock, 2001; G. T. Harris et al., 2003), and there are minimal differences in their overall predictive accuracy (Hanson & Morton-Bourgon, 2009; Rettenberger, Matthes, Boer, & Eher, 2010).

In the absence of a clear winner, psychometric theory supports the use of multiple instruments. Classical test theory holds that test error can be minimized by increasing the item pool ("the more, the better"). Specifically, an observed score on a test (or item) has two components: the true score (or under item response theory, the examinee's ability or trait parameter) and measurement error (see Rust & Golombok, 2009, for a review). As such, increasing items or instruments should reduce the amount of measurement error. Because error is theorized to be random, the errors are expected to cancel themselves out across observations (Nunnally & Bernstein, 1994). Consequently, adding items to prediction tools should result in increased predictive accuracy. Of course, if the additional items are substantially worse (less predictive) than the items already considered, the accuracy of the overall prediction would deteriorate.

Incremental Validity

When using multiple scales in applied risk assessment, a central concern is incremental validity. Specifically, incremental validity is the extent to which new information improves the accuracy of a prediction above and beyond that of the previous instrument(s) used. Conceptually, if an instrument provides new information to better understand an offender's risk, it provides incremental information. For example, additional information about antisociality would aid in understanding an offender's risk to reoffend above that provided by a particular risk instrument that only considered mental health problems.

That certain items or domains of risk factors add incrementally to the prediction of violence or crime is uncontroversial. Indeed, the construction of most actuarial tools considered the incremental validity of the items retained in the final scale (e.g., the Level of Service Inventory-Revised [LSI-R; Andrews & Bonta, 1995], Static-99, Static-2002, and VRAG). Having these established risk assessment tools, the question then becomes how effectively the final measures have sampled and weighted the relevant variables. Namely, of the measures that are currently in use and are intended to be global assessments of risk, to what extent is it possible to identify other variables or scales that add incrementally to these measures?

Research on the incremental validity of commonly used risk instruments is mixed. Seto (2005) found that routinely used scales (i.e., RRASOR, Static-99, SORAG, and VRAG) did not add incrementally to one another in the prediction of sexual recidivism. Such findings suggest that the use of multiple instruments is an unnecessary hassle. The study, however, was limited by a small sample size of sex offenders (N = 215). In addition, of the risk instruments sampled, Seto (2005) found the RRASOR to be the most predictive of sexual recidivism. Most available studies, however, have found the RRASOR to be inferior to other available risk instruments, such as Static-99 (Hanson & Morton-Bourgon, 2009). In short, Seto's (2005) recommendation to choose the best instrument is difficult to apply because, as yet, there is no scientific consensus concerning which is the "best" instrument for the prediction of sexual recidivism, and different instruments may be better or worse for specific decisions in specific jurisdictions.

Lloyd (2008) examined a large set of actuarial instruments (MNSOST-R, Risk Matrix 2000 [Thornton et al., 2003], RRASOR, SORAG, Static-99), structured clinical guidelines (Structured Risk Assessment - Need Assessment [SRA; Thornton, 2002], SVR-20) and other variables hypothesized to predict sexual recidivism (e.g., number of male victims) in a group of sex offenders (N = 391). Lloyd (2008) found that a combination of risk scales best predicted sexual recidivism and added incremental validity to one another (including the SORAG, MNSOST-R, the Social-Affective score of the SRA, and the SVR-20). Although there may be some question of overfitting due to a large number of variables entered into the regression equation, the study demonstrates the possibility that existing scales can add incrementally to one another in the prediction of sexual recidivism.

Mills and Kroner (2006) expanded the examination of incremental validity by examining the impact of discordance among the risk instruments. They examined the incremental validity of the General Statistical Information on Recidivism Scale (GSIR; Nuffield, 1982), the LSI-R, and the VRAG for the prediction of general and violent recidivism for offenders (approximately 3/4 violent offenders). Further, they divided offenders into those with low discordance among risk instruments (i.e., the average standardized differences between instruments were small, suggesting consistency across instruments in relative risk estimates) and high discordance (i.e., the average standardized differences between instruments were large, suggesting inconsistency across instruments in relative risk estimates). Mills and Kroner (2006) found that the scales added incrementally to the prediction of general and violent recidivism for offenders with low discordance (n = 140), but not those with high discordance (n = 69). Given the small sample size of the discordant group, a plausible explanation for the null finding is lack of statistical power required to test such hypotheses.

Welsh, Schmidt, McKinnon, Chattha, and Meyers (2008) examined the incremental validity of the Youth Level of Service/Case Management Inventory (YLS/CMI; Hoge & Andrews, 2002), Structured Assessment of Violence Risk in Youth (SAVRY; Borum, Bartel, & Forth, 2002) and Psychopathy Checklist: Youth Version (PCL:YV; Forth, Kosson, & Hare, 2003) in a sample of juvenile offenders (N = 105), for predicting general and violent recidivism. Even with a small sample size, Welsh and colleagues (2008) found that the SAVRY added incrementally to the PCL:YV and the YLS/CMI for both violent and general recidivism. In addition, the PCL:YV was found to add incrementally to the YLS/CMI, whereas the YLS/CMI did not add incremental validity to the other two scales.

In summary, there are relatively few studies examining the incremental validity of unmodified (e.g., no items removed) risk scales for crime and violence, and most available studies are limited by small sample sizes. Overall, the research suggests that multiple risk instruments may add incremental validity to one another. Further research with larger samples is required, however, to better understand whether there is practical utility in using several risk instruments.

Current Study

The purpose of the present study was to compare the predictive validity of three commonly used measures for the prediction of recidivism among sexual offenders: RRASOR, Static-99R, and Static-2002R. Specifically, we examined (1) whether the RRASOR, Static-99R, or Static-2002R predicted sexual, violent, and any recidivism more accurately than the others and (2) whether the three instruments added incremental validity to one another in the prediction of the three types of recidivism. All three scales included in the current study are similar to each other in that they have the same purpose (predicting sexual recidivism) and are based on similar demographic and criminal history variables. If one of the instruments was clearly superior in terms of predictive accuracy and no other scales added incrementally to it, evaluators would be justified in using only the "best" measure. The choice of instruments would be less clear, however, if none of the measures had superior predictive accuracy or if they were found to add incrementally to one another.

Method

Measures

Rapid Risk Assessment for Sex Offence Recidivism (RRASOR)

The RRASOR (Hanson, 1997) is an actuarial instrument designed to measure risk of sexual recidivism. Scores range from 0 to 6, with a higher score indicating greater risk of sexual recidivism. It has four items: (1) prior sexual offenses, (2) any unrelated victims, (3) any male victims, and (4) offender is less than 25 years of age. For the current study, the items of Static-99 were used to compute the RRASOR. The coding rules for the items of the RRASOR and Static-99 are identical with the exception of prior sexual offences. Specifically, unlike the RRASOR, the coding rules of Static-99 do not count pseudo-recidivism as prior sexual offences. Pseudo-recidivism is estimated to affect approximately 5% of offenders (Phenix, Doren, Helmus, Hanson, & Thornton, 2009), and hence, the difference between using the item scoring of Static-99 rather than RRASOR is expected to be minimal.

In the development study, the RRASOR differentiated sexual recidivists from nonrecidivists with an Area Under the Curve (AUC) of .71 (Hanson, 1997). A recent meta-analysis conducted by Hanson and Morton-Bourgon (2009) found that the RRASOR showed similar, although slightly smaller effects, when averaged across 34 diverse follow-up studies (weighted mean d = 0.60, 95% CI = 0.54 to 0.65, N = 11,031, k = 34; which translates to an AUC of .66, 95% CI = .65 to .68).

Static-99R contains all the RRASOR items as well additional items concerned with relationship history (1 item), sexual offence history (stranger victims, non-contact sexual offences), and general criminal history (number of prior sentencing occasions, index non-sexual violence, prior non-sexual violence; see Table 1). A recent meta-analysis found a moderate relationship between Static-99 and sexual recidivism (weighted mean d = 0.67, 95% CI = 0.62 to 0.72, N = 20,010, k = 63; which translate to an AUC for ROC of .68, 95% CI = .67 to .70; Hanson & Morton-Bourgon, 2009). For an overview of research on Static-99, see Anderson and Hanson (2010).

Table 1. Items Contained in the RRASOR, Static-99R and Static-2002R

Notes

RRASOR

STATIC-99/STATIC-99R

STATIC-2002/STATIC-2002R

a

Offender's age at release

Offender's age at release

Offender's age at release

b

Number of prior sexual offence charges and convictions

Number of prior sexual offence charges and convictions

Prior sentencing occasions for sexual offences

c

Any unrelated victims of sexual assaults

Any unrelated victims of sexual assaults

Any unrelated victims of sexual assaults

c

Any male victims of sexual assaults

Any male victims of sexual assaults

Any male victims of sexual assaults

d

Convictions for non-contact sexual offences

Convictions for non-contact sexual offences

d

Any stranger victims of sexual assaults

Any stranger victims of sexual assaults

a

Number of prior sentencing dates

Prior sentencing occasions for anything

e

Conviction for non-sexual violence prior to the Index Offence

Prior violent non-sexual sentencing occasion

f

Conviction for non-sexual violence at the time of the Index Offence

Any prior involvement with the criminal justice system

f

Ever lived with an intimate partner for two consecutive years

Any young, unrelated victims

f

Rate of sexual offences

f

Any community supervision violation

f

Arrests for sexual offences as both an adult and a juvenile

f

Years free prior to Index

Note. Adapted from A. J. R. Harris and Hanson (2010). Static-99 and Static-2002 are identical with their "R" versions, with the exception of the cut-points and weights accorded to age.

aSame definition, but different cut-points and weights.bStatic -99 and RRASOR have the same definitions and same weights for prior sex offences, but Static -99 scoring includes the concept of "pseudo-recidivism" whereas RRASOR does not. Static-2002 has a different definition than the other measures.cIdentical item across all three measures.dIdentical item for Static -99 and Static -2002.eSimilar concepts, different definitions.fDifferent items (no equivalent on the other scale).

Static-2002R

Static-2002 (Hanson & Thornton, 2003) was created with the aim of improving Static-99. Static-2002R is a 14-item actuarial measure that assesses recidivism risk of adult male sexual offenders. The items are identical to Static-2002 (Hanson & Thornton, 2003), with the exception of updated age weights (see Helmus et al., 2010). Important differences between Static-99 and Static-2002 are that Static-2002 added and altered some items, organized items into meaningful subscales to aid interpretation, and has more standardized coding rules. Static-2002 has a moderate relationship with sexual recidivism (weighted mean d = 0.70, 95% CI = 0.59 to 0.81, N = 3,330, k = 8; which translate to an ROC of .69, 95% CI = .66 to .72; Hanson & Morton-Bourgon, 2009). Previous research found that Static-2002 was more predictive of sexual, violent, and any recidivism than Static-99 (Hanson, Helmus, & Thornton, 2010; Stalans, Hacker, & Talbot, 2010).

A list of the items in the RRASOR, Static-99R, and Static-2002R is provided in Table 1. For further information on Static-99R and Static-2002R, see http://www.static99.org.

Samples

Multiple samples from diverse jurisdictions were used. Table 2 presents the main characteristics of each sample (k = 20, N = 7,491). All twenty samples had both RRASOR and Static-99R scores, but only 7 had Static-2002R scores. Most samples were drawn from Canada (k = 10) or United States (k = 4), followed by single samples from Austria, Denmark, Germany, New Zealand, Sweden, and United Kingdom. The current study examined three types of recidivism: sexual, violent (including sexual recidivism), and any recidivism. Of the 20 samples, 4 samples only reported sexual recidivism, 2 samples reported both sexual and violent recidivism, and 14 samples reported all three types of recidivism.

Each dataset was verified for internal inconsistencies (e.g., miscalculation of total scores or item scores contradicted by other information in the dataset). Identified errors were corrected if possible; otherwise, the case was deleted. Cases were also deleted under the following circumstances: missing follow-up information, any missing Static-99R item other than Ever Lived with a Lover (Item 2), more than one missing Static-2002R item, the offender was less than 18 years old at time of release or less than 16 years old when they committed the index offence, or if the offender was female. The age and gender exclusionary criteria are specified in the coding rules for Static-99 (A. J. R. Harris, Phenix, Hanson, & Thornton, 2003) and Static-2002 (Phenix et al., 2009). The new age item of Static-99R and Static-2002R was calculated from the verified datasets for each sample.

The number of participants in these samples was smaller than previously reported (e.g., Helmus, 2009) because (1) the date of birth or age of the offender at release was required to code the new Static-99R and Static-2002R age weights, and (2) the total scores of at least two of the scales included in this study had to be available in the dataset (e.g., Static-99 item scores were needed to calculate RRASOR total scores). The samples are described in detail in Helmus (2009; available from http://www.static99.org).

Overview of Analyses

All analyses were conducted separately by the first and third author to ensure accuracy.

Predictive accuracy

The first set of analyses used fixed-effect and random-effects meta-analyses to compute the weighted areas under receiver operating characteristic curves (ROC AUC) and 95% confidence intervals for each risk instrument. The AUC is a measure of relative risk and can be interpreted as the probability that a randomly selected recidivist has a higher score on the risk instrument than a randomly selected non-recidivist. The AUC is useful for comparing results across samples because it is not influenced by recidivism base rates (Rice & Harris, 1995). It is, however, influenced by the variance in the distribution of scores used to predict recidivism (Hanson, 2008; Humphreys & Swets, 1991).

Fixed-effect estimates of the AUCs and standard errors were calculated using the formula and procedures presented in Hedges (1994). Fixed-effect analyses have the advantage of providing an estimate of between-study variability (i.e., Cochran's Q statistic; Hedges & Olkin, 1985). A significant Cochran's Q statistic indicates that there is more variability across studies than expected by chance (the Q statistic is distributed as a chi-square, with k – 1 degrees of freedom). In random-effects meta-analysis, the between-study variability is included in the error term, resulting in wider (and often more realistic) confidence intervals (Schmidt, Oh, & Hayes, 2009). The results of the random-effects and fixed-effect models therefore converge as the amount of between-study variability decreases (when Q is less than the degrees of freedom, the results are identical). Random-effects estimates were calculated using Formulae 10, 12, and 14 from Hedges and Vevea (1998).

Table 2. Descriptive Information of Samples

Study

N

Release Period

Follow-up (SD)a

Recidivism Rates

Age (SD)

RRASOR

Static-99R

Static-2002R

Sexual

Violentb

Any

M (SD)

M (SD)

M (SD)

Allan et al. (2007)

492

1990-2000

5.7 (2.9)

9.6

16.5

25.2

42.3 (12.2)

1.4 (1.4)

1.8 (2.3)

-

Bengtson (2008)

308

1978-1995

16.2 (4.2)

34.1

52.3

64.6

32.5 (10.4)

1.8 (1.2)

3.8 (2.4)

4.6 (2.4)

Bigras (2007)

457

1995-2004

4.6 (1.9)

5.7

14.7

23.4

42.8 (12.0)

1.3 (1.3)

2.1 (2.4)

3.5 (2.5)

Boer (2003)

296

1976-1994

13.3 (2.1)

8.8

23.3

48.3

41.2 (12.5)

1.4 (1.2)

2.8 (2.8)

3.9 (2.7)

Bonta & Yessine (2005)

133

1992-2004

5.5 (2.4)

15.8

33.8

48.9

39.8 (9.6)

2.7 (1.3)

5.0 (2.1)

-

Brouillette-Alarie & Proulx (2008)

228

1979-2006

9.9 (4.5)

20.2

30.7

-

36.0 (10.2)

2.1 (1.4)

3.9 (2.3)

-

Cortoni & Nunes (2007)

73

2001-2004

4.6 (0.6)

0.0

8.2

12.3

41.6 (12.3)

1.2 (1.0)

2.2 (2.1)

-

Eher et al. (2008)

706

2000-2005

3.9 (1.1)

4.0

14.7

26.2

40.7 (12.6)

1.2 (1.0)

2.3 (2.3)

-

Epperson (2003)

177

1989-1998

7.9 (2.5)

14.1

-

-

37.2 (13.2)

1.5 (1.2)

2.5 (2.6)

-

Haag (2005)

190

1995

7.0 (0.0)

24.7

-

-

36.7 (9.7)

2.0 (1.4)

4.1 (2.2)

5.7 (2.3)

Hanson et al. (2007)

702

2001-2005

3.4 (1.0)

8.1

16.4

27.9

41.6 (13.2)

1.5 (1.2)

2.4 (2.4)

3.5 (2.5)

Harkins & Beech (2007)

190

1994-1998

10.4 (1.1)

14.2

21.1

36.3

43.3 (12.5)

1.5 (1.3)

2.2 (2.6)

3.7 (2.8)

Hill et al. (2008)

86

1971-2002

12.6 (6.6)

15.1

29.1

61.6

39.4 (11.1)

1.9 (1.0)

4.7 (2.0)

-

Johansen (2007)

273

1994-2000

9.1 (1.1)

7.7

20.5

53.5

37.8 (10.8)

1.8 (1.2)

2.9 (2.3)

-

Knight & Thornton (2007)

466

1957-1986

8.6 (2.6)

26.2

36.9

53.0

36.1 (11.4)

2.4 (1.3)

4.6 (2.4)

6.1 (2.5)

Långström (2004)

1,278

1993-1997

8.9 (1.4)

7.5

21.4

-

41.5 (12.0)

0.8 (0.9)

2.0 (2.4)

-

Nicholaichuk (2001)

281

1983-1998

6.4 (4.0)

18.5

-

-

34.8 (9.4)

2.4 (1.4)

4.8 (2.4)

-

Swinburne Romine et al. (2008)

680

1977-2007

16.8 (7.8)

13.8

-

-

38.2 (12.3)

1.2 (1.1)

1.7 (2.2)

-

Ternowski (2004)

247

1994-1998

7.5 (1.0)

8.1

15.4

19.8

43.9 (13.0)

1.2 (1.2)

1.6 (2.5)

-

Wilson et al. (2007 a & b)

228

1994-2007

5.2 (3.0)

10.5

25.9

35.5

41.7 (11.4)

2.8 (1.5)

5.1 (2.3)

-

Total

7,491

1957-2007

8.3 (5.2)

12.0

22.4

35.9

39.8 (12.2)

1.5 (1.3)

2.7 (2.6)

4.3 (2.7)

The Hanley and McNeil (1983) test of correlated ROC areas was used to test whether the risk instruments differed in their level of predictive accuracy. The Hanley and McNeil test requires the following: (1) the average AUC for the two risk instruments that are being compared, and (2) the average correlation between the two instruments being compared, computed separately for the recidivists and non-recidivists. The AUCs and average correlations were computed for each of the three recidivism type (sexual, violent including sexual, and any recidivism). Hanley and McNeil (1983) proposed the use of the Kendal Tau (τ) correlation rather than the Pearson correlation. The τ is a rank correlation that represents the relationship between the ordering of the data when ranked by the two separate measures (i.e., for ordinal data). The τ therefore provides a more conservative test compared to the Pearson correlation, which assumes interval data. Table 1 (Hanley & McNeil, 1983, p. 841) associates an overall correlation based on the average AUC (for the two measures being compared) and the average τ (between the measures for the recidivists and the non-recidivists). We will refer this new correlation derived from Table 1 (Hanley & McNeil, 1983, p. 841) as the overall average r. Standard errors for the differences between two AUCs (A1 – A2) were based on Hanley and McNeil's (1983) Formula 3:

where r is the overall average r, and SE is the respective standard errors for the AUC of each measure. If the 95% confidence interval of the difference between measures included zero, the difference between the two scales was not statistically significant.

Incremental validity

Incremental validity was examined using Cox regression (Allison, 1984). Cox regression estimates relative risk ratios (hazard rates) associated with one or more predictor variables from survival data with unequal follow-up times. Each sample was used as a stratum to allow separate baseline hazard functions (i.e., recidivism rates) for each value of the stratified variable, effectively removing from the analysis the base rate variability across samples.

Results

Predictive Validity

The predictive validity of the three scales was measured using AUCs. Appendix A presents the AUCs for the RRASOR, Static-99R and Static-2002R by sample. Tables 3 to 5 present the weighted AUC for each risk instrument and the Hanley and McNeil test. Static-99R and Static-2002R predicted sexual, violent, and any recidivism similarly, with no one scale displaying greater predictive accuracy (Table 3). Given that τ provides a more conservative test than the Pearson correlation, the analyses were also computed using Pearson correlations. The results were similar, with one exception: Static-2002R was significantly better than Static-99R in predicting any recidivism (difference with fixed-effect = 0.0133, 95% CI = 0.00275 to 0.0238; difference with random-effects = 0.0138, 95% CI = 0.00115 to 0.0265) using the Pearson correlation but this difference was not found when using τ correlation coefficient.

Table 4 presents the meta-analyzed AUC for the RRASOR and Static-99R. The Hanley and McNeil test found that Static-99R had significantly greater accuracy in predicting sexual, violent, and any recidivism than the RRASOR, with larger differences found for violent (including sexual) and any recidivism. The same pattern of results was found for the RRASOR and Static-2002R, with Static-2002R predicting sexual, violent, and any recidivism more accurately than the RRASOR (see Table 5).

The differences in predictive accuracy between scales were similar for both the fixed-effect and random-

effects analyses. In addition, the differences in predictive accuracy between the scales were remarkably consistent across samples for the prediction of sexual and violent recidivism, as indicated by a nonsignificant Q. For any recidivism, the comparison between the RRASOR and Static-99R as well as the comparison between the RRASOR and Static-2002R had significant variability, indicating that the difference in predictive accuracy for these comparisons were inconsistent across the samples.

Incremental Validity

Tables 6 to 8 present the Cox regression analyses used to examine the incremental validity of the risk instruments for each recidivism type. For the prediction of sexual recidivism, risk instruments were found to add incrementally to one another despite large correlations between instruments, ranging between .70 and .92 (Table 6). The RRASOR and Static-99R each added incrementally to one another; Static-99R and Static-2002R each added incrementally to one another; and, finally, Static-2002R added incremental validity to the RRASOR but the RRASOR did not add incrementally to Static-2002R. In addition, entering all three risk instruments into a model found that Static-99R and Static-2002R added incrementally to the model, but not the RRASOR. Namely, adding the RRASOR after accounting for both Static-99R and Static-2002R did not significantly improve the predictive accuracy of the model (χ2 change = 0.48, df = 1, p = .49).

For the prediction of violent (including sexual) recidivism, all three instruments added incremental information for all analyses. Of note, however, was that the incremental effect for the RRASOR was reversed – namely, low scores on the RRASOR were associated with higher rates of violent recidivism once the other scales were controlled for (see Table 7). In addition, a model that included all three risk instruments found significant incremental validity for each instrument (with low scores on the RRASOR predicting violent recidivism).

For the prediction of any recidivism, all comparisons found that the risk instruments added incremental validity to one another (see Table 8). Specifically, the RRASOR and Static-99R added incrementally to one another, Static-99R and Static-2002R added incrementally to one another, and, finally, the RRASOR and Static-2002R added incrementally to one another. Similarly to the prediction of violent recidivism, higher scores on the RRASOR were associated with lower probability of any recidivism. Lastly, a model that included all three risk instruments found significant incremental validity for each instrument (with low RRASOR score predicting high rates of general recidivism).

To examine the practical importance of the incremental finding, participants were also sorted into risk categories (low, moderate, and high) based on a scale-independent definition of nominal risk categories suggested by Babchishin and Hanson (2009) for Static-99R and Static-2002R. Specifically, offenders with a score associated with less than half the rate of sexual re-offending than the typical offender (risk ratio < 0.50) were classified as "low-risk." Offenders with a score associated with more than half the rate of re-offending than the typical offender, but less than twice the rate of re-offending of a typical offender

Table 8. Incremental Validity of the Risk Instrument for Predicting Any Recidivism

A simple crosstab of the sexual recidivism rates by Static-99R and Static-2002R risk categories is presented in Table 9 to allow for a visual representation of the recidivism rates of offenders for whom the scales provide discordant results (when both instruments sort offenders into different risk categories). Recidivism rates for discordant groups were intermediate between the two adjacent risk categories. For example, when both instruments classified offenders as moderate risk, the observed recidivism rate was 10.7% (146/1,360), and when both instruments rated offenders as high risk, the observed rate was 34.4% (174/506). When one instrument classified the offender moderate and the other instrument classified the offender high, the observed sexual recidivism rate was 21.9% (73/334).

Note. Sexual recidivism rates from all cases, not controlling for length of follow-up. Average follow-up = 8.0 years (SD = 4.9).

Discussion

The purpose of the present study was to examine the relative and incremental validity of three scales designed to predict recidivism among sexual offenders. The current study found that Static-99R and Static-2002R outperformed the RRASOR in the prediction of sexual, violent, and any recidivism. No differences in predictive accuracy were found between Static-99R and Static-2002R. Despite large correlations between the scales, they all added incremental validity to one another for predicting sexual, violent, and any recidivism, with one exception: the RRASOR not adding incremental validity to the prediction of sexual recidivism after controlling for Static-2002R. Interestingly, the RRASOR added incrementally in a negative direction for violent and any recidivism, with higher scores indicating lower risk.

The finding of incremental validity in the current study is truly remarkable given the substantial overlap in the items of these scales, and is in stark contrast with Seto (2005) who did not find incremental validity of similar risk scales (albeit using a much smaller sample). It would be easy to assume that the high correlations between risk scales would preclude incremental validity. Given substantial overlap in content, Vrieze and Grove (2010) assumed that discordant results between the measures would form "…a prima facie reason to disbelieve" either scale and would "…undercut each others' statuses as knowledge claims" (Vrieze & Grove, 2010, p. 388). The current findings suggest that Vrieze and Grove (2010) are only partially correct. Equally valid measures can give divergent results. Even when the items "look" similar, they can be related to recidivism through different causal mechanisms, a point we will return to later in the discussion.

A previous meta-analysis with seven of the datasets used in the current study found that Static-2002 outperformed Static-99 in predicting sexual, violent, and any recidivism (Hanson et al., 2010). Stalans and colleagues (2010) also found that Static-2002 outperformed Static-99 in predicting sexual recidivism. The reason for the lack of differences in predictive accuracy between the revised versions of Static-99 and Static-2002 in the current study (despite using the same samples as Hanson et al., 2010) can most likely be attributed to the updated age weights in the revised scales. The revised age weights notably increased the predictive accuracy of Static-99R, whereas a smaller improvement was found in Static-2002R (Helmus et al., 2010). As such, Static-99R and Static-2002R are more similar in predictive accuracy than the original scales. There were also differences in statistical analyses between Hanson and colleagues (2010) and the current study. Specifically, Hanson and colleagues (2010) used Pearson correlation coefficients to compute the Hanley and McNeil (1983) test (a less conservative test than the Kendall's Tau correlation coefficients) whereas in the current study we used the Kendall's Tau. Re-analyzing Hanson and colleagues' (2010) data using Kendall's Tau, however, did not alter the findings (i.e., Static2002 still significantly outperformed Static-99). As such, the similarity in predictive accuracy for the revised version of Static-99 and Static-2002 is likely due to the revised age weights of the scales rather than the method used to examine the difference in predictive accuracy.

Item Weighting

The finding of incremental validity in the current study demonstrates that the original weighting of the items in the RRASOR, Static-99R, and Static-2002R was not optimal. Remarkably, the RRASOR was found to add incremental validity to Static-99R in the prediction of sexual recidivism, despite the fact that all the items of the RRASOR are included in Static-99R. (In fact, we used the items of Static-99R to calculate the RRASOR.) The incremental validity findings therefore cannot be attributed to new constructs being captured by the RRASOR, but to the different weighting of the items.

Our findings provide clear evidence that the weightings for actuarial scales are unlikely to ever be optimal. Given large enough samples sizes, the null hypothesis (finding no incremental validity) can almost always be rejected (Cohen, 1994). The refinement of weights, however, is a never-ending task requiring larger sample sizes for decreasingly small gains in precision. Test developers also need to be vigilant about over-fitting the data, as small adjustments rarely generalize to other datasets (Cureton, 1950). As well, complex weights reduce practical ease of the scoring and increase the risk of error; integers are relatively simple.

Although some progress in risk assessment can be made by improving item weights, we do not believe this will solve the most pressing problems of applied risk assessment. Instead, we believe the way forward involves increasing attention to the construct validity of prediction tools.

Construct Validity and Combining Multiple Risk Scales

Most psychological tests are designed to assess latent constructs, such as mental health and intelligence. As such, concordance among alternate measures of the same construct (e.g., different intelligence tests) is expected, and evaluators routinely average findings from multiple measures (Weiner, 2003). Such an averaging approach is based on the assumption of classical test theory that increasing the item pool should reduce sampling error and produce more reliable results (Nunally & Bernstein, 1994). Evaluators who find concordance between measures have increased confidence in the results.

The scores used for violence risk prediction, however, have often been selected on a purely empirical basis, with little attention to construct validity. Without knowing what is being assessed, it is difficult for evaluators to know how to combine the results of different risk tools. The preferred method of combination will depend on whether or not the scales are measuring similar or different constructs.

When scales sample items from the same domains and have similar relationships with the outcome (i.e., recidivism), then it is plausible to base conclusions on the average of the measures. For example, in the current study Static-99R and Static-2002R had similar contributions to the prediction of sexual recidivism and can be assumed to sample from, and give similar weights to, the same latent constructs. Despite a relatively small incremental effect between Static-99R and Static-2002R, there was a noticeable difference in the recidivism rates of discordant cases. Namely, when Static-99R and Static-2002R were discordant, there was an approximately 10% difference in observed recidivism rates, with the recidivism rates of the discordant cases being intermediate between the two respective risk categories. A 10% difference is similar in size to the effects found for most of the well established risk factors (e.g., any male victims, single, any unrelated victims; Hanson & Bussière, 1998).

When scales sample items from different domains, it is less clear how to combine their findings into one coherent judgment. When scales measure different constructs, it should not be a surprise that the scales rank offenders differently. The average of the two distinct scales may not be advisable as it may result in decreased predictive accuracy compared to other methods of combining the results. For example, the RRASOR attributes more weight to sexual deviancy than Static-99 (Doren, 2004; Roberts, Doren, & Thornton, 2002), which includes items from the domains of sexual deviancy as well as general antisociality. The method of combining results from scales sampling different domains must therefore also consider (1) what are the domain(s) being assessed by the scales and (2) how each of the domains are related to the outcome of interest (i.e., recidivism). In the current study, the RRASOR added incrementally to Static-99R, but in different directions depending on the recidivism type (i.e., positive incremental validity to Static-99R for sexual recidivism, but negative incremental validity for violent and any recidivism). The negative relationship of the RRASOR to violent and any recidivism suggests that subtracting the RRASOR from Static-99R would be a better method of combination than averaging. For sexual recidivism, however, where both scales add incrementally with positive weights, it is possible that an approach that adds or averages the scales together would be more accurate.

In summary, the method used to combine findings from risk scales assessing different domains necessitates the identification of what the scales are actually measuring. This, however, is not an obvious task. Despite all the items of the RRASOR being included in Static-99R, the two scales had opposite relationships with violent recidivism, once the other scales were controlled for. Consequently, it can be assumed (albeit post-hoc) that the two scales are sampling different domains. Identifying the constructs being measured requires both theory and empirical evidence; without such evidence, reliability between assessors concerning the latent constructs would be expected to be low.

Implications for Researchers

We believe the results of the current study should motivate further consideration of construct validity in the development of empirical risk prediction tools. Although it is possible to address the problem of combining multiple measures without understanding what they are measuring, a pure prediction approach to this problem has considerable limitations. Vrieze and Grove (2010), for example, have proposed creating a superscale, with existing scales treated as items in the superscale. Although such an approach is logically consistent, it is inefficient and impractical. Specifically, such a superscale would require all the same steps required when creating any new scale, such as generating a scoring manual and completing cross-validation. Given that many of the individual scales have identical or nearly identical items, evaluators would soon tire of the repetition and quickly look for ways of combining items rather than the total scores of diverse measures.

We believe that future research on risk assessment should focus on identifying and assessing the psychologically meaningful characteristics associated with recidivism (Mann, Hanson, & Thornton, 2010). For example, a single dimension or propensity (e.g., antisociality) would be composed of and influenced by several markers (e.g., unemployment, substance abuse, history of criminal behaviour, procriminal attitudes). Once valid measures of the core constructs have been assessed, researchers can examine the independent contribution of these dimensions. Following dimensional theory (Loftus, Oberg, & Dillon, 2004) risk factors could be weighted at the construct level (e.g., antisociality, sexual deviancy) and the weight allocated to each construct can depend on the type of recidivism being predicted (e.g., violence vs. sexual).

One advantage of such a conceptual actuarial measure would be that the subcomponents are defined and, consequently, evaluators could identify the reasons for an offender's score. Understanding what items are measuring would allow evaluators to explain inconsistencies in risk rating across measures, thereby helping inform the method of combining multiple risk scales. This task would, however, be difficult as it requires not only an understanding of the underlying constructs, but knowledge of how the specific items measure these constructs. Nevertheless, this type of task is essential given that the incremental addition of scales is most likely not limited to the three actuarial scales examined in this study. If the scales are created using a purely predictive approach, risk evaluators will continue to be faced with the knowledge that other variables (and scales) add incremental validity without being able to explain why. The direction forward for risk assessment combines empirical prediction with the construct validity tradition.

Implications for Current Practice

The current study did not find a clear superiority for either Static-99R or Static-2002R for the prediction of sexual, violent, or general recidivism (both scales were, however, superior to the RRASOR). Consequently, evaluators choosing between them would need to consider other criteria. For example, evaluators interested in estimating absolute recidivism rates may prefer Static-99R over Static-2002R because of the relatively large normative samples available (Helmus, 2009). For other assessments, Static2002R may be preferable to Static-99R because the items are grouped into subscales (i.e., age, sexual deviancy, general criminality) that suggest the source of the risk. In high stakes situations, evaluators may want to use both measures: both scales add incremental validity to one another, with recidivism rates of discordant cases being intermediate between the rates suggested by the individual scales.

For violent recidivism, both Static-99R and Static-2002R can be used. Risk evaluators should be aware, however, that the item weighting of these scales is not optimal for the prediction of violent recidivism (i.e., too much weight allocated to items assessing sexual deviancy). As such, if the evaluation is primarily concerned with violent recidivism, we recommend scales designed for that purpose (e.g., VRAG, SORAG – Quinsey et al., 2006; Risk Matrix-2000v and Risk Matrix-2000c – Thornton et al., 2003). These measures have stronger weights for general criminality than the RRASOR, Static-99R, and Static-2002R.

In the current study, we presented the prediction weights (standardized regression coefficients from the Cox regression analyses) of the RRASOR, Static-99R, and Static-2002R for illustrative purposes only. We do not advocate the use of these weights in applied practice because they would likely be affected by overfitting (Cureton, 1950). Without further replication studies (with large sample sizes), the extent to which the weights found are accurate and generalizable is unknown.

In summary, for evaluators who select scales that measure similar domains of risk factors (e.g., Static99R and Static-2002R), then it is likely that an averaging approach would be the optimal method of combining the findings of the multiple scales. Such an approach follows classical test theory, in that a greater number of items measuring the same construct pool (and having similar predictive accuracy) reduces measurement error and increases predictive accuracy. Consequently, concordance among scales would increase evaluators' confidence in the accuracy of the risk assessment. In contrast, if the selected scales are not sampling the same latent constructs, then the evaluators would require a defensible model concerning (1) the latent constructs measured by the scales, (2) how the domains relate to the outcome of interest, and (3) empirical evidence concerning how the constructs should be weighted and combined. In the absence of such an empirically supported model, it would be prudent for evaluators to privilege the scale for which the evaluator holds the most confidence.

References

References marked with an asterisk indicate studies included in the meta-analysis.

Hanson, R. K. (1997). The development of a brief actuarial scale for sex offender recidivism (User Report No. 1997-04). Ottawa, ON: Department of the Solicitor General of Canada. Available from http://www.defenseforsvp.com/Resources/Hanson_Static-99/RRASOR.pdf

Hanson, R. K. (2008). What statistics should we use to report predictive accuracy. Crime Scene, 15(1), 15-17. Available from http://www.cpa.ca/cpasite/userfiles/Documents/Criminal%20Justice/Crime%20Scene%20200804.pdf

Hanson, R. K., & Thornton, D. (2003). Notes on the development of Static-2002. (Corrections Research User Report No. 2003-01). Ottawa, ON: Department of the Solicitor General of Canada. Available from http://www.publicsafety.gc.ca/cnt/rsrcs/pblctns/nts-dvlpmnt-sttc/index-eng.aspx

*Nicholaichuk, T. (2001, November). The comparison of two standardized risk assessment instruments in a sample of Canadian Aboriginal sexual offenders. Paper presented at the annual Research and Treatment Conference of the Association for the Treatment of Sexual Abusers, San Antonio, TX.

Appendix A

Table 1A. ROC Areas for the RRASOR, Static-99R, and Static-2002R by Sample

Sexual Recidivism

Violent Recidivism

Any Recidivism

Sample

N

ROC

95% CI

ROC

95% CI

ROC

95% CI

Allan et al. (2007)

492

RRASOR

.70

.62

.78

.60

.53

.67

.57

.51

.63

Static-99R

.72

.64

.80

.69

.63

.75

.70

.65

.75

Bengtson (2008)

308

RRASOR

.61

.54

.68

.60

.54

.67

.59

.52

.66

Static-99R

.62

.56

.68

.66

.60

.72

.64

.57

.70

Static-2002R

.64

.57

.70

.66

.60

.72

.67

.60

.73

Bigras (2007)

457

RRASOR

.60

.48

.72

.54

.47

.62

.55

.48

.61

Static-99R

.71

.60

.82

.69

.62

.75

.69

.64

.75

Static-2002R

.70

.59

.81

.69

.63

.75

.71

.65

.76

Boer (2003)

296

RRASOR

.69

.57

.81

.62

.55

.70

.60

.53

.66

Static-99R

.75

.65

.85

.75

.69

.81

.79

.74

.84

Static-2002R

.73

.63

.83

.74

.67

.80

.81

.76

.86

Bonta & Yessine (2005)

133

RRASOR

.50

.36

.64

.47

.36

.57

.48

.38

.58

Static-99R

.64

.52

.77

.66

.57

.76

.65

.56

.74

Brouillette-Alarie & Proulx (2008)

228

RRASOR

.67

.59

.75

.60

.53

.68

-

-

-

Static-99R

.68

.60

.76

.69

.62

.76

-

-

-

Cortoni & Nunes (2007)a

73

RRASOR

-

-

-

.73

.56

.90

.64

.49

.80

Static-99R

-

-

-

.71

.53

.89

.70

.54

.87

Eher et al. (2008)

706

RRASOR

.75

.65

.85

.68

.63

.74

.63

.58

.68

Static-99R

.71

.61

.82

.76

.71

.80

.71

.67

.75

Epperson (2003)

177

RRASOR

.73

.62

.84

-

-

-

-

-

-

Static-99R

.78

.67

.88

-

-

-

-

-

-

Haag (2005)

190

RRASOR

.64

.55

.74

-

-

-

-

-

-

Static-99R

.70

.61

.78

-

-

-

-

-

-

Static-2002R

.67

.58

.76

-

-

-

-

-

-

Hanson et al. (2007)

702

RRASOR

.70

.64

.77

.63

.57

.68

.61

.56

.66

Static-99R

.76

.69

.82

.75

.70

.80

.76

.72

.80

Static-2002R

.75

.69

.82

.76

.72

.81

.76

.73

.80

Harkins & Beech (2007)

190

RRASOR

.73

.64

.83

.66

.56

.75

.62

.54

.71

Static-99R

.78

.68

.88

.75

.67

.84

.77

.69

.84

Static-2002R

.79

.68

.89

.77

.69

.85

.77

.70

.84

Hill et al. (2008)

86

RRASOR

.61

.46

.76

.58

.45

.71

.53

.40

.65

Static-99R

.67

.52

.82

.62

.50

.75

.61

.48

.74

Johansen (2007)

273

RRASOR

.63

.52

.75

.61

.53

.69

.57

.50

.64

Static-99R

.65

.53

.77

.71

.63

.79

.72

.66

.78

Knight & Thornton (2007)

466

RRASOR

.62

.56

.68

.59

.53

.64

.56

.51

.61

Static-99R

.62

.56

.67

.64

.59

.69

.63

.58

.68

Static-2002R

.63

.58

.69

.64

.59

.69

.64

.59

.69

Långström (2004)

1,278

RRASOR

.72

.66

.78

.65

.61

.68

-

-

-

Static-99R

.73

.68

.79

.78

.75

.81

-

-

-

Nicholaichuk (2001)

281

RRASOR

.67

.60

.75

-

-

-

-

-

-

Static-99R

.74

.67

.81

-

-

-

-

-

-

Swinburne Romine et al. (2008)

680

RRASOR

.62

.55

.68

-

-

-

-

-

-

Static-99R

.63

.57

.70

-

-

-

-

-

-

Ternowski (2004)

247

RRASOR

.61

.48

.74

.60

.50

.70

.61

.52

.69

Static-99R

.75

.66

.85

.76

.69

.84

.77

.70

.84

Wilson et al. (2007 a & b)

228

RRASOR

.63

.51

.75

.52

.43

.60

.50

.42

.58

Static-99R

.57

.43

.71

.62

.54

.70

.60

.52

.68

Note. Sexual recidivism rates from all cases, not controlling for length of follow-up used to compute AUC values. a No AUC value for sexual recidivism could be computed for Cortoni and Nunes (2007) because they were no recidivists