We use cookies to enhance your experience on our website. By continuing to use our website, you are agreeing to our use of cookies. You can change your cookie settings at any time.Find out moreJump to
Content

Risk Assessment

Summary and Keywords

From rudimentary conceptions of risk in the late 18th century, risk assessment slowly evolved toward a more multifaceted conceptualization of risk and progressed to more sophisticated methods to calibrate offender risk levels. This story largely involves the struggles in criminology and applied agencies to achieve a successful “science-to-practice” advancement in risk technologies to support criminal justice decision making. This has involved scientific measurement issues such as reliability, predictive validity, construct validity, and ways to assess the accuracy of predictions and to effectively implement risk assessment methods. The urgent call for higher predictive accuracy from criminal justice policymakers has constantly motivated such change. Over time, the concept of risk has fragmented as diverse agencies, including pretrial release, probation, courts, and jails, have sought to assess specific risk outcomes that are critical for their policy goals. Most agencies are engaged in both risk assessment and risk reduction, with the latter requiring a deeper assessment of explanatory factors. Currently, risk assessment in criminal justice faces several turbulent challenges. The explosive trends in information technology regarding data access, computer memories, and processing speed are combining with new predictive analytic methods that may challenge the currently dominant techniques of risk assessment. A final challenge is that there is, as yet, insufficient standardization of risk assessment methods; nor are there any common language or definitions for offender risk categories. Thus, recent proposals for standardization are examined.

There has been some terminological confusion over several aspects of risk assessment in criminal justice. This confusion stems partly from the fact that multiple disciplines have joint interests in risk assessment and differences in their methods, theories, and policy goals. Thus, it is important to first discuss several terminological issues of recent interest in criminal justice risk assessment.

Basic Terminology Linked to Risk Assessment

Terminology for Risk Factors: An important paper by Kraemer, Kazdin, Offord, Kessler, Jensen, and Kupfer (1997) offered a new set of terms associated with the concept of risk factors in social sciences in order to add precision to the risk assessment area. Monahan and Skeem (2014), in turn, clarified these terms with regard to sentencing and criminal justice risk assessment—as follows:

• Fixed marker: This term refers to a factor that is linked to crime where the factor is unchangeable. For example, an offender’s gender and age at onset of crime are fixed and cannot be changed. Yet, both markers are strong predictors of future antisocial behavior. The criminal justice literature often uses the term “static risk factor” to identify features of a person’s criminal history that are fixed, such as age at first arrest.

• Variable marker: This term refers to a factor that can change but cannot be changed by means of any intervention. For example, many predictive factors can change but can only increase over time (e.g., age, number of prior incarcerations).

• Variable risk marker: This factor can fluctuate and is also changeable by means of interventions; yet some such markers may not have been explicitly shown to have a causal relationship to antisocial outcomes (e.g., vocational skills). As Monahan and Skeem (2014) point out, randomized controlled trials have produced only sparse evidence to reliably establish causal risk factors in the criminal justice field. As Kraemer et al. (1997) noted, future advances in the field may change the status of a variable risk marker based on appropriate experimental designs.

• Causal risk factor: This is commonly understood as a risk factor that is changeable through interventions and has also been clearly linked to changes in outcomes. An example would be to show through a group randomized trial that inmates assigned to a cognitive-behavioral therapy (CBT) intervention showed more improvement on measures of criminal thinking, resulting in less recidivism compared with inmates assigned to a control group that did not have this intervention.

Static and Dynamic Variables: These terms have become widely accepted among criminal justice practitioners (Ward, 2015; Taxman & Caudy, 2015). Broadly, static factors (e.g., age at first offense, gender) are seen as unchanging and thus unable to be targets for change. The term overlaps with the term “fixed marker.” Dynamic factors, in contrast, are seen as changeable and as likely targets for interventions (e.g., self efficacy, vocational resources). Many dynamic risk factors are often referred to by practitioners as criminogenic needs and are seen as being strongly related to crime production and as important targets for interventions. However, among some academics, the dynamic/static distinction has been criticized as ambiguous and as potentially undermining the clarity of purpose needed by criminal justice (Gottfredson & Moriarty, 2006). Meta-analytic studies have shown that both static and dynamic risk can have predictive power for recidivism, and both are frequently included as predictive factors in risk and needs assessments to improve predictive accuracy (Andrews, Bonta, & Wormith, 2006; Brennan, Dieterich, & Ehret, 2009; Taxman & Caudy, 2015). Criminal history factors—both static and dynamic (e.g., age at first offense, total convictions, prior incarcerations)—are routinely collected in criminal justice databases and are widely used in risk assessment models.

Bonta’s Generations of Risk and Need Assessment (RNA): A second terminological feature within the risk and need assessment literature in criminal justice is Bonta’s well-known framework (1996) for describing stages of risk assessment approaches. He proposed four broad “generations” of risk and needs assessments: (1G)—unaided clinical judgments that are basically driven by the personal or professional judgment of the decision maker with access to case data, but who is unaided by actuarial tools; (2G)—actuarial risk assessments driven mostly by summing scores on a specific set of criminal history factors and a few social or behavioral factors; (3G)—assessments that use multiple-factor and theory-guided RNA scales, including both static and dynamic factors, and utilize actuarial methods to assess risk; and (4G)—fully automated software systems that integrate validated RNA, static and dynamic factors, major needs, and strengths data, with a seamless integration of multiple predictive models and case management variables, within an efficient web-based software system. However, Bonta’s approach appears increasingly obsolete given the many rapid methodological developments that do not easily fit into this model—particularly the incorporation of machine learning (ML) methods into RNA.

The Risk–Need–Responsivity (RNR) Principles: These terms are now widely accepted as principles that provide a cornerstone framework linking risk and need assessment to treatment planning by criminal justice practitioners. The framework emerged from a series of outcome and meta-analytic studies undertaken to identify strong predictors and features of successful programs (Andrews, Bonta, & Wormith, 2006). However, the RNR framework also has its critics who question its theoretical foundations and caution that more research is needed to fully establish its efficacy (Ward & Beech, 2015; Skeem, Steadman, & Manchak, 2015). The following are the current meanings of these principles.

Risk Principle: This principle asserts that the intensity of services and supervision levels should be matched to a person’s risk level for reoffending. In many evidence-based outcome studies, such matching was found to significantly reduce recidivism for correctional populations. In contrast, it may be harmful to impose intensive interventions on low-risk groups since this may increase the likelihood of further crime, as well as wasting treatment resources.

Need Principle: This principle holds that the primary targets for interventions should be the main criminogenic needs or causal risk factors found for an offender. Decades of research have identified several major criminogenic needs (Andrews, Bonta, & Wormith, 2006). As noted earlier, these are commonly seen as changeable, predictive of crime, and potentially potent in reducing the risk of further criminal behaviors. The principle asserts that treatment and case management goals should prioritize the identified criminogenic needs of each offender and address these using appropriate programs and supervision levels.

Responsivity Principle: This principle emphasizes the need for carefully matching supervision and treatment services to the offender’s individual factors such as learning style, reading ability, motivation, pattern of needs, and practical accessibility to programs. Two types of responsivity have been discussed: (1) general responsivity and (2) specific responsivity.

General responsivity is based on findings that certain interventions appear generally effective for most offenders regardless of the type of crime. For example, social learning approaches and cognitive-behavioral therapies have been found effective with most offenders. Related interventions such as prosocial modeling, skills development, and problem-solving skills have all been shown to be generally effective and relevant for most offenders.

Specific responsivity requires a far more complex individualistic matching of different offenders to different kinds of programs and services. It raises the more complex question of what programs work for what kinds of offenders and under what circumstances. Thus, different sets of case management services may be tailored to the personal risk and needs patterns and circumstances of individual offenders. It may require revisions of standard interventions to effectively align these services to the offender’s specific gender, situation, strengths, motivation, or cultural characteristics.

Key Contextual Issues That May Influence the Development of Risk Assessment

Shifts in Sentencing Policies and New Legislation That Will Impact Risk Assessment: In what Hanson (2015) describes as a “remarkable change,” the last decade in the United States has seen many state legislative bodies introducing legislation requiring validated risk assessments to be available for sentencing deliberations. These legislative directives make it clear that risk assessment alone is not meant to determine sentences but should be considered as one factor among other relevant legalistic and background issues. This requirement for science-based risk assessment is motivated by concerns over public safety, high postsentence recidivism, fiscal pressures, desire for the transparency of risk assessment methods, and their higher predictive accuracy compared to human judgment. Such legislation is likely to accelerate the use of risk assessment across the criminal justice system. It will also serve to place risk assessment within the contested issue of sentencing for both the sanctioning and crime reduction components of sentencing.

The Rapidly Escalating Volume and Complexity of Criminal Justice Data: The increasing volume and complexity of criminal justice data have several implications for risk assessment. Social psychological and biosocial sciences, including criminal justice research, are widely acknowledged to be extremely complex (Richters, 1997; Mitchell, 2009). This is partly shown by the rise in the complexity and dimensionality of information that parallels the shift from 2G to more recent 3G and 4G RNA assessments (Taxman & Caudy, 2015; Ward, 2015). The expanding volume of data and their increasing complexity are driven by factors such as the following:

- The ever-present demands for increased predictive accuracy

- User demands for comprehensive coverage of “all relevant factors”

- The information requirements of the RNR principles

- Calls for more theoretically informed RNA instruments

- The ongoing discovery of new risk and need factors

- The need to go beyond the low explanatory and treatment relevance of static criminal history factors

Advances in Data Synthesis Between Disconnected Criminal Justice Databases. The last decade has witnessed a nationwide trend in the United States of integrating the many previously fenced-off databases of jails, courts, prisons, probation, and other agencies using web-based connectivity. Electronic synthesis of these multiple databases should enhance the speed, access, and depth of relevant data elements, all of which are designed to improve predictive risk assessment.

Automated Data Collection and Data Streaming Using Electronic Sensors: Criminal justice is in the early stages of an explosion of automated and efficient data-gathering technologies for continuous longitudinal streaming. These procedures use web-based data communication software linked to low-cost electronic sensors to provide continuous data gathering from widely placed data sources (e.g., individual offenders, ankle bracelets, cell phones, motor vehicles, single jail cells, kiosks). This development, in the parlance of the information technology world, is often referred to as the Internet of Things (IoT) and is a component of “big data.” This development has the potential to substantially improve risk prediction by providing longitudinal continuous data that can be used for monitoring, early detection, and prediction of individual pathways to recidivism. Analytical tools geared for data streaming, such as multivariate time series and anomaly detection, have already emerged to determine baseline levels, undulations over time, and early warning systems (Molenaar, 2006).

Confusion and Linguistic Chaos from a Proliferation of Risk Assessment Systems: A crisis of language and communication in risk assessment has emerged in criminal justice as a result of several decades of creative anarchy leading to a proliferation of diverse offender risk assessment systems. Across different agencies, the numerous risk assessments differ in choices of risk factors, numbers of categories, methods to identify categories, and naming systems (Brennan, 1987; Gottfredson & Moriarty, 2006; Hanson et al., 2017). Recent national studies of risk assessment and classification in prisons found that this proliferation involves many “one-off” or unique risk assessment models developed for specific jurisdictions or single institutions and based on diverse local offender populations (Hardyman et al., 2002; Byrne & Dezember, 2017). Austin and McGinnis (2004) commented on the “linguistic chaos” they found across the nations prisons. Thus, how an offender is classified for risk may still largely depend on what agency receives the offender. Unfortunately, recent surveys have found that this confusion still exists as various agencies continue to invent their own local offender risk assessment systems, terminologies, and definitions of risk categories (Hanson et al., 2017; Byrne & Dezember, 2017).

Basic Steps in Developing Effective Risk Assessments

Acceptable risk assessments for criminal justice should be developed to meet several performance standards and to avoid design flaws. The following steps, broadly sequential, outline the main challenges and steps when designing new risk assessments:

Step 1. Clearly specify what “risk” is to be predicted, for what population sample, over what time period

A first task is to be specific about the exact risk outcome to be predicted by the risk assessment model. For example, a risk assessment may focus on any of the following outcomes: general recidivism, violent recidivism, parole failure, return to prison following release, risk of absconding, risk of probation violation or parole revocation, and so on. Unfortunately, agencies are often ambiguous as to the specific risk to be predicted. A second equally critical problem is to clearly conceptualize the specific offender population and the time period to be considered. All subsequent steps depend on these selections. Prediction equations may differ substantially based on the choice of a specific risk, the offender target population, and the length of outcome period to be addressed.

Agencies should be particularly clear about the performance requirements they hope to achieve from the risk assessment they adopt. Many performance requirements are either scientifically or legally required. Some key performance requirements are as follows:

Predictive, Concurrent, and Criterion Validity: Predictive validity is demonstrated when a risk assessment accurately predicts the outcome it was designed to predict (e.g., a new violent felony). Typically, a risk assessment is first administered to an offender, and after a follow-up interval, also referred to as the “time at risk,” the criminal activities of the offender are collected and evaluated. All risk factors used in calculating the risk scale must be measured first. Information for the outcome (e.g., violent offense) is often finalized at the end of the follow-up time. For predictive validity studies, the outcome is collected after the time at risk has elapsed. However, for concurrent validity, the risk predictors and the outcome may be collected simultaneously often in a single testing session. Concurrent validity can be claimed if the risk assessment and its separate risk factors correlate significantly with the outcome of concern. Criterion validity is a similar term that focuses on the correlation between a risk assessment and future measures of the specific outcome (e.g., new violent felonies following release from prison).

Calibration: This term refers to the level of the agreement between the predicted probabilities from a risk assessment model compared to the observed probabilities obtained from an external or independent sample of the population of interest. If these correlations are high, a risk instrument is considered well calibrated. In such comparisons, the observed outcome achieved in the independent or follow-up sample is often referred to as the “gold standard.” A popular coefficient for assessing calibration is the Hosmer–Lemeshow test (Oliver, Dieterich, & Brennan, 2014).

Construct Validity: This facet of validity is concerned with whether a risk scale actually measures what it purports to measure. It is more typically used with reference to specific risk factors that drive the risk assessment instrument (e.g., peer associations, impulsivity) or to the overall risk assessment instrument. Construct validity is established by examining whether a specific risk correlates positively with other measures of the same construct or with other theoretically linked constructs. For example, a self-esteem scale should correlate highly with any alternative measure of self-esteem, whereas a risk assessment for future violence should correlate highly with any other risk instrument designed to predict future violence. Construct validity is usually established cumulatively by finding several significant positive correlations with other instruments to which the proposed instrument or scale should be theoretically correlated. Construct validity may change over time as the theoretical understanding of a construct evolves, often leading to new ways to assess the measure (Messick, 1995).

Clinical Validity/Practical Utility—Does a Risk Assessment Provide Useful Guidance in Treating and Managing Offenders?: Clearly, while a first requirement of a risk assessment is to achieve predictive accuracy in establishing risk, a second critical performance requirement is to provide staff users with useful information to guide security placements, case plans, and treatments. However, most risk assessments that follow the narrow 2G design typically eliminate need factors in an effort to be as brief as possible. Thus, these 2G instruments often have little to offer regarding understanding and treating offenders. Attempts to understand the offender and design treatment plans are more generally seen as the purpose of needs assessment instruments as used in the RNR principles and in later 3G and 4G RNA systems such as the Level of Service Inventory (LSI; Andrews et al., 2006) or the Correctional Offender Management Profiles for Alternative Sanctions (COMPAS; Brennan et al., 2009).

Face Validity: This form of validity assesses whether a risk assessment instrument is understandable, appears coherent, and is consistent with common sense. It can be pivotal in criminal justice agencies for achieving acceptance among users. Face validity implies that the meaning of a scale is transparent and can be quickly and intuitively understood. Risk assessment scales for criminal justice users should attempt to achieve face validity by excluding any items that have counterintuitive prediction directions. For example, a common counterintuitive finding is that a current serious violent felony conviction often has minimal association with future violence. Among ML methods, criminal justice users may not readily accept neural networks since many of them are “black-box” methods that often lack transparency or cannot offer a clear logic to explain their predictive results.

Generalizability—Are Risk Assessments Transportable across Jurisdictions?: Can a risk assessment model developed in one jurisdiction retain its predictive performance if it is adopted by another jurisdiction? An “ideal” in criminal justice is to have risk assessments for specific risk outcomes (e.g., violent felonies, parole revocation) that retain their predictive validity and general meaning across diverse jurisdictions or agencies. Generalizability is established by careful cross-validation of the risk assessment instrument on independent samples of the same target population, ideally from other institutions or geographical areas. Any preexisting risk assessment instrument that an agency is adopting should be cross-validated on a local sample before being implemented to test whether it retains its predictive power.

The hazards of assuming generalizability were shown in a classic study of a probation risk assessment instrument developed in Wisconsin that had become widely implemented as a national “model system,” with strong support from the National Institute of Corrections. Wright, Clear, and Dickson (1984) tested the predictive validity of the instrument on New York City, but found that it could not replicate its predictive validity. This is not an uncommon finding and underscores the danger of adopting any new risk instrument without appropriate independent validation on new samples.

Interrater reliability (IRR): Reliability is an important performance criterion for risk assessment instruments. IRR assesses the agreement between two or more assessors when they use the same risk assessment tool to rate an offender’s risk level. In such studies, the assessors should base their ratings on exactly the same information such as videotaped interviews of offenders that can be presented to different assessors to test their levels of agreement. Agreement statistics can be calculated at both the raw item level and the subscale level, or for the overall risk instrument (e.g., high–medium–low agreement). IRR is primarily reported as an intraclass correlation. A satisfactory IRR score would be a correlation of 0.75 or greater. An important related measure is the standard error of measurement (SEM), which indicates the degree of confidence one should have in the risk assessment. However, IRR depends not only on the assessment instrument but also on many organizational factors such as the appropriateness of staff training, work overload, stressfulness of work conditions, and supervisor competence. Thus, IRR for the same instrument may differ between institutions to the extent that these conditions vary between institutions.

Internal Consistency Reliability: Internal consistency does not typically refer to the total risk assessment scale. It is more often applied to each specific risk factor or needs scale used as inputs to a risk assessment. The goal is to ensure that the various questions in each input scale are highly correlated with each other and are measuring a single meaningful construct (e.g., impulsivity, poverty, low self esteem). Internal consistency is critical for any subscale that purports to measure a single unidimensional risk or need. In corrections, this form of consistency is generally assessed using Cronbach’s Alpha coefficient, which was designed to reflect such mutual interitem correlations within a scale. Alpha scores of around 0.50 are unacceptable, indicating that the questions in a scale are incoherent, with no clear meaning. An alpha of at least 0.60 or higher is minimally required in social sciences. Scores of 0.70 and above are viewed as satisfactory. Internal consistency is not typically applied to the overall risk assessment but rather is mostly calculated only for each separate subscale.

Test–Retest Reliability: An important form of reliability is whether a risk assessment gives consistent results over a given time period (e.g., two weeks). When a sample of offenders is tested and then later retested with the same assessment tool, similar and highly correlated results are expected. This approach, however, relies on the absence of any confounding factors that may occur in the intervening time period. For example, the test–retest approach may be misleading in criminal justice if certain key events occur in the intervening period (e.g., exposure to a traumatic event or exposure to inmates holding strong antisocial attitudes). Also, if the time interval is too short, subjects may remember the questions, and this may influence their responses. Thus, test–retest reliability in a corrections setting may be compromised, so that other methods, such as internal consistency, are often jointly assessed.

The process of selecting appropriate risk factors is largely guided by the preceding scientific performance requirements. If these requirements are well specified, this should help to select an effective set of risk and classification factors. Some specific approaches are as follows:

The Traditional BOGSAT Method: This acronym stands for “bunch of guys sitting around a table.” Risk factors selected by this approach dominated the traditional custody and security point-scale risk assessments that were widely used in corrections until the early 1980s and are still often used. For example, a committee of experienced jail or prison officers, treatment staff, or other knowledgeable experts would rely on personal experience (often considerable) to select apparently “relevant” risk factors for their agency. The resulting risk scales emphasized conventional correctional wisdom, such as violent current offenses, and often had minimal guidance from empirical research or noted whether the chosen factors played any role in theories of antisocial behavior. It often included factors with little predictive validity. Fortunately, this approach has largely died out.

Theory-Guided Selection of Predictive Risk Factors: Most best practice reviews emphasize the benefits of a “theory-guided” explanatory framework for selecting risk factors (Sechrest, 1987; Jones, 1996). The theory-guided approach is rare, although it is shown in some RNA instruments such as the LSI (social learning theory) or COMPAS (e.g., strain theory, control theory, routine activities theory, social learning). Each major criminological theory will include a core set of hypothesized explanatory variables. Many of these explanatory variables, such as antisocial peers that are central to social learning theory, have been empirically confirmed in theory testing studies. Theory-based factors have the advantage of offering coherent explanatory arguments to clarify how a particular factor X potentially explains criminal behavior and, in turn, may be helpful in suggesting interventions. This reflects Lewin’s (1931) well-known statement, “There’s nothing more practical than a good theory.”

Empirical Research to Discover Correlates of Criminal Behavior: Empirical methods reflect the data-driven approach by using several techniques to identify factors that are significantly correlated with antisocial behavior. The predominant method uses cross-sectional samples and bivariate correlational or regression techniques to identify such factors. These studies may only establish the concurrent validity of the factors. Additionally, an increasing number of large-sample theory-guided longitudinal prospective studies have been conducted that can shift the analysis toward identifying potentially causal and predictive factors. A typical study uses a large sample of offenders that have been released from jail or prison and then are followed up over time to establish any specific factors that may differentiate those who remain crime free from those who recidivate. Another kind of study termed “known-groups” validity is used to contrast preselected categories of offenders (e.g., violent recidivists versus nonviolent) in order to discover factors that differentiate the categories. The technique of meta-analysis has also been used to summarize the results of the huge numbers of specific studies that have examined risk factors that correlate with antisocial behavior. Experimental studies of program effectiveness using strict design requirements for experimental and control groups also offer a useful way to identify factors that have predictive ability.

A Brief Review of Major Risk Factors including Meta-Analysis Findings: A voluminous literature exists on RNA and risk predictors for crime and delinquency. A detailed examination is beyond present purposes. The meta-analytic approach is a statistical procedure designed to combine and summarize the vast body of research studies on recidivism as a way of systematically identifying the most powerful predictive risk factors (Gendreau, Little, & Coggin, 1996). The following factors have emerged from meta-analytic studies as significant correlates of antisocial behaviors.

-History of antisocial behavior: This history includes age at first arrest, number of prior arrests and incarcerations, seriousness of crimes, and other criminal history factors.

-Antisocial personality factors: These factors include sensation-seeking, impulsivity, aggressive behavior, callous disregard for other people, low self-control, and others. This personality pattern is similar to the psychopathy construct of Hare (2003) and the low self-control construct in the General Theory of Crime (Gottfredson & Hirschi, 1990). The psychopathy construct is currently being reformulated, resulting in several recent approaches to its assessment.

-Antisocial cognitions: Offenders may use several attitudes and beliefs to justify, make excuses, or minimize feelings of guilt for their behaviors. These may include blaming the victim, blaming the situation, citing the unfairness and corruption of the police and courts, complaining of unfair unwarranted sentencing, asserting that the crime was minimal and that “nobody was really hurt,” and so on.

-Antisocial associates: Indicators of this risk factor may include time spent with antisocial companions who indulge in drugs, crime, and sensation-seeking lifestyles and who may have prior arrests or convictions.

-Family and marital features: This factor focuses on quality of parenting in the family of origin and in the current family if applicable (e.g., care, nurturance, neglect, levels of discipline and supervision, family violence, substance abuse, antisocial behaviors). A related theme is family disorganization, as reflected by broken homes, unstable housing, divorce, and others.

-Educational and vocational factors: These may include indicators such as failure/success in school and work settings, negative relations with teachers or work supervisors, satisfaction with school or work, dropout, failure to maintain employment, and lack of job skills.

-Leisure and recreation: Leisure can be oriented toward prosocial activities (e.g., organized sports, music, study, church). Alternatively, it may focus on drugs, looking for action and excitement, time spent with antisocial peers, and participation in high-risk antisocial adventures.

-Substance abuse: This well-established risk factor includes excessive use of alcohol or drugs. Prior arrests or prior treatment for alcohol or drugs may also indicate problems in these areas.

Several strength or protective variables are often used in RNA instruments. In some cases, they may be seen as the alterative or opposite pole of these high-risk factors. Protective factors may act to reduce or buffer the person against the risks of violence or antisocial behavior. They have not been studied to the same extent as risk factors. However, identifying and understanding protective factors may be equally as important as risk factors (Lösel & Farrington, 2012). The following have been seen as protective factors:

Biological Factors: Finally, biological factors are generally not widely used in criminal justice risk assessment procedures. However, this is a very active research arena, and recent texts such as Adrian Raine’s (2013) exploration of “the anatomy of violence” have identified several biosocial factors that appear to have important implications for advancing risk assessment.

Design flaws in risk assessments can occur if any of the major sampling, cross-validation, and norming issues are poorly accomplished. The key sampling issues for designing risk assessment models are as follows:

Development, Testing, and Norming Samples: Specific kinds of samples are required for several key purposes related to the initial development of RA instruments and for testing and norming a new RA model on fresh data. These are as follows:

Development or Derivation or Training Sample: This is a sufficiently large sample of the specific offender population for which the statistical model is being developed. This sample establishes the basic, but as yet, untested predictive risk assessment model. Both the nature and size of this sample are critical. It must represent the kind of offenders, or target population, for which the predictive model is being developed and on which it will be used. In many cases, a random sample from the total agency population may be appropriate. In other cases, a more specific sample perhaps focusing on a particular kind of crime, or a specific policy relevant target population is required (e.g., homicides, white-collar offenders). In effect, the “case mix” of the development sample must be carefully aligned with the intended target population for which the risk assessment instrument is being developed. Depending on the disciplinary orientation of the risk assessment developers, this sample may also be known as the training or derivation sample.

Test or Validation Sample, and Follow-up Samples: While a derivation or development sample is used to derive or “train” a new risk assessment, a separate test or cross-validation sample is employed to assess whether the new risk model retains its predictive accuracy when used on an independent validation or follow-up sample. A common strategy uses a “split sample” cross-validation where the development sample is randomly split into two subsamples. The first subsample is used to derive the new risk assessment model, and the second is used to test its predictive accuracy. This approach has several variants depending on the splitting ratios and numbers of splits. The follow-up validation strategy can involve a single development sample to develop or train the new predictive model and then a follow-up of this same sample for a substantial “time at risk” (e.g., three years). At the end of the follow-up period, the obtained outcomes from the follow-up data are correlated with the risk prediction of the development sample to assess how well the predictions of the training model are fitted to the real outcomes. Typically, the risk assessment from any development or training sample will not exactly fit the new validation data as well as it fitted the original data. A cross-validation test on an independent sample indicates the degree of “shrinkage” or erosion of predictive accuracy of the new risk assessment on the new sample. Such tests are mandatory when developing any new risk assessment in criminal justice.

Norming Samples, Normative Scoring and Decile Cut-Points: A further task for any new risk assessment model is to obtain a large representative sample of the agency’s offender population to be used as a norming study. In many cases, the developmental sample can be used for this norming process. Norming is critical for aligning the offender’s raw risk assessment scores to the overall score distribution of the institutional norming sample. Distribution statistics for the norming sample can then be calculated (e.g., mean, standard deviation, z-scores, deciles, quartiles). The overall distribution of risk scores for the norming population provides a stable frame of reference against which any new offender can be compared to clarify the degree to which an offender is higher or lower than other offenders on a standard metric such as z-scores or deciles. These standard statistics provide simple benchmark scores for the total offender population and for any important subpopulations (e.g., male and female, racial subgroups). These standard scores give the standing or rank order of each individual offender compared to the overall offender population. For example, decile scores subdivide the total frequency distribution of the norming sample into ten groups of equal size and are typically scored one through ten. Alternatively, the exact probability of failure can be calculated for all offenders as another standard metric of risk. Traditionally, most criminal justice institutions appear to prefer the less precise ordinal rankings such as high, medium, and low, thus preferring simplicity over precision for these scores (Hanson et al., 2017).

Determining the Effective Sample Size and Providing a Power Analysis: Sampling issues are critical in developing risk assessments in criminal justice. For risk assessment models that use binary outcomes, sample size adequacy is determined not by the total number of cases in a sample but by the “effective sample size,” which is understood as the minimum of the number of outcome events (e.g., parole revocations) or nonevents (no revocations) in the sample that occur during the follow-up. The effective sample size is critical in influencing the precision of predictive risk measures such as the area under the receiver operating characteristic curve (AUC). As noted earlier, the AUC is a measure of predictive accuracy and has become widely used in assessing the predictive performance of risk assessment systems. Scores vary from 0.50, representing a worthless RA instrument, to a score of 1.00, representing perfect accuracy. Most good risk assessment instruments in criminal justice score in the area of 0.67 to 0.73 and ideally higher. A power analysis should always be carried out before a validation study to provide realistic estimates of how large the needed sample must be. Several factors are required to estimate the size of the development sample. These include the conjectured AUC (or other accuracy measures); the expected base rate of failure (depending on the sample, event of interest, and follow-up length); and the precision that is wanted for the AUC—or other point estimate—of predictive performance. When reporting results, the confidence interval for the point estimate of whatever measure of discrimination is used should also be provided.

A sufficiently large effective sample size is required if the goal is to detect differences in AUC or the calibration between initial risk scale development and validation test sample results. It is also important for testing whether a new validation result is different than a previously published result, or from a minimum standard (e.g., an AUC = 0.70). For example, Vergouwe, Steyerberg, Eijkemans, and Habbema (2005) conducted simulation methods to estimate minimum effective sample size for a risk assessment needed to detect differences in AUC and calibration under various scenarios. Based on the results of their simulation study, the authors suggested that samples used for external validation should include a minimum of 100 events and 100 nonevents.

Step 5. Select a method to combine risk factors into a risk decision

A further critical task is to choose a method to effectively combine the separate risk factors into a single score to delineate classification risk levels. There are several popular approaches in criminal justice for making this choice, as follows.

Unstructured Clinical Assessments: These assessments largely conform to Bonta’s 1G method; that is, they use the traditional clinical judgment of an expert clinician or other trained decision makers who will gather a body of case information using an open-ended clinical interview, typically augmented by a record search. This expert then engages in a personal internal deliberation process to synthesize the information before reaching a risk decision. This cognitive process is often quick and intuitive, and the human decision maker is not always able to articulate the exact underlying process. Another common feature of this approach is the relatively high level of “confidence” with which expert decision makers regard their decision.

Structured Professional Judgment (SPJ): This approach uses a formal list of well-established risk factors, including many of those listed in Step 3, to structure an assessment interview. The twenty or more factors are not always numerically scored, or they may be scored only as low, moderate, or high (Webster, Douglas, Eaves, & Hart, 1997). The method uses no explicit algorithm to combine the factors into a risk score. Instead, the clinician again intuitively combines the specified factors into a risk-level judgment. This approach estimates risk only for individual offenders based on their personal pattern of risk factors and avoids using “risk group” profiles as in the actuarial approach. SPJ appears to be effective in increasing the quality, transparency, and reliability of decision making. Additionally, the rater or decision maker may introduce other factors on a case-by-case basis.

Unweighted Linear Additive Point Scales: This method simply sums several unweighted risk factors into a single risk score for an offender. Thus, if an offender is scored as having five risk factors out of a possible ten factors, and each factor is scored as 1 or 0, then his or her overall scale score is 5 points. Cut-points are applied to such scales to demarcate high, medium, and low scoring risk levels that are either logically developed by policy arguments or by an actuarial method. This simple approach remains common in many current security point classifications for jails and prisons (Brennan, Wells, & Demory, 2004). It remains popular owing to its simplicity, transparency, and minimal time or skill demands on staff. Many studies have shown that this simple additive method performs just as well as multiple regression actuarial methods in predicting recidivism (Wainer, 1976).

Weighted Linear Scales—The Multiple Regression Family of Methods: MR produces statistical linear additive point scales such that each risk factor is given an MR-generated scoring weight that partially reflects its predictive importance. MR is of value in specifying the weights that seek to optimize the predictive accuracy of the total set of risk factors. Discretionary adjustments or overrides are typically allowed to decision makers to take account of aggravating or mitigating factors. The use of discretionary overrides, however, is strongly contested (Quinsey et al., 1998), although there are mixed findings in the recent literature on whether these raise or lower errors. A recent powerful numerical method for developing predictive risk assessment models is the lasso technique (standing for “least absolute shrinkage and selection operator”; Hastie, Tibshirani, & Friedman, 2009) that fits a linear model as with MR. However, lasso excels in efficiently handling huge numbers of potential risk factors and in identifying irrelevant noise factors that can be safely dropped from the analysis (Oliver, Dieterich, & Brennan, 2014). This parsimony improves the interpretability of the final linear model.

The next challenge is to evaluate the predictive performance of the selected risk assessment instrument. This choice has become difficult for criminal justice managers owing to the proliferation of performance measures, each measuring different facets of performance. A safe choice in most studies is the AUC “area under the curve” coefficient from signal detection theory (SDT). It is the most widely used general measure of the discriminative power of risk assessment instruments—although it is not without controversy (Hand, 2009). Its interpretation is simple. An AUC of 0.50 indicates chance performance no better than flipping a coin. AUC values close to or over 0.70 are considered acceptable in criminal justice. A value of 1.0 indicates perfect performance, with no predictive errors. In recent studies, most widely used risk assessment systems in criminal justice have been achieving AUC scores ranging from 0.67 to 0.75, depending on the kind of offenders and age and gender breakdowns (Duwe & Kim, 2015, Brennan, Dieterich, & Ehret, 2009). Given its wide use and relatively simple interpretation, there are advantages in using the AUC together with one or two of the more traditional measures of specific facets of predictive performance of interest to the agency. These include measures such as sensitivity, specificity, overall percent correctly classified (TCP), positive predictive power (PPP), negative predictive power (NPP), false positive, and false negative rate. All of these measures are relevant for public safety policy decisions in corrections. If discrimination between risk levels on percent failures is of key importance to policymakers, a measure such as the DIFR (Dispersion Index for Risk) may be useful (Silver, Smith, & Banks, 2000).

Selected Controversies and Dilemmas

Discretionary Adjustments/Overrides—Should They Be Allowed?

Most correctional agencies allow a discretionary adjustment or override for possible use by staff decision makers if they disagree with the formal risk assessment score. Agencies typically require a written justification whenever an override is used, as well as a supervisor’s review of any aggravating or mitigating factors that were used to justify the override. The discretionary adjustment gives the human decision maker the final say in accepting or rejecting the risk assessment result. Most criminal justice decision makers strongly prefer to maintain such discretionary adjustments. Senior decision makers (e.g., judges, parole boards) often see actuarial risk methods as inferior to their own human expert judgment. This contradicts years of research findings showing the opposite. To limit the use of risk assessment for the sanctioning component of sentencing, some jurisdictions introduce policy rules to limit its role only to the crime reduction component of sentencing. Yet, several decades and dozens of research studies have generally found that actuarial predictive methods are more accurate than expert human judgment over a variety of professional fields (Grove & Meehl, 1996, Garb, 1998). For example Quinsey et al. (1998) reported that the discretionary option often reduces predictive accuracy by introducing more errors. On this basis, they concluded that “discretionary adjustments” should not be allowed and may even be unethical, given that they appear to introduce more errors. Thus, the use of discretionary adjustments remains contested in criminal justice.

In some criminal justice jurisdictions, policymakers may see certain subpopulations as requiring their own separate normative scores (e.g., male/female, ethnic, or racial groups). Separate norms may be important if a particular group is believed to be systematically different from the overall offender population. In such cases, separate norming studies can be conducted to provide group-specific normative scores based on samples for such subpopulations (e.g., for male and female offenders and for distinct institutional populations such as prison, jail, and probation populations). A carefully selected, large sample is typically used to ensure that the parameters for the selected normative population are strictly based on a representative norming sample for that agency or jurisdiction (e.g., mean, standard deviation, z-scores, decile scores, or cutting points). Norm-referenced assessments focus primarily on rank ordering, or other key parameters, for the specified population.

The norming process can be highly political, and agencies may come under attack from advocates of specific subgroups if there is suspicion that scoring procedures and the resulting normative score distributions of the overall population are inappropriate or biased against specific subgroups (e.g., women or minorities). This may set off a never-ending proliferation of separate norms for a variety of ethnic or racial groups, male and female groups, and so forth. Such proliferation of norms for progressively smaller subgroups may become confusing for both program staff and the agency (Wormith, 2001). However, this issue has advocates on both sides and intermittently flares across different jurisdictions.

The Challenge of Standardization and a Lingua Franca for Criminal Justice Risk Assessment and Classification: Coordination or Chaos?

In a broad challenge to mental health fields, Kendell (1975) argued that any scientific discipline must choose between developing standardized and unified classification systems for its cases or face a potentially chaotic diversity of uncoordinated diagnostic systems. He comes down squarely on the idea of striving for a standardized classification system that might evolve into a lingua franca for that profession to support meaningful identification and communication between practitioners, researchers, and policymakers. The alternative is a continuation of the myriad separate state and local risk classifications that are found in criminal justice.

As noted earlier, the need for a standardized risk assessment in criminality and criminal justice has been contested. For example, Hardyman et al. (2002), in a national review of prison risk assessment for security levels, appear to condone the current fragmentation by concluding that “there is no best model … nor should there be one.” This claim challenges both the value and possibility of standardized risk classification for prison security risks. Hardyman et al. argued that there can be no standard model based on their findings of the diverse resources, cultures, and architectures among prison systems. The ongoing lack of standardization and resulting communication problems were confirmed in the recent national survey of risk assessment in U.S. prisons (Byrne & Dezember, 2017). Admittedly, national standardization is a difficult challenge, as seen in the heated debates and constant revisions of the Diagnostic and Statistical Manual of Mental Disorders of the American Psychiatric Association (APA). This system was designed to offer a common language and standard criteria for the classification of mental disorders. It brought order and more reliability to diagnostic decisions for the mental health professions. In criminal justice, some federal agencies (e.g., National Institute of Corrections) since the late 1980s have supported a decades-long nationwide training and technical assistance program in the United States to encourage jails to adopt a standardized approach to offender security/risk classification. Yet, three decades later, the nation’s jails and prisons have had ongoing implementation problems, and, although some successes were achieved, these institutions remain far from achieving the desired levels of standardized risk assessment for security purposes (Brennan, 1999; Byrne & Dezember, 2017).

A Current National-Level Attempt to Achieve Standardized Language for Risk Classification

In 2014, the United States Bureau of Justice Assistance (BJA) and the National Reentry Resource Center (NRRC), in their role as the primary sources of professional guidance, initiated an effort to achieve standardized terminology for risk levels. This effort convened several meetings of international experts on RNA—including researchers from multiple disciplines, scientists, policymakers, and correctional practitioners. The goal was to develop standard categories to communicate risk and needs, regardless of the specific instrument used by an agency. A recently released white paper reported the progress on this project (Hanson et al., 2017). It is beyond present purposes to provide full details, so only the main recommendations are summarized here.

The paper proposed a five-class standardized model for offender risk categories. This was described and linked to implementation strategies of the risk–need–responsivity (RNR) principles. Each category defined a prototypical risk and needs profile to inform case planning, formulate differential programming, and estimate recidivism outcome rates for each category. Interestingly, consensus was not reached on names/labels for the five categories, and each category was simply labeled using the Roman numerals I, II, III, IV, and V. Level I corresponds to the lowest risk of reoffending, whereas Level V corresponds with the highest risk of reoffending (Hanson et al., 2017). The categories are as follows:

Level I: These have few, if any, criminogenic or noncriminogenic needs, and any identified needs may be minimal or transitory. These offenders have several psychological, interpersonal, and lifestyle strengths. Custody (prison or jail) is seen as counterproductive and may increase the risk of recidivism. These offenders are likely to comply with community supervision, so that minimal monitoring is warranted. The prognosis for this group estimates that reoffending rates would be similar to the offending rate of the average nonarrested citizen (less than or equal to 5% over three years). These persons were seen as most likely to desist from criminal behavior with or without a correctional intervention.

Level II: These offenders may have one or two criminogenic needs that are transitory or acute. They have several psychological, interpersonal, and lifestyle strengths and should respond well to services. The estimated reoffending rate at 19% is higher than that at Level I. Long-term custody was not seen as appropriate, given the negative effects of incarceration. These offenders are expected to comply with community supervision. Traditional case management is appropriate to monitor compliance and program participation and should focus primarily on short-term interventions, problem solving, and community services. With appropriate services, Level II cases are expected to shift to Level I over a short time (e.g., six months) with its lower recidivism rate. Desistance appears likely for Level II when their criminogenic needs are addressed.

Level III: These offenders are seen as being in the midrange of the risk distribution of correctional populations. Their profiles may have multiple psychological, interpersonal, and lifestyle criminogenic needs and some noncriminogenic needs (e.g., mental health). While having several strengths, their multiple needs may sabotage the protective effects of these resources. Reoffending is likely to reach about 40% over two years. Some short-term custody is envisioned, mainly for risk management purposes. This level is expected to benefit from community supervision to monitor compliance. Other services should focus on prosocial change and reduction of criminogenic needs, with secondary attention to noncriminogenic needs. With successful interventions, about half of these offenders may reach lower reoffense rates similar to Level II offenders (19% over two years). However, these offenders are expected to continue their criminal career for the next three to seven years.

Level IV: These offenders will reflect multiple criminogenic needs on all risk-relevant psychological, interpersonal, and lifestyle domains. These are typically paralleled by several chronic noncriminogenic needs. The group is likely to have had multiple prior incarcerations. Their two-year reoffending rate was estimated at 65% or more. Further incarceration may be warranted to ensure both community safety and treatment engagement. Lengthy in-custody programs and services (200–300 hours) should focus on reducing criminogenic needs. This may be followed by postrelease community treatment and monitoring. A reduction of reoffending (about 10%) is expected if these cases receive a sufficient dosage of appropriate treatments. Successful rehabilitation, however, typically involves gradual life changes over a long period of time (10+ years) and desistance with aging.

Level V: This category reflects multiple criminogenic needs in all risk domains, several noncriminogenic needs, and multiple prior incarcerations. Their estimated reoffending rate may exceed 85% over two years. Custody is often appropriate for community safety. Their high propensity for crime may warrant highly structured and substantial change-focused supervision and treatment (e.g., over 300 hours). Change-focused services should occur in secure prerelease settings, as these persons hopefully exhibit positive behavioral and attitudinal change while also building social and vocational resources. Postrelease, this group will require intensive community supervision and close monitoring. Reduced reoffending may only occur gradually with sufficient dosages of evidence-based strategies. Only with advancing age (50+), will these offenders fall to the low reoffending base rate of Level II.

A perhaps overly optimistic assumption of this project is that agencies will not have to develop any new risk and need assessments. The expectation in the white paper is that agencies can continue using their current risk and need assessment, and that these will suffice for assigning offenders to these five standard categories. This assumption was based on field testing in several agencies, suggesting that these standard category descriptions were recognizable in most agencies. This approach clearly will need further investigation to demonstrate appropriate assignment accuracy.

The Emergence of Risk Assessment for Women

A further challenge to risk assessment that gathered force during the 1990s concerned the practice of using identical risk and need assessment tools for male and female offenders. Most main theories of crime—and risk assessments based on them—assumed that males and females had similar pathways to crime and could be assessed and understood using the same risk factors. The LSI exemplified this gender-neutral (GN) approach. However, a number of feminist criminologists challenged this assumption (Bloom et al., 2003; Reisig, Holtfreter, & Morash, 2006; Hannah-Moffat, 2004; Holtfreter & Morash, 2003; Morash et al., 1998; Van Voorhis & Presser, 2001). They identified various deficiencies of the LSI regarding women offenders and sought to bring attention to specific risk factors of particular relevance for women offenders (Holtfreter, Reisig, & Morash, 2004). These have become known as gender-responsive (GR) factors. Reisig et al. (2006) found that the LSI appeared ineffective for predicting recidivism among a substantial segment of women, particularly those characterized by GR risk factors. An overall concern was that GN assessments by omitting GR factors would be incapable of targeting relevant programs and services for women (Belknap & Holsinger, 2006; Bloom et al., 2003; Hannah-Moffat, 2009; Van Voorhis et al., 2010).

These concerns motivated an impressive body of research on GR factors during the 1990s. The influential work of Daly (1992, 1994) formulated a compelling typology of five distinct pathways of women offenders, each defined by a specific pattern of GR risk factors (Brennan et al., 2012; Wattanaporn & Holtfreter, 2014). The Daly studies converged with other classic, mostly qualitative studies (Bloom, 1996; Chesney-Lind & Shelden, 2004; Owen, 1998) to identify key factors that appeared to differentiate women’s pathways to crime from those of men.

A landmark event in the late 1990s was the funding by the National Institute of Corrections (NIC) of a multiyear, multistate project to develop and validate a women’s risk/needs assessment instrument (WRNA) and to formalize the measurement of many of the proposed GR factors. This was led by Van Voorhis and colleagues and they produced a series of studies of the psychometric reliability and predictive validity of the new WRNA scales (Van Voorhis et al., 2008; Van Voorhis & Groot, 2010). The critical GR factors that initially emerged from the WRNA studies included anger, family conflict, relationship dysfunction, child abuse, adult abuse, mental health history, depression (symptoms), psychosis (symptoms), and parenting stresses. Several strengths were also added that appeared of great importance for women, including family support, parental involvement, self-efficacy, and educational assets (Van Voorhis, 2012).

Another important goal of the WRNA effort was to design a trailer assessment to serve as an addendum to gender-neutral risk/needs assessments such as the LSI-R, COMPAS, and others, and allow its joint use in the context of GN instruments. The trailer was programmed into the COMPAS software and pilot-tested on a large prison sample in California (Brennan et al., 2008; Van Voorhis & Groot, 2010). This study also demonstrated the predictive validity of the WRNA and led to further development of Daly’s original typology, while also developing a more precise classification of the pathways to crime among women offenders (Brennan et al., 2012).

Unsurprisingly, given the multiple criticisms of the LSI, Andrews and Bonta (2010) strongly contested the need for gender-responsive scales, arguing that gender-responsive needs were noncriminogenic, were unrelated to future offending, and did not qualify as risk factors. They argued that GN instruments such as the LSI were equally predictive for male and female offenders. Smith, Cullen, and Latessa (2009), in a large meta-analytic study of the LSI-R, found that it had similar effect sizes in predicting recidivism of females and males. However, Van Voorhis and colleagues, while acknowledging the predictive evidence for GN factors, produced an impressive body of research supporting the incremental predictive validity of the WRNA scales. They found that the addition of the GR factors significantly improved the overall predictive validity of the GN risk/needs assessments for women offenders for a variety of criminal outcomes (Salisbury & Van Voorhis, 2009; Van Voorhis, 2012). This debate will no doubt continue as more research is completed on the WRNA scales.

Risk Assessment and Racial Bias

Racial bias in risk assessment was given intense scrutiny during 2016 when an investigative journalist group confidently concluded in the U.S. national media that all criminal justice risk assessments were biased against blacks. This conclusion prompted a useful debate, including several forceful rebuttals, clarification of appropriate methods to detect bias in risk assessments, and several ways to conceptualize and measure fairness and bias. This debate raised concerns that risk assessment instruments may have a disparate and adverse effect on racial minorities. The context for this debate is that risk assessment is becoming steadily more involved in various decision-making junctures in criminal justice aimed at achieving several positive goals, including (1) fairness and consistency—by reducing classification errors, (2) transparency—by using objective and replicable risk methods, and (3) higher accuracy—quantitative risk assessments have been shown to be more accurate than human decision makers. The concern is that risk assessment may violate constitutional requirements for the equal treatment of racial minorities and the poor.

Risk Assessment and Concerns over Risk Factors: If any risk factor is correlated to any degree with race, critics argue that it may act as proxy for race and unfairly raise risk factor scores for minorities. A further objection against certain risk factors (e.g., gender) is that they are immutable and static and beyond the control of the offender (Starr, 2014). Thus, factors such as age, gender, and race are seen as objectionable and, in fact, are typically removed from most current risk assessment instruments. While rhetoric in these debates may be forceful, it is important to subject such concerns to careful data analytic tests. Following is a brief discussion of some of the risk factors of concern in these debates.

Criminal History: This important factor is widely used in risk assessment for estimating risk of future reoffending. Several aspects of criminal history do correlate with race, although the correlations are mixed and can vary as a function of how criminal history is operationalized (Skeem & Lowencamp, 2016). Some critics would allow its use in risk assessment (Starr, 2014), while others reject it as a “proxy” for race (Harcourt, 2015). Some minority groups typically have somewhat higher mean criminal history risk scores than whites. However, as Skeem and Lowencamp emphasize, mean differences in risk assessment scores cannot be directly taken as indicators of bias. A related concern is with differential rates of participating in crime versus differential criminal justice responses to crime (e.g., convictions).

Socioeconomic and Educational Factors: These factors are also often used in risk assessment and have also been challenged as potential proxies for race (Starr, 2014). They do correlate to some degree with both race and recidivism and are potential predictors for risk assessment and key targets for rehabilitation efforts.

Behavioral and Psychological Factors: These factors are among the top predictors of recidivism based on meta-analytic research. They also correlate with race to varying degrees. They are widely used to guide risk reduction programs since they are changeable (dynamic) and can be modified through interventions. For example, peer group choice, drug abuse, and criminal thinking are all valid predictive risk factors and are often the main focus for rehabilitative treatments. Selected factors from these categories can be found in most current risk assessment instruments.

Scientific Approaches to Evaluating Bias in Risk Assessment Tools—The Need to Transcend Rhetoric: As noted earlier, it is imperative that claims of bias be examined with appropriate methods to detect bias and to go beyond rhetoric to data. Given recent public concerns about bias against blacks, the following focuses on the possibility of test bias against black offenders. Flores et al. (2016) recently noted that rigorous scientific methods to detect bias in risk assessments have now achieved national consensus among several mature social science disciplines that also face high-stakes decisions (e.g., education, employment, intelligence testing). These fields have recently adopted standard methods and explicit criteria to detect bias. This includes the American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME; 2014). Some critical questions regarding bias among these criteria are as follows:

2. Does a given risk score have essentially the same form and meaning across the groups, so that the same score will reach the same probability of recidivism?

3. Are factors such as criminal history, socioeconomic factors, and selected psychological factors proxies for race?

Two recent studies have applied these standard NCME methods to assess racial bias on two well-known risk assessment instruments used by criminal justice agencies. Flores, Bechtel, and Lowenkamp (2016) tested the Correctional Offender Management Profiles for Alternative Sanctions (COMPAS) on a large cohort of released defendants in Broward County, Florida. Skeem and Lowencamp (2016) evaluated the Post Conviction Risk Assessment (PCRA) developed by the U.S. Administrative Office of the Courts to improve the effectiveness of federal community supervision. Technical details are fully described in the cited studies. The key results are summarized next.

Equality of Predictive Accuracies for Black and White Groups: Skeem and Lowencamp found that the PCRA had good predictive accuracy coefficients (AUCs) that were fundamentally equal for black and white offenders. Flores et al. similarly found that the COMPAS demonstrated parity of predictive accuracy between black and white groups, with good predictive AUCs of (0.71) for the total sample and for white (0.69) and black groups (0.70).

Equality of Form and Score Meanings across Black and White Groups: Both the Skeem and the Flores studies confirmed that risk scores for both the PCRA and COMPAS had essentially the same meanings regardless of racial group. Thus, the risk scores for both instruments had very close regression slopes for black and white offenders, and the interaction terms between race and the RA were insignificant. This implies that the same risk score would have essentially the same probability of recidivism irrespective of racial group.

Are Criminal History and Other Socioeconomic Factors Proxies for Race? Skeem and Lowencamp carefully examined this issue. They concluded that although criminal history and race were modestly correlated, criminal history and the other socioeconomic risk factors did not meet the statistical criteria to be seen as a proxy scores for race. The core criterion in this approach was that criminal history predicted the arrest outcome far more dominantly than race.

Are the Prediction Results Due to Biased Criminal Histories Driving Biased Risk Assessments? In a study of whether biased criminal history records for blacks were driving the predictive results, Skeem and Lowencamp registered an emphatic “no” to this question. The key finding was that criminal history predicted arrest outcomes with essentially equal accuracies for the separate black or white categories.

Conclusions Regarding Race Bias: This research concluded that two commonly used risk assessments (PCRA, COMPAS), when tested with rigorous statistical methods, had equal predictive accuracies for black and white groups and that the risk assessment scores had essentially the same form and led to the same recidivism probabilities irrespective of racial group (black and white). A stark conclusion of Skeem and Lowencamp was that risk assessment is not “race assessment.” Additionally, given the evidence of equal predictive accuracies and minimal racial bias, these authors and others suggest that risk assessment should be viewed as useful input into criminal justice decisions. It should be noted that the findings for the two risk assessment instruments that were tested cannot be generalized to other risk instruments that have not been tested with these procedures, given the numerous differences in design and purposes of different risk assessment instruments.

A clear dilemma for criminal justice is that in spite of the dominance and popularity of multiple regression (MR), critics from an increasing number of diverse disciplines have been voicing challenges to this standard approach to risk assessment (Cronbach, 1982; Ragin, 2000; Lykken, 1991; Molenaar & Campbell, 2009). Some of these criticisms are as follows:

Difficulties of Meeting Parametric Assumptions for MR with Complex Data: A basic challenge to MR noted by Berk and Bleisch is the difficulty of always meeting its required parametric assumptions (e.g., normality, linearity, additivity) when faced with increasingly complex criminal justice data. Traditionally, criminal justice researchers using MR have relied on various “work-around” tools (e.g., variable transformations, interaction terms, dummy variables) to meet these requirements. Berk and Bleich (2013) are, however, quite blunt in suggesting that these parametric assumptions cannot always be met in practice. This problem is likely to become more pervasive as the volume and complexity of criminal justice data for risk assessment continue to rise.

Loss of Fidelity of MR to the Complexity of Criminal Justice Data—Cronbach’s Flat Earth Society: A related criticism is that MR oversimplifies the data for risk analysis by compressing the data into a compact procrustean linear mathematical model and that to satisfy its parametric assumptions it eliminates most of the complexity of the data structures (Cronbach, 1982; Richters, 1997; Lykken, 1991; Molenaar & Campbell, 2009). Decades ago, Cronbach (1982) noted that these procedures would flatten or remove individual heterogeneity from further analyses and that parameters based on such pooled sample-wide analyses (e.g., correlation coefficients, partial regression coefficients, effect sizes) would be misleading if the underlying complexities and heterogeneities of the data had been obliterated or masked. He asserted that “some of our colleagues are beginning to sound like a Flat Earth Society … The Flat Earth folk seek to bury any complex hypothesis with an empirical bulldozer” (1982, p. 70).

Predictive Errors When Linear MR Is Used on Nonlinear Data: A third criticism focuses on predictive errors. Several studies have compared the predictive accuracy of MR against machine learning (ML) methods and documented higher error levels with MR when the data contain nonlinearities (Berk & Bliech, 2013). These researchers concluded that if a simple linear decision boundary is wrongly assumed, there can be a “substantial” loss of predictive accuracy. However, they also acknowledged that these errors with MR emerge only with higher complexity data, so that the predictive accuracies of MR and ML would be mostly similar on less complex data.

Arrival of Potential Alternatives to MR—Nonlinear Machine Learning (ML) and Artificial Intelligence (AI) Methods: Alternative predictive analytic methods to MR have now clearly emerged. ML and AI methods appear well aligned to address predictive risk when dealing with data characterized by nonlinearity and higher complexity. This trend toward more complex data appears likely in criminal justice. The challenge of complexity in predicting human behavior has long been recognized. The systems theorist Kenneth Boulding (1991) argued that the complex patterns of social science data would create problems for the commonly used linear mathematical tools for predicting human behavior. He viewed such complex patterns as “beyond the reach” of the algebraic methods and compact linear equations that have dominated social sciences. He stressed the need for analytical methods that were more suited for this complexity. ML methods may offer this alternative since they appear better suited for high complexity data. Several recent comparative studies show that ML methods have already reached similar predictive accuracy to MR (Berk & Bleich, 2013; Duwe & Kim, 2015; Oliver, Dieterich, & Brennan, 2014; Breitenbach et al., 2009).

Following Boulding (1991), critics of MR have focused not only on predictive errors, but they are questioning its overall appropriateness for conducting research, including predictive modeling for complex open dynamic systems. Open systems have permeable boundaries allowing multiple complex interactions between a system and its environment, including feedback mechanisms to exchange energy, information, and adaptive behaviors. In contrast, closed systems boundaries are nonpermeable and block all such exchanges. This distinction between open and closed systems and their respective complexity is highly relevant for the choice of analytical methods for risk assessment. A growing number of researchers in the social, psychological, and biological disciplines see persons as prime examples of complex open systems; as such, these complex systems will require analytical methods capable of addressing this complexity. Since ML methods do not require parametric assumptions, they may be better suited to address this complexity regarding pattern recognition, nonlinearity, and heterogeneous latent class structures. More detailed treatment of these issues can be found in Ragin, 2000; Richters, 1997; Granic and Patterson, 2006; and Brennan, 2017.

Controversy over a 2009 Critical Review of Risk Assessment

A widely circulated paper by Baird (2009) raised a brief controversy by making several criticisms primarily aimed at the LSI-R that he extended more generally to all “modern risk assessments.” Given the wide range of issues in Baird’s paper, it would take a lengthy discussion to fully respond. However, a core problem is that Baird makes a consistent categorization error by lumping all current risk and needs assessment together with the LSI-R into one homogeneous category. For example, he criticized the LSI for its simple additive summation of all fifty-four items to create its overall risk model. As is well known, this practice will allow many nonpredictive factors to enter a predictive model. However, this practice has long been abandoned by most “modern” 3G/4G assessment instruments in favor of regression and other multivariate predictive analytic methods. In a second complaint, Baird argued against the evils of combining needs and risk items into a composite predictive model, as does the LSI, and suggested that they be strictly separated. However, most 3G and 4G RNA systems typically develop regression models for risk prediction only after combining item-level needs data into separate assessment scales for each major need factor. These needs scales can then be entered into a risk assessment model if they add incremental validity to the combined model. As noted by Berk and Bleisch (2013), the dominant goal of any risk assessment is strictly to optimize predictive accuracy, and if a particular need factor has sufficient predictive power to increase predictive accuracy, it should be added to the predictive risk assessment to optimize the dominant goal of reducing errors. It might also be noted that several of the most useful predictor variables identified in meta-analytic studies are best construed as needs factors (Andrews, Bonta, & Wormith, 2006).

Other criticisms of the LSI by Baird were appropriate (e.g., lack of attention to interrater reliability and failure to periodically upgrade the instrument over an almost three-decades-long period. However, the LSI-R clearly represents an upgrade over the original LSI, and there was no shortage of validation studies of the LSI during the many years since its appearance in the early 1980s. This brief controversy was raised by Baird early in 2009 and seemed largely ended when he subsequently had a public debate with Don Andrews in a well-attended national criminal justice conference later that year.

Can Any Current Predictive Risk Assessment Instrument Be Regarded as Best Based on Recent Comparative Evaluations?

This question has two themes. First, several recent comparative studies suggest that most current well-designed risk assessment models achieve fairly similar predictive accuracies with a narrow range of differences as assessed by AUCs. Thus, the current conclusion is that no clear-cut “winner” has emerged. For example Yang, Wong, and Coid (2010), in a meta-analysis that included twenty-eight separate studies, reported that the predictive performances of nine well-validated risk assessment instruments were basically interchangeable. Most had AUC values in a range of 0.65 to 0.71, and all measures were strongly correlated with each other. Another large-scale study to examine the predictive accuracy of COMPAS conducted twenty-seven predictive analyses of AUC values for multiple gender, racial, and offense groups (Brennan, Dieterich, & Ehret, 2009). This study reported AUC levels that were similar to those reported by Yang, with values mostly ranging from 0.66 to 0.76 with a majority being over 0.70.

A second kind of comparative study focuses on the various multiple regression and machine learning quantitative methods to assess the relative predictive accuracies of these different predictive methods (e.g., neural networks, random forests, support vector machines, and logistic regression). Over a dozen such studies have been conducted in the last decade (e.g., Berk & Bleish, 2013; Duwe & Kim, 2015; Breitenbach et al., 2009; Oliver, Dieterich, & Brennan, 2014; Hamilton et al., 2014). Several conclusions emerged. Once again no single analytical method performed best across all datasets and performance measures. Most AUCs fell in a fairly narrow range of 0.72 to 0.75, with no consistent rank order and numerous reversals. Importantly, MR had similar or better accuracies than the various ML methods in certain studies. However, random forests marginally achieved the top ranking (with MR) in those studies that had more complex higher dimensional datasets or included nonlinear features (Berk & Bleish, 2013; Duwe & Kim, 2015; Brietenbach et al., 2009). Researchers are still in the early stages of understanding how best to use ML methods and cannot yet answer the question “What method works best, with what kind of data?”

Can Group Risk Levels Be Applied to Individuals?

Another unresolved issue is a challenge posed by Hart, Michie, and Cooke (2007) that the margins of error and confidence intervals for individual risk assessments of violence are so wide that such predictions for individuals are “virtually meaningless” (p. 263). This position is strongly contested by other researchers. Skeem and Monahan (2011), for example, assert that group risk data are useful in many professions where risk decisions are unavoidable and can provide valuable empirical information that is typically required when making high-stakes decisions about individuals. They note that in other industries (e.g., insurance, weather forecasting) this decision support is critical for maintaining the basic operations of these giant enterprises. In another well-known paper, Grove and Meehl (1996) humorously proposed a hypothetical Russian Roulette game to those who claim that actuarial results are meaningless. In this scenario, two revolvers are placed on the table, and players are informed that one gun has five live rounds and one empty chamber, while the other gun has five empty chambers and one live bullet. The player is required to choose a gun. Those choosing the gun with five empty chambers clearly have better odds. In most well-designed risk assessment instruments, the odds of recidivism generally have at least a six-fold range from the low risk (around 10% recidivism) to the high risk (around 60% or more). Those who say the odds don’t matter (or who have a death wish) might feel free to ignore the odds in choosing their gun. This scenario has been challenged but appears sensible to most people.

Conclusions

Has Risk Assessment for Criminal Justice Reached a Ceiling?

Skeem and Monahan (2011) have questioned whether risk assessment has reached a ceiling, given the slow recent progress and apparently diminishing returns. They ask whether attention should be shifted to other assessment issues, such as risk reduction, explanatory research to understand rather than predict violence, or to the decision-making heuristics of Goldstein and Gigerenzer (2009). However, an alternative and more optimistic view is that several technical advances now unfolding in the basic components of risk assessment may warrant optimism, given the likely synergies between these advances. These advances in information technologies (IT) may presage substantial improvements in criminal justice risk assessment. These advances are as follows.

Advances in Data Collection Techniques: As noted earlier, rapid advances in automated data collection, data storage, data communication, and sensor technologies are occurring in parallel. These are likely to facilitate cheaper, faster, and broader data coverage for each individual offender. They also may provide a more powerful range of biological, social, psychological, and environmental data that may enhance risk assessment. A particularly important aspect of this is the potential for continuous “data streaming” to provide ongoing monitoring and the detection of early warnings over broad domains of persons, environments, and behaviors.

Advances in Analytical Techniques—Machine Learning (ML), Artificial Intelligence (AI), and Regression Methods: Analytic tools are also undergoing rapid advancements. The lasso technique, for example, in contrast to MR can handle hundreds or even thousands of risk factors to produce interpretable models. The fields of ML and AI are also rapidly expanding the range and power of new quantitative techniques that should enhance predictive analytics, clarify offender target populations, and offer pattern recognition methods to accurately and efficiently assign offenders to their most likely risk level. This turbulent IT environment is likely to produce synergies between enhanced data access and analytical techniques. From a risk assessment standpoint, ML analytical techniques such as multivariate time series have already been developed to handle the expected massive production of within-individual longitudinal data for early detection and prevention strategies (Nesselroade & Molenaar, 2010).

Advances in Computing Power and Memory Storage Capacities: The above two areas of advancement are both enabled and accelerated by the continuing almost exponential growth in computational power, database memories, and so on. These should also produce mutually reinforcing synergies.

Advances in Risk Factor Discovery: While there has been some pessimism that a ceiling regarding potential crime markers has been reached, it seems likely that other theoretical and predictive domains in criminology, including biological, biosocial, and environmental, may begin to identify useful new risk factors for risk assessment. It is likely that potentially predictive risk factors should emerge from the biological domain. Adrian Raine’s (2013) “The Anatomy of Violence: The Biological Roots of Crime” is one of many recent efforts that are uncovering potential biological markers linked to violence and criminal behaviors. Although biological markers are not yet widely used in current criminal justice and may introduce difficult political and ethical challenges, they are gradually being identified. Practical tests are becoming faster and less expensive and may soon offer practical and reliable options for routine use in criminal justice (Fishbein, 2000).

Austin, J., & McGinnis, K. (2004). Classification of high-risk and special management prisoners: A national assessment of current practices. Washington, DC: National Institute of Corrections.Find this resource:

Brennan, T., & Oliver, W. L. (2013). The emergence of machine learning techniques in criminology: Implications of complexity in our data and in research questions. Criminology and Public Policy, 12, 551–562.Find this resource:

Byrne, & Dezember. (2017). “The prison research director perspective on the design, implementation, and impact of risk assessment and offender classification systems in USA prisons: A National Survey.” In F. Taxman (Ed.), Handbook on Risk and Need Assessment—Theory and Practice. New York: Routledge.Find this resource:

Duwe, D., & Kim, K. (2015). Out with the old and in with the new? An empirical comparison of supervised learning algorithms to predict recidivism. Criminal Justice Policy Review, 1–31.Find this resource:

Nesselroade, J. R., & Molenaar, P. C. (2010). Emphasizing Intraindividual Variability in the Study of Development Over the Life Span. Handbook of Life-Span Development Wiley—On-line library.Find this resource:

Taxman, F. S., & Caudy, M. S. (2015). Risk tells us who, but not what or how: Empirical assessment of the complexity of criminogenic needs to inform correctional programming. Criminology and Public Policy, 14(1), 71–103.Find this resource:

Van Voorhis, P., & Presser, L. (2001). Classification of women offenders: A national assessment of current practices. Washington, DC: National Institute of Corrections, U.S. Department of Justice.Find this resource: