Wolters Kluwer Health may email you for journal alerts and information, but is committed
to maintaining your privacy and will not share your personal information without
your express consent. For more information, please refer to our Privacy Policy.

Abstract

Purpose: The mini-Clinical Evaluation Exercise (mCEX) is increasingly being used to assess the clinical skills of medical trainees. Existing mCEX research has typically focused on isolated aspects of the instrument's reliability and validity. A more thorough validity analysis is necessary to inform use of the mCEX, particularly in light of increased interest in high-stakes applications of the methodology.

Method: Kane's (2006) validity framework, in which a structured argument is developed to support the intended interpretation(s) of assessment results, was used to evaluate mCEX research published from 1995 to 2009. In this framework, evidence to support the argument is divided into four components (scoring, generalization, extrapolation, and interpretation/decision), each of which relates to different features of the assessment or resulting scores. The strength and limitations of the reviewed research were identified in relation to these components, and the findings were synthesized to highlight overall strengths and weaknesses of existing mCEX research.

Results: The scoring component yielded the most concerns relating to the validity of mCEX score interpretations. More research is needed to determine whether scoring-related issues, such as leniency error and high interitem correlations, limit the utility of the mCEX for providing feedback to trainees. Evidence within the generalization and extrapolation components is generally supportive of the validity of mCEX score interpretations.

Conclusions: Careful evaluation of the circumstances of mCEX assessment will help to improve the quality of the resulting information. Future research should address issues of rater selection, training, and monitoring which can impact rating accuracy.

When the American Board of Internal Medicine (ABIM) discontinued its patient-based oral examination in 1972, sole responsibility for assessing residents' clinical skills was assigned to program directors. In response to the need for a practical method of evaluating the clinical skills of residents, the Clinical Evaluation Exercise (CEX) was created. The main purpose of the CEX was to ensure that resident–patient encounters would be directly observed, and the typical implementation involved a faculty member observing a resident performing a complete history and physical examination of a new patient followed by a case presentation and discussion of patient management. Each CEX lasted about two hours, and the format and experience were generally well received by faculty and residents. Despite its benefits, there were significant concerns related to the use of the CEX as a performance assessment tool. Observation of a single patient encounter by a single rater significantly limited score reproducibility and the validity of score interpretations. Additionally, the unit of analysis (i.e., observation of a complete history and physical examination) was not representative of the focused patient encounters that more commonly occurred as part of residents' and practicing doctors' clinical interactions with patients.1,2

The mini-Clinical Evaluation Exercise (mCEX), a modification of the traditional CEX, was introduced to address some of the shortcomings of the original assessment tool. In the mCEX, faculty members observe residents conducting a focused clinical task in a variety of different settings. Each encounter is intended to last about 15 minutes; this allows mCEX observations to be integrated into daily clinical and educational activities. Raters use a nine-point scale ranging from “unsatisfactory” to “superior” to provide six domain-specific ratings and one rating for overall clinical competence.3 Raters also are expected to provide educational feedback to residents at the end of the encounter. This feedback should indicate both areas of strength and areas that are in need of further development. In contrast to the CEX, the mCEX was intended to be used more frequently during residency training; this approach was expected to enhance the reliability and validity of the resulting scores and score interpretations. Initial research supported the reliability and feasibility of the mCEX for measuring residents' clinical skills,1 and subsequent research endorsed the superior psychometric properties of the mCEX over the traditional CEX.4

In the last decade, numerous studies have investigated the feasibility, reliability, and validity of the mCEX when used for assessing clinical skills in both undergraduate and graduate medical education. Although the findings generally have supported use of the tool for formative assessment of trainees, variability in study design and conflicting results raise questions about the optimal use of the mCEX. In addition to this residual uncertainty, reports of the use of the mCEX as a summative tool5 and recent consideration of its use in high-stakes assessment of practicing physicians6 suggest the need for a thorough analysis of the psychometric properties of this instrument. The purpose of this article is to provide such an analysis. Specifically, this report will evaluate existing mCEX literature in order to (1) provide a larger understanding of the psychometric properties of the mCEX, (2) inform the use of the mCEX, and (3) identify areas in which additional research is needed.

Method

One author (R.H.) used the following terms to conduct a MEDLINE search for literature published between January 1995 and January 2009: mini-Clinical Evaluation Exercise, mCEX, Clinical Evaluation Exercise, and CEX. Three study authors (M.M., S.D., J.N.) reviewed the resulting list of articles and conducted their own independent literature searches to identify additional relevant research. We also obtained articles from the reference lists of the original set of articles. We developed the process for reviewing publications and discussed all articles by conference call. Research that used all or some of the items on the mCEX scale without significant changes in scale structure or item descriptors was included in this validity analysis. The analysis did not include instruments that were significantly modified from the original mCEX format (e.g., that did not retain the global, multidimensional focus or changed the scale structure or descriptors), although the research describing such instruments was referenced where pertinent to the validity discussion.7–12

Our validity analysis was based on work that was described originally by Kane13 and subsequently adapted by Clauser and colleagues14 for use in medical education. This approach conceptualizes the validation process as the development of a structured argument in support of the intended interpretation or use of assessment results. Evidence collected in support of this argument is divided into four components: scoring, generalization, extrapolation, and interpretation/decision. The strength of the validity argument is directly related to the chain of inferences within and across these four components; confidence in the validity argument is most significantly influenced by the weakest link in this chain. Although research studies that target individual components of the argument may yield important information about the use of a particular assessment, this approach to validation may result in relying too heavily on an individual piece of evidence and increase the potential for misinterpretation of results.

Support for the scoring component of the validity argument includes evidence that the assessment was properly administered and that rules or procedures for scoring performance were consistently and accurately applied. For assessments involving direct observation of clinical performance, evidence that observers in different contexts (i.e., ward, clinic) assessed the same construct (i.e., history taking) in the same way would support the validity argument.

The generalization component includes two aspects of the assessment: (1) evidence that the material on which the examinee is being assessed is appropriately representative of the larger universe of instances from which that material was drawn, and (2) that the sample is large enough to produce reliable results. To constitute evidence within the generalization component of an assessment of clinical performance, the assessment should include a broad sample of clinical content that reflects the types of patients that would be seen in practice and enough observations to allow for reasonable reproducibility of the results. Observing a resident only on cardiac cases when the assessment is intended to be a general internal medicine clinical performance assessment would call into question the rater's ability to generalize from performance on cardiac cases to performance on other case content in internal medicine.

Evidence for the extrapolation component would include analytic results indicating that assessment outcomes are related to the proficiency or construct of interest. For example, positive correlations between observational ratings (e.g., the medical interviewing score on the mCEX) and other measures of clinical skills (e.g., patient ratings of communication skills) would be considered evidence for this component of the validity argument. It is important to note that judgments regarding an instrument's validity based on a comparison instrument should consider evidence supporting the validity of scores generated by the comparison instrument (patient ratings per this example).

The final component of the validity argument is interpretation/decision. This component includes evidence in support of the theoretical framework required for interpreting assessment results and evidence of the credibility of the procedures used in informing the resulting inferences or decisions. For clinical performance ratings, analysis of the theoretical rationale for making inferences or decisions about trainee performance and justification for use of the scores for summative purposes would address validity considerations within this component. For example, if mCEX scores will be used to identify people who will receive remediation, providing evidence that identified individuals will benefit more from the remediation than those who were not selected (because of higher mCEX scores) would be relevant to this component of the argument.

List 1 provides examples of the questions that were considered when evaluating the evidence within each component. Depending on the focus and content of the individual articles reviewed, not all validity components were relevant and not all questions within components were appropriate. In particular, information related to the interpretation/decision component was less frequently provided. Table 1 includes an excerpt from a working data table that informed the discussion during author conference calls. It is important to note that although the organizational framework presented in this report is useful for structuring the process of developing a validity argument, the distinctions between the components are not fixed. As outlined below, occasional overlap across the components is expected, and some specific pieces of evidence will provide answers to the questions contained within multiple components.

Results

Scoring

The primary consideration for the scoring component was how raters used the mCEX rating scale to evaluate trainee performance. Additional scoring-related considerations were the impact of rater selection and rater training on mCEX outcomes (see List 1 for a review of the questions that are relevant to each component of the validity argument).

Investigation into the use of the scale revealed a number of phenomena that commonly are associated with the use of global rating forms. The first general finding is that raters did not use the full nine-point rating scale; the distribution of ratings was skewed positively (i.e., toward the higher end of the scale). Leniency among raters (defined here as mean scores of 6 or higher) was common, and the highest scores tended to be assigned to the competencies of professionalism and humanism.1,2,15–17 Reports of low scores (ratings of less than 4 and labeled “unsatisfactory”) were relatively infrequent. Though uncommon, it is important to note that when low scores do occur they may be associated with specific performance deficits or may identify trainees who will have difficulty with more high-stakes performance assessments.18–20 However, it should be noted that reported shifts in score distributions also may be attributed to factors unrelated to examinee proficiency. For example, Boulet and colleagues18 reported lower mean ratings overall for their group of international medical graduates. At first glance, these results may suggest inferior performance by international medical graduates. Additional explanations for the findings could be that (1) raters felt more comfortable providing lower ratings because of the lower stakes associated with receiving such ratings in the context of a research study, and (2) rating videotapes (rather than directly observed encounters) allowed raters to feel that their evaluations were removed from the actual performance and therefore would not impact the examinee as directly as they would in a live observation setting.

Another common finding was that the individual competencies on the mCEX tended to be highly intercorrelated.1,2,4,15–19,21 This raises questions about the ability of raters to discriminate between individual strengths and weaknesses among trainees when using this instrument. Though this finding may seem indicative of a “halo” effect (when raters' impressions about performance in some domains inappropriately influence ratings in other domains), the possibility also exists that this reflects the true relationships among related performance domains (e.g., counseling and interviewing should be highly correlated). It also is plausible that many trainees do not demonstrate decided patterns of strengths and weaknesses that can be captured within the individual mCEX domains. Lastly, the fact that the rating form has overlapping descriptors (e.g., attention to patient comfort and modesty are described as underlying both physical examination and humanism/professionalism) may predispose raters to provide similar ratings of different competencies.3

The literature provided limited information about the impact of rater selection on assessment results. Though some studies used raters who either had general rating experience or experience with the mCEX specifically, no comparisons with less experienced raters were provided. These studies also yielded mixed results regarding the quality of ratings among types of raters. Some research indicated significant variation in rater stringency for faculty raters,20,21 and faculty in general seemed to be more stringent than resident raters.16,17 In one study, faculty selected by their program directors to participate in an intensive faculty development workshop viewed videotaped resident performances that were scripted to be at three distinct performance levels (unsatisfactory, satisfactory, and superior) and rated them on history taking, physical examination, and counseling. Results indicated that faculty were able to discriminate in the appropriate direction between the performance categories (although a broad range of ratings for each performance level was noted).22

Published research on the mCEX sheds little light on the impact of rater training on assessment outcomes. In a number of studies, no specific form of rater preparation was described.1,4,15,17,23 For the studies in which rater training was addressed, the type of preparation included such varied approaches as provision of guidance notes, written and/or verbal orientation during meetings, and multiple-hour interactive workshops.16,20,21,24–26 Regardless of the specific training approach, the use of control groups to measure the impact of rater preparation on assessment outcomes was not common, and only two studies systematically investigated the impact of training on mCEX ratings by comparing the results across three training conditions: participation in an intensive rater training workshop, review of written training materials, and no intervention.25,26 Results of these two studies conflicted: One found that use of an interactive workshop (that incorporated elements of frame-of-reference training) led to an improved ability to discriminate between different levels of performance and a modest decrease in leniency error among raters,25 and the other (involving a somewhat shorter intervention) showed no difference in accuracy or reliability between trained and untrained raters.26

Generalization

The generalization stage of the validity argument requires consideration of two main kinds of evidence: (1) that observations are representative of the domain to which the score is to be generalized and (2) that the sampling is extensive enough that it prevents the observed scores from being unduly influenced by sampling error (see List 1). For direct observation of clinical performance using the mCEX, the specific factors of interest are the number and diversity of encounters/patients, the number and diversity of the raters, and the rating form itself.

Review of the research indicated that study designs varied widely in terms of the numbers of encounters and raters. Much of the work is in uncontrolled settings where it is not possible to tease apart the effects of patients, raters, and the rating form.4,15,16,19,23 However, more recent studies have been conducted in controlled settings using videotaped and sometimes scripted encounters.21,27 Analysis of the research findings from studies in both of these settings suggests that data from between 6 and 14 encounters would be sufficient to produce a dependability or phi coefficient of 0.80 (the dependability coefficient represents the expected correlation between scores across replications of the assessment procedure). This is comparable to what is reported for other methods of assessing clinical skills.

Dependability coefficients are a very useful means of comparing the reliability of different methods of assessment. However, for domain-referenced interpretation of scores, a 95% confidence interval (CI) built on the standard error of measurement is often more useful. Here there is a difference between the controlled and uncontrolled studies. For the uncontrolled studies, two to four encounters would be sufficient to obtain a 95% CI of one point or less on the nine-point rating scale. For the controlled studies, six or seven encounters are necessary. In these controlled studies, however, the videotaped encounters represented a much greater range of trainee competence than exists in most applied settings. It is unclear whether this influenced the results, the degree to which it did so, or the direction of the influence.

Of the facets of measurement that influence reliability, the length of the rating form has the smallest effect. Forms composed of 5 to 10 items are sufficient for most assessment purposes, and longer forms do not significantly increase the reliability of the results.9 In contrast, the numbers of raters and patients both have substantial effects on reliability. Some studies have made an excellent start at teasing these two apart,21,27 but additional research is necessary before firm conclusions can be drawn.

Extrapolation

The extrapolation component of the validity argument is concerned with the ability to relate performance on the assessment to performance in practice. Perhaps the most notable strength of the mCEX is that its use involves observation of what actually happens in clinical practice, but this is not in and of itself evidence for the validity of interpretations about performance that will be made based on mCEX scores.

Perhaps the most straightforward area in which the reviewed research provided evidence for the extrapolation component of the validity argument was through investigations of the relationship between mCEX scores and performance on other measures that are assumed to assess related abilities. Some research reported lower (or the lowest) mCEX scores for examinees who were unsuccessful on other related assessments,18,19 and other research reported that mCEX scores were moderately to highly related to outcomes of similar assessments such as postclerkship summative evaluations,16 high-stakes clinical skills examinations,18 and ABIM monthly evaluations.15 The consistent finding of positive (and often statistically significant) relationships between mCEX scores and other assessment outcomes provides further evidence for the validity of mCEX score interpretations.

Additional areas of research that can potentially provide support for the extrapolation component are those that investigate mCEX performance across levels of education, observation settings, and encounter complexity. Most of the reviewed studies reported improvement in mCEX scores within and across both undergraduate and graduate years of study, particularly for domains such as clinical judgment and organization and efficiency (in which improved performance would be expected).1,4,20,24 The finding of consistent results across observation settings also provides evidence for the extrapolation component; if it were demonstrated that mCEX performance differed based on where the observations were taking place, it would call the overall interpretation of the results into question.2 Variation in ratings relative to encounter complexity suggests that raters tend to factor in the difficulty of the case when assigning their ratings.4 This leniency effect, in which raters seem to give examinees the benefit of the doubt, also makes straightforward interpretation of the results difficult.

While it remains important to collect evidence to support score interpretations by demonstrating the existence of certain desirable relationships between assessment outcomes, it also is necessary to demonstrate that other undesirable relationships do not exist. Collecting evidence of the absence of the influence on scores of factors that are unrelated to the intended score interpretation—referred to as construct-irrelevant variance—therefore has become an important feature of this more recent conceptualization of validity. Halo effects are one example of a finding that is potentially illustrative of the impact of construct-irrelevant variance. As mentioned previously, results of the present research indicated a fairly common result of high intercorrelations between mCEX items.1,2,4,15–19,21 What is not clear is whether this result is an accurate representation of real and expected relationships between proficiencies or whether some feature of the encounter (e.g., the ratee, the rater, or the testing format) that is unrelated to the proficiencies of interest is influencing scores.

Construct underrepresentation refers to the extent to which scores fail to reflect aspects of the proficiency of interest. This is relevant to the extrapolation component in that score interpretations essentially will be meaningless if a scoring approach and resulting outcomes are not representative of the constructs the test was intended to measure. Assessment using the mCEX closely mirrors practice, and the scoring approach reinforces practice-relevant behaviors; these factors should help to reduce construct-irrelevant variance and construct underrepresentation, but published studies have not explicitly addressed these issues.

Interpretation/decision

This component of the validity argument focuses researchers and practitioners on two main types of validity evidence: (1) the extent to which interpretations are based on theoretical constructs that are reasonable and credible, and (2) the extent to which any decision rules that are applied to assessment outcomes are based on sound and defensible procedures. This is a critical step of the overall process of collecting validity evidence because the interpretations and resulting decisions made based on assessment outcomes are what directly impact the population of interest. In fact, the entire validation process is focused not on providing evidence that one has developed or implemented a valid test but, instead, on demonstrating that there is evidence for the validity of the interpretations that will be made about the resulting scores.

In terms of the credibility of theoretical constructs, one of the themes that emerged from the work reviewed in this article, with one exception,15 is that performance on the mCEX differed by level of proficiency and/or training. From a theoretical perspective, it is reasonable to believe that the level of skill with which clinical tasks are performed would increase with the level of training or knowledge. As mentioned in the extrapolation section, the studies that reported higher mCEX scores across levels of undergraduate16 and graduate or postgraduate training1,4,20 are sensible within the overall theoretical framework of skill development or content mastery (with deliberate practice).28,29 Although variation in scores with duration of training is supportive of the validity argument, important information that is not specifically addressed in the reviewed research relates to the effectiveness of targeted educational interventions on increasing low overall or domain-specific mCEX scores.

More generally, the original purpose of the instrument is something that also relates to the intended score interpretations. The mCEX was designed by the ABIM for use as a formative tool that would encourage education versus simply a summative tool used for evaluation.1 The educational component makes clear the importance of feedback in the overall process, yet the majority of researchers reported on the more objective outcomes of mCEX implementation (i.e., the ratings) and neglected to comment on the feedback component. Although some researchers specifically investigated this aspect of the mCEX,11,30 the general lack of attention to an important intended component of the mCEX raises questions about whether the mCEX is being used in the manner that was intended and about the consequences of this potential misuse for the validity of score interpretations.

Discussion

This report analyzes the validity of mCEX scores using a framework that considers the process of validation as one of building an argument, or chain of inferences, in support of intended score interpretations. Overall, the weakest component of the mCEX validity argument seems to be in the area of scoring, while analysis of the other components of the argument is generally supportive. Unfortunately, there are relatively few studies of the mCEX, and many of them are based on limited settings and small numbers of trainees, examiners, and patients. Consequently, it is difficult to separate problems with the method from gaps and limitations in the research conducted to date.

In terms of the scoring component, three issues are of primary concern: high interitem correlations, rater selection and training, and leniency. The finding of high interitem correlations requires additional study of mCEX use in both formative and summative assessment. There are at least five potential causes for this finding: (1) examiners are unable to distinguish among the different dimensions that need to be rated, (2) the dimensions themselves are highly correlated, (3) the preponderance of the patient encounters used in the studies do not elicit differential performance, (4) a substantial portion of the examinees do not have focal strengths or weaknesses, and/or (5) descriptors associated with the scale promote similar ratings for different items. Research aimed at identifying the relative contributions of each to the high interitem correlations is needed. In the meantime, on the basis of scores alone, it is difficult to target specific areas for learner improvement and/or identify whether there is a primary reason for an overall low score. Specific written or verbal feedback should be encouraged, but research on feedback quality is limited, and available reports suggest it is highly variable.11,30

The manner in which examiners are selected and trained seems to affect rating outcomes. If trainees are allowed to select their cases and examiners, differences in case difficulty and rater stringency may impact the fairness of grading outcomes, especially in the setting of summative assessment.4,16,17 Results from the studies describing examiner interventions are mixed. Further research on the efficacy of examiner training clearly is needed; research also should focus on the cultural and environmental influences on ratings as well as the potential value of a quality assurance and feedback process for examiners. In addition, the data speak to the importance of ensuring that individual trainees be assessed by as many different examiners as is feasible.

In terms of leniency, lower (and generally more accurate) ratings are given in the research context compared with those given in clinical educational settings, and this has particular ramifications for summative assessment. This finding suggests that examiners may inflate their ratings when the results have greater implications for trainees.18,25,26 This effect may be mitigated to some degree by examiner training.25 Alternatively, being aware of observation and assessment in the clinical setting may elevate trainee performance. Research to clarify the sources and magnitudes of these effects is needed, as are means to alter them as necessary.

The generalization component of the validity argument takes into account the appropriateness of the sample of patients, examiners, and encounters. More work is needed in this area, because many of the studies had small sample sizes, and this limits the ability to interpret the findings within a specific context. Nonetheless, with some consistency the research found that a defensible reliability or generalizability result was obtained with 8 to 10 encounters. Depending on the purpose of the assessment (formative versus summative), the stakes, and the nature of the trainees and setting, this number may be higher or lower. Several papers described deliberate sampling strategies targeting the distribution of case content and complexity, setting, task, and/or raters.1,2,4,18–22,26,27 Attention to sampling across the relevant characteristics is important to the validity of intended interpretations based on aggregate mCEX scores.

The mCEX is an assessment that is designed to closely approximate the practice setting, and this enables the validity argument for the extrapolation component to be quite strong. The close correspondence with the practice setting also may limit the effects of extrapolation-related threats to validity (such as construct-irrelevant variance and construct underrepresentation). Of course, the similarity between assessment method and practice setting does not, by itself, guarantee that the score represents the proficiency of interest.

Theoretically, the mCEX assesses constructs similar to those assessed by other methods, such as standardized patient-based examinations and, to a lesser degree, monthly attending ratings and written examinations. Overall, the results did provide evidence for interpretations that assert that the mCEX assesses constructs that are similar to those evaluated, in whole or in part, in the other assessments.15,16,18,19 Although attention to this type of validity evidence is important, even an established and widely referenced instrument will need to provide sufficient evidence of validity of its own score interpretations or decisions if it is to be properly used for such an argument.

Review of published reports on the mCEX with regard to the extrapolation component reveals several areas for future research. Most of the studies have focused on trainees in internal medicine. This is not particularly surprising given the instrument's origin, but much less is known about the use of the mCEX in other specialties. Similarly, the ability of mCEX ratings to predict real-world outcomes of care (e.g., patient satisfaction) also is largely unexplored. It isn't clear to what degree authentic behaviors captured and rated by the mCEX relate to subsequent clinical performance.

The final component of the validity argument, interpretation/decision, requires attention to the ways in which scores will be used and interpreted. For example, if mCEX scores are used to identify individuals who will receive remediation, it would be reasonable to provide evidence that the identified individuals will benefit more from the remediation than those who were not selected or that the remediation acts to decrease or eliminate the discrepancy in performance between the two groups. Though the mCEX was originally developed for formative purposes, investigations into its use as a summative assessment make clear the need to carefully consider how decisions about people will be made. If cut scores are used to make decisions, evidence is required to show that those cut scores were established in a sensible manner.

Finally, as noted above, although the organizational framework presented in this paper is useful for structuring the process of developing a validity argument, the distinctions between the components are not fixed. Instead, overlap across the components is expected, and some specific pieces of evidence will provide answers to the questions contained within multiple components. What is important is not the structure, per se, but the process of collecting evidence that will provide an overall argument for the validity of interpretations that will be made about assessment scores.

Journal of Continuing Education in the Health ProfessionsMultisource Feedback: Can It Meet Criteria for Good Assessment?Lockyer, JJournal of Continuing Education in the Health Professions, 33(2):
89-98.10.1002/chp.21171CrossRef

Enter and submit the email address you registered with. An email with instructions to reset your password will be sent to that address.

Email:

Password Sent

Link to reset your password has been sent to specified email address.

Remember me

What does "Remember me" mean?
By checking this box, you'll stay logged in until you logout. You'll get easier access to your articles, collections,
media, and all your other content, even if you close your browser or shut down your
computer.

To protect your most sensitive data and activities (like changing your password),
we'll ask you to re-enter your password when you access these services.

What if I'm on a computer that I share with others?
If you're using a public computer or you share this computer with others, we recommend
that you uncheck the "Remember me" box.