The modified Rankin Scale (mRS) is the preferred outcome measure in stroke trials. Typically, mRS assessment is based on a clinician’s rating of a patient interview and interobserver variability is common. Meta-analysis suggests an overall reliability of k=0.46 but this may be less (k=0.25) in multi-centre studies. Mandatory training in mRS assessment is employed in most trials to mitigate this but the problem persists. Variability in assigning outcomes may lead to endpoint misclassification increasing the challenge of accurately demonstrating a treatment effect. We aimed to assess the impact of endpoint misclassification on trial power and explore methods to improve the use of the mRS in acute stroke trials. First we used the mRS outcome distributions of previous phase III randomised controlled trials (RCT) in stroke (NXY059 study and tPA NINDS study) to perform statistical simulations. We generated power estimates and sample sizes from simulated mRS studies under various combinations of sample size, mRS reliability and adjudication panel size. Simulations suggest that the potential benefit of improving mRS reliability from k 0.25 to k 0.5, k 0.7 or k 0.9 may allow a reduction in sample size of n= 386, n= 490 or n= 488 in a typical n=2000 RCT. We then developed a method for providing group adjudication of mRS endpoints and examined the feasibility, reliability and validity of its use in a multicentre clinical trial. We conducted a “virtual” acute stroke trial across 14 UK sites. Local mRS interviews were scored as normal but also recorded to digital video camera. Video clips were uploaded via secure web portal for scoring by adjudication committee reviewers. We demonstrated excellent technical success rates with acceptability to both participants and investigators. 370 participants were included in our “virtual” acute stroke trial and 563 mRS video assessments were uploaded for central review. 96% (538/563) of study visits resulted in an adjudicated mRS score. At 30 and 90 days respectively, 57.5% (161/280) and 50.8% (131/258) of clips were misclassified. Agreement was measured using kappa statistics (k/kw) and intraclass correlation coefficient. Agreement between the adjudication committee was very good (30 days kw 0.85 [95%CI 0.81-0.86], 90 days kw 0.86 [95% CI 0.82-0.88]) with no significant or systematic bias in mRS scoring in comparison to the local mRS. We demonstrated criterion and construct validity of centrally adjudicated mRS scores through comparison with the locally assigned mRS score and other measures known to affect stroke outcome including baseline NIHSS (bNIHSS), Systolic Blood Pressure (SBP), blood glucose and home time. We studied our cohort of mRS video clips to identify any features predictive of variability in mRS scoring. Patient specific variables included participant age, pre stroke mRS, baseline stroke severity as graded by baseline NIHSS (bNIHSS) and presence of language disorder. Interview specific variables included length of interview, poor sound quality, location of the interview, use of a proxy or discussion of prior disability. At both 30 and 90 days only “interview length” was a significant predictor of agreement in mRS scoring. Using a sample of mRS video clips in English and Mandarin, we conducted a pilot study to assess the effect of translation of mRS interviews on interobserver reliability. The interobserver reliability of the translated mRS assessments was similar to native language clips (Native (n=69) kw 0.91 [95%CI 0.86-0.99], Translated (n=89) kw 0.90 [95% CI 0.83-0.96]). We then incorporated a translation step into the central adjudication model using our existing web portal. Inter observer reliability seen in the modified clips (kw 0.85 [95% CI 0.74-0.95]) was similar to that seen in the original video files (kw 0.88 [95% CI 0.78-0.99]). Finally we aimed to investigate the ability of raters to detect more subtle degrees of disability within mRS ranks through blinded assessment of pairs of clips with matching mRS grades. These pairs contained either two clips with full agreement in mRS grade at initial group review or one clip with full agreement and one clip where scores were skewed in the direction of “more” or “less” disability. Pairs were randomly assigned to multiple raters. We could not identify any reliable pattern in identification of the “less disabled” mRS clip. More sensitive grading of the mRS with “good” or “bad” forms of each grade is not reliable on the basis of this exploratory study. Perhaps alternative methods of converting the ordinal ranks of the mRS scale into a more continuous distribution should be investigated; such as the use of a mean mRS score following multiple mRS ratings. Prior estimates of mRS reliability in multicentre studies are poor [k=0.25]. The risks of endpoint misclassification affecting trial power are substantial. Simulations suggest that the effect of improving interobserver reliability and multiple mRS assessments may reduce study sample size by 25%, resulting in substantial ethical and financial benefits. Agreement between our adjudication committee was good [k=0.59(95% CI:0.53-0.63), kw=0.86(95% CI:0.82-0.88)]. Central review may bring many additional potential benefits: “expert” review, quality control and improved blinding in complex trial design. Central adjudication of mRS assessments is feasible, reliable and valid, including the use of translated mRS assessments. This model of outcome assessment has been incorporated into four ongoing large clinical trials: CLEAR-3, MISTIE-3, EUROHYP-1 and SITS-OPEN.