Is Accessibility Conformance an Elusive Property?

We undertook a study of validity and reliability of WCAG 2.0 and found that an 80% target for agreement is not attainable, when audits are conducted without communication between evaluators. Even with experienced evaluators the error rate is relatively high; and further, untrained accessibility auditors -be they developers or quality testers from other domains- do much worse than this. Read the full published text via ACM Author-izer Open Access on the publications page.

You can also cite us using the bibtex at the bottom of this post. But what does this really mean? Well it means that it’s going to be very difficult for the 80% agreement WCAG 2 requires of expert evaluators. Now this isn’t the deal breaker it may seem – as we do see agreement at the 70% mark; and so agreement does still happen – however a strict adherence to the standard cannot be achieved – at least in our strict experimental work. It maybe that a post evaluation round table discussion would push up the agreement, it maybe that a collaborative round table evaluation would push up the agreement. We don’t know, as there is no empirical work out there to support this.

@article{Harper2012hgy,
Abstract = {The Web Content Accessibility Guidelines (WCAG) 2.0 separate testing into both “Machine” and “Human” audits; and further classify “Human Testability” into “Reliably Human Testable” and “Not Reliably Testable”; it is human testability that is the focus of this paper. We wanted to investigate the likelihood that “at least 80% of knowledgeable human evaluators would agree on the conclusion” of an accessibility audit, and therefore understand the percentage of success criteria that could be described as reliably human testable, and those that could not. In this case, we recruited twenty-five experienced evaluators to audit four pages for WCAG 2.0 conformance. These pages were chosen to differ in layout, complexity, and accessibility support, thereby creating a small but variable sample.

We found that an 80% agreement between experienced evaluators almost never occurred and that the average agreement was at the 70–75% mark, while the error rate was around 29%. Further, trained—but novice—evaluators performing the same audits exhibited the same agreement to that of our more experienced ones, but a reduction on validity of 6–13% ; the validity that an untrained user would attain can only be a conjecture. Expertise appears to improve (by 19%) the ability to avoid false positives. Finally, pooling the results of two independent experienced evaluators would be the best option, capturing at most 76% of the true problems and producing only 24% of false positives. Any other independent combination of audits would achieve worse results.