In this, the 10th and final installment of the Test Design and Delivery series, we take a look at evaluating the test. Statistical analysis improves as the number of test takers goes up, but data from even a few attempts can provide useful information. In most cases, it we recommended performing analysis on data from at least 100 participants data from 250 or more is considered more trustworthy.

Analysis falls into two categories: item statistics and analysis (the performance of an individual item), and test analysis (the performance of the test as a whole). Questionmark provides both of these analyses in our Reports and Analytics suites.

Item statistics provide information on things like how many times an item has been presented and how many times each choice has been selected. This information can point out a number of problems:

An item that has been presented a lot may need to be retired. There is no hard and fast number as far as how many presentations is too many, but items on a high-stakes test should be changed fairly frequently.

If the majority of test-takers are getting the question wrong but they are all selecting the same choice, the wrong choice may be flagged as the correct answer, or the training might be teaching the topic incorrectly.

If no choice is being selected a majority of the time, it may indicate that the test-takers are guessing, which could in turn indicate a problem with the training. It could also indicate that no choice is completely correct.

Item analysis typically provides two key pieces of information: the Difficulty Index and the Point-Biserial Correlation.

Point-Biserial Correlation: how well item discriminated between those who did well on the exam and those who did not

Positive value = those who got the item correct also did well on the exam, and those who got the item wrong also did poorly on the exam

Negative value = those who did well on the test got the item wrong, those who did poorly on the test got the item right

+0.10 or above is typically required to keep an item

Test analysis typically comes down to determining a Reliability Coefficient. In other words, does the test measure knowledge consistently – does it produce similar results under consistent conditions? (Please note that this has nothing to do with validity. Reliability does not address whether or not the assessment tests what it is supposed to be testing. Reliability only indicates that the assessment will return the same results consistently, given the same conditions.)

Reliability Coefficient: range of 0 – 1.00

Acceptable value depends on consequences of testing error

If failing means having to take some training again, a lower value might be acceptable

If failing means the health and safety of coworkers might be in jeopardy, a high value is required

There are a number of different types of consistency:

Test – Retest: repeatability of test scores with the passage of time

Alternate / Parallel Form: consistency of score across two or more forms by same test taker

Inter-Rater: consistency of test score when rated by different raters

Internal Consistency: extent to which items on a test measure the same thing

Most common: Kuder Richardson-20 (KR-20) or Coefficient Alpha

Items must be single answer (right/wrong)

May be low if test measures several different, unrelated objectives

Low value can also indicate many very easy or hard items, poorly written items that do not discriminate well, or items that do not test the proper content