Similar presentations

2Outline Contrast IRT with classical test theoryIntroduce basic concepts in IRTIllustrate IRT methods with ADL and IADL scalesDiscuss empirical comparisons of IRT and CTTAdvantages and disadvantages of IRTWhen would it be appropriate to use IRT?

3Test Theory Any item in any health measure has two parameters:The level of ability required to answer the question correctly.In health this translates into the level of health at which the person doesn’t report this problemThe level of discrimination of the item: how accurately it distinguishes well from sickThis means that estimates of item parameters (item-total correlation, difficulty, discrimination), will change with different types of samples. Must re-establish validity for a different population.Total score is dependent on particular set of items used

4Classical Test TheoryThis is the most common paradigm for scale development and validation in health.Few theoretical assumptions, so broadly applicablePartitions observed score into True Score + ErrorProbability of a given item response is a function of person to whom item is administered and nature of itemItem difficulty: proportion of examinees who answer item correctly (in health context: item severity…)Item discrimination: biserial correlation between item and total test score.This means that estimates of item parameters (item-total correlation, difficulty, discrimination), will change with different types of samples. Must re-establish validity for a different population.Total score is dependent on particular set of items used

5Classical test theoryProbability of ‘no’ answer depends on type of item (difficulty) and the level of physical functioning (e.g. SF-36 bathing vs. able to do vigorous activities)Some limitationsItem difficulty, discrimination, and ability are confoundedSample dependent; item difficulty estimates will be different in different samples. Estimate of ability is item dependentDifficult to compare scores across two different tests because not on same scaleOften, ordinal scale of measurement for testAssumes equal errors of measurement at all levels of ability

6Item Response Theory Complete theory of measurement and item selectionTheoretically, item characteristics are not sample dependent; estimates of ability are not item dependentItem scores are presented on the same scale as abilityPuts all individual scores on standardized, interval level scale; easy to compare between tests and individualsThis means that item validity should be the same for different samples.In CTT, total test score is dependent on difficulties of item. In IRT, item difficulty is considered in calculating total ability.this means that estimate of ability should be the same for different groups of items.

7Item Response TheoryAssumes that a normally distributed latent trait underlies performance on a measureAssumes unidimensionalityI.e., all items measure the same constructAssumes local independenceItems are uncorrelated with each other when ability is held constantGiven unidimensionality, any response to an item is a monotonically increasing function of the latent trait (see the item characteristic curves in next slide)Local independence: if you hold ability constant, the probality of reponding correctly to one item is independent of the probability of responding correctly to another item. This assumes that all variation in items is due to the underlying latent trait.

8Illustration of IRT with ADL and IADL ScalesThe latent traits represent the ability to perform self-care activities and instrumental activities (necessary for independent living)Item difficulty (b): the level of function corresponding to a 50% chance of endorsing the itemItem discrimination (a): slope of the item characteristic curve, or how well it differentiates low from high functioning people

11IRT can show distribution of respondents along theta and can also show distribution of item difficulties (lower chart)

12And can also show you the theta location of different response levels (here 0 to 3 scale)

13Differential Item FunctioningAssuming that the measured ability is unidimensional and that the items measure the same ability, the item curveshould be unique except for random variations, irrespective of thegroup for whom the item curve is plotted……items that do not yield the same item response functionfor two or more groups are violating one of the fundamentalassumptions of item response theory, namely that the item andthe test in which it is contained are measuring the sameunidimensional trait…

15Item BiasItems may be biased against one gender, linguistic, or social groupCan result in people being falsely identified with problems or missing problemsTwo elements in bias detectionStatistical detection of Differential Item FunctioningItem reviewIf source of problems not related to performance, then item is biased

16DIF detection Important part of test validationHelps to ensure measurement equivalenceScores on individual items are compared for two groups:ReferenceFocal group under studyGroups matched on total test score (ability)

17DIF detection DIF can be uniform or nonuniform Uniform NonuniformProbability of correctly answering item correctly is consistently higher for one groupNonuniformProbability of correctly answering item is higher for one group at some points on the scale; perhaps lower at other points

21Steps in IRT: continuedScore the examineesGet item information estimatesBased on discrimination adjusted for ‘standard error’Study test informationIf choosing items from a larger pool, can discard items with low information, and retain items that give more information where it is needed

22Item InformationItem information is a function of item difficulty and discrimination. It is high when item difficulty is close to the average level of function in the group and when ICC slope is steep

23The ADL scale exampleCaregiver ratings of ADL and IADL performance for 1686 people1048 with dementia and 484 without dementia1364 had complete ratings

24ADL/IADL example ProceduresAssessed dimensionality. Found two dimensions: ADL and IADLAssessed fit of one-parameter and two parameter model for each scaleTwo-parameter betterOnly 3 items fit one-parameter modelSig. improvement in χ2 goodness of fitUsed two-parameter model to get item statistics for 7 ADL items and 7 IADL items

25ADL/IADLGot results for each item: difficulty, discrimination, fit to modelResults for item information and total scale information

26Example of IRT with Relative’s Stress ScaleThe latent trait (theta) represents the intensity of stress due to recent life eventsItem severity or difficulty (b): the level of stress corresponding to a 50% chance of endorsing the itemItem discrimination (a): slope of the item characteristic curve, or how well it differentiates low from high stress casesItem information is a function of both: high when (b) is close to group stress level and (a) is steep

27Stress Scale: Item Informationitem information is a function of item difficulty and discrimination. It is high when item difficulty is close to group stress level and when ICC slope is steep

29Stress Scale: Item Discriminationitem discrimination reflected in the slope of the item characteristic curve (ICC): how well does the item differentiate low from high stress cases?

30Example of developing Index of Instrumental SupportCommunity Sample: CSHA-1Needed baseline indicator of social support as it is important predictor of healthConcept: Availability and quality of instrumental supportBlended IRT and classical methods

31Sample 8089 people Randomly divided into two samples: ProceduresDevelopment and validationProceduresItem selection and coding7 items

34Empirical comparison of IRT and CTT in scale validationFew studies. So far, proponents of IRT assume it is better. However,IRT and CTT often select the same itemsHigh correlations between CTT and IRT difficulty and discriminationVery high (0.93) correlations between CTT and IRT estimates of total score

35Empirical comparisons (cont’d)Little difference in criterion or predictive validity of IRT scoresIRT scores are only slightly betterWhen item discriminations are highly varied, IRT is betterIRT item parameters can be sample dependentNeed to establish validity on different samples, as in CTT

36Advantages of IRTContribution of each item to precision of total test score can be assessedEstimates precision of measurement at each level of ability and for each examineeWith large item pool, item and test information excellent for test-building to suit different purposesGraphical illustrations are helpfulCan tailor test to needs: For example, can develop a criterion-referenced test that has most precision around the cut-off score

37Advantages of IRT Interval level scoringMore analytic techniques can be used with the scaleAbility on different tests can be easily comparedGood for tests where a core of items is administered, but different groups get different subsets (e.g., cross-cultural testing, computer adapted testing)

38Disadvantages of IRT Strict assumptionsLarge sample size (minimum 200; 1000 for complex models)More difficult to use than CTT: computer programs not readily availableModels are complex and difficult to understand

39When should you use IRT? Cross-cultural testing In test-building withLarge item poolLarge number of subjectsCross-cultural testingTo develop short versions of tests(But also use CTT, and your knowledge of the test)In test validation to supplement information from classical analyses

40Software for IRT analysesRasch or one parameter models:BICAL (Wright)RASCH (Rossi)RUMM 2010Two or three parameter modelsNOHARM (McDonald)LOGISTTESTFACTLISRELMULTILOG