6
Assess a student’s knowledge of topic X Based on a sequence of items that are dichotomously scored – E.g. the student can get a score of 0 or 1 on each item

7
Scoring Not a simple average of the 0s and 1s – That’s an approach that is used for simple tests, but it’s not IRT Instead, a function is computed based on the difficulty and discriminability of the individual items

8
Key assumptions There is only one latent trait or skill being measured per set of items – There are other models that allow for multiple skills per item, we’ll talk about them later in the semester Each learner has ability  Each item has difficulty b and discriminability a From these parameters, we can compute the probability P(  ) that the learner will get the item correct

9
Note The assumption that all items tap the same latent construct, but have different difficulties, is a very different assumption than is seen in other approaches such as BKT (which we’ll talk about later) Why might this be a good assumption? Why might this be a bad assumption?

10
Item Characteristic Curve Can anyone walk the class through what this graph means?

11
Item Characteristic Curve If Iphigenia is an Idiot, but Joelma is a Jenius, where would they fall on this curve?

31
Model Degeneracy Where a model works perfectly well computationally, but makes no sense/does not match intuitive understanding of parameter meanings What parts of the 2PL parameter space are degenerate? What does the ICC look like?

41
Model Degeneracy Where a model works perfectly well computationally, but makes no sense/does not match intuitive understanding of parameter meanings What parts of the 3PL parameter space are degenerate? What does the ICC look like?

42
Fitting an IRT model Typically done with Maximum Likelihood Estimation (MLE) – Which parameters make the data most likely We’ll do it here with Maximum a-priori estimation (MAP) – Which parameters are most likely based on the data

43
The difference Mostly a matter of religious preference – In many models (though not IRT) they are the same thing – MAP is usually easier to calculate – Statisticians frequently prefer MLE – Data Miners sometimes prefer MAP – In this case, we use MAP solely because it’s easier to do in real-time

45
Let’s fit IRT parameters to this data We’ll use SSR (sum of squared residuals) as our goodness criterion – Lower SSR = less disagreement between data and model = better model – This is a standard goodness criterion within statistical modeling – Why SSR rather than just sum of residuals? – What are some other options?

53
Standard Error in Estimation of Student Knowledge 1.96 standard errors in each direction = 95% confidence interval Standard error bars are typically 1 standard error – If you compare two different values, each of which have 1 standard error bars – Then if they do not overlap, they are significantly different This glosses over some details, but is basically correct

54
Standard Error in Estimation of Student Knowledge Let’s estimate the standard error in some of our student estimates in the data set Are there any students for whom the estimates are not trustworthy?

55
Final Thoughts IRT is the classic approach to assessing knowledge through tests Extensions are used heavily in Computer- Adaptive Tests Not frequently used in Intelligent Tutoring Systems – Where models that treat learning as dynamic are preferred; more next class