Presidential Address: Improving precision of CAT measures

The basic idea of adaptive testing is quite simple and has been implemented for over a century (Binet-Simon; oral examinations; etc.). Over the years item selection algorithms such as MI, MPP and WI have been developed to maximize efficiency and convergence and MLE, EAP and MAP are commonly used to estimate ability.

Dichotomously scored MCQs are mostly used to obtain response vectors in CATs. This means that a response is scored as either correct or incorrect. However, a correct response doesn’t necessarily mean that the test taker knew the answer. Although the SEM increasingly decreases as the provisional ability is estimated, the question is whether the process can be improved at the item response level. In other words, can more information be extracted from a response than a simple 0 or 1? In my presentation this question is addressed.

Bio: John is the founder and Executive Director of EPEC Pty Ltd in Melbourne, Australia and an internationally recognized expert in psychometrics, assessment and education. His work and leadership has been recognized through his appointment as professor with a dual appointment in an adjunct capacity at The University of Sydney and an honorary position at The University of Cape Town. He also taught courses in psychometrics, statistics and research methodology on a sessional basis at the Australian Catholic University from 2000 to 2013. John is a full member of a number of professional organizations, the latest as a Board member of the International Association for Computerized Adaptive Testing (IACAT) since 2010, elected as Vice President in 2013 and as President from 2015. John is also a member of the International Assessments Joint National Advisory Committee (IAJNAC). He is a consulting editor of the Journal of Computerized Adaptive Testing and a member of the International Editorial Board of the SA Journal of Science and Technology.

He holds three doctorates (D.Ed.; Ph.D.; Ed.D) following two masters' degrees (M.Sc.; M.Ed.). He specializes in psychometrics, computer-based assessment and measurement theories. His acclaimed work as a researcher has been acknowledged with various awards and promotion to the highest rank as chief research specialist. Following studies in the US, John pioneered the implementation of IRT in South Africa at national level, developed an item banking system and published the first CATs for selection of students in the late 80's. After resigning as professor at the University of South Africa, he migrated to Australia in 1996 to take up a position as research director and head of psychometrics. Since founding EPEC in 2000, John has been active in numerous projects - from engineering CATs, through psychometric services, to the development of Option Probability Theory.

This presentation introduces several exciting new developments concerning Computerized Adaptive Testing (CAT) foundations and implementations. The first one is the establishment of a mathematical foundation to demonstrate that an examinee should be allowed to revise the answers to previously administered items during the course of testing. Currently very few operational CAT programs permit item revision, which is allowed by the traditional paper-and-pencil tests. This has become such a main concern for both examinees and testing companies that some testing programs have decided to switch from CAT to other modes of testing. Most recently, Wang, Felloris, and Chang have demonstrated that allowing item revision will not compromise test efficiency and security.

Then, we address a number of issues emerging from large scale implementation and show how theoretical works can solve practical problems. Our new focus will be on Cognitive Diagnostic CAT (CD-CAT), which has become a powerful tool for schools to assess students' mastery of various skills. In particular, we will present our research results on the use of CD-CAT to improve STEM learning and retention. We will also show how CD-CAT can support individualized learning on a mass scale. Lastly, we will ruminate on and discuss some possible future directions of research on CAT.

Bio: Dr. Hua-Hua Chang is a Professor of Educational Psychology, Psychology and Statistics at the University of Illinois at Urbana-Champaign (UIUC). He is a practitioner turned professor. Before moving to academia in 2001, he had worked in the testing industry for nine years, of which six years were at Educational Testing Service in Princeton, NJ and three years were at the National Board of Medical Examiners in Philadelphia, PA. He is the Editor-in-Chief of Applied Psychological Measurement, past President of the Psychometric Society (2012-2013), and a Fellow of American Educational Research Association. Since 2008, he has been included on the “List of Teachers Ranked Excellent by Their Students” at UIUC for six consecutive years. Dr. Chang currently serves as the director of the Confucius Institute at UIUC, and he was most recently awarded the Changjiang Scholar Chair Professor by the Ministry of Education of PR China.

Cees Glass, Department of Research Methodology, Measurement and Data Analysis, Faculty of Behavioural Sciences, University of Twente, the Netherlands

Abstract

Computerized adaptive testing is becoming more and more prominent, not only in educational measurement, but also in fields as industrial and organizational psychology and in health assessment, especially in the field of assessment of quality of life and the field of physical ability. While for educational measurement unidimensional IRT models are usually satisfactory, a field such as assessment of quality of life often calls for multidimensional IRT models. In this presentation, we outline specific problems encountered when developing a multidimensional CAT (MCAT) for fatigue for use with rheumatoid arthritis patients. The itembank consists of polytomously scored items. Some of the problems discussed apply to CAT in general, while some others are specific for MCAT.

The first topic addressed is the calibration phase. Discussed are the item administration design, and the estimation and testing procedures used. The second topic concerns the operational phase. Discussed are the item administration procedure, estimation of person parameters, models fit, differences in parameter estimates between the calibration and the operational phase, and updating strategies for item parameters. The final topic is how to contact secondary analyses using the patients’ parameter estimates. Critical comments are made regarding debatable common practices and suggestions are made to improve common practices in secondary analysis using data emanating from MCAT, and CAT in general.

Bio: Cees Glas is the chair of the Department of Research Methodology, Measurement and Data Analysis, at the Faculty of Behavioural Science of the University of Twente in the Netherlands. The focus of his work is on estimation and testing of latent variable models in general and of IRT models in particular, and on the application of IRT models in educational and psychological testing. He participated in numerous research projects including projects of the Dutch National Institute for Educational Measurement (Cito, the Netherlands), the Law School Admission Council (USA) and the OECD international educational survey PISA. He serves as the chair of the technical advisory committee of the OECD project PIAAC.

With Wim van der Linden, he co-edited the volume Elements of Adaptive Testing (2010) published by Springer. Published articles, book chapters and supervised theses cover such topics as testing of fit to IRT models, Bayesian estimation of multidimensional and multilevel IRT models using MCMC, modeling with non-ignorable missing data, concurrent modeling of item responses and response times, concurrent modeling of item response and textual input, and the application of computerized adaptive testing in the context of health assessment and organisational psychology.

With its well-known advantages such as improved measurement efficiency, computerized adaptive testing (CAT) is quickly becoming mainstream in the testing industry. Many test takers, however, say they are not necessarily happy with the testing experience under CAT. Most (if not all) CAT programs do not allow test takers to review and change their responses during the testing process in order to prevent individuals from attempting to game the CAT system. According to findings from our recent research study, more than 50% of test takers complained about increased test anxiety due to these CAT restrictions and more than 80% of test takers believe they would perform better on the test if they were allowed to review and change their responses. In this keynote session, Chris Han from Graduate Management Admission Council (GMAC®) will introduce several CAT testing options that would allow for response review and revision while still retaining the measurement efficiency of CAT and its robustness against attempts to game the CAT system.

Bio: Kyung (Chris) T. Han is a senior psychometrician, director at the Graduate Management Admission Council responsible for designing various test programs, including the Graduate Management Admission Test® (GMAT®) exam, and conducting psychometric research to improve and ensure the quality of the test programs. Han received his doctorate in Research and Evaluation Methods from the University of Massachusetts at Amherst. He received the Alicia Cascallar Award for an Outstanding Paper by an Early Career Scholar in 2012 and the Jason Millman Promising Measurement Scholar Award in 2013 from the National Council on Measurement in Education (NCME). He has presented and published numerous papers and book chapters on a variety of topics from item response theory, test validity, and test equating to adaptive testing. He also has developed several psychometric software programs including WinGen, IRTEQ, and SimulCAT, which are used widely in the test measurement field.

Keynote 4: The future of CAT should be Open Source

Michal Kosinski, University of Stanford

Abstract

Beyond high-stake testing, the applications of CAT are still frustratingly rare. Furthermore, even well-established adaptive tests often employ only the most rudimentary methods, lagging decades behind the cutting edge of CAT research. Finally, a shortage of talent and software tools inflate the costs of expanding CAT portfolios incurred by the testing industry and its clients. As a result, individuals and economies are affected by the preventable loss of time, misallocation of talent, and decreased efficiency. I argue that the community of test publishers and CAT researchers could efficiently address these problems by embracing an open-source approach. We should focus on three areas: developing open-source research tools, open-source testing platforms, and open-source item banks. I will discuss how open-source approaches can boost CAT research, expand the pool of CAT talent, increase the quality of CAT tests, and increase profits in the CAT industry.

Bio: Michal is a Professor of Organizational Behaviour at Stanford University Graduate School of Business. His research focuses on humans in a digital environment and employs cutting-edge computational methods and Big Data mining. Michal holds a PhD in Psychology from University of Cambridge, an MPhil in Psychometrics, and a MS in Social Psychology. He previously worked at Microsoft Research, founded a successful ITC start-up and served as a brand manager for a major digital brand.

Keynote 5: A Self-replenishing Adaptive Test

Wim van der Linden(Pacific Metrics)

Abstract

Items in the operational pool for an adaptive test have a restricted life span. Ideally, we should be able to replace them periodically, using the response data immediately both to calibrate the new items and score the examinees. In my presentation, I will show how a fully Bayesian approach to calibration, item selection, and examinee scoring statistics can be exploited to realize the ideal.

Bio: Wim J van der Linden is Distinghuished Sicentist and Director of Reseach Innovation, Pacific Metrics Corporation, Monterey, CA, and Professor Emeritus of Measurement and Data Analysis, University of Twente. He received his PhD in psychometrics from the University of Amsterdam. His research interests include test theory, computerized adaptive testing, optimal test assembly, test equating, modeling response times on test items, as well as decision theory and its application to problems of educational decision making. He is the author of Linear Models for Optimal Test Design published by Springer in 2005 and the editor of a new three-volume Handbook of Item Response Theory: Models, Statistical Tools, and Applications to be published by Chapman & Hall/CRC in 2015. He is also a co-editor of Computerized Adaptive Testing: Theory and Applications (Boston: Kluwer, 2000; with C. A. W. Glas), and its sequel Elements of Adaptive Testing (New York Springer, 2010; with C. A. W. Glas). Wim van der Linden has served on the editorial boards of nearly all major test-theory journals and is co-editor for the Chapman & Hal//CRC Series on Statistics for Social and Behavioral Sciences. He is also a former President of both the National Council on Measurement in Education (NCME) and the Psychometric Society, Fellow of the Center for Advanced Study in the Behavioral Sciences, Stanford, CA, was awarded an Honorary Doctorate from Umea University in Sweden in 2008, and is a recipient of the ATP and NCME Career Achievement Awards for his work on educational measurement.

Keynote 6: CAT and Optimal design for Rasch Poisson Counts Models

Heinz Holling, University of Münster

Abstract:

The Rasch Poisson counts model (RPCM) may be considered as the first item response theory (IRT) model. This model allows for the analysis of count data which are assumed to be distributed according to a Poisson distribution. Although many educational and psychological tests yield such data the RPCM has gained little attention compared to other IRT models designed for binary or polytomous responses. Opposite to its counterpart the logistic Rasch model the classical RCPM as published in the monograph by Rasch (1960) does not benefit from adaptive testing strategies. However, recently developed extensions of this model will be more efficient using such procedures.

In this presentation two issues will be addressed. First, locally D-optimal designs for calibrating item parameters of a RPCM using a K-way layout with binary explanatory variables are derived. To overcome the dependence of the parameters to be estimated a sequential procedure will be introduced. The proposed method is especially suited for tests consisting of automatically generated rule-based items such as the Münster Mental Speed Test. The second topic concerns adaptive testing using the item characteristic curve Poisson counts model (ICCPCM) recently introduced by Doebler, Doebler and Holling (2014). To justify the application of this model which is more general and flexible than the classical RPCM we will introduce the covariate adjusted frequency plot.

Bio: Heinz Holling is professor for statistics and quantitative methods at the department of psychology of the University of Muenster, Germany. His research focuses on adaptive rule-based testing, optimal design and meta-analysis. He has published numerous articles in international journals, e. g. Psychometrika, Journal of Statistical Planning and Inference or Biometrika. Furthermore, he is author of several books on statistics and methodology and currently associate editor of three journals, e. g. Journal of Probability and Statistics. His research has been continuously funded for the past 30 years by the German Research Foundation and Federal Ministry of Education and Research.

Calibration of items for adaptive testing is a very well-studied problem, and frequently the solution involves carefully curating the populations exposed to particular items. However, in an adaptive learning environment, students are exposed to items based on their current goals and needs. Thus, in adaptive learning environments, response patterns of students are necessarily biased in ways that can typically be avoided in assessment contexts. In this talk, we discuss several variations on item response theory and response patterns observed in data from Knewton’s adaptive learning platform, and we connect the pitfalls they highlight to corner cases of these models that practitioners should be aware of.

Bio: Kevin studied mathematics at the University of Michigan, where he re-implemented the computer science department's autograding technology and studied parallel SAT solving. During this time, he also obtained several cryptography-related patents. He went on to get his Ph.D. at Princeton University where he worked with Manjul Bhargava and focused on the connections between representation theory, commutative algebra, and number theory. Kevin continues collaborating with number theorists, being especially interested in the security of certain cryptographic schemes and in random families of representations. He joined Knewton in 2012 and specializes in the proficiency models underlying Knewton’s recommendation system.