Secondary school foreign language qualifications in England through the lens of the Common European Framework of Reference for Languages (CEFR): are assessment standards too high?Milja Curcin, Beth Black

Balancing between psychometric validity and content validity: the case of differential item functioning for gender in a national assessment of French as a foreign languageKoen Aesaert, Jo Denis, Karen Van Renterghem

The CEFR as an assessment tool for learner linguistic and content competence: assisting learners in understanding the language proficiency needed for specific content goals in the CLIL classroom.Stuart Shaw

How Can We Use Item Response Times in the Low Stakes Testing? Ideas on Reliability, Cross-National Comparability, and Responses ClassificationDenis Federiakin

Looking beyond the test scores: Latent motivational profiling of teenage English language learners from four country contextsKaren Dunn

What do international large-scale assessments tell us about high achievement in mathematics and science, with specific reference to Ireland and some comparison…Vasiliki Pitsia, Michael O'Leary, Gerry Shiel, Zita Lysaght

The ‘grey history’ of assessment: understanding the origins of England’s new model of assessment of practical work in ScienceTim Oates

Age-standardising on-demand tests: Is there an effect of “learning time”?Ben Smith

External evaluation as a tool for school development: how do Flemish teachers and school leaders engage with school-level feedback from large-scale national assessments?Evelyn Goffin, Mieke Heyvaert, Isabel Laenen, Rianne Janssen, Jan Vanhoof

09:30-10:00

The Nordic student experience: How do students in Finland, Norway and Sweden experience instructional quality in Language Arts and Mathematics?Astrid Roe, Marte Blikstad-Balas, Michael Tengberg

Differential Item Functioning (DIF) analysis is an analytic method useful for identifying potentially biased items in assessments. While simply comparing two groups’ total scores can lead to incorrect conclusions about test fairness, many DIF detection methods have been proposed in the past, those based on total scores as well as those based on Item Response Theory (IRT) models (Martinková, Drabinová et al., 2017).

The workshop will offer an introduction into DIF detection from a practical point of view. We will first provide psychometric background, from Classical Test Theory (CTT) to IRT models. Then, we will introduce the mostly used DIF detection methods including i) Delta plots, ii) Mantel-Haenszel test, iii) Logistic regression, iv) Nonlinear regression, v) Lord/Wald test, vi) Raju’s area between item characteristic curves, and vii) SIBTEST. We will also discuss multinomial regression model for detection of Differential Distractor Functioning (DDF). Further, we will discuss other technical aspects, such as item purification, correction for multiple comparisons etc. We will discuss pros and cons of each method and we will focus on their application in practice on real data examples.

The free statistical software R will be used throughout the sessions using its packages difNLR, difR, deltaPlotR, and mirt. Moreover, the ShinyItemAnalysis package will provide interactive user-friendly interface helpful for those who are new to R (Martinková & Drabinová, 2018).

The workshop is intended for researchers, graduate students and practitioners and all with interest on how to conduct DIF analysis in practice. An introductory statistical background is expected. Some experience in R is helpful, but not required.

Attendees are expected to bring their own laptop with R installed together with the latest versions of the R package ShinyItemAnalysis and its dependencies, including packages difNLR, difR, mirt and deltaPlot. Electronic training materials will be provided to the attendees.

Tentative Schedule

09.00

Coffee and registration

09.30

Welcome & introductions

Outline of the Workshop

Patrícia, Adéla

09.45

Background to DIF analysis: from CTT to IRT

Patrícia

11.00

Break

11.30

Examples in R and ShinyItemAnalysis

Adéla

13.00

Lunch

14.00

Differential item functioning: Methods and approaches

Patrícia

15.30

Break

15.45

Examples of DIF analyses in R and ShinyItemAnalysis

Adéla

16.30

Workshop close

–

Presenters’ Bios:

Patrícia Martinková is a researcher at the Department of Statistical Modelling, Institute of Computer Science of the Czech Academy of Sciences and the principal investigator of the Center for Educational Measurement and Psychometrics at Faculty of Education, Charles University, Prague. She is also a Fulbright alumna and 2013-2015 visiting research scholar with the Center for Statistics and the Social Sciences (CSSS), and an affiliate assistant professor at Department of Statistics and CSSS, University of Washington. Her research focuses on developing models and estimators for detection of disparities in rating and for better understanding of the differences among different groups. She has a long history in analyzing achievement tests, including those used in university admissions, grant selection or hiring. She taught Item Response Theory Models of Testing at University of Washington, and she teaches Selected topics in psychometrics and Seminar in Psychometrics at Charles University, Prague. Martinková has published number of innovative papers on detection of disparities in rating and assessment. She is the maintainer of the R package ShinyItemAnalysis and she initiated the development of the R package difNLR. For more information about Martinková, visit the webpage: http://www.cs.cas.cz/martinkova/

Adéla Hladká (née Drabinová) is a PhD student at Department of Probability and Mathematical Statistics, Charles University and a PhD fellow at Institute of Computer Science of the Czech Academy of Sciences. Her field of interest is developing statistical tools in psychometrics with main focus on detection of differential item functioning (DIF). Hladká is the main author of difNLR R package – tool for DIF detection using generalized logistic regression models, and one of the main developers of the ShinyItemAnalysis R package and application. She is a teaching assistant for Selected topics in psychometrics at Charles University. For more information about Hladká, see the webpage: http://www.cs.cas.cz/hladka/

Martinková and Hladká organized 11th workshop on Psychometric computing, Psychoco 2019, in Prague. They are both authors of the difNLR package with almost 20,000 downloads from CRAN and of the ShinyItemAnalysis R package and interactive online application, now with more than 13,000 downloads from CRAN and 10,000 online accesses from almost 100 countries around the world. ShinyItemAnalysis package has been featured in the December 2018 issue of The R journal.

Papers about DIF and detection of rating disparities, published by the presenters:

Presentation Title

Improving student’s performance with Active Learning

Abstract

Is it possible to practically implement active learning techniques, so that students really improve their learning performance, at the university level? The answer is clearly positive, as evidenced since long by primary and secondary schools, and more recently by engaging initiatives in higher education, mainly in Europe and North America.

The talk will review successful practical implementations of active learning at the classroom level, focusing on the necessary changes that organizations, classrooms and teachers must address in their everyday practice.[1] Supporting evidences will be provided from the overwhelming scientific literature nowadays available. Such evidences have been organized under a “frequently asked questions” format, as it has proven adequate when addressing hesitating stakeholders.

The speech will also present the recent conclusions of the Thematic Peer Group on Promoting Active Learning and Teaching in Universities, more specifically from the paper issued by the European University Association in 2019.[2] The main aspects are: a) implementation of active learning has to be done concurrently at all institutional levels; b) students must have a major role in driving and assessing such implementation; c) teacher career paths must coherently valorize the teaching activity; and d) evidence–based learning and teaching requires an adequate communication strategy, so as to foster the necessary change in mindset.

Short bio

Xavier Giménez Font (Barcelona, 1963) is Professor of Chemistry at the Chemistry Department of the University of Barcelona. He currently teaches Environmental Chemistry and Physical Chemistry of Materials, researches in Computational Simulation of Molecular Systems, speaks and writes widely about popular science and, last but not least, he is much involved in teaching innovation. He did research stages at the University of Perugia, Italy, CNRS in Paris, as well as the University of California, Berkeley. He is author of more 100 research papers and four books about popular science and the teaching of chemistry. He belongs to the Active Learning and Teaching Thematic Peer Group of the European University Association, having created the synchronous flipped classroom methodology SABER, that is used in several Universities as one of their active learning methodologies.

Dr. Aisling Keane

Presentation Title

Abstract

Informed by Rogoff’s Three Planes of Analysis framework and influenced by situated participationism the Kathleen Tattersall New Researcher Award keynote lecture will explore and expand discussions surrounding formative assessment to offer alternative approaches to current practices dominant in the early years of undergraduate teaching and educational transitions in general. This is important as a clearly articulated sociocultural perspective provides comprehensive theoretical insight into formative assessment practices in first year higher education which inadvertently negatively impact on student enculturation into a new community. This work advocates for sociocultural approaches which see pedagogies as transformative for newcomers when they rely on clear frameworks of mutual participation between staff and students in valuable on-going cultural activities. Such pedagogies facilitate learner involvement to recognise processes and efforts which contribute to community goals.

Short bio

Upon completing her Ph.D in Anatomy (National University of Ireland, Galway, Ireland) Aisling joined the Centre for Biomedical Sciences Education at the Queen’s University Belfast (QUB) Northern Ireland in 2005 as a Lecturer (Education). Recognising the importance of educational research in third level education Aisling undertook and graduated with a Doctorate in Education (2019) from QUB. Her educational research is underpinned by sociocultural approaches to exploring the nature of assessment, learning and student transition to third level education and scholarship of teaching and learning in Higher Education. Aisling’s work makes an original contribution to the field through the application of a sociocultural framework to explore student experiences of formative assessment in the first year of university and the impact of this on subsequent approaches to assessment and learning, particularly in the second year.

Dr. Kadriye Ercikan

Presentation Title

Using Response Process Data for Informing Group Comparisons

Abstract

Group comparisons, such as gender, ethnic, cultural groups or low or high performing students, are one of the key uses of assessment results. One goal of comparing groups is to gain insights to inform policy and practice and the other is for examining the comparability of scores and score meaning for the comparison groups. Such comparisons typically focus on examinees’ final answers and responses to test questions. In this presentation my goal is to discuss and demonstrate the use of response process data in enhacing methodologies used in comparing groups. Response processes may reveal important information about differences in engagement of students that may not be captured by the final responses and provide insights about differences in response patterns that may be identified by using final responses. I argue for use of response process data in addition to final responses to test questions in comparing groups and for examining measurement comparability. I demonstrate use of process data in comparing groups in three example cases. In Study 1, I examine response times for English Learners (EL) and Non-EL groups on a mathematics assessment and explore how such differences may inform measurement and measurement comparability. In Study 2, I examine sequences of actions captured in key stroke data for EL and Non-EL students on a writing assessment. In Study 3, I focus on using response times in examining measurement comparability in an international assessment. Across these examples I discuss distinctions between response process differences that may constitute measurement inequivalence and others reflecting group differences in engagement with the test which do not constitute measurement inequivalence.

Short bio

Kadriye Ercikan is the Vice President of Psychometrics, Statistics and Data Sciences at ETS and Professor of Education, at the University of British Columbia. She is the current Vice President of American Educational Research Association, a member of the AERA Executive Board of Directors, a member of the ITC Executive Council, has been a member of NCME Board of Directors. She is the recipient of the Significant Contributions Award from AERA Division D.

Developing selected response test items

Presenter: Ezekiel Sweiry

The purpose of this workshop is to present and discuss guidance on developing selected response (SR) items. The guidance is based on a synthesis of available research literature on SR item writing, relevant aspects of cognitive psychology (including models of language comprehension and working memory capacity) and the presenter’s own experience of high-stakes test development across primary and secondary education in the UK.

The primary basis for the guidance will be on ensuring that items are, as far as possible, free from construct irrelevant variance, which occurs when scores are influenced by factors irrelevant to the construct (Messick, 1984). These factors can make items unintentionally easy (construct irrelevant easiness) or difficult (construct irrelevant difficulty).

Guidelines will focus on a range of issues including language accessibility, the central role played by distractors in affecting the difficulty and validity of SR items, unintentional cues that can betray the correct answers, and the assessment of higher-order skills. While most SR guidelines and research are based specifically on conventional multiple choice questions (e.g. Haladyna et al., 2002), this workshop will address the full range of SR item types, including true/false, matching, ordering, multiple correct answer and cloze. Structural differences between these item types can influence their difficulty and proneness to different validity issues.

While most of the workshop will be focused on writing SR items in traditional (paper and pencil) summative contexts, consideration will also be given to the use of SR e-assessment item formats (e.g. the drag and drop, drop-down menus) and the use of selected response items in diagnostic assessments.

Finally, the workshop will consider how evidence from item trialling can be used to identify problematic items, in particular through the use of distractor analysis.

The workshop will incorporate a number of practical activities, and all aspects of the guidance given will be illustrated through the use of example questions. Participants will also have the opportunity to review potential revisions to a variety of sample questions.

The workshop is aimed at anyone with an interest in SR item writing, for both summative and diagnostic assessment purposes, including test developers, educational assessment researchers and teachers. No specific prior knowledge is needed.

Efforts towards offering tests and examinations on a computer started in the early 1980s using almost exclusively constrained response question items. However, almost four decades and much improved onscreen presentation later, the most frequently used onscreen assessment items are still only variations of the original constrained response items, i.e. multiple choice, short (numeric) responses, sequencing or matching response elements. One often mentioned reason is that marking items with constrained responses is easily automated, enabling immediate feedback and adaptive testing. On the other hand, constrained item types are not necessarily the ideal instruments to assess complex thinking skills. Large improvements in computing and screen technology have enabled the development of assessments that make more effective use of digital opportunities. These include items with interactive responses, computer algebra systems, gaming, simulations, and even augmented reality (e.g. Bressler & Bolton, 2013), and provide opportunities for assessing thinking process or complex thinking skills. Paradoxically, these assessment types have not been widely adopted.

In the past decade, two taxonomies have been published classifying digital assessments by item characteristics (Scalise, 2009; Parshall, Harmes, Davey and Parshley, 2010). These classification systems are helpful to establish the level of sophistication of current digital assessments, identify gaps in knowledge and point ways forward for item development. However, these models do not seem to support test design, in the sense that they lack guidance linking item types and assessment objectives. Scalise’s taxonomy categorises items on two dimensions, level of constraint and complexity of item format, indicating to some extent that different item types can be used to assess different objectives. In this workshop participants are invited to propose links between item types within Scalise’s taxonomy and a range of assessment objectives linked in a revised Bloom’s taxonomy (Heer, 2012). In this activity, participants will have access to a range of fully functional digital summative assessment items. The connections between item type and assessment objectives proposed by participants will be summarised to create an enhanced scheme of Scalise’s taxonomy.

Digital assessment is often presented as offering many opportunities for advanced design. However, creating digital assessment items also presents its own challenges, which are not always obvious at first glance. The transition towards computer-based assessment (CBA) often starts with what is seen as a simple or straightforward reformatting of an original paper-based assessment (PBA) to the new format. Using examples from their own examinations or tests (or from existing PBAs provided), participants will experience this process of redesigning and propose a manageable CBA version. In this process they will confront the considerations, both technical and logistical, that often drive and shape the transition process. The groups will then share their choices, the obstacles and the resulting designs, which will encourage deeper understanding of the many considerations feeding into the development of valid, manageable and fair on-screen assessment. Participants will experience the intricacies of digital transition.

The SIG E-Assessment pre-conference workshop will interest anyone currently involved in preparing for the digital migration of PBA and those interested in the development of CBA items for assessing complex thinking skills.

The workshop will focus on some of the findings that have emerged from our research on assessment fairness which has drawn upon important material from a variety of sources, different disciplines and disparate jurisdictions in order to illustrate concepts with concrete examples and case studies.

The workshop will consist of an introductory overview followed by four sessions each separated by group discussion work.

Part One opens with a conceptual preface and distinguishes six different uses of ‘fair’ which have relevance to assessment. We also raise questions about several assumptions which are often made relating to, for example, Fairness to whom? And whether fairness applies to groups rather than individuals. Debate about fairness in assessment can involve a wide range of people, who bring their own expectations, conceptual apparatus and assumptions. We have found the metaphor of “lenses” useful for describing and distinguishing different approaches. We also describe a common structure of questions to apply to each lens which helps martial the structure of the workshop.

In Part Two we consider fairness through the lens of educational measurement and assessment. Fairness as viewed through this lens suggests, variants – such as the psychometric paradigms found in the authoritative US texts such as the Standards and the approach to public, award-based qualifications offered by UK awarding bodies, which is grounded on a curriculum-embedded paradigm. First, we explore the history of, and consensus on, fairness through a number of key publications focusing in particular on The Standards and Educational Measurement then critique some aspects of that consensus.

In Parts Three, Four and Five we extend the list to lenses that bring in concepts and assumptions from three other disciplines or traditions:

Legal approaches

Philosophical approaches

Fairness as a contributor to social reform

In Part Three we shall explore international cases studies where a mix of statute and common law, reflect a range of legal traditions – rights-based, process-based, outcomes-based – and increasingly are influenced by legislation defining prohibited grounds for discrimination.

In Part Four we explore the links between assessment fairness and social justice It is precisely because assessment has the potential for important, life-changing impact over students’ current and future well-being, shaping their educational experiences in a multitude of ways, and informing their future directions and careers, that assessment fairness is a social justice issue.

In Part Five, we view fairness through the lens of philosophical approaches. Philosophers from Aristotle to John Rawls and beyond have linked fairness with concepts of justice, typically seeing “fairness” as a narrower concept, linked to, but not the same as, the wider concept of justice. We suggest that a closer look at philosophers’ treatment of fairness reveals some common ground with the accounts of assessment theorists.

We conclude the workshop by presenting a fairness agenda for the 21st Century.

Each part of the workshop will be punctuated by international case studies which will afford ample opportunity for participants to share, comment and assess in light of their own contexts and experiences.

The proposed structure and content of the workshop brings together a wide range of intellectual disciplines and experiences. From experience with groups of students, teachers and assessment practitioners, there is considerable interest in educational topics which bridge disciplinary divides and explicitly raise wider questions about social justice and public policy.

The workshop is envisaged as a resource for postgraduate students in educational measurement and assessment, for key practitioners in assessment agencies who wish to gain a deeper understanding of the implications for (un)fair assessment, for those with an academic interest in fairness, for teachers and for the novice who should be able to benefit from attending the workshop.

American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (1999). Standards for Educational and Psychological Testing. Washington, D.C.: American Educational Research Association.

IRT in R made easy

Presenters

Remco Feskens, Dutch National Institute for Educational Assessment, Cito, The Netherlands and Jesse Koops, Dutch National Institute for Educational Assessment, Cito, The Netherlands.

The first session of the workshop starts with a gentle introduction in R. The participants will learn the general principles of R, how to read in data into R, how to manipulate data in such a way that it can be used to perform CTT and IRT analyses and how to produce some graphs suitable for reporting to stakeholders. We will use the framework of data science as presented by Grolemund and Wickham in R for Data Science (2017).

The second part of the workshop will focus on the analyses of test data in R by making use of the R package dexter (https://CRAN.R-project.org/package=dexter). Dexter is intended as a robust and fairly comprehensive system for managing and analyzing test data organized in booklets. It includes facilities for importing and managing test data, assessing and improving the quality of data through basic test-and-item analysis, fitting an IRT model, and computing various estimates of ability. During this session we will illustrate the general properties of IRT using illustrative examples. CTT and IRT output will be explained, discussed and interpreted based on exercises and supplementary materials.

In the third session participants will analyze parts of the PISA-2012 test data in R by themselves with the assistance of the course instructors. PISA is an educational survey which takes place every three years and is commissioned by the OECD. PISA aims to measure the performances of 15-years old students on subjects like mathematics, reading and sciences within different countries. More information on the PISA survey can be found on http://www.oecd.org/pisa/. Topics that will be discussed are CTT, and the IRT concepts item estimation, person ability estimation, and equating. During this session participants will experience the main features of IRT through these hands-on exercises.

In the afternoon there will be time to discuss questions from the participants. Alternatively, participants can either work on their own data or do additional exercises using the PISA dataset. In the latter case, exercises will concentrate on differential item functioning in PISA.

Schedule

Time

Session

09:00

Coffee and registration

09:30

Welcome & introductions & Outline of the Workshop

09:45

Intro R

11:00

Break

11.15

IRT in R

13:00

Lunch

14:00

Analysing Pisa in R

15.15

Break

15:30

Analysing Pisa or/ and own data in R

16:30

Workshop close

Prerequisites and preparation for the workshop

Participants might be novices or more experienced users in IRT and/or R. No prior knowledge is required to attend the workshop. Participants are invited to bring their own laptops for practicing. Participants are asked to install R (http://cran.r-project.org) and Rstudio (http://www.rstudio.com/) on their own laptop before the course. During the course we will make use of PISA datasets. These are publicly available and can be downloaded from the OECD website: http://www.oecd.org/pisa/data/. It is useful to download the following files beforehand: A SAS control file, which can be found on: http://www.oecd.org/pisa/pisaproducts/PISA2012_SAS_scored_cognitive_item.sas, and the scored cognitive data which can be downloaded from the OECD website as well: https://www.oecd.org/pisa/pisaproducts/INT_COG12_S_DEC03.zip. After downloading the files, these need to be unzipped manually. It would also be useful (but not necessary) to have already installed the R packages tidyverse and dexter. Apart from that, no special preparation is required. However, participants who want to be especially well prepared are invited to read the free ebook ‘R for Data Science’ and browse the help and vignettes of the dexter package. Whenever possible, participants are encouraged to bring their own data and analyses for discussion.

Why AEA members should attend this workshop:

The workshop will offer an introduction to IRT and applications from a practical point of view using R. IRT is used for many measurement applications including item banking, test construction, scaling, linking and equating, test scoring and score reporting. Main features of these applications will be addressed in the workshop. The course will also provide an introduction how to use R and more specifically how R can be used to perform CTT and IRT analyses. Participants will be able to understand and assess the usefulness of IRT using R in their own work.

Who this Workshop is for:

The workshop is aimed at those who want to know more about IRT with a focus on applications in R. Participants will learn the general principles of psychometrics and how to analyze test data in R using CTT and IRT techniques. No specific background knowledge is required.

Presenters’ Bios

Remco Feskens: After graduating for his Ph.D. in Methods and Statistics in 2009, Remco Feskens has worked at Utrecht University and Statistics Netherland. In 2010 Remco Feskens joined Cito’s research department. Feskens is currently working as a senior research scientist for Cito. He is a trained research methodologist and provides national and international organizations with consultancy on psychometric and methodological issues. He is also involved in the psychometric analyses for nation-wide test administrations in the Netherlands, England and Kazakhstan, with a focus on the sampling, design, calibration and equating. He (co-) authored several publications in international statistical and methodological journals. Remco Feskens has taught courses in IRT, R and survey methodology in several international summer schools.

Jesse Koops: After receiving his MSc in psychology in 2007 Jesse Koops has worked as a researcher for the Dutch Association of Mental Health and Addiction Care and the Free University of Amsterdam. In 2009 Jesse joined Cito where he currently works as a senior research scientist and computer programmer. He has taught courses on item construction, analysis and programming in R, both nationally and internationally. His focus is on data management, ‘psychometric programming’ and analysis. He is also involved in analyses for nation-wide test administrations in the Netherlands, England and Kazakhstan, with a focus on data management and calibration. In addition he is one of the co-developers of Dexter.

Presentation Title

Towards a New Generation of External Assessments: Reflections on a Systematic Literature Review

Short CV

Domingos Fernandes is a full and tenured professor of Educational Evaluation and also an integrated researcher of the Research and Development Unit in Education and Training at the Institute of Education of the University of Lisbon. Currently, he is serving as coordinator of the Department of Education and Training Policies and of the Master’s and Doctoral programs in Educational Evaluation as well. His main research and teaching interests are Evaluation Theory, Program Evaluation, Policies Evaluation, and Learning Assessment. He has been a visiting professor in a number of foreign universities such as Texas A&M University in the USA, University of São Paulo (USP) and State University of São Paulo (UNESP) in Brasil, and University of La Salle in Colombia. Moreover, he has been the principal researcher and coordinator of several financed national and international research and evaluation projects. He is the author of more than one hundred publications (e.g., research journal articles, books, book chapters, monographs, research and evaluation reports).