9 Step Two: Missing data Mechanism (or probability distribution of missingness) Consider the probability of missingness Are certain groups more likely to have missing values? Example: Respondents in service occupations less likely to report income Are certain responses more likely to be missing? Example: Respondents with high income less likely to report income Certain analysis methods assume a certain probability distribution

10 Missing Data Mechanisms Missing Completely at Random (MCAR) Missing value (y) neither depends on x nor y Example: some survey questions asked of a simple random sample of original sample Missing at Random (MAR) Missing value (y) depends on x, but not y Example: Respondents in service occupations less likely to report income Missing not at Random (NMAR) The probability of a missing value depends on the variable that is missing Example: Respondents with high income less likely to report income

12 Good News!! Some MAR analysis methods using MNAR data are still pretty good. May be another measured variable that indirectly can predict the probability of missingness Example: those with higher incomes are less likely to report income BUT we have a variable for years of education and/or number of investments ML and MI are often unbiased with NMAR data even though assume data is MAR See Schafer & Graham 2002

17 Pairwise deletion (Available Case Analysis) Analysis with all cases in which the variables of interest are present. Advantage: Keeps as many cases as possible for each analysis Uses all information possible with each analysis Disadvantage: Can t compare analyses because sample different each time

21 Dummy variable adjustment Create an indicator for missing value (1=value is missing for observation; 0=value is observed for observation) Impute missing values to a constant (such as the mean) Include missing indicator in regression Advantage: Uses all available information about missing observation Disadvantage: Results in biased estimates Not theoretically driven NOTE: Results not biased if value is missing because of a legitimate skip

25 Model-based Methods: Maximum Likelihood Estimation Identifies the set of parameter values that produces the highest log-likelihood. ML estimate: value that is most likely to have resulted in the observed data Conceptually, process the same with or without missing data Advantages: Uses full information (both complete cases and incomplete cases) to calculate log likelihood Unbiased parameter estimates with MCAR/MAR data Disadvantages SEs biased downward can be adjusted by using observed information matrix

26 Multiple Imputation 1. Impute: Data is filled in with imputed values using specified regression model This step is repeated m times, resulting in a separate dataset each time. 2. Analyze: Analyses performed within each dataset 3. Pool: Results pooled into one estimate Advantages: Variability more accurate with multiple imputations for each missing value Considers variability due to sampling AND variability due to imputation Disadvantages: Cumbersome coding Room for error when specifying models

29 ice & mim ice: Imputation using chained equations Series of equations predicting one variable at a time Creates as many datasets as desired mim: prefix used before analysis that performs analyses across datasets and pools estimates

43 Notes and help with mi in stata LOTS of options Can specify exactly how you want imputed Can specify the model appropriately (ex. Using svy command) mi impute mvn (multivariate normal regression) also useful Help mi is useful Also, UCLA has great website about ice and mi

44 General Tips Try a few methods: often if result in similar estimates, can put as a footnote to support method Some don t impute dependent variable But would still use to impute independent variables

123 Kwantitatieve Methoden (1999), 62, 123-138. A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA Joop J. Hox 1 ABSTRACT. When we deal with a large data set with missing data, we have to undertake

Data Cleaning and Missing Data Analysis Dan Merson vagabond@psu.edu India McHale imm120@psu.edu April 13, 2010 Overview Introduction to SACS What do we mean by Data Cleaning and why do we do it? The SACS

Imputing Attendance Data in a Longitudinal Multilevel Panel Data Set April 2015 SHORT REPORT Baby FACES 2009 This page is left blank for double-sided printing. Imputing Attendance Data in a Longitudinal

Technical Report No. 4 May 6, 2013 Dealing with missing data: Key assumptions and methods for applied analysis Marina Soley-Bori msoley@bu.edu This paper was published in fulfillment of the requirements

Missing Data Katyn & Elena What to do with Missing Data Standard is complete case analysis/listwise dele;on ie. Delete cases with missing data so only complete cases are le> Two other popular op;ons: Mul;ple

4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data, I mean data that are missing for some (but not all) variables and for some (but not all)

IBM SPSS Missing Values 22 Note Before using this information and the product it supports, read the information in Notices on page 23. Product Information This edition applies to version 22, release 0,

Journal of School Psychology 48 (2010) 5 37 An introduction to modern missing data analyses Amanda N. Baraldi, Craig K. Enders Arizona State University, United States Received 19 October 2009; accepted

SIX DEALING WITH MISSING OR INCOMPLETE DATA Debunking the Myth of Emptiness In almost any research you perform, there is the potential for missing or incomplete data. Missing data can occur for many reasons:

Masters by Coursework and Research Report Mathematical Statistics School of Statistics and Actuarial Science Title: Categorical Data Imputation Using Non-Parametric or Semi-Parametric Imputation Methods

Missing Data in Palliative Care Research Imputation and Analysis Peter Fayers Department of Public Health University of Aberdeen NTNU Det medisinske fakultet Missing data Missing data is a major problem

IBM SPSS Missing Values 20 Note: Before using this information and the product it supports, read the general information under Notices on p. 87. This edition applies to IBM SPSS Statistics 20 and to all

ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are

Analyzing Complex Survey Data: Some key issues to be aware of Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 24, 2015 Rather than repeat material that is

Comparison of Imputation Methods in the Survey of Income and Program Participation Sarah McMillan U.S. Census Bureau, 4600 Silver Hill Rd, Washington, DC 20233 Any views expressed are those of the author

Using Stata 9 & Higher for OLS Regression Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 8, 2015 Introduction. This handout shows you how Stata can be used

Introduction to Multilevel Modeling Using HLM 6 By ATS Statistical Consulting Group Multilevel data structure Students nested within schools Children nested within families Respondents nested within interviewers

Using Medical Research Data to Motivate Methodology Development among Undergraduates in SIBS Pittsburgh Megan Marron and Abdus Wahed Graduate School of Public Health Outline My Experience Motivation for

Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88) Introduction The National Educational Longitudinal Survey (NELS:88) followed students from 8 th grade in 1988 to 10 th grade in

Impact of school inspections on teaching and learning in primary and secondary education in the Netherlands; Technical report ISI-TL project year 1-3 data M. Ehren N. Shackleton Institute of Education,

University of Pretoria Data analysis for evaluation studies Examples in STATA version 11 List of data sets b1.dta (To be created by students in class) fp1.xls (To be provided to students) fp1.txt (To be

NCEE 2009-0049 What to Do When Data Are Missing in Group Randomized Controlled Trials What to Do When Data Are Missing in Group Randomized Controlled Trials October 2009 Michael J. Puma Chesapeake Research

Logistic Regression http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests Overview Binary (or binomial) logistic regression is a form of regression which is used when the dependent is a dichotomy

Power Calculation Using the Online Variance Almanac (Web VA): A User s Guide Larry V. Hedges & E.C. Hedberg This research was supported by the National Science Foundation under Award Nos. 0129365 and 0815295.

This is a chapter excerpt from Guilford Publications. Applied Missing Data Analysis, by Craig K. Enders. Copyright 2010. Series Editor s Note Missing data are a real bane to researchers across all social