The Comparative Efficacy of Imputation Methods for Missing Data in Structural Equation Modeling
Missing data, Imputation, Multivariate statistics, Regression, Structural equation modeling
Missing data is a problem that permeates much of the research being done today. Traditional techniques for replacing missing values may have serious limitations. Recent developments in computing allow more sophisticated techniques to be used. This paper compares the efficacy of five current, and promising, methods that can be used to deal with missing data. This efficacy will be judged by examining the percent of bias in estimating parameters. The focus of this paper is on structural equation modeling (SEM), a popular statistical technique, which subsumes many of the traditional statistical procedures. To make the comparison, this paper examines a full structural equation model that is generated by simulation in accord with previous research. The five techniques used for comparison are expectation maximization (EM), full information maximum likelihood (FIML), mean substitution (Mean), multiple imputation (MI), and regression imputation (Regression). All of these techniques, other than FIML, impute missing data and result in a complete dataset that can be used by researchers for other research. FIML, on the other hand, can still estimate the parameters of the model. The study involves two levels of sample size (100 and 500) and seven levels of incomplete data (2%, 4%, 8%, 12%, 16%, 24%, and 32% missing completely at random). After extensive bootstrapping and simulation, the results indicate that FIML is a superior method in the estimation of most different types of parameters in a SEM format. Furthermore, MI is found to be superior in the estimation of standard errors. Multiple imputation (MI) also is an excellent estimator, with the exception of datasets with over 24% missing information. Considering the fact that FIML is a direct method and does not actually impute the missing data, whereas MI does, and can yield a complete set of data for the researcher to analyze, we conclude that MI, because of its theoretical and distributional underpinnings, is probably most promising for future applications in this field.