Performance Evaluation of Imputation Methods for Missing Data in Logistic Regression Model: Simulation and Application
Expectation maximization, imputation by random forests, K-nearest neighbor, multivariate imputation by chained equations, predictive mean matchingAbstract
Missing data is a common phenomenon most analysts have experienced. Even if the dataset includes a significant number of data points, many of the variables of interest will have missing values. The most prevalent method for dealing with such data points is to leave them out of the analysis. This method is not ideal for multiple reasons. One is that unless the data are missing completely at random, leaving out data points with missing values will bias the results of analysis. A second is that it leads to smaller datasets used for analysis. In this paper, we discuss some commonly used imputation methods, such as Expectation-Maximization (EM), multiple imputation by chained equations, and K-nearest neighbor. Furthermore, we propose a new imputation (EPK) method. The Monte Carlo simulation study is conducted to examine the efficiency of nine imputation methods in the binary logistic regression model when the missingness mechanism is missing at random. Moreover, we used a real data on social network advertising, as an empirical study, to examine these methods. The results of our simulation and empirical studies indicated that the EPK and EM methods are more efficient than other imputation methods; where the EPK and EM have smallest values of Akaike information criterion (AIC) and Bayesian information criterion (BIC), whether the missing data is in the independent variables only, the dependent variable only, or in both together.
