Performance Evaluation of Imputation Methods for Missing Data in Logistic Regression Model: Simulation and Application
Keywords:Expectation maximization, imputation by random forests, K-nearest neighbor, multivariate imputation by chained equations, predictive mean matching
Missing data is a common phenomenon most analysts have experienced. Even if the dataset includes a significant number of data points, many of the variables of interest will have missing values. The most prevalent method for dealing with such data points is to leave them out of the analysis. This method is not ideal for multiple reasons. One is that unless the data are missing completely at random, leaving out data points with missing values will bias the results of analysis. A second is that it leads to smaller datasets used for analysis. In this paper, we discuss some commonly used imputation methods, such as Expectation-Maximization (EM), multiple imputation by chained equations, and K-nearest neighbor. Furthermore, we propose a new imputation (EPK) method. The Monte Carlo simulation study is conducted to examine the efficiency of nine imputation methods in the binary logistic regression model when the missingness mechanism is missing at random. Moreover, we used a real data on social network advertising, as an empirical study, to examine these methods. The results of our simulation and empirical studies indicated that the EPK and EM methods are more efficient than other imputation methods; where the EPK and EM have smallest values of Akaike information criterion (AIC) and Bayesian information criterion (BIC), whether the missing data is in the independent variables only, the dependent variable only, or in both together.
Abonazel MR. A practical guide for creating Monte Carlo simulation studies using R. Int J Math. Comput Sci. 2018; 4(1): 18-33.
Abonazel MR, Algamal ZY, Awwad FA, Taha IM, A new two-parameter estimator for beta regression model: Method, simulation, and application. Front Appl Math Stat. 2022; 7, https://doi.org/10.3389/fams.2021.780322.
Abonazel MR, Ibrahim MG. On estimation methods for binary logistic regression model with missing values. Int J Math Comput Sci. 2018; 4 (3): 79-85.
Abonazel MR, Farghali RA. Liu-type multinomial logistic estimator. Sankhya B. 2019; 81(2): 203-225.
Abonazel MR. Handling outliers and missing data in regression models using R: simulation examples. Acad J Appl Math Sci. 2020; 6(8): 187-203.
Akram MN, Golam Kibria BM, Abonazel MR, Afzal N. On the performance of some biased estimators in the gamma regression model: simulation and applications. J Stat Comput Simul. 2022; 92(12): 2425-2447.
Armitage P, Berry G. Logistic regression. In Statistical methods in medical research. Oxford: Blackwell Scientific Publications; 1994.
Awwad FA, Dawoud I, Abonazel MR, Development of robust Özkale-Kaçiranlar and Yang-Chang estimators for regression models in the presence of multicollinearity and outliers. Concurr Comput Pract Exp. 2022; 34(6), https://doi.org/10.1002/cpe.6779.
Azur MJ, Stuart EA, Frangakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it work?. Int J Meth Psych Res. 2011; 20(1): 40-49.
Bender R. Introduction to the use of regression models in epidemiology. Methods Mol Biol. 2009;
Buuren SV, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in R. J Stat Soft. 2010; 45(i03): 1-68.
Concato J, Feinstein AR, Holford TR. The risk of determining risk with multivariable models. Ann Int Med 1993; 118(3):201-210.
Dawoud I, Abonazel MR, Robust Dawoud-Kibria estimator for handling multicollinearity and outliers in the linear regression model. J Stat Comput Simul. 2021; 91: 3678-3692.
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Series B Stat Methodol. 1977; 39(1): 1-38.
El-Masry AM, Youssef AH, Abonazel MR. Using logit panel data modeling to study important factors affecting delayed completion of adjuvant chemotherapy for breast cancer patients. Commun Math Biol Neurosci. 2021; 48: 1-16
El-Sheikh AA, Alteer FA, Abonazel MR. Four imputation methods for handling missing values in the ARDL model: an application on Libyan FDI. J App Prob. Stat., 2022; 17(3):029-47.
Farghali RA, Qasim M, Kibria BG, Abonazel MR. Generalized two-parameter estimators in the multinomial logit regression model: methods, simulation and application. Commun Stat -Simul Comput. 2023; 52(7): 3327-3342.
Gelman A, Hill J. Data analysis using regression and multilevel hierarchical models. New York: Cambridge University Press; 2007.
Hilbe JM. Negative binomial regression. New York: Cambridge University Press; 2011.
Hosmer Jr DW, Lemeshow S, Sturdivant RX. Applied logistic regression. New York: John Wiley & Sons; 2013.
Khan KS, Chien PF, Dwarakanath LS. Logistic regression models in obstetrics and gynecology literature. Obstet Gynecol. 1999; 93(6): 1014-1020.
Little RJA, Rubin DB. Statistical analysis with missing data. New York: John Wiley & Sons; 2002.
Little RJA, Schenker N. Missing data. In: Arminger G, Clogg CC, Sobel ME, editors. Handbook of statistical modeling for the social and behavioral sciences. Boston: Springer; 1995.
Little RJA. Regression with missing X's: a review. J Am Stat Assoc. 1992; 87(420): 1227-1237.
Little RJA, Rubin DB. Statistical analysis with missing data. New York: John Wiley & Sons; 1987.
Meeyai S. Logistic regression with missing data: a comparison of handling methods and effects of percent missing values. J Traffic Logist Eng. 2016; 4(2): 128-134.
Mohamed SM, Abonazel MR, Ghallab MG. A review of ten imputation methods for handling missing values in logistic regression: a medical application. J Pure Appl Sci. 2021; 21(3): 440-451.
Nakagawa S. Missing data: mechanisms, methods, and messages. Ecol Stat Contemp Theory Appl. 2015; 81-105.
Osius G, Rojek D. Normal goodness-of-fit tests for multinomial models with large degrees of freedom. J Am Stat Assoc. 1992; 87(420): 1145-1152.
Pan R, Yang T, Cao J, Lu K, Zhang Z. Missing data imputation by K-nearest neighbours based on grey relational structure and mutual information. Appl Intell. 2015; 43(3): 614-632.
Patnana DS, Hitesh G, Kumar INS. Logistic regression analysis on social networking advertisement. J Crit Rev. 2020; 7(4): 914-917.
Peng CYJ, Zhu J. Comparison of two approaches for handling missing covariates in logistic regression. Educ Psychol Meas. 2008; 68(1): 58-77.
Rady E, Abonazel MR, Metaweâ MH. A comparison study of goodness of fit tests of logistic regression in R: simulation and application to breast cancer data. Acad J Appl Math Sci. 2021; 7(1): 50-59.
Rubin DB. Inference and missing data. Biometrika. 1976; 63(3): 581-592.
Scheffer J. Dealing with missing data. Res Lett Info Math Sci. 2002; 3(1): 153-160.
Sentas P, Angelis L. Categorical missing data imputation for software cost estimation by multinomial logistic regression. J Syst Softw. 2006; 79(3): 404-414.
Stoltzfus JC. Logistic regression: a brief primer. Acad Emerg Med. 2011; 18(10): 1099-1104.
Tranmer M, Elliot M. Binary logistic regression. Cathie Marsh for census and survey research, paper, 20; 2008.
Tsikriktsis N. A review of techniques for treating missing data in OM survey research. J Oper Manag. 2005; 24(1): 53-62.
Wisniewski SR, Leon AC, Otto MW, Trivedi MH. Prevention of missing data in clinical research studies. Biol Psychiatry. 2006; 59(11): 997-1000.
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, Zhou Z H. Top 10 algorithms in data mining. Knowl Inf Syst. 2008; 14(1): 1-37.
How to Cite
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.