A Comparative Study of Imputation Techniques for Handling Multivariate Missing Completely at Random in Numeric Datasets

Ratchaneewan  Paisanwarakiat; Anamai  Na-udom; Jaratsri  Rungrattanaubol

Authors

Ratchaneewan Paisanwarakiat Department of Mathematics, Faculty of Science, Naresuan University, Phitsanulok, Thailand
Anamai Na-udom Department of Mathematics, Faculty of Science, Naresuan University, Phitsanulok, Thailand
Jaratsri Rungrattanaubol Department of Computer Science and information Technology, Faculty of Science, Naresuan University, Phitsanulok, Thailand

Keywords:

Linear regression imputation, predictive mean matching, expectation-maximization imputation, K-nearest neighbors imputation, random forest imputation, normalized root mean squared error

Abstract

Effective handling of missing data is essential in Big Data analytics, as missing values, particularly those occurring randomly across input variables, can significantly affect the reliability and accuracy of results. This research compares and evaluates the effectiveness of eleven imputation techniques: Mean Imputation (MI), Median Imputation (MEI), Deterministic Linear Regression (DLR), Stochastic Linear Regression (SLR), Bayesian Linear Regression (BLR), Bootstrap Linear Regression (BTLR), Predictive Mean Matching (PMM), Expectation-Maximization (EM), KNearest Neighbors with Median (KNNM), K-Nearest Neighbors with Weighted Average (KNNW), and Random Forest (RF). The study utilized nine numeric datasets of varying sizes: three small, three medium, and three large, to assess these techniques. Multivariate missing data were simulated using the Missing Completely at Random (MCAR) mechanism, with missing rates ranging from 10% to 50%. The effectiveness of imputation techniques is evaluated using NRMSE, while their performance consistency is tested with Kendalls W test. The results indicated that MI outperformed MEI. Among the linear regression techniques, DLR excelled compared to the other methods, including SLR, BLR, and BTLR. Additionally, KNNW demonstrated better performance than KNNM. In terms of overall dataset performance, RF, KNNW, and EM were the top performers. For recommending imputation techniques, EM is most suitable for small datasets, KNNW or EM are effective for medium datasets, and RF shows the best performance for large datasets. However, both RF and KNNW demand considerably longer processing times, particularly with large datasets. These insights provide practical guidance for selecting the most appropriate imputation method based on the characteristics of the dataset.

References

Batista GE, Maria CM. (2002). A Study of K-Nearest Neighbour as an Imputation Method. HIS :

Soft Computing Systems - Design, Management and Applications, HIS 2002 Dec 1-4; Santiago,

Chile, 2002; 30: 251-260.

Breiman L. Random forests. Mach Learn. 2001; 45: 5-32.

Buuren SV. Flexible Imputation of Missing Data, Chapman and Hall/CRC; 2018.

Chhabra G, Vashisht V, Ranjan J. A Comparison of Multiple Imputation Methods for Data with

Missing Values. IndianJ. Sci Technol. 2017; 10(9): 1-7.

Dahmani K, Notton G, Voyant C, Dizene R, Nivet ML, Paoli C, Tamas W. Multilayer Perceptron

approach for estimating 5-min and hourly horizontal global irradiation from exogenous meteorological data in locations without solar measurements. Renew Energ. 2016; 90: 267-282.

Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc B Met. 1997; 39(1): 1-38.

Ezzine I, Benhlima L. A Study of Handling Missing Data Methods for Big Data. 2018 IEEE 5th

International Congress on Information Science and Technology (CiSt), Marrakech, Morocco,

pp. 498-501.

Garca LPJ, Sancho GJL, Figueiras VAR, Verleysen M. K nearest neighbours with mutual information

for simultaneous classification and missing data imputation. Neurocomputing. 2009; 72(7-9):

-1493.

Gopal K.M, Durgaprasad N, Deepa KS, Sravan RG, Revanth RD. Comparative Analysis Of Different

Imputation Techniques For Handling Missing Dataset. Int J Innnov Technol Explor Eng. 2019;

(7): 347-351.

Han JW, Kamber M, Pei J. Data Mining Concepts and Techniques. 3rd Edition, Morgan Kaufmann

Publishers, Waltham; 2012.

Hedderley D, Wakeling I. A comparison of imputation techniques for internal preference mapping

using Monte Carlo simulation. Food Qual Prefer. 1995; 6(4): 281-297.

Jadhav A, Pramod D, Ramanathan K. Comparison of Performance of Data Imputation Methods for

Numeric Dataset. Appl Artif Intell. 2019; 33(10): 913-933.

Jamshidian M, Bentler PM. ML Estimation of Mean and Covariance Structures with Missing Data

Using Complete Data Routines. J Educ Behav Stat. 1999; 24(1): 21-41.

Jerez JM, Molina I, Garcia LPJ, Alba E, Ribelles N, Martin M, Franco L. Missing data imputation

using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med.

; 50(2):105-115.

Lee JY, Styczynski MP. NS-kNN: a modified k-nearest neighbors approach for imputing

metabolomics data. Metabolomics. 2018; 14(12): 153.

Le TD, Beuran R, Tan Y. Comparison of the Most Influential Missing Data Imputation Algorithms

for Healthcare. 10th International Conference on Knowledge and Systems Engineering (KSE);

Nov 1-3; Vietnam. Ho Chi Minh City: IEEE; 2018. pp. 247-251.

Little RJA. Missing-data adjustments in large surveys. J Bus Econ Stat. 1988; 6(3): 287-296.

Little RJA, Rubin DB. Statistical Analysis with Missing Data. Wiley; 2002.

McLachlan GJ, Krishnan T. The EM Algorithm and its Extensions. Wiley; 1997.

Memon SMZ, Wamala R, Kabano IH. A comparison of imputation methods for categorical data.

Inform Med Unlocked. 2023; 42: 101382.

Mitra M, Samanta RK. A Study on Missing Data Management. Int J Recent Res Sci Eng Technol.

; 5(2): 2347-2693.

Mohammed M, Zulkafli H, Adam M, Ali N, Baba I. Comparison of five imputation methods in

handling missing data in a continuous frequency table. AIP Conference Proceeding; 2020 Dec

-2; Malaysia. Pagoh: AIP Conf. Proc; 2021. pp.61-69.

Nookhong J, Kaewrattanapat N. Efficiency Comparison of Data Mining Techniques for Missing Value

Imputation. J Ind Intell Inf. 2015; 3(4): 305-309.

Nugroho H, Utama NP, Surendro K. Class center-based firefly algorithm for handling missing data. J

Big Data. 2021; 8(1):37-50.

Paisanwarakiat R, Na-udom A, Rungrattanaubol J. Comparing Statistical and K Nearest Neighbor

Imputation Techniques in Diabetes Dataset. Proceeding of The International Conference on Applied Statistics 2022, 2022 Nov 3-4; Thailand. Bangkok: ICAS; 2022. pp. 179-185.

Rahman HAA, Hidayat T, Rahman AA, Razif AM. Comparisons of imputation methods on different

types of survey research data: A continuous variable. Proceeding of The 38th International Conference of The Polymer Processing Society 2023, 2023 Sep 19- 20; Shah Alam. Malaysia: AIP;

pp. 050001-050007.

Schmitt P, Mandel J, Guedj M. A Comparison of six methods for missing data imputation. J Biomet

Biostat. 2015; 6(1): 224-229.

Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H. Comparison of random forest and

parametric imputation models for imputing missing data using MICE: a CALIBER study. Am J

Epidemiol. 2014; 179(6): 764-774.

Song Q, Shepperd M. Missing data imputation techniques. Int J Bus Intell Data Min. 2007; 2(3):

-291.

Suh H, Song J. A comparison of imputation methods using machine learning models. Commun Stat

Appl Methods. 2023; 30(3): 331-341.

Tang F, Ishwaran H. Random Forest Missing Data Algorithms. Stat Anal Data Min. 2017; 10(6):363-

Thongsri, T, Samart K. Development of Imputation Methods for Missing Data in Multiple Linear

Regression Analysis. Lobachevskii J Math 2022; 43: 33903399.

Torgo L. Data Mining with R: Learning with Case Studies, Second Edition. Chapman and Hall/CRC;

Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB.

Missing value estimation methods for dna microarrays. Bioinformatics. 2001; 17(6): 520-525.

Zhang S. Nearest neighbor selection for iteratively KNN imputation. J Syst Softw. 2012; 85(11):

-2552.

Zhang Z. Missing data imputation: focusing on single imputation. Ann Transl Med. 2016; 4(1): 1-8.

A Comparative Study of Imputation Techniques for Handling Multivariate Missing Completely at Random in Numeric Datasets

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Information

logo

ThaiES

visitor