Impact of COVID-19 Pandemic on Road Traffic Accident Severity in Thailand: An Application of K-Nearest Neighbor Algorithm with Feature Selection Techniques
Keywords:
Classifiers, feature selection, KNN, machine learning, random forest, road safetyAbstract
This study aims to develop road crash severity classifiers utilizing available government data from Thailand, specifically focusing on the period of the COVID-19 outbreak and possible factors. Three primary machine learning algorithms including logistic regression, random forest, and K-Nearest Neighbor (KNN) were utilized. Focusing on factors affecting accident severity, the feature importance was analyzed by stepwise, mean decrease in accuracy and mean decrease in impurity selection techniques. Customizing the three ML models and three feature selection techniques, nine different predictive models were built and evaluated based on accuracy, precision, recall, and F1-score. The results indicated that KNN with feature selections outperform candidate models, particularly KNN-MDA and KNN-MDI for pre-pandemic and during pandemic periods, respectively. Among the eight features, vehicle type was the most important factor causing a higher number of fatal accidents, followed by region, crash type, weather, and time of incidence. That is, motorcycle riders and pedestrians are especially susceptible. Therefore, this study can aid practitioners in formulating effective management policies to enhance road safety.
References
Agresti A. Categorical data analysis. 3rd ed. Hoboken, New Jersey: John Wiley and Sons; 2013.
Akarajarasroj T, Wattanapermpool O, Sapphaphab P, Rinthon O, Pechprasarn S, Boonkrong P. Feature selection in the classification of erythemato-squamous diseases using machine learning models and principal component analysis. BMEiCON 2023: Proceedings of the 15th Biomedical Engineering International Conference; 2023 Oct 28-31; Japan. Tokyo: IEEE; 2023. pp. 1-5.
Baek S, Moon H, Ahn H, Kodell RL, Lin CJ, Chen JJ. Identifying high-dimensional biomarkers for
personalized medicine via variable importance ranking. J Biopharm Stat. 2008; 18(5): 853-868.
Bokaba T, Doorsamy W, Paul BS. Comparative study of machine learning classifiers for modelling
road traffic accidents. Appl Sci. 2022; 12(2), https://doi.org/10.3390/app12020828.
Boonkrong P, Simmachan T. A Multigroup SEIR epidemic model with vaccination on heterogeneous network. Chiang Mai J Sci. 2016; 43(4): 897-903.
Breiman L. Random forests. Mach Learn. 2001; 45(1): 5-32.
Cessie SL, Houwelingen JCV. Ridge estimators in logistic regression. J R Stat Soc Ser C Appl Stat.1992; 41(1): 191-201.
Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, Gibson J, Lawler JJ. Random forests for classification in ecology. Ecology. 2007; 88(11): 2783-2792.
Dong LY, Wang YQ, Li YL, Zhu Q. Adaptive random sampling algorithm based on the balance maximization. J Northeast Univ Nat Sci, 2018; 39(6): 792-796.
Field A. Discovering statistics using IBM SPSS Statistics. 5th ed. Thousand Oaks, CA: Sage Publications; 2017.
Fiorentini N, Losa M. Handling imbalanced data in road crash severity prediction by machine learning algorithms. Infrastructures. 2020; 5(7), https://doi.org/10.3390/infrastructures5070061.
Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003; 3: 1157-1182.
Harrell FE. Binary logistic regression. In: Regression Modeling Strategies. Springer Series in Statistics. Switzerland: Springer International Publishing; 2015. p. 219–274.
Hasanin T, Khoshgoftaar TM, Leevy JL, Seliya N. Examining characteristics of predictive models with imbalanced big data. J Big Data. 2019; 6(1): 1-21.
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York: Springer; 2009.
Hosmer DW, Lemeshow S, Sturdivant RX. Applied logistic regression. 3rd ed. Hoboken, New Jersey: John Wiley and Sons; 2013.
James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. New York: Springer; 2021.
Kotb MH, Ming R. Comparing SMOTE Family Techniques in Predicting Insurance Premium Defaulting using Machine Learning Models. Int J Adv Comput Sci Appl. 2021; 12(9): 621-629.
Kowshalya G, Nandhini M. Predicting fraudulent claims in automobile insurance. In ICICCT 2018: Proceedings of the Second International Conference on Inventive Communication and Computational Technologies (ICICCT); 2018 Apr 20-21; India. Coimbatore: IEEE; 2018. pp. 1338-1343.
Lerdsuwansri R, Phonsrirat C, Prawalwanna P, Wongsai N, Wongsai S, Simmachan T. Road traffic injuries in Thailand and their associated factors using Conway-Maxwell-Poisson regression model. Thai J Math. 2022; Special Issue (2022): IMT-GT International Conference on Mathematics, Statistics and Their Applications 2021: 240-249.
Liaw A, Wiener M. Classification and regression by random Forest. R News. 2002; 2(3): 18-22.
Mansoor U, Ratrout NT, Rahman SM, Assi K. Crash severity prediction using two-layer ensemble machine learning model for proactive emergency management. IEEE Access. 2020; 8: 210750-210762.
Mathew TE. Appositeness of Hoeffding tree models for breast cancer classification. J Curr Sci Technol. 2022; 12(3): 391-407.
Mauša G, Grbac TG, Bašić BD. Multivariate logistic regression prediction of fault-proneness in software modules. In MIPRO 2012: Proceedings of the 35th International Convention on Information and Communication Technology, Electronics and Microelectronics; 2012 May 21-25; Croatia. Opatija: IEEE; 2012. pp. 698-703.
Miao J, Zhu W. Precision–recall curve (PRC) classification trees. Evol Intell. 2022; 15(3): 1545-1569.
Moon H, Pu Y, Ceglia C. A predictive modeling for detecting fraudulent automobile insurance claims. Theor Econ Lett. 2019; 9(6): 1886-1900.
Bangchang KN, Wongsai S, Simmachan T. Application of data mining techniques in automobile insurance fraud detection. In ICoMS 2023: Proceedings of the 2023 6th International Conference on Mathematics and Statistics; 2023 Jul 14-16; Germany. Leipzig; 2023. pp. 48-55.
Open Government Data of Thailand. Road Accident Data Set [Internet]. 2020 [cited 2023 Jan 29]. Available from: https://data.go.th/dataset/gdpublish-number-of-road-accidents-in-the-country
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine
learning in Python. J Mach Learn Res. 2011; 12: 2825-2830.
Phaphan W, Sangnuch N, Piladaeng J. Comparison of the effectiveness of regression models for the
number of road accident injuries. Sci Technol Asia. 2023; 28(4): 54-66.
Prasasti IMN, Dhini A, Laoh E. Automobile insurance fraud detection using supervised classifiers. In IWBIS 2020: Proceedings of the 5th International Workshop on Big Data and Information Security (IWBIS); 2020 Oct 17-18; Indonesia. Depok: Institute of Electrical and Electronics Engineers Inc; 2020. pp. 47-51.
Stando A., Cavus M., Biecek, P. The effect of balancing methods on model behavior in imbalanced classification problems. In LIDTA 2023: Proceedings of the 5th International Workshop on Learning with Imbalanced Domains: Theory and Applications; 2023 Sep 18; Turin. Italy: Proceedings of Machine Learning Research 241; 2023. pp. 16-30.
Schreiber-Gregory DN. Multicollinearity: what is it, why should we care, and how can it be controlled? In: SAS Global Forum 2018; 2018 Apr 8-10; Proceedings of the SAS® Global Forum 2018 Conference. Cary, NC: SAS Institute Inc.; 2018. Paper 1404-2017.
Shu J, Tang Y, Cui J, Yang R, Meng X, Cai Z, et al. Clear cell renal cell carcinoma: CT-based radiomics features for the prediction of Fuhrman grade. Eur J Radiol. 2018; 109: 8-12.
Simmachan T, Manopa W, Neamhom P, Poothong A, Phaphan W. Detecting fraudulent claims in automobile insurance policies by data mining techniques. Thail Stat. 2023; 21(3): 552-568.
Simmachan T, Wongsai N, Wongsai S, Lerdsuwansri R. Modeling road accident fatalities with underdispersion and zero-inflated counts. PLoS One. 2022; 17(11): e0269022.
Tabachnick BG, Fidell LS. Using multivariate statistics. 6th ed. Boston, MA: Pearson; 2013.
Tan PN, Steinbach M, Kumar V. Introduction to data mining. Boston, MA: Pearson Addison Wesley;
Taveekal P, Rajchanuwong P, Wongwiangjan R, Lerdsuwansri R, Intrakul J, Simmachan T, Wongsai
S. Modelling road accident injuries and fatalities in Suratthani Province of Thailand using Conway-Maxwell-Poisson regression. Thail Stat. 2023; 21(3): 569-579.
Tyagi S, Mittal S. Sampling Approaches for Imbalanced Data Classification Problem in Machine Learning. In Proceedings of ICRIC 2019: Recent innovations in computing. Lecture Notes in Electrical Engineering Proceedings of ICRIC 2019. Springer International Publishing. 2019: 209-221.
Vaiyapuri T, Gupta M. Traffic accident severity prediction and cognitive analysis using deep learning. Soft Comput. 2021: 1-13.
Goorbergh RVD, Smeden MV, Timmerman D, Calster BV. The Harm of Class Imbalance Corrections for Risk Prediction models: Illustration and Simulation Using Logistic Regression. J Am Med Inform Assoc. 2022; 29(9): 1525-1534.
Vanishkorn B, Supanich W. Crash severity classification prediction and factors affecting analysis of
highwayaccidents. In ICAICTA2022: Proceedings of the 9th International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA); 2022 September 28-29; Japan. Tokoname: IEEE; 2022. pp. 1-6.
Wan J, Zhu S. Cross-city crash severity analysis with cost-sensitive transfer learning algorithm. Expert Syst Appl. 2022; 208(4), https://doi.org/10.1016/j.eswa.2022.118129.
Wang J, Neskovic P, Cooper LN. Improving nearest neighbor rule with a simple adaptive distance measure. Pattern Recogn Lett. 2007; 28(2): 207-213.
World Health Organization. Global status report on road safety 2018. 2018 [cited 2024 Jan 20]. Available from: https://books.google.co.th/books?hl=th&lr=&id=uHOyDwAAQBAJ&oi=fnd& pg=PR6&dq=World+Health+Organization.+(2018).+Global+status+report+on+road+safety+218.&ots=2T-m0zreWW&sig=FpsjOkITsJO1WEHWSeJI6aCH5R0&redir_esc=y#v=onepage& q&f=false.
Yilmaz AE, Demirhan H. Weighted kappa measures for ordinal multi-class classification performance. Appl Soft Comput. 2023; 134, https://doi.org/10.1016/j.asoc.2023.110020.
Zhang S, Khattak A, Matara CM, Hussain A, Farooq A. Hybrid feature selection-based machine learning classification system for the prediction of injury severity in single and multiple-vehicle accidents. PLoS One. 2022; 17(2), https://doi.org/10.1371/journal.pone.0262941.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.