Detecting Automobile Insurance Fraud: A Novel Two-Step Strategy Using Effective Ensemble Learning Techniques

Wikanda  Phaphan; Samach  Sathitvudh; Tikumporn  Suntornsuwan; Kamon  Budsaba; Teerawat  Simmachan

Authors

Wikanda Phaphan Department of Applied Statistics, Faculty of Applied Science, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand
Samach Sathitvudh Department of Statistics, School of Computer, Data and Information Sciences, University of Wisconsin-Madison, Madison, USA
Tikumporn Suntornsuwan Department of Applied Statistics, Faculty of Applied Science, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand
Kamon Budsaba Department of Mathematics and Statistics, Faculty of Science and Technology, Thammasat University, Pathum Thani, Thailand
Teerawat Simmachan Department of Mathematics and Statistics, Faculty of Science and Technology, Thammasat University, Pathum Thani, Thailand

Keywords:

Corruption, ensembles, fraudulent claims, machine learning, security threats

Abstract

Like other industries, insurance companies processed large volumes of data during the industrial revolution. The industry’s major concern is increasing numbers of fraudulent claims. These claims affect not only financial losses but also the entire industry, honest policyholders, and society. Machine learning (ML) approaches are recently utilized in insurance fraud detection to reduce such losses. To further improve, this article introduces a novel prediction framework for fraudulent claims called the Two-step models. The anonymous US auto insurance dataset was used to demonstrate and evaluate the framework. Under-sampling and synthetic minority over-sampling technique (SMOTE) were used to balance data. Mutual information was employed as a feature selection tool. Five proposed models were built in two steps. Early on, eight basic ML models were implemented. The top three affective models were chosen based on their F-measure scores. Then, their predicted values were used as components to construct the two-step models using ensemble techniques. Statistical tests were utilized to appraise all models. Numerical results indicated that the proposed models yielded significant enhancements. Moreover, the most effective model is a combination of SMOTE and improved multilayer perceptron (IMLP). This research could help insurance firms improve their fraud detection systems to prevent insurance abuse.

References

Abakarim Y, Lahby M, Attioui A. A Bagged ensemble convolutional neural networks approach to recognize insurance claim frauds. Appl Syst Innov. 2023; 6(1): 1-20.

Aksoy S, Haralick RM. Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recogn Lett. 2001; 22(5): 563-582.

Bangchang KN. Application of Bayesian variable selection in logistic regression model. AIMS Math. 2024; 9(5): 13336-13345.

Bangchang KN, Wongsai S, Simmachan T. Application of data mining techniques in automobile insurance fraud detection. ICoMS 2023: Proceedings of the 2023 6th International Conference on Mathematics and Statistics; 2023 July 14-16; Germany. Leipzig: ACM; 2023. pp. 48-55.

Belhadji EB, Dionne G, Tarkhani F. A model for the detection of insurance fraud. Geneva Pap RiskInsur Issues Pract. 2000; 25(4): 517-538.

Bhowmik R. Detecting auto insurance fraud by data mining techniques. J Emerg Trends Comput Inf Sci. 2011; 2(4): 156-162.

Boonkrong P, Simmachan, T. A multigroup SEIR epidemic model with vaccination on heterogeneous network. Chiang Mai J Sci. 2016; 43(4): 896-902.

Botchey FE, Qin Z, Hughes-Lartey K. Mobile money fraud prediction—a cross-case analysis on the efficiency of support vector machines, gradient boosted decision trees, and Naïve Bayes algorithms. Information. 2020; 11(8): 383.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002; 16: 321-357.

Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006; 7: 1-30.

Dhieb N, Ghazzai H, Besbes H, Massoud Y. Extreme gradient boosting machine learning algorithm for safe auto insurance operations. ICVES 2019: Proceeding IEEE of the 2019 IEEE International Conference on Vehicular Electronics and Safety; 2019 September 4-6; Cairo. Egypt: IEEE; 2019. pp. 1-5.

Farghaly HM, Shams MY, El-Hafeez TA. Hepatitis C Virus prediction based on machine learning framework: a real-world case study in Egypt. Knowl Inf Syst. 2023; 65(6): 2595-2617.

Fernández A, Garcia S, Herrera F, Chawla NV. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. J Artif Intell Res. 2018; 61: 863-905.

Friedman JH. Stochastic gradient boosting. Comput Stat Data An. 2002; 38(4): 367-378.

Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc. 1937; 32(200): 675-701.

Friedman M. A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat. 1940; 11(1): 86-92.

Hanafy M, Ming R. Using machine learning models to compare various resampling methods in predicting insurance fraud. J Theor Appl Inf Technol. 2021; 99(12): 2819-2833.

Harrell FE. Binary logistic regression. In: Frank E, Harrell Jr. regression modeling strategies springer series in statistics. Switzerland: Springer, Cham; 2015.

He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009; 21(9):1263-

Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML’15: Proceedings of the 32nd International conference on machine learning; 2015 July 7-9; Lille. France: JMLR; 2015. pp. 448-456.

Kotb MH, Ming R. Comparing SMOTE family techniques in predicting insurance premium defaulting using machine learning models. Int J Adv Comput Sci Appl. 2021; 12(9): 621-629.

Kowshalya G, Nandhini M. Predicting Fraudulent Claims in Automobile Insurance. ICICCT: Proceeding of the 2018 Second International Conference on Inventive Communication and Computational Technologies; 2018 Apr 20-21; India. Coimbatore: IEEE; 2018. pp. 1338-1343.

Mathew TE. Appositeness of Hoeffding tree models for breast cancer classification. J Curr Sci Technol. 2022; 12(3): 391-407.

Matos T, Macedo JA, Lettich F, Monteiro JM, Renso C, Perego R, Nardini FM. Leveraging feature selection to detect potential tax fraudsters. Expert Syst Appl. 2020; 145, https://doi.org/10.1016/j.eswa.2019.113128.

Moon H, Pu Y, Ceglia C. A predictive modeling for detecting fraudulent automobile insurance claims. Theoretical Economics Letters. 2019; 9(6): 1886-1900.

Njoh-Paul IM. A comparative study of ensemble techniques and individual classifiers in predicting insurance claim. MSc [Thesis], Ireland: National College of Ireland; 2020.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. AdaBoost and Voting Classifier. Scikit-learn: Machine learning in Python. JMLR [monograph online] 2011 [cite 2023 May 20]; 12: 2825-2830. Available from: https://scikit-learn.org/stable/

modules/ensemble.html.

Phaphan W. Fraud-detection-in-insurance-claims. GitHub. 2024 [cited 2024 Dec 8]. Available from: https://github.com/wikanda-phaphan/Fraud-detection-in-insurance-claims.

Prasasti IMN, Dhini A, Laoh E. Automobile Insurance fraud detection using supervised classifiers. International Workshop on Big Data and Information Security (IWBIS), 2020 October 17-18; Indonesia. Depok: IEEE; 2020. pp. 47-51.

Roy R, George KT. Detecting insurance claims fraud using machine learning techniques. ICCPCT: Proceeding of 2017 International Conference on Circuit ,Power and Computing Technologies; 2017 Apr 20-21; India. Kollam: IEEE; 2017. pp. 1-6.

Saheed YK, Hambali MA, Arowolo MO, Olasupo YA. Application of GA feature selection on Naive Bayes, random forest and SVM for credit card fraud detection. The 2020 international conference on decision aid sciences and application (DASA), 2020 November 8-9; Bahrain. Sakheer: IEEE; 2020. pp. 1091-1097.

Simmachan T. Impact of homogeneity of variances violation in single factor components of variance model when sampling from finite population. Sci Eng Health Stud. 2019; 13(1): 29-37.

Simmachan T, Manopa W, Neamhom P, Poothong A, Phaphan W. Detecting fraudulent claims in automobile insurance policies by data mining techniques. Thail Stat. 2023; 21(3): 552-568.

Smirnov N. Table for estimating the goodness of fit of empirical distributions. Ann Math Stat. 1948; 19(2): 279-281.

Srisuradetchai P, Panichkitkosolkul W, Phaphan W. Combining machine learning models with ARIMA for COVID-19 epidemic in Thailand. RI2C: Proceeding of the 2023 Research, Invention, and Innovation Congress: Innovative Electricals and Electronics; 2023 Aug 24-25; Thailand. Bangkok: IEEE; 2023. pp. 155-161.

Subudhi S, Panigrahi S. Use of optimized fuzzy C-means clustering and supervised classifiers for automobile insurance fraud detection. J King Saud Univ Comput Inf Sci. 2020; 32(5): 568-575.

Sudjai N, Duangsaphon M, Chandhanayingyong C. Relaxed adaptive Lasso for classification on high-dimensional sparse data with multicollinearity. Int J Stat Med Res. 2023; 12: 97-108.

Vanishkorn B, Supanich W. Crash severity classification prediction and factors affecting analysis of highway accidents. ICAICTA: Proceeding of the 9th International Conference on Advanced Informatics: Concepts, Theory and Applications. 2022 Sep 28-29; Japan. Tokoname: IEEE; 2022. pp. 1-6.

Vosseler A. Unsupervised insurance fraud prediction based on anomaly detector ensembles. Risks. 2022; 10(7), https://doi.org/10.3390/risks10070132.

Wang J, Neskovic P, Cooper LN. Improving nearest neighbor rule with a simple adaptive distance measure. Pattern Recogn Lett. 2007; 28(2): 207-213.

Yang JB, Shen KQ, Ong CJ, Li XP. Feature selection for mlp neural network: the use of random permutation of probabilistic outputs. IEEE Trans Neural Netw. 2009; 20(12): 1911-1922.