Detecting Fraudulent Claims in Automobile Insurance Policies by Data Mining Techniques

Authors

  • Teerawat Simmachan Department of Mathematics and Statistics, Faculty of Science and Technology, Thammasat University, Pathum Thani, Thailand
  • Weerapong Manopa Department of Mathematics and Statistics, Faculty of Science and Technology, Thammasat University, Pathum Thani, Thailand
  • Pailin Neamhom Department of Mathematics and Statistics, Faculty of Science and Technology, Thammasat University, Pathum Thani, Thailand
  • Achiraya Poothong Department of Mathematics and Statistics, Faculty of Science and Technology, Thammasat University, Pathum Thani, Thailand
  • Wikanda Phaphan Department of Applied Statistics, Faculty of Applied Science, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand

Keywords:

Na¨ıve Bayes, random forest, adaptive boosting, logistic regression, variable selection

Abstract

The insurance industry is a fast-growing industry and handles substantial amounts of data. Fraudulent claims are the main problem in the industry. Auto insurance fraud is one of the most prominent types of insurance fraud. Numerous fraudulent claims affect not only the insurance company but also the sincere policyholders because of the increase in premium amounts. Typically, a fraud report is
unbalanced data. Overlooking this generally leads to weak classifiers for predicting the minority class (fraudulent claim). Therefore, the fraud detection is a challenging problem. Traditional approaches are difficult to handle and inefficient. Data mining has recently offered significant contributions to insurance analysis. To overcome this, data mining techniques are used to predict fraudulent claims. The aims of this research are to develop, firstly, what types of features should be used to build the predictive model; and second, a statistical learning strategy to classify whether a fraud report is fraudulent or not. To discover important sets of features, logistic regression (parametric method) and random forest (non-parametric method) are considered as tools of variable selection algorithms. This process is done by cross-validation to reduce uncertainty until two sets of important features are obtained. Four algorithms including logistic regression, random forest, Na¨ıve Bayes, and adaptive boosting are employed as classifiers. A confusion matrix is used to evaluate the algorithm’s performance. The results suggest that a set of important features obtained from the non-parametric method provides better performance than the parametric method. The random forest is considered as the best algorithms to identify fraudulent claims with the highest sensitivity (99.19%) and the positive predictive value (93.62%). This work would help in a screening process to investigate claims, thus minimizing human resources and monetary losses in the insurance industry.

References

Aksoy S, Haralick RM. Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recogn Lett. 2001; 22(5): 563-582.

Baek S, Moon H, Ahn H, Kodell RL, Lin CJ, Chen JJ. Identifying high-dimensional biomarkers for personalized medicine via variable importance ranking. J Biopharm Stat. 2008 Sep 5; 18(5): 853-868.

Belhadji EB, Dionne G, Tarkhani F. A Model for the Detection of Insurance Fraud. Geneva Pap Risk Insur Issues Pract. 2000; 25(4): 517-538.

Belyakov SL, Karpov SM. Identity of Fraudulent Financial Operations using the Machine Learning Algorithm. Vestnik Komp’iuternykh i Informatsionnykh Tekhnologii. 2020; 188: 023-031.

Berrar D. Bayes theorem and naive bayes classifier. In: Ranganathan S, Gribskov M, Nakai K, Schnbach C, editors. Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics. Oxford: Academic Press; 2019.

Culp M, Johnson K, Michailides G. ada: An R package for stochastic boosting, J Stat Softw. 2006; 17(2): 1-27.

Fan D. creditmodel Toolkit for Credit Modeling, Analysis and Visualization. R Package Version 1.3.1. 2022 [cite 2022 Dec 20]. Available from: https://CRAN.R-project.org/package=creditmodel.

Gareth J, Daniela W, Trevor H, Robert T. An Introduction to Statistical Learning: with Applications in R, ser. Springer texts in statistics. New York: Springer; 2013.

Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. The 32nd International conference on machine learning; 2015, June; pmlr; 2015. p. 448-456.

Kowshalya G, Nandhini M. Predicting Fraudulent Claims in Automobile Insurance. Proceeding IEEE Conference on Inventive Communication and Computational Technologies; 2018 Apr 20-21; India. Coimbatore: IEEE; 2018. pp. 1338-1343.

Kuhn M. Building predictive models in R using the caret package. J Stat Softw. 2008; 28(1): 1-26.

Liaw A, Wiener M. Classification and regression by randomforest, R news. 2002; 2(3): 18-22.

Maua G, Grbac TG, Bai BD. Multivariate logistic regression prediction of fault-proneness in software modules. Proceeding IEEE of the 35th International Convention on Information and Communication Technology, Electronics and Microelectronics; 2012 May 21-25; Croatia. Opatija: IEEE; 2012. pp. 698-703.568

Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F. Misc Functions of the Department of Statistics (E1071), TU Wien. R J. 2019.

Moon H, Pu Y, Ceglia C. A predictive modeling for detecting fraudulent automobile insurance claims. Theor Econ Lett. 2019; 9(6): 1886-1900.

Priya KU, Pushpa S. A survey on fraud analytics using predictive model in insurance claims. Int J Pure Appl Math. 2017; 114(7): 755-767.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/. 2021.

Randhawa K, Loo CK, Seera M, Lim CP, Nandi AK. Credit card fraud detection using AdaBoost and majority voting. IEEE Access. 2018; 6: 1427714284.

Roy R, George KT. Detecting insurance claims fraud using machine learning techniques. Proceeding IEEE Conference on Circuit Power and Computing Technologies; 2017 Apr 20-21; India. Kollam: IEEE; 2017. pp. 1-6.

Schapire RE. Explaining adaboost. Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik. Berlin: Springer; 2013.

Sharma R. Fraud-detection-in-insurance-claims. Kaggle. 2020 [cite 2021 May 20]. Available from: https://www.kaggle.com/roshansharma/fraud-detection-in-insurance-claims/data.

Venables WN, Ripley BD. Modern Applied Statistics with S. New York: Springers; 2002.

Viaene S, Derrig RA, Baesens B, Dedene G. A comparison of state-of-the-art classification techniques for expert automobile insurance claim fraud detection. J Risk Insur. 2002; 69(3): 373-421.

Wang Y, Xu W. Leveraging deep learning with LDA-based text analytics to detect automobile insurance fraud. Decis Support Syst. 2018; 105: 87-95.

Zou H. Analysis of Best Sampling Strategy in Credit Card Fraud Detection Using Machine Learning. Proceeding of the 6th International Conference on Intelligent Information Technology; 2021 Feb 25-28; Vietnam. Ho Chi Minh: Association for Computing Machinery; 2021. pp. 40-44.

Downloads

Published

2023-06-28

How to Cite

Simmachan, T. ., Manopa, W. ., Neamhom, P. ., Poothong, A. ., & Phaphan, W. . (2023). Detecting Fraudulent Claims in Automobile Insurance Policies by Data Mining Techniques. Thailand Statistician, 21(3), 552–568. Retrieved from https://ph02.tci-thaijo.org/index.php/thaistat/article/view/250065

Issue

Section

Articles