Enhancing Random Oversampling for Imbalanced Classification
Keywords:
imbalanced data, random oversampling, Hotelling's T-squared, classificationAbstract
In classification problems, imbalanced data is a common challenge, as classifiers often exhibit a tendency to assign new sample points to the majority class. This leads to suboptimal prediction performance for the minority class. Therefore, it is imperative to mitigate the imbalanced data problem. Random oversampling is a simple employed technique to address class imbalance in datasets. Nevertheless, a subset of the sampled data points may prove inconsequential for the classification process. This research introduces a novel approach: the Hotelling Important Data Point Oversampling Algorithm (HIDPO), an improved version of Random Oversampling. This study aimed to compare classification performance using original data and data from oversampling techniques between Random Oversampling method and the proposed HIDPO method across 96 simulated scenarios. These scenarios varied in four parameters: 1) imbalance rate (IR), 2) the number of relevant predictor variables (RelVar), 3) the difference in means of predictor variables between the minority and majority groups (ClassDif), and 4) sample size (n). Logistic regression models were employed to perform classification tasks. The empirical findings demonstrated that the HIDPO method yields the highest F-measure in scenarios with minimal differences between the minority and majority classes, particularly in cases of severe imbalance, which present challenges for classification. Regarding true positive rate and true negative rate, the HIDPO method yielded moderate values.
References
Sara F, Shahrokh A, Michael WK. A comprehensive data level analysis for cancer diagnosis on imbalanced data. J Biomed Inf 2019;90:103089.
Soh WW, Rika MY. Predicting Credit Card Fraud on an Imbalanced Data. Int J Data Sci Adv Anal 2019;1(1):12-7.
Meryem C, Mohamed H. A sight on defect detection methods for imbalanced industrial data. ITM Web of Conferences 2022;43:01012.
กิตติภพ แซ่เตีย, จิรภัทร์ หยกรัตนศักดิ์. การจัดการข้อมูลไม่สมดุลของการทำกลยุทธ์เสนอขายประกันต่อยอดสำหรับผู้ถือบัตรเครดิต. ใน: เอกสารประกอบการประชุมวิชาการระดับชาติ ครั้งที่ 13 วันที่ 8 - 9 กรกฎาคม 2564. มหาวิทยาลัยราชภัฏนครปฐม. นครปฐม; 2564. หน้า 514-23.
Andrea B. Imbalanced Data Classification with Neural Networks and Classifiers. [Internet]. 2021 [cited 2024 Apr 14]. Available from: https://aboskovic21.github.io/projects/thesis.pdf
Aida A, Siti MS, Anca LR. Classification with class imbalance problem: A Review. Int J Advance Soft Compu Appl 2015;7(3):176-204.
Firuz K, Ho-Hon L, Aswani KC. Keep it simple: random oversampling for imbalanced data. In: proceedings of Advances in Science and Engineering Technology International Conferences (ASET), February 20-23, 2023; Dubai, United Arab Emirates; 2023. p. 1-4.
Michelle J, Maria M. A Comparison of Resampling Techniques to Handle the Class Imbalance Problem in Machine Learning Conversion prediction of Spotify Users - A Case Study. [Internet]. 2017 [cited 2024 Apr 14]. Available from: https://www.kth.se/social/files/5a5ad14056be5b323d61de10/MJagelid.MMovin.pdf
กัลยา วานิชย์บัญชา. การวิเคราะห์ข้อมูลหลายตัวแปร. กรุงเทพฯ: บริษัทธรรมสารจำกัด; 2552.
Ronald EW, Raymond HM. Probability and Statistics for Engineers and Scientists. 5th ed. New York: Macmillan Publishing Company; 1993.
Qianyou M. Recent Applications and Perspectives of Logistic Regression Modelling in Healthcare. In: proceedings of the 2nd International Conference on Mathematical Physics and Computational Simulation, August 9, 2024; Glasgow, UK; 2024. p.185-90.
Mingze S. Research on Influencing Factors of Video Game Sales using Binary Logistic Regression. In: 3rd International Conference on Applied Mathematics, Modeling Simulation and Automatic Control (AMMSAC 2024), June 22-23, 2024; San Diego, USA; 2024. p. 65-70.
Abdulrashid S, Zahriya LH, Anas TB. A Logistic Regression-based Model for Identifying Credit Card Fraudulent Transactions. Asian J Res Com Sci 2024;17(7):41-54.
Prabhakaran N, Nedunchelian R. Combined Feature Set with Logistic Regression Model to Detect Credit Card Frauds in Real Time Applications. J Mach Comput 2024;4(3):804-12.
Art BO. Infinitely Imbalanced Logistic Regression. J Mach Learn Res 2007;8:761-73.
Pang-Ning T, Michael S, Vipin K. Introduction to Data Mining. Boston: Pearson Education, Inc.; 2006.
Lian Y, Nengfeng Z. Survey of Imbalanced Data Methodologies. [Internet]. 2021 [cited 2024 Apr 14]. Available from: https://arxiv.org/pdf/2104.02240
Himanshu T. What Is Balanced And Imbalanced Dataset?. [Internet]. 2019 [cited 2024 Apr 14]. Available from: https://medium.com/analytics-vidhya/what-is-balance-and-imbalance-dataset-89e8d7f46bc5
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Huachiew Chalermprakiet Science and Technology Journal

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
บทความทุกบทความที่ได้รับการตีพิมพ์ถือเป็นลิขสิทธิ์ของ คณะวิทยาศาสตร์แฟละเทคโนโลยี มหาวิทยาลัยหัวเฉียวเฉลิมพระเกียรติ