A Hybrid Data Level Approach for Improving Classification Performance in Imbalanced Dataset

Main Article Content

วันทนี ประจวบศุภกิจ

Abstract

The imbalanced problem occurs when the number of instance in the one class sharply outnumber another class. The classification on imbalanced data always brings about problems because the traditional classifiers tend to predict well on the majority class while the prediction based on the minority class is poor.  Therefore, the aim of this research is to propose the hybrid data level approaches in order to improve the classification performance based on the two-class imbalanced dataset. This research introduces a new approach that combines the clustering approach of k-means algorithm and over-sampling techniques namely Clustering Switching Method for Sampling Imbalanced Data or ClusIM. The research’s result shows that ClusIM has higher F-measure and G-mean results than the other methods especially on majority classes that ClusIM obtains the F-measure and the G-mean values about 90% on all dataset.  Moreover, ClusIM reduced the overlap and imbalanced ratio between classes to get good performance.

Article Details

How to Cite
1.
ประจวบศุภกิจ ว. A Hybrid Data Level Approach for Improving Classification Performance in Imbalanced Dataset . Prog Appl Sci Tech. [Internet]. 2018 Dec. 30 [cited 2024 May 6];8(2):125-42. Available from: https://ph02.tci-thaijo.org/index.php/past/article/view/243034
Section
Information and Communications Technology

References

Asuncion A. and Newman D., "UCI machine learning repository.". Available: http://archive.ics.uci.edu/ml/datasets.html, (2007).

Breiman L., "Bagging predictors",Mach. Learn., 24(2): 123-140, (1996).

Chawla N. V., Bowyer K. W., Hall L. O., and Kegelmeyer W. P., "SMOTE: synthetic minority over-sampling technique",Journal artificial intelligence research, 16(1): 321-357, (2002).

Cieslak D. A., Chawla N. V., and Striegel A.,"Combating imbalance in network intrusion datasets", Granular Computing, 2006 IEEE International Conference on 732-737.(2006)

Deng X., Zhong W., Ren J., Zeng D., and Zhang H.,"An imbalanced data classification method based on automatic clustering under-sampling", 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC) 1-8.(2016)

Fan W., Stolfo S. J., Zhang J., and Chan P. K., "AdaCost: Misclassification Cost-Sensitive Boosting", presented at the Proceedings of the Sixteenth International Conference on Machine Learning, (1999).

Fathi Ganji M., Abadeh M. S., Hedayati M., and Bakhtiari N.,"Fuzzy classifcation of imbalanced data sets for medical diagnosis", Biomedical Engineering (ICBME), 2010 17th Iranian Conference of 1-5.(2010)

Freund Y. and Schapire R.,"Experiments with a New Boosting Algorithm", International Conference on Machine Learning 148-156.(1996)

Gazzah S., Hechkel A., and Amara N. E. B.,"A hybrid sampling method for imbalanced data", 2015 IEEE 12th International Multi-Conference on Systems, Signals & Devices (SSD15) 1-6.(2015)

Maldonado S. and López J., "Imbalanced data classification using second-order cone programming support vector machines",Pattern Recognition, 47(5): 2070-2079, (2014).

Márquez-Vera C., Cano A., Romero C., and Ventura S., "Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data",Applied Intelligence, 38(3): 315-330, (2013).

Prachuabsupakij W. and Doungpaisan P.,"Matching preprocessing methods for improving the prediction of student's graduation", 2016 2nd IEEE International Conference on Computer and Communications (ICCC) 33-37.(2016)

Zhu M., Su B., and Ning G.,"Research of Medical High-Dimensional Imbalanced Data Classification Ensemble Feature Selection Algorithm with Random Forest", 2017 International Conference on Smart Grid and Electrical Automation (ICSGEA) 273-277.(2017)