Optimization of Imbalanced Tuberculosis Data Classification Using Cost-Sensitive Binary Logistic Regression

Ihsan  Fathoni Amri; Muhammad  Ivan Ardiansyah; Febrian  Hikmah Nur Rohim; Novia  Yunanita; Amelia  Kusuma Wardani

Authors

Ihsan Fathoni Amri Department of Data Science, Faculty of Science and Agriculture, Universitas Muhammadiyah Semarang, Semarang, Indonesia
Muhammad Ivan Ardiansyah Department of Data Science, Faculty of Science and Agriculture, Universitas Muhammadiyah Semarang, Semarang, Indonesia
Febrian Hikmah Nur Rohim Department of Data Science, Faculty of Science and Agriculture, Universitas Muhammadiyah Semarang, Semarang, Indonesia
Novia Yunanita Department of Data Science, Faculty of Science and Agriculture, Universitas Muhammadiyah Semarang, Semarang, Indonesia
Amelia Kusuma Wardani Department of Data Science, Faculty of Science and Agriculture, Universitas Muhammadiyah Semarang, Semarang, Indonesia

Keywords:

Tuberculosis, binary logistic regression, cost-sensitive learning, smote, class imbalance

Abstract

Tuberculosis (TB) remains a major public health challenge in Indonesia, particularly in urban areas. This study aims to optimize the classification of TB case predictions by comparing three binary logistic regression approachesard binary logistic regression, cost-sensitive binary logistic regression, and SMOTE-based binary logistic regression. The dataset consists of 5,180 patient samples obtained from a health foundation. Initial analysis reveals a significant class imbalance, with TB negative cases dominating the data, while TB-positive cases are relatively scarce. The standard binary logistic regression model demonstrates weak predictive performance for positive cases; out of 195 TB-positive cases, only 4 were correctly identified, while 191 were misclassified as negative, posing a high risk in real-world implementation.

Conversely, the cost-sensitive binary logistic regression approach assigns higher weights to the minority class to reduce bias caused by class imbalance. The class weights are determined based on the inverse class frequency using the formula ${{w}_{k}}=N/{{N}_{c}}.$ Based on the distribution of the training dataset, which consists of 3.175 negative cases and 451 positive cases, the resulting weights are approximately ${{W}_{negative}}\approx 1.14$ and ${{W}_{positive}}\approx 8.04.$ The application of this weighting scheme improves the model's ability to detect positive cases, with 76 cases correctly classified, particularly in the context of low public disclosure regarding health conditions. The SMOTE-based binary logistic regression model achieves a higher recall, detecting 82 positive cases; however, the use of synthetic data introduces potential concerns regarding predictive validity. Overall, the cost-sensitive model achieved a recall of 39%, an F1-score of 32%, and an overall accuracy of 79%, with higher AUC-ROC and AUC-PR values compared to the baseline model. Although the improvement in recall remains moderate at 39%, the cost-sensitive approach shows potential in enhancing the model’s ability to detect positive cases. Therefore, this approach may be considered as a supporting method in efforts to improve more targeted TB control strategies in Indonesia.

References

Albattah W, Khan RU. Impact of imbalanced features on large datasets. Front Big Data. 2025; 8: 1455442. https://doi.org/10.3389/fdata.2025.1455442

Amri I, Hikmah F, Rohim N, Ardiansyah M, Saputra F, Supriyanto, Ningrum A, Nakib A. Analysis of suspected factors in tuberculosis cases in Semarang City using a logistic regression model. Smarth J Comput Sci. 2025; 1(1): 23-34.

Araf I, Idri A, Chairi I. Cost-sensitive learning for imbalanced medical data: A review. Artif Intell Rev. 2024; 57(4): 80. https://doi.org/10.1007/s10462-023-10652-8

Bastian F, Atika RA, Nora S, Lidiawati M, Fadhil I, Safirza S, Elmiyati, Riezky AK. Peningkatan Kesadaran Pencegahan dan Pengobatan Tuberkulosis Paru Melalui Edukasi Kesehatan. Future Acad J Multidiscip Res Sci. 2025; 3(3): 1364-1370.

Bhirawa AA, Sanjaya P. From data imbalance to precision: SMOTE-driven machine learning for early detection of kidney disease. J Inovtek Polbeng Seri Inform. 2025; 10(1): 514-525.

Branco P, Torgo L, Ribeiro RP. A survey of predictive modeling on imbalanced domains. ACM Comput Surv. 2016; 49(2): 1-50.

Chinagudaba SSN, Gera D, Dasu KKV, Shankar US, Kiran K, Singarajpure A, et al. Predictive analysis of tuberculosis treatment outcomes using machine learning: A Karnataka TB data study at scale. arXiv [Preprint]. 2024: arXiv:2403.08834. Available from: https://arxiv.org/abs/2403.08834

Dewi M, Saragih TH, Herteno R. Penerapan SMOTE-NCL untuk Mengatasi Ketidakseimbangan Kelas pada Klasifikasi Penyakit Jantung Koroner. J Inform Polinema. 2023; 10(1): 27-34.

Diana GN, Marlinton S, Damayanti E, Astuti AW. The impact of stigma and discrimination on Tuberculosis patients. J Ilm Kebidanan Kesehat. 2024; 2(2): 61-70.

Erlin E, Desnelita Y, Nasution N, Suryati L, Zoromi F. Impact of SMOTE on random forest classifier performance based on imbalanced data. Matrik J Manaj Tek Inform Rekayasa Komput. 2022; 21(3): 677-690.

Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F. Cost-sensitive learning. In: Learning from imbalanced data sets. Cham: Springer; 2018. pp. 63-78.

Haloho O, Sembiring P, Manurung A. Application of logistic regression analysis on the use of female contraceptives (case study in Dolok Mariah Village, Simalungun Regency). Saintia Mat. 2013; 1(1): 51-61.

He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009; 21(9): 1263-1284.

Hosmer DW Jr, Lemeshow S, Sturdivant RX. Applied logistic regression. 3rd ed. Hoboken: John Wiley & Sons; 2013.

Kartikasari D. Analisis faktor-faktor yang mempengaruhi level polusi udara dengan metode regresi logistik biner [Analysis of factors affecting air pollution levels with methods binary logistic regression]. Mathunesa J Ilm Mat. 2020; 8(1): 55-59.

Kaufman S, Rosset S, Perlich C, Stitelman O. Leakage in data mining: formulation, detection, and avoidance. ACM Trans Knowl Discov Data. 2012; 6(4): 1-21.

Kementerian Kesehatan Republik Indonesia. TB cases are high due to improvements in detection and reporting systems [Internet]. Jakarta: Kementerian Kesehatan Republik Indonesia; 2024 [cited 2026 Jan 17]. Available from: https://www.kemkes.go.id

Mangunsong S, Simamora S. Pendampingan keluarga, tenaga kesehatan, dan kader dalam pencegahan dropout pengobatan TB di puskesmas [Assistance for families, health workers, and cadres in preventing TB treatment dropouts at health centers]. J Abdikemas. 2024; 6(2): 97-100.

Mienye ID, Sun Y. Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. Inform Med Unlocked. 2021; 25: 100690. https://doi.org/10.1016/j.imu.2021.100690

Pentury T, Aulele SN, Wattimena R. Analisis regresi logistik ordinal [Ordinal logistic regression analysis]. Barekeng: J Ilmu Mat dan Terap. 2016; 10(1): 55-60.

Purba T, Pane R. Analisis regresi logistik biner pada penyakit polycystic ovary syndrome (PCOS). J Pendidik Inklusif. 2024; 8(12): 152-169.

Salmi M, Atif D, Oliva D, Abraham A, Ventura S. Handling imbalanced medical datasets: review of a decade of research. Artif Intell Rev. 2024; 57(10): 273. https://doi.org/10.1007/s10462-024-10884-2

Satriawan MA, Widhiarso W. Klasifikasi pengenalan wajah untuk mengetahui jenis kelamin menggunakan metode convolutional neural network. J Algoritme. 2023; 4(1): 43-52.

Shen F, Wang R, Shen Y. A cost-sensitive logistic regression credit scoring model based on multi-objective optimization approach. Technol Econ Dev Econ. 2020; 26(2): 405-429.

Yetti ER, Tombeg Z, Hadi AJ. Hubungan sosial budaya dengan upaya pencegahan TBC di Puskesmas Makale Kabupaten Tana Toraja. J Ners. 2023; 7(2): 1364-1373.

World Health Organization. Global tuberculosis report 2024. Geneva: World Health Organization; 2024.

Yang Y, Khorshidi HA, Aickelin U. A review on over-sampling techniques in classification of multi-class imbalanced datasets: insights for medical problems. Front Digit Health. 2024; 6: 1430245. https://doi.org/10.3389/fdgth.2024.1430245

Zhu J, Pu S, He J, Su D, Cai W, Xu X, Liu H. Processing imbalanced medical data at the data level with assisted-reproduction data as an example. BioData Min. 2024; 17(1): 29. https://doi.org/10.1186/s13040-024-00384-y