Depression Classification with Imbalanced Data Problems: Literature Survey

Artitayaporn Rojarath; Wararat  Songpan; Olarik Surinta

Authors

Artitayaporn Rojarath Mahasarakham University, Thailand https://orcid.org/0009-0001-7852-1557
Wararat Songpan Khon Kaen University, Thailand https://orcid.org/0000-0002-3813-6910
Olarik Surinta Mahasarakham University, Thailand

Keywords:

Depressive classification, Imbalanced data, Resampling method, Oversampling technique, Machine learning

Abstract

Depression is an increasingly serious global mental health concern, with the number of affected individuals rising steadily. In Thailand, more than 70% of the working-age population is at risk of developing depressive conditions, as reported by the Thai Depression Center. A significant challenge in depression research is the issue of imbalanced datasets, where the number of depressive cases (minority class) is significantly lower than non-depressive cases (majority class). This imbalance often results in biased classification models that favor the majority class, thereby reducing the accuracy and effectiveness of depression classification. This literature survey addresses critical gaps in the field by focusing on the imbalanced data problem in depression classification. While previous studies have primarily relied on traditional oversampling and undersampling techniques, these approaches often intensify the problem of overfitting and lead to the loss of valuable information. Our research explores these issues by reviewing various resampling methods, with a particular emphasis on advanced oversampling techniques that aim to preserve data integrity while mitigating overfitting. The survey also presents a comparative analysis of evaluation metrics, including accuracy, precision, recall, F1-score, and AUC, to provide a more nuanced understanding of classifier performance in the context of imbalanced data. Our findings indicate that while oversampling methods are generally effective, careful implementation is essential to avoid overfitting, which can distort the predictive accuracy of the model.

Author Biographies

Artitayaporn Rojarath, Mahasarakham University, Thailand

Multi-agent Intelligent Simulation Laboratory (MISL) Research Unit, Department of Information Technology, Faculty of Informatics, Mahasarakham University, Khamriang Sub-District, Kantarawichai District, Mahasarakham 44150, Thailand

Wararat Songpan, Khon Kaen University, Thailand

Department of Computer Science, College of Computing, Khon Kaen University, Nai Muang sub-District, Muang District, Khon Kaen 40002, Thailand

Olarik Surinta, Mahasarakham University, Thailand

Multi-agent Intelligent Simulation Laboratory (MISL) Research Unit, Department of Information Technology, Faculty of Informatics, Mahasarakham University, Thailand

References

Z. Liu et al., “Classification of major depressive disorder using machine learning on brain structure and functional connectivity,” J Affect Disord Rep, vol. 10, pp. 1–11, 2022.

World Health Organization’s, World Health Statistics 2022. World Health Organization 2022, 2022.

X. Zhang et al., “Prevalence and risk factors of depression and anxiety among Chinese adults who received SARS-CoV-2 vaccine — A cross-sectional survey,” J Affect Disord, vol. 324, pp. 53–60, 2023.

H. A. Yazdavar et al., “Multimodal mental health analysis in social media,” Public Library of Science, vol. 15, no. 4, p. e0226248, 2020.

R. Haand and Z. Shuwang, “The relationship between social media addiction and depression: A quantitative study among university students in Khost, Afghanistan,” Int J Adolesc Youth, vol. 25, no. 1, pp. 780–786, 2020.

Md. R. Islam, M. A. Kabir, A. Ahmed, A. R. M. Kamal, and H. Wang, “Depression detection from social network data using machine learning techniques,” Health Inf Sci Syst, vol. 6, no. 1, pp. 1–13, 2018.

R. S. Begum and Y. S. Sait, “Effective techniques for depression detection on social media: A comprehensive review,” in International Conference on Computer Communication and Informatics (ICCCI), India: IEEE, 2022, pp. 1–9.

D. Geng, Q. An, Z. Fu, C. Wang, and H. An, “Identification of major depression patients using machine learning models based on heart rate variability during sleep stages for pre-hospital screening,” Comput Biol Med, vol. 162, p. 107060, 2023.

K.-I. Jang, S. Kim, J.-H. Chae, and C. Lee, “Machine learning-based classification using electroencephalographic multi-paradigms between drug-naïve patients with depression and healthy controls,” J Affect Disord, vol. 338, pp. 270–277, 2023.

Sofia, A. Malik, M. Shabaz, and E. Asenso, “Machine learning based model for detecting depression during Covid-19 crisis,” Sci Afr, vol. 20, p. e01716, 2023.

Y. Chen, W. , J. Stewart, J. Ge, B. Cheng, A. Chekroud, and J. , D. Hellerstein, “Personalized symptom clusters that predict depression treatment outcomes: A replication of machine learning methods,” J Affect Disord Rep, vol. 11, p. 100470, 2023.

Y. Sánchez-Carro et al., “Importance of immunometabolic markers for the classification of patients with major depressive disorder using machine learning,” Prog Neuropsychopharmacol Biol Psychiatry, vol. 121, p. 110674, 2023.

A. Occhipinti, L. Rogers, and C. Angione, “A pipeline and comparative study of 12 machine learning models for text classification,” Expert Syst Appl, vol. 201, p. 117193, 2022.

J. Zhai, J. Qi, and C. Shen, “Binary imbalanced data classification based on diversity oversampling by generative models,” Inf Sci (N Y), vol. 585, pp. 313–343, 2022.

S. Shi, J. Li, D. Zhu, F. Yang, and Y. Xu, “A hybrid imbalanced classification model based on data density,” Inf Sci (N Y), vol. 624, pp. 50–67, 2023.

ao, Z. Huang, Y. Sang, Y. Sun, and J. Lv, “A neural network learning algorithm for highly imbalanced data classification,” Inf Sci (N Y), vol. 612, pp. 496–513, 2022.

Y. Xiao, J. Wu, and Z. Lin, “Cancer diagnosis using generative adversarial networks based on deep learning from imbalanced data,” Comput Biol Med, vol. 135, p. 104540, 2021.

D. Li, C. Zheng, J. Zhao, and Y. Liu, “Diagnosis of heart failure from imbalance datasets using multi-level classification,” Biomed Signal Process Control, vol. 81, p. 104538, 2023.

Inamullah, S. Hassan, S. B. Belhaouari, and I. Amin, “Deciphering the impact of diversity in CNN-based ensembles on overcoming data imbalance and scarcity in medical datasets: A case study on diabetic retinopathy,” Inform Med Unlocked, vol. 49, p. 101557, 2024, doi: 10.1016/j.imu.2024.101557.

L. Bai, T. Ju, H. Wang, M. Lei, and X. Pan, “Two-step ensemble under-sampling algorithm for massive imbalanced data classification,” Inf Sci (N Y), vol. 665, p. 120351, 2024.

F. Wang, M. Zheng, X. Hu, H. Li, T. Wang, and F. Chen, “FIAO: Feature information aggregation oversampling for imbalanced data classification,” Appl Soft Comput, vol. 161, p. 111774, 2024.

J. Guo, H. Wu, X. Chen, and W. Lin, “Adaptive SV-borderline SMOTE-SVM algorithm for imbalanced data classification,” Appl Soft Comput, vol. 150, p. 110986, 2024.

A. , S. Alex, V. J. J. Nayahi, and S. Kaddoura, “Deep convolutional neural networks with genetic algorithm-based synthetic minority over-sampling technique for improved imbalanced data classification,” Appl Soft Comput, vol. 156, p. 111491, 2024.

K. L. Xin and A. binti, N. Rashid, “Prediction of depression among women using random oversampling and random forest,” in 2021 International Conference of Women in Data Science at Taif University (WiDSTaif ), Taif, Saudi Arabia: IEEE, 2021, pp. 1–5.

X. Gao et al., “An ensemble imbalanced classification method based on model dynamic selection driven by data partition hybrid sampling,” Expert Syst Appl, vol. 160, p. 113660, 2020.

R. M. Pereira, Y. M. G. Costa, and S. C. N. Jr., “Toward hierarchical classification of imbalanced data using random resampling algorithms,” Inf Sci (N Y), vol. 578, pp. 344–363, 2021.

H. Li, X. Dong, W. Shen, F. Ge, and H. Li, “Resampling-based cost loss attention network for explainable imbalanced diabetic retinopathy grading,” Comput Biol Med, vol. 149, p. 105970, 2022.

B. Zhang et al., “Discriminating subclinical depression from major depression using multi-scale brain functional features: A radiomics analysis,” J Affect Disord, vol. 297, pp. 542–552, 2022.

J. C. P. Suen, S. Goerigk, B. L. Razza, F. Padberg, C. I. Passos, and R. A. Brunoni, “Classification of unipolar and bipolar depression using machine learning techniques,” Psychiatry Res, vol. 295, p. 113624, 2021.

Z. Sun, W. Ying, W. Zhang, and S. Gong, “Undersampling method based on minority class density for imbalanced data,” Expert Syst Appl, vol. 249, p. 123328, 2024.

P. Vuttipittayamongkol and E. Elyan, “Neighbourhood-based undersampling approach for handling imbalanced and overlapped data,” Inf Sci (N Y), vol. 509, pp. 47–70, 2020.

H. Benhar, A. Idri, and J.L. Fernández-Alemán, “Data preprocessing for heart disease classification: A systematic literature review,” Comput Methods Programs Biomed, vol. 195, pp. 1–30, 2020.

E. I. Emre, Ç. Erol, C. Tas¸, and N. Tarhan, “Multi-class classification model for psychiatric disorder discrimination,” Int J Med Inform, vol. 170, p. 104926, 2023.

Chenxi Huang et al., “Sample imbalance disease classification model based on association rule feature selection,” Pattern Recognit Lett, vol. 133, pp. 280–286, 2020.

O. K. Asare et al., “Mood ratings and digital biomarkers from smartphone and wearable data differentiates and predicts depression status: A longitudinal data analysis,” Pervasive Mob Comput, vol. 83, p. 101621, 2022.

I. Moshe et al., “Predicting symptoms of depression and anxiety using smartphone and wearable data,” Front Psychiatry, vol. 12, p. 625247, 2021.

C. Karima and W. Anggraeni, “Performance analysis of the Ada-Boost algorithm for classification of hypertension risk with clinical imbalanced dataset,” Procedia Comput Sci, vol. 234, pp. 645–653, 2024.

T. Zuo, F. Li, X. Zhang, F. Hu, L. Huang, and W. Jia, “Stroke classification based on deep reinforcement learning over stroke screening imbalanced data,” Computers and Electrical Engineering, vol. 114, p. 109069, 2024.

K. Niu, Z. Zhang, Y. Liu, and R. Li, “Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending,” Inf Sci (N Y), vol. 536, pp. 120–134, 2020.

Z. Seng, A. S. Kareem, and D. K. Varathan, “A neighborhood undersampling stacked ensemble (NUS-SE) in imbalanced classification,” Expert Syst Appl, vol. 168, p. 114246, 2021.

J. Hoyos-Osorio, A. Alvarez-Meza, G. Daza-Santacoloma, A. Orozco-Gutierrez, and G. Castellanos-Dominguez, “Relevant information undersampling to support imbalanced data classification,” Neurocomputing, vol. 436, pp. 136–146, 2021.

J. Ren, Y. Wang, M. Mao, and Y. Cheung, “Equalization ensemble for large scale highly imbalanced data classification,” Knowl Based Syst, vol. 242, p. 108295, 2022.

Y. Liu, Y. Liu, X. B. , B. Yu, S. Zhong, and Z. Hu, “Noise-robust oversampling for imbalanced data classification,” Pattern Recognit, vol. 133, p. 109008, 2023.

G. Wei, W. Mu, Y. Song, and J. Dou, “An improved and random synthetic minority oversampling technique for imbalanced data,” Knowl Based Syst, vol. 248, p. 108839, 2022.

A. Othmani and O. A. Zeghina, “A multimodal computer-aided diagnostic system for depression relapse prediction using audiovisual cues: A proof of concept,” Healthcare Analytics, vol. 2, p. 100090, 2022.

K. Priya S. and P. Karthika K., “An embedded feature selection approach for depression classification using short text sequences,” Appl Soft Comput, vol. 147, p. 110828, 2023.

E. Garcia-Ceja et al., “Depresjon: A motor activity database of depression episodes in unipolar and bipolar patients,” in The 9th ACM International Conference on Multimedia Systems (MMsys 2018), Amsterdam, 2018, pp. 472–477.

X. Yuan, C. Sun, and S. Chen, “A clustering-based adaptive undersampling ensemble method for highly unbalanced data classification,” Appl Soft Comput, vol. 159, p. 111659, 2024.

S. Shen, Z. Li, Z. Huan, F. Shang, Y. Wang, and Y. Chen, “Neighborhood repartition-based oversampling algorithm for multiclass imbalanced data with label noise,” Neurocomputing, vol. 600, p. 128090, 2024.

J. Chen, H. Huang, A. G. Cohn, D. Zhang, and M. Zhou, “Machine learning-based classification of rock discontinuity trace: SMOTE oversampling integrated with GBT ensemble learning,” Int J Min Sci Technol, vol. 32, pp. 309–322, 2022.

S. Maldonado, C. Vairetti, A. Fernandez, and F. Herrera, “FW-SMOTE: A feature-weighted oversampling approach for imbalanced classification,” Pattern Recognit, vol. 124, p. 108511, 2022.

C. Rao, Y. Xu, X. Xiao, F. Hu, and M. Goh, “Imbalanced customer churn classification using a new multi-strategy collaborative processing method,” Expert Syst Appl, vol. 247, p. 123251, 2024.

J. Wei, H. Huang, L. Yao, Y. Hu, Q. Fan, and D. Huang, “New imbalanced fault diagnosis framework based on Cluster-MWMOTE and MFO-optimized LS-SVM using limited and complex bearing data,” Eng Appl Artif Intell, vol. 96, p. 103966, 2020.

L. Han et al., “An explainable XGBoost model improved by SMOTE-ENN technique for maize lodging detection based on multi-source unmanned aerial vehicle images,” Comput Electron Agric, vol. 194, p. 106804, 2022.

K. G. R. Narayan et al., “Attenuating majority attack class bias using hybrid deep learning based IDS framework,” Journal of Network and Computer Applications, vol. 230, p. 103954, 2024.

F. Soleymani, S. Zhu, and X. Hu, “An unsupervised k-means machine learning algorithm via overlapping to improve the nodes selection for solving elliptic problems,” Eng Anal Bound Elem, vol. 168, p. 105919, 2024.

G. Wei, W. Mu, Y. Song, and J. Dou, “An improved and random synthetic minority oversampling technique for imbalanced data,” Knowl Based Syst, vol. 248, p. 108839, 2022.

J. Li, Q. Zhu, Q. Wu, and Z. Fan, “A novel oversampling technique for class-imbalanced learning based on SMOTE and natural neighbors,” Inf Sci (N Y), vol. 565, pp. 438–455, 2021.

W. Wang, L. Yang, J. Zhang, J. Yang, D. Tang, and T. Liu, “Natural local density-based adaptive oversampling algorithm for imbalanced classification,” Knowl Based Syst, vol. 295, p. 111845, 2024.

F. Ridzuan and W. N. M. W. Zainon, “A review on data cleansing methods for big data,” in The Fifth Information Systems International Conference 2019, Surabaya, 2019, pp. 731–738.

A. Maghraby and H. Ali, “Modern standard Arabic mood changing and depression dataset,” Data Brief, vol. 41, p. 107999, 2022.

D. AL-Alimi, Z. Cai, A. A. , M. Al-qaness, and A. E. Alawamy, “ETR: Enhancing transformation reduction for reducing dimensionality and classification complexity in hyperspectral images,” Expert Syst Appl, vol. 213, Part B, p. 118971, 2023.

A. , B. Ojokoh, A. , O. Olaku, A. , O. Sarumi, and I. , S. Olotu, “Predictive analytics for economic crisis triggered depression risk level identification among some adults in Nigeria,” Sci Afr, vol. 14, p. e01056, 2021.

A. Farshidvard, F. F. Hooshmand, and S. A. S.A. MirHassani, “A novel two-phase clustering-based under-sampling method for imbalanced classification problems,” Expert Syst Appl, vol. 213, Part B, p. 119003, 2023.

Y. Huang, B. Giledereli, A. Köksal, A. Özgür, and E. Ozkirimli, “Balancing methods for multi-label text classification with long-tailed class distribution,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Dominican Republic, 2021, pp. 8153–8161.

R. Mohammed, J. Rawashdeh, and M. Abdullah, “Machine learning with oversampling and undersampling techniques: Overview study and experimental results,” in 2020 11th International Conference on Information and Communication Systems (ICICS), Jordan, 2020, pp. 243–248.

J. Li, Q. Zhu, Q. Wu, and Z. Fan, “A novel oversampling technique for class-imbalanced learning based on SMOTE and natural neighbors,” Inf Sci (N Y), vol. 565, pp. 438–455, 2021.

T. T. Han et al., “Machine learning based classification model for screening of infected patients using vital signs,” Inform Med Unlocked, vol. 24, p. 100592, 2021.

H. Ding, Y. Sun, Z. Wang, N. Huang, Z. Shen, and X. Cui, “RGAN-EL: A GAN and ensemble learning-based hybrid approach for imbalanced data classification,” Inf Process Manag, vol. 60, p. 103235, 2023.

L. Cañete-Sifuentes, R. Monroy, and A. M. Medina-Pérez, “FT4cip: A new functional tree for classification in class imbalance problems,” Knowl Based Syst, vol. 252, p. 109294, 2022.

L.-H. Yang, T.-Y. Ren, F.-F. Ye, P. Nicholl, Y.-M. Wang, and H. Lu, “An ensemble extended belief rule base decision model for imbalanced classification problems,” Knowl Based Syst, vol. 242, p. 108410, 2022.

C. Morris and J. , J. Yang, “Effectiveness of resampling methods in coping with imbalanced crash data: Crash type analysis and predictive modeling,” Accid Anal Prev, vol. 159, p. 106240, 2021.

S. Jere, P. , A. Patil, I. , G. Shidaganti, S. , S. Aladakatti, and L. Jayannavar, “Dataset for modeling Beck’s cognitive triad to understand depression,” Data Brief, vol. 38, p. 107431, 2021.

R. Chiong, S. G. Budhi, S. Dhakal, and F. Chiong, “A textual-based featuring approach for depression detection using machine learning classifiers and social media texts,” Comput Biol Med, vol. 135, p. 104499, 2021.

X. Fang et al., “Accurate classification of depression through optimized machine learning models on high-dimensional noisy data,” Biomed Signal Process Control, vol. 71, Part B, p. 103237, 2022.

A. Ahmed et al., “Machine learning models to detect anxiety and depression through social media: A scoping review,” Computer Methods and Programs in Biomedicine Update, vol. 2, p. 100066, 2022.

E. Richardson, R. Trevizani, J. A. Greenbaum, H. Carter, M. Nielsen, and B. Peters, “The receiver operating characteristic curve accurately assesses imbalanced datasets,” Patterns, vol. 5, no. 6, p. 100994, Jun. 2024, doi: 10.1016/j.patter.2024.100994.

S. A. Khan and Z. Ali Rana, “Evaluating performance of software defect prediction models using area under precision-recall curve (AUC-PR),” in 2019 2nd International Conference on Advancements in Computational Sciences (ICACS), IEEE, Feb. 2019, pp. 1–6. doi: 10.23919/ICACS.2019.8689135.

J. Zhu et al., “An improved classification model for depression detection using EGG and eye tracking data,” IEEE Trans Nanobioscience, vol. 19, no. 3, pp. 527–537, 2020.

A. Sharma and J. M. I. , W. Verbeke, “Improving diagnosis of depression with XGBOOST machine learning model and a large biomarkers Dutch dataset (n = 11,081),” Front Big Data, vol. 3, p. 15, 2020.