Objective: Many of the machine learning classification algorithms are not robust against unbalanced classes and result in poorly accurate and biased models. One way to address class imbalance is to assign weights to classes. This article proposes a new class-weighting approach to improve the classification problem when there is an imbalance between two class. Material and Methods: The performances of the new formulation were compared with the previously proposed Inverse of Square Root of Number of Samples, effective number of samples weighting formula and unweighted Random Forest solutions. A simulation study was performed using performances of 3 imbalance rates (0.10, 0.20, 0.30), 6 different sample sizes (250, 300, 350, 400, 450, 500) and 4 different methods with 1,000 repetitions. Additionally, the methods were analyzed on the lung cancer dataset with 39 samples in the minority group and with 270 samples in the majority group. Results: Experimental results demonstrated that our proposed weighting formula, least number of ratio and range multiplier, performed equal to or better solution than Inverse of Square Root of Number of Samples in both simulations and real data. Generally, minority class accuracy and balanced accuracy of our formulation were either very close to or higher than that of Inverse of Square Root of Number of Samples. Conclusion: The new formulation provided accuracy estimates of the 2 classes in a balanced way for each sample size and for each imbalance rate. Additionally, as the sample size increased from 250 to 500, stable decreasing weights could be obtained for the patient and control groups.
Keywords: Class Imbalance; class-weighting methods; classification; Random Forest algorithm
Amaç: Makine öğrenimi sınıflandırma algoritmalarının birçoğu, dengesiz sınıflara karşı güçlü değildir ve doğruluğu düşük ve yanlı modeller ile sonuç verir. Sınıf dengesizliğini çözmenin bir yolu, sınıflara ağırlık atamaktır. Bu makale, 2 sınıf arasında bir dengesizlik olduğunda sınıflandırma problemini iyileştirmek için yeni bir sınıf ağırlıklandırma yaklaşımı önermektedir. Gereç ve Yöntemler: Yeni formülasyonun performansları, daha önce önerilen Örnek Sayısının Karekökünün Tersi, etkin örnek sayısı ağırlıklandırma formülü ve ağırlıklandırılmamış Random Forest çözümleri ile karşılaştırılmıştır. Üç dengesizlik oranı (0,10, 0,20, 0,30), 6 farklı örneklem büyüklüğü (250, 300, 350, 400, 450, 500) ve 4 farklı yöntemin 1.000 tekrarlı performansları kullanılarak simülasyon çalışması yapılmıştır. Ayrıca yöntemler azınlık grubunda 39 örnek ve çoğunluk grubunda 270 örnek ile akciğer kanseri veri setinde analiz edilmiştir. Bulgular: Deneysel sonuçlar, önerilen ağırlıklandırma formülümüz olan en az sayı oranı ve açıklık çarpanının hem simülasyonlarda hem de gerçek veride Örnek Sayısının Karekökünün Tersi'ninkine eşit veya daha iyi bir performans gösterdiğini belirtmiştir. Genel olarak formülümüzün azınlık sınıfı doğruluğu ve dengeli doğruluğu, Örnek Sayısının Karekökünün Tersi formülünün doğruluğuna çok yakın ya da daha yüksektir. Sonuç: Yeni formülasyon, her örneklem büyüklüğü ve her bir dengesizlik oranı için 2 sınıfın doğruluk tahminlerini dengeli bir şekilde sağlamıştır. Ayrıca örneklem büyüklüğü 250'den 500'e çıkarıldığında hasta ve kontrol grupları için tutarlı azalan ağırlıklar elde edilebilmiştir.
Anahtar Kelimeler: Sınıf dengesizliği; sınıf ağırlıklandırma yöntemleri; sınıflandırma; Rastgele Orman algoritması
- Johnson JM, Khoshgoftaar TM. Survey on deep learning with class imbalance. J Big Data. 2019;6(1):27. [Crossref]
- Wu Z, Lin W, Ji Y. An integrated ensemble learning model for imbalanced fault diagnostics and prognostics. IEEE Access. 2018;6:8394-8402. [Crossref]
- Barua S, Islam MM, Yao X, Murase K. MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans know data Eng. 2012;26(2):405-25. [Crossref]
- Hassan MM, Huda S, Yearwood J, Jelinek HF, Almogren A. Multistage fusion approaches based on a generative model and multivariate exponentially weighted moving average for diagnosis of cardiovascular autonomic nerve dysfunction. Inf. Fusion. 2018;41:105-18. [Crossref]
- Han J, Yang Z, Zhang Q, Chen C, Li H, Lai S, et al. A method of insulator faults detection in aerial images for high-voltage transmission lines inspection. Appl Sci. 2019;9(10):2009. [Crossref]
- Irtaza A, Adnan SM, Ahmed KT, Jaffar A, Khan A, Javed A, et al. An ensemble based evolutionary approach to the class imbalance problem with applications in CBIR. Appl Sci. 2018;8(4):495. [Crossref]
- Fiore U, De Santis A, Perla F, Zanetti P, Palmieri F. Using generative adversarial networks for improving classification effectiveness in credit card fraud detection. Inf. Sci. 2019;479:448-55. [Crossref]
- Tao X, Chen W, Zhang X, Guo W, Qi L, Fan Z. SVDD boundary and DPC clustering technique-based oversampling approach for handling imbalanced and overlapped data. Knowl Based Syst. 2021;234:107588. [Crossref]
- Mahani A, Ali ARB. Classification problem in imbalanced datasets. Recent Trends Comput Intell. 2019:1-23. [Crossref]
- Krawczyk B. Learning from imbalanced data: open challenges and future directions. Prog Artif Intell. 2016;5(4):221-32. [Crossref]
- Cui Y, Jia M, Lin T-Y, Song Y, Belongie S. Class-balanced loss based on effective number of samples. Paper presented at: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. [Crossref]
- Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intel Res. 2002;16:321-57. [Crossref]
- Asundi RV, Prakash R, Kumar K. Class Weight technique for Handling Class Imbalance. ResearchGate. 18 Temmuz 2022. [Link]
- Campos Almazán A. Bal images analysis for their automatic quantification [Degree thesis]. Barcelona: Universitat Politècnica de Catalunya; 2021. [Cited: 10.01.2023]. Available from: [Link]
- Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6. [Crossref] [PubMed] [PMC]
- Akosa J. Predictive accuracy: A misleading performance measure for highly imbalanced data. Paper Presented at: Proceedings of the SAS Global Forum. 2017. [Link]
- Hen H, Garcia E. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263-84. [Crossref]
- Ripley B, Venables B, Bates DM, Hornik K, Gebhardt A, Firth D, et al. Package 'mass'. Cran R. 2013;538:113-20. [Link]
- Yarberry W. DPLYR. CRAN Recipes: DPLYR, Stringr, Lubridate, and RegEx in R. 1st ed. Berkeley, CA: Springer; 2021. p.1-58. [Crossref]
- Maldonado S, Vairetti C, Fernandez A, Herrera F. FW-SMOTE: A feature-weighted oversampling approach for imbalanced classification. Pattern Recognition. 2022;124:108511. [Crossref]
- Zhu M, Xia J, Jin X, Yan M, Cai G, Yan J, et al. Class weights random forest algorithm for processing class imbalanced medical data. IEEE Access. 2018;6:4641-52. [Crossref]
- Cao K, Wei C, Gaidon A, Arechiga N, Ma T. Learning imbalanced datasets with label-distribution-aware margin loss. Advances in Neural Information Processing Systems. 2019;32. [Link]
- Divyanth L, Marzougui A, González-Bernal MJ, McGee RJ, Rubiales D, Sankaran S. Evaluation of effective class-balancing techniques for CNN-based assessment of aphanomyces root rot resistance in pea (Pisum sativum L.). Sensors. 2022;22(19):7237. [Crossref] [PubMed] [PMC]
- Wang YX, Ramanan D, Hebert M. Learning to model the tail. Adv Neural Inf Process Syst. 2017;30. [Link]
- Huang C, Li Y, Loy CC, Tang X. Learning deep representation for imbalanced classification. Paper Presented at: Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition. 2016. [Crossref]
- Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. Adv Neural inf Process Syst. 2013;26. [Link]
- Tan J, Wang C, Li B, Li Q, Ouyang W, Yin C, et al. Equalization loss for long-tailed object recognition. Paper Presented at: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. [Crossref]
- Jonker J. Weighted convolutional neural networks rare electrocardiogram detection for real-time heart monitoring. 2021. [Link]
.: Process List