Objective: The main purpose of this research is to develop a novel user-friendly web tool based on machine learning approaches, which applies a variety of techniques to address the class imbalance problem. Material and Methods: Shiny, an opensource R package, was used to develop the proposed web tool. The interactive tool can handle the class imbalance problem for binary classification dataset(s) by implementing sampling-based methods. As a clinical application, the dataset retrospectively obtained from the database of the Cardiovascular Surgery Department of Turgut Özal Medical Center, İnönü University, Malatya, Türkiye was used in this web-based software. To overcome the class imbalance problem, sampling-based methods were implemented on the original dataset. After this process, the classification of hypertension in patients with coronary artery disease was achieved by three classification models. Results: According to the outputs of the developed web application, the best classification performance was obtained by the support vector machines with radial basis function kernel (SVM-RBF) model after applying the density-based synthetic minority over-sampling technique oversampling method. The accuracy, sensitivity, specificity, precision, f-measure, and g-mean metrics of the relevant model were calculated as 0.99, 0.99, 0.99, 0.95, 0.97, and 0.97, respectively. Conclusion: The oversampling methods used in this research indicated a more positive contribution to the classification performance of the models as compared to the undersampling methods. When the undersampling methods were applied, the three classification models did not demonstrate successful classification performance, whereas the SVM-RBF model outperformed the other two models when the oversampling methods were implemented. The designed interactive web application is freely accessible through http://biostatapps.inonu.edu.tr/twoclsbalancer.
Keywords: Classification; coronary artery disease; hypertension; class imbalance problem; web-based application
Amaç: Bu araştırmanın temel amacı, sınıf dengesizliği sorununu çözmek için çeşitli teknikler uygulayan makine öğrenimi yaklaşımlarına dayalı yeni, kullanıcı dostu bir web aracı geliştirmektir. Gereç ve Yöntemler: Açık kaynaklı bir R paketi olan Shiny, önerilen web aracını geliştirmek için kullanıldı. Etkileşimli araç, örneklemeye dayalı yöntemler uygulayarak ikili sınıflandırma veri kümeleri için sınıf dengesizliği sorununu çözebilir. Web tabanlı bu yazılımda, klinik uygulama olarak Malatya İnönü Üniversitesi Turgut Özal Tıp Merkezi Kalp Damar Cerrahisi Anabilim Dalı veri tabanından geriye dönük olarak elde edilen veri seti kullanılmıştır. Sınıf dengesizliği sorununun üstesinden gelmek için orijinal veri seti üzerinde örneklemeye dayalı yöntemler uygulanmıştır. Bu işlemden sonra koroner arter hastalığı olan hastalarda hipertansiyonun sınıflandırılması üç sınıflandırma modeli ile sağlanmıştır. Bulgular: Geliştirilen web uygulamasının çıktılarına göre en iyi sınıflandırma performansı, 'density-based synthetic minority oversampling technique' aşırı örnekleme yöntemi uygulandıktan sonra radyal tabanlı destek vektör makineleri [support vector machines with radial basis function (SVM-RBF)] modeli ile elde edilmiştir. İlgili modelin doğruluk, duyarlılık, özgüllük, kesinlik, f-ölçümü ve g-ortalama metrikleri sırasıyla 0,99, 0,99, 0,99, 0,95, 0,97 ve 0,97 olarak hesaplanmıştır. Sonuç: Bu araştırmada kullanılan aşırı örnekleme yöntemleri, alt örnekleme yöntemlerine kıyasla modellerin sınıflandırma performansına daha olumlu katkı sağlamıştır. Alt örnekleme yöntemleri uygulandığında, 3 sınıflandırma modeli başarılı sınıflandırma performansı göstermezken, aşırı örnekleme yöntemleri uygulandığında SVM-RBF modeli diğer 2 modelden daha iyi performans göstermiştir. Tasarlanan interaktif web uygulamasına http://biostatapps.inonu.edu.tr/twoclsbalancer adresinden ücretsiz olarak erişilebilir.
Anahtar Kelimeler: Sınıflandırma; koroner arter hastalığı; hipertansiyon; sınıf dengesizliği sorunu; web tabanlı uygulama
- Sagiroglu S, Sinanc D. Big data: A review. International Conference on Collaboration Technologies and Systems (CTS). 2013;42-7. [Crossref]
- Firat F, Arslan AK, Colak C, Harputluoglu H. Estimation of risk factors associated with colorectal cancer: an application of knowledge discovery in databases. Kuwait J. Sci. 2016;43(2):151-61. [Link]
- Bekkar M, Alitouche TA. Imbalanced data learning approaches review. Int J Data Min Knowl Manag Process. 2013;3(4):15-33. [Crossref]
- Alpar CR. Uygulamalı Çok Değişkenli İstatistiksel Yöntemler. 4. Baskı. Ankara: Detay Yayıncılık; 2013.
- Sümbüloğlu V, Sümbüloğlu K. Klinik Saha Araştırmalarında Örnekleme Yöntemleri ve Örneklem Büyüklüğü. 1. Baskı. Ankara: Hatiboğlu Yayınevi; 2005.
- Colak MC, Colak C, Kocatürk H, Sağiroğlu S, Barutçu I. Predicting coronary artery disease using different artificial neural network models. Anadolu Kardiyol Derg. 2008;8(4):249-54. [PubMed]
- Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C. Safe-Level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho TB, eds. Advances in Knowledge Discovery and Data Mining. 1st ed. Thailand: Springer Berlin Heidelberg; 2009. p.475-82. [Crossref]
- Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A. Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med. 2006;37(1):7-18. [Crossref] [PubMed]
- He H, Ma Y. Imbalanced Learning: Foundations, Algorithms, and Applications. 1st ed. USA: John Wiley & Sons; 2013. [Crossref]
- Tomek I. An experiment with the edited nearest-neighbor rule. IEEE Trans syst Man Cybern. 1976;6(6):448-52. [Crossref]
- Hart P. The condensed nearest neighbor rule. IEEE Trans Inf Theory. 1968;14(3):515-6. [Crossref]
- Kubat M, Matwin S. Addressing the curse of imbalanced training sets: one-sided selection. In: Fisher DH, ed. Proceedings of the 14th International Conference on Machine Learning. USA: Morgan Kaufmann Publishers Inc.; 1997. p.179-86. [Link]
- Wilson DL. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans syst Man Cybern. 1972;2(3):408-21. [Crossref]
- García-Borroto M, Villuendas-Rey Y, Carrasco-Ochoa JA, Martínez-Trinidad JF. Using maximum similarity graphs to edit nearest neighbor classifiers. In: Corrochano EB, Eklundh JO, eds. Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. 1st ed. Springer; 2009. p.489-96. [Crossref]
- Laurikkala J. Improving identification of difficult small classes by balancing class distribution. In: Quaglini S, Barahona P, Andreassen S, eds. Artificial Intelligence in Medicine. 1st ed. Portugal: Springer-Verlag; 2001. p.63-6. [Crossref]
- Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16(1):321-57. [Crossref]
- He H, Garcia EA. Learning from imbalanced data. IEEE Trans knowl data eng. 2009;21(9):1263-84. [Crossref]
- Han H, Wang WY, Mao BH. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang DS, Zhang XP, Huang GB, eds. 1st ed. Advances in Intelligent Computing. China: Springer; 2005. p.878-87. [Crossref]
- Verbiest N, Ramentol E, Cornelis C, Herrera F. Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection. Appl Soft Comput. 2014;22(2):511-7. [Crossref]
- Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C. Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho TB, eds. Advances in Knowledge Discovery and Data Mining. 1st ed. Berlin, Heidelberg: Springer Berlin, Heidelberg; 2009. p.475-82. [Crossref]
- Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C. DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique. Appl Intell. 2012;36(3):664-84. [Crossref]
- Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96). Portland, Oregon, USA: 1996. p.226-31. [Link]
- He H, Bai Y, Garcia EA, Li S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. IEEE World Congress on Computational Intelligence. 2008;1322-8. [Link]
- Lemaître G, Nogueira F, Aridas CK. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res. 2017;18(1):559-63. [Link]
- Li GZ, He Z, Shao FF, Ou AH, Lin XZ. Patient classification of hypertension in Traditional Chinese Medicine using multi-label learning techniques. BMC Med Genomics. 2015;8 Suppl 3(Suppl 3):S4. [Crossref] [PubMed] [PMC]
- Li GZ, Yan SX, You M, Sun S, Ou A. Intelligent ZHENG Classification of Hypertension Depending on ML-kNN and Information Fusion. Evid Based Complement Alternat Med. 2012;2012:837245. [Crossref] [PubMed] [PMC]
- Antalek MD, Suwa K, Schaffer M, Fenster B, Markl M, Freed B, et al. Non-invasive classification of pulmonary hypertension using 4D flow MRI and random forests. Circulation. 2017;136(1). [Link]
- Ye C, Fu T, Hao S, Zhang Y, Wang O, Jin B, et al. Prediction of incident hypertension within the next year: prospective study using statewide electronic health records and machine learning. J Med Internet Res. 2018;20(1):e22. [Crossref] [PubMed] [PMC]
- LaFreniere D, Zulkernine F, Barber D, Martin K. Using machine learning to predict hypertension from a clinical dataset. IEEE Symposium Series on Computational Intelligence (SSCI). 2016;1-7. [Crossref]
- Kublanov VS, Dolganov AY, Belo D, Gamboa H. Comparison of machine learning methods for the arterial hypertension diagnostics. Appl Bionics Biomech. 2017;2017:5985479. [Crossref] [PubMed] [PMC]
- Seffens W, Evans C; Minority Health-GRID Network, Taylor H. Machine learning data imputation and classification in a multicohort hypertension clinical study. Bioinform Biol Insights. 2016;9(Suppl 3):43-54. [Crossref] [PubMed] [PMC]
- Held E, Cape J, Tintle N. Comparing machine learning and logistic regression methods for predicting hypertension using a combination of gene expression and next-generation sequencing data. BMC Proc. 2016;10(Suppl 7):141-5. [Crossref] [PubMed] [PMC]
- Chen T, Guestrin C. Xgboost: A scalable tree boosting system. arXiv. 2016:785-94. [Crossref]
- Cortes C, Mohri M, Syed U. Deep boosting. PMLR. 2014;32(2):1179-87. [Link]
.: Process List