Comparison of Unbalanced Data Methods for Support Vector Machines

Pelin AKIN^a, Yüksel TERZİ^b
^aDepartment of Statistics, Çankırı Karatekin University Faculty of Science, Çankırı, TURKEY
^bDepartment of Statistics, Ondokuz Mayıs University Faculty of Science, Samsun, TURKEY

Turkiye Klinikleri J Biostat. 2021;13(2):138-46

doi: 10.5336/biostatic.2020-80268

Makale Dili: EN

Tam Metin

ABSTRACT
Objective: The biggest problem we encounter when applying classification algorithms is that the classification categories are not equally distributed. Eight different re-sampling methods were used for balancing the dataset. Material and Methods: Support Vector Machines (SVM) were used to compare these methods. SVM are supervised learning models with associated learning algorithms that analyze data used for categorization and regression analysis. The main function of the algorithm is to find the best line, or hyperplane, which divides the data into two classes. SVM is basically a linear classifier that classifies linearly separable data, but, in general, the feature vectors might not be linearly separable. To overcome this issue, what is now called kernel trick was used. Results: This article presents a comparative study of different kernel functions (linear, radial, and sigmoid) for unbalanced data. The myocardial infarction dataset which was taken from the Github were classified by 10-fold cross validation to increase the performance. Accuracy, sensitivity, specificity, precision, g-mean and F score were used for comparing the methods. The analysis was carried out by R software. Conclusion: As a conclusion, the results of performance metrics for the original data increased through random over sampling examples re-sampling methods for linear and sigmoid kernel functions. Smote method performed better in the case of radial kernel. In general, the unbalance in the data in classification algorithms gives biased results and this should be eliminated.

Keywords: Support vector machine; unbalanced dataset; data sampling methods

ÖZET
Amaç: Sınıflandırma algoritmalarını uygularken karşılaştığımız en büyük problem, sınıflandırma kategorilerinin eşit dağılmamasıdır. Veri kümesini dengelemek için 8 farklı yeniden örnekleme yöntemi kullanılır. Gereç ve Yöntemler: Bu yöntemleri karşılaştırmak için destek vektör makineleri [support vector machines (SVM)] kullanıldı. SVM, sınıflandırma ve regresyon analizi için kullanılan verileri analiz eden ilişkili öğrenme algoritmalarına sahip denetimli öğrenme modellerindendir. Algoritmanın ana görevi, verileri 2 sınıfa ayıran en doğru hattı veya hiper düzlemi bulmaktır. SVM, temelde doğrusal olarak ayrılabilir verileri sınıflandıran doğrusal bir sınıflandırıcıdır, ancak genel olarak özellik vektörleri doğrusal olarak ayrılamayabilir. Bu sorunun üstesinden gelmek için çekirdek hilesi kullanılır. Bulgular: Bu makalede, dengesiz veriler için farklı çekirdek işlevlerinin (doğrusal, Radyal ve Sigmoid) karşılaştırmalı bir çalışması verildi. Github'dan alınan miyokardiyal enfarktüs veri seti, performansı artırmak için 10 kat çapraz doğrulama kullanıldı. Yöntemlerin karşılaştırılmasında doğruluk, duyarlılık, özgüllük, kesinlik, Gmean ve F ölçüsü kullanıldı. Analiz, R yazılımı tarafından gerçekleştirildi. Sonuç: Sonuç olarak, doğrusal ve Sigmoid çekirdek fonksiyonları için 'random over sampling examples' yeniden örnekleme yöntemi orijinal veriye göre performans ölçütlerinin sonuçlarını artırmıştır. Radyal çekirdek için Smote yönteminin performansı artmıştır. Sınıflandırma algoritmalarında verilerdeki dengesizlik yanlı sonuçlar verir ve bu problem ortadan kaldırılmalıdır.

Anahtar Kelimeler: Destek vektör makinesi; dengesiz veri seti; veri örnekleme yöntemleri

REFERANSLAR:

Poggio T, Girosi F. Networks for Approximation and Learning. Proceedings of the Ieee. 1990;78(9):1481-97. [Crossref]
Wahba G. Spline Models for Observational Data. 1st. Philadelphia: SIAM; 1990. [Crossref]
Zhang Y, Fu PP, Liu WZ, Chen GL. Imbalanced data classification based on scaling kernel-based support vector machine. Neural Computing & Applications. 2014;25(3-4):927-35. [Crossref]
Jian CX, Gao J, Ao YH. A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing. 2016;193:115-22. [Crossref]
Fan Z, Qian J, Sun B, Wu D, Xu Y, Tao Z, editors. Modeling Voice Pathology Detection Using Imbalanced Learning. 2020 International Conference on Sensing, Measurement & Data Analytics in the era of Artificial Intelligence (ICSMD); 2020. p.330-4. IEEE. [Crossref]
Zhang J, Chen L. Clustering-based undersampling with random over sampling examples and support vector machine for imbalanced classification of breast cancer diagnosis. Comput Assist Surg (Abingdon). 2019;24(sup2):62-72. [Crossref] [PubMed]
Mathew J, Pang CK, Luo M, Leong WH. Classification of Imbalanced Data by Oversampling in Kernel Space of Support Vector Machines. IEEE Trans Neural Netw Learn Syst. 2018;29(9):4065-76. [Crossref] [PubMed]
Jakkula V. Tutorial on support vector machine (svm). School of EECS, Washington State University. 2006;37. [Link]
Han J, Kamber M. Data Mining: Concepts and Techniques. 2nd edition. San Francisco, CA, USA: Morgan Kaufmann Publishers; 2006.
James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. London: Springer; 2013. [Link]
Alpaydin E. Introduction to Machine Learning (Adaptive Computation and Machine Learning). 2nd ed. The MIT Press; 2010.
Maglogiannis IG. Emerging artificial intelligence applications in computer engineering: real word ai systems with applications in ehealth, hci, information retrieval and pervasive technologies. 1st. Netherlands: Los Press; 2007. [Link]
Uğuz S. Makine Öğrenmesi Teorik Yönleri ve Python Uygulamaları ile Bir Yapay Zeka Ekolü. 1. Baskı. Ankara: Nobel; 2020. [Link]
He H, Ma Y. Imbalanced learning: foundations, algorithms, and applications. 1st. New Jersey: John Wiley & Sons; 2013. [Crossref]
Dal Pozzolo A, Caelen O, Bontempi G, eds. When is undersampling effective in unbalanced classification tasks? Joint European Conference on Machine Learning and Knowledge Discovery in Databases; 2015: Springer. [Crossref]
Blagus R, Lusa L. Joint use of over-and under-sampling techniques and cross-validation for the development and assessment of prediction models. BMC Bioinformatics. 2015;16(1):363. [Crossref]
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 2002;16:321-57. [Crossref]
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C. DBSMOTE: density-based synthetic minority over-sampling technique. Applied Intelligence. 2012;36(3):664-84. [Crossref]
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C, eds. Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Pacific-Asia conference on knowledge discovery and data mining; 2009: Springer. [Crossref]
Wang X, Pedrycz W, Chan P, He Q, SpringerLink (Online service). Machine Learning and Cybernetics 13th International Conference, Lanzhou, China, July 13-16, 2014. Proceedings. Available from: [Crossref]

.: Güncel

.: İşlem Listesi

Türkçe İngilizce

Hakkımızda İletişim Görüş ve Öneri

Veri Politikamız Kullanım Şartları

Ortadoğu Reklam Tanıtım Yayıncılık Turizm Eğitim İnşaat Sanayi ve Ticaret A.Ş.

.: Adres

Türkocağı Caddesi No:30 06520 Balgat / ANKARA
Telefon: +90 312 286 56 56
E-posta: info@turkiyeklinikleri.com

.: Yazı İşleri Servisi

Telefon: +90 312 286 56 56/ 154 - 153
E-posta: yaziisleri@turkiyeklinikleri.com

.: İngilizce Dil Redaksiyonu

Telefon: +90 312 286 56 56/ 145
E-posta: tkyayindestek@turkiyeklinikleri.com

.: Reklam Servisi

Telefon: +90 312 286 56 56/ 142
E-posta: reklam@turkiyeklinikleri.com

.: Abone ve Halkla İlişkiler Servisi

Telefon: +90 312 286 56 56/ 197
E-posta: abone@turkiyeklinikleri.com

.: Müşteri Hizmetleri

Telefon: +90 312 286 56 56/ 197
E-posta: satisdestek@turkiyeklinikleri.com