Objective: Nowadays, 'high dimension low sample size (HDLSS) settings' is very popular in many areas such as genetics, bioinformatics, medical imaging. In this study, it was aimed to investigate the classification performance of Random Ferns, which is a relatively new classifier, on HDLSS settings with different characteristics. Material and Methods: By simulation studies, artificial data sets that have different characteristics in terms of dimension, sample size, correlation structure, noise ratio and prevalence, were generated. Each scenario was iterated for 1,000 times and, classification performances of Random Ferns has been compared with support vector machines (SVM), which stand out with its high classification performance. Results: The performance variation of Random Ferns differs from SVM. It performed better at small sample size (n=20). When the F values are examined, it is seen that the classification performance of Random Ferns in the case of imbalanced distribution does not change much according to the balanced distribution situation and is higher than SVM. It has also been observed that high success as was expected in the classical data structure cannot be achieved. Conclusion: It is noteworthy that Random Ferns outperforms, especially in balanced distribution and small sample sizes. It is thought that this method, which does not have many applications in the field of health, will contribute to studies where the number of observations is quite low.
Keywords: Random Ferns; classification; machine learning; high dimensional data; simulation
Amaç: Günümüzde 'yüksek boyut düşük örneklem genişlikli (YBDÖG) düzenler'; genetik, biyoinformatik, tıbbi görüntüleme gibi birçok alanda oldukça popülerdir. Bu çalışmada, nispeten yeni bir sınıflandırıcı olan Random Ferns algoritmasının farklı özelliklere sahip YBDÖG yapılar üzerindeki sınıflandırma performanslarının araştırılması amaçlanmıştır. Gereç ve Yöntemler: Benzetim teknikleri kullanılarak boyut, örneklem genişliği, korelasyon yapısı, gürültü oranı ve prevalans açısından farklı özelliklere sahip yapay veri setleri üretilmiştir. Her bir senaryo 1.000 kez tekrarlanmış ve Random Ferns algoritmasının sınıflandırma performansı, yüksek sınıflama başarısı ile bilinen destek vektör makineleri (DVM) ile karşılaştırılmıştır. Bulgular: Random Fernsvin performans değişimi DVM'den farklıdır. En belirgin özelliği, küçük örneklem düzeylerinde (n=20) daha iyi performans göstermesidir. F değerleri incelendiğinde, dengesiz dağılım durumunda Random Ferns'lerin sınıflandırma performansının dengeli dağılım durumuna göre çok fazla değişmediği ve DVM'den daha yüksek olduğu görülmektedir. Klasik veri yapılarında olduğu gibi yüksek başarının ise sağlanmadığı da görülmüştür. Sonuç: Random Ferns'ün özellikle dengeli dağılıma ve küçük örneklem genişliğine sahip yapılarda daha iyi performans göstermesi dikkat çekmektedir. Sağlık alanında fazla uygulaması olmayan bu yöntemin, gözlem sayılarının oldukça düşük olduğu çalışmalara katkı sağlayabileceği düşünülmektedir.
Anahtar Kelimeler: Random Ferns; sınıflandırma; makine öğrenmesi; yüksek boyutlu veriler; simülasyon
- Donoho DL. High-dimensional data analysis: the curses and blessings of dimensionality. AMS Lectures. 2000. [Link]
- Tan PN, Steinbach M, Kumar V. Introduction to Data Mining. 1st ed. New York: Pearson Education; 2006.
- Han J, Kamber M. Data Mining: Concept and Techniques. 2nd ed. Amsterdam ; Boston: Elsevier Inc.; 2006.
- Qiao Z, Zhou L, Huang JZ. Effective linear discriminant analysis for high dimensional, low sample size data. Proceedings of the World Congress of Engineering. 2008;1070-5.
- Ramey JA, Young PD. A comparison of regularization methods applied to the linear discriminant function with high-dimensional microarray data. Journal of Statistical Computation and Simulation. 2011;83(3):581-96. [Crossref]
- Ye J, Wang T. Regularized discriminant analysis for high dimensional, low sample size data. Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 20-23, 2006. p.454-63. [Crossref] [PubMed]
- Sen PK, Tsai MT, Jou YS. High dimension, low-sample size perspectives in constrained statistical inference: The SARSCoV RNA genome illustration. Journal of the American Statistical Association. 2007;102(478):686-94. [Crossref]
- Muller KE, Chi Y, Ahn J, Marron JS. Limitations of high dimension, low sample size principal components for gaussian data. 2008. [Link]
- Yang X. Machine Learning Methods in HDLSS Settings. Chapel Hill: North Carolina University; 2021. [Link]
- Qiao X, Zhang HH, Liu Y, Todd MJ, Marron JS. Asymptotic properties of distance-weighted discrimination. Journal of the American Statistical Association 2010;105:401-14. [Crossref] [PubMed] [PMC]
- Hall P, Marron JS, Neeman A. Geometric representation of high dimension, low sample size data. J R Statist Soc Series B. 2005;67:476-98. [Crossref]
- Ahn J. High Dimension Low Sample Size Data Analysis. Chapel Hill: North Carolina University; 2006. [Link]
- Özuysal M, Calonder M, Lepetit V, Fua P. Fast keypoint recognition using random ferns. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2010;32(3):448-61. [Crossref] [PubMed]
- Bosch A, Zisserman A, Munoz X. Image Classification Using Random Forests and Ferns. IEEE 11th International Conference on Computer Vision; Oct 14-21, 2007. p.1-8. [Crossref]
- Dong Y, Zhang Y, Yue J, Hu Z. Comparison of random forest, random ferns and support vector machine for eye state classification. Multimed Tools Appl 2016;75(19):11763-83. [Crossref]
- Kim S, Ko BC. Building Deep Random Ferns Without Backpropagation. Vol 8. IEEE Access; 2020. p.8533-42. [Crossref]
- Kursa MB. rFerns-Random Ferns Method Implementation for the General Purpose Machine Learning. Journal of Statistical Software. 2014;61(10):1-13. [Crossref]
- Demirtas H, Doganay B. Simultaneous generation of binary and normal data with specified marginal and association structures. J Biopharm Stat. 2012;22(2):223-36. [Crossref] [PubMed]
- Demirtaş H, Amatya A, Doğanay B. BinNor: An R package for concurrent generation of binary and normal data. Communication in Statistics-Simulation and Computation. 2014;43(3):569-79. [Crossref]
- Demirtaş H, Amatya A. Package 'BinNor', simultaneous generation of multivariate binary and normal variables. 2012. Available from: [Link]
- Straube S, Krell MM. How to evaluate an agent's behavior to infrequent events? Reliable performance estimation insensitive to class distribution. Frontiers in Computational Neuroscience. 2014;8(43). [Crossref] [PubMed] [PMC]
- Cho SB, Won HH. Machine learning in DNA microarray analysis for cancer classification. In proceedings of the first Asia-Pasific bioinformatics conference on bioinformatics 2003. Volume 19 (APBC '03). Australian Computer Society, Inc., AUS; 2003. p.189-98.
- Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics. 2000;16(10):906-14. [Crossref] [PubMed]
- Pirooznia M, Yang JY, Yang MQ, Deng Y. A comparative study of different machine learning methods on microarray gene expression data. BMC Genomics. 2008;9 Suppl 1(Suppl 1):S13. [Crossref] [PubMed] [PMC]
- Tamatani M. Asymptotic Theory for Discriminant Analysis in High Dimension Low Sample Size. Memoirs of the Graduate School of Science and Engineering. Series B: Mathematics. 2015;48:15-26. [Link]
.: Process List