Genom-Boyu İlişki Çalışmalarında, Makine Öğrenimi ve Derin Öğrenme Yöntemlerinin Farklı Örnek Genişliklerinde Performanslarının Değerlendirilmesi

Ragıp Onur ÖZTORNACI^a, Erdal COŞGUN^b, Bahar TAŞDELEN^a
^aMersin Üniversitesi Tıp Fakültesi, Biyoistatistik ve Tıbbi Bilişim ABD, Mersin, TÜRKİYE
^bbMicrosoft Research, 14820 NE 36th Street, Building 99, Redmond, WA, 98052, USA

Turkiye Klinikleri J Biostat. 2020;12(2):204-10

doi: 10.5336/biostatic.2020-73403

Makale Dili: TR

Tam Metin

ÖZET
Amaç: Makine öğrenimi (ML) ve derin öğrenme (DL) yöntemlerinin genetik ilişki çalışmalarında (GWAS) kullanımı giderek yaygınlaşmaktadır. Bu yöntemler, klasik GWAS metodolojisine ek olarak elde edilen sonuçların doğrulanmasına yardımcı olmaktadır. Bu nedenle; bu çalışmanın amacı, en yaygın kullanılan ML yöntemleri ve DL yönteminin farklı örnek genişliklerinde kıyaslanarak en doğru hasta-kontrol ayırımını yapan sınıflama yöntemini bulmaktır. Gereç ve Yöntemler: ML ve DL yöntemleri için farklı örnek genişliklerinde aynı prevelansa ve minör allel frekansına (MAF) sahip ve eşit sayıda hasta-kontrol içeren veriler üretilmiştir. Tüm veri setlerinde, tüm yöntemler için 10-katlı çapraz geçerlilik yöntemi uygulandı ve tüm yöntemler için, %70 eğitim veri seti %30 test veri seti olarak belirlenerek, modellerin geçerlilikleri ve tahmin güçleri sınandı. Simülasyonlar için PLINK yazılımı, ML için R programlama dili ve DL için de Python programlama dili Microsoft Azure GPU makinalar ile kullanmıştır. Bulgular: Örnek genişliği; N=200 için; en iyi sınıflama performanslarını ML yöntemleri içinde, Random Forest (RF) 0.73 Doğruluk (Acc.) ve CART 0.73 Acc. ile gösterirken, DL 0.60 Acc. değeri elde etmiştir. N=600 için; ML yöntemlerinden DVM 0.78 Acc. ile en iyi sınıflama başarısını elde ederken, DL 0.64 Acc. değeri elde etmiştir. Maksimum örnek genişliğinde, hem ML hem de DL için doğru sınıflama başarısının değişiminin etkilenmediği söylenebilir. Sonuç: Son yıllarda özellikle bağımsız değişkenler arasında ilişkilerin çok yüksek olduğu veri setlerinde yüksek doğru sınıflama oranı DVM ile edilmektedir. Bu çalışmada örneklem genişliğinden bağımsız olarak en yüksek sonuç DVM ile elde edilmiştir. Bu da genetik verilerde sık karşılaşılan doğrusal olmayan ML yöntemlerinin başarısını kanıtlamaktadır.

Anahtar Kelimeler: Tek nükleotid polimorfizmi; makine öğrenimi; derin öğrenme

ABSTRACT
Objective: The use of Machine Learning (ML) and deep learning (DL) methods in genetic association studies (GWAS) is becoming increasingly common. These methods help confirm the results obtained in addition to the classical GWAS methodology. Therefore; The aim of this study was to compare the most commonly used ML methods and DL methods in different sample widths and find the most accurate patient-control classification method. Material and Methods: For ML and DL methods, data with the same prevalence and minor allele frequency with different sample size and equal number of patient-controls were generated. In all data sets, 10-fold cross validity method was applied for all methods and 70% training data set was determined as 30% test data set for all methods and validity and predictive power of the models were tested. PLINK software for simulations, R programming language for ML and Python programming language for DL have been used with Microsoft Azure GPU machines. Results: Sample size; n=200; The best classification performances in ML methods, Random Forest 0.73 Accuracy (Acc.) and CART' 0.73 Acc. while DL 0.60 Acc. n=600; MLM methods of DVM 0.78 Acc. while DL 0.64 Acc. value. Conclusion: In recent years, DVM method is used for high estimation ratio, especially in data sets where the relationships between independent variables are very high. In this study, the highest result was obtained with DVM regardless of sample size. This proves the success of nonlinear ML methods that are frequently encountered in genetic data.

Keywords: Single nucleotide polymorphism; machine learning; deep learning

REFERANSLAR:

Zhang W, Zhang P, Gao F, Zhu Y, Liu R. Simulation comparison of statistical approaches and procedures in building SNP based prediction models for drug response. Acta Scientific Pharmaceutical Sciences. 2020;4(1):38-43.
Battaloğlu E, Başak AN. Kompleks hastalık genetiği: güncel kavramlar ve nörolojik hastalıkların tanısında kullanılan genomik yöntemler. Klinik Gelişim Dergisi. 2010;23:128-33.
Fadista J, Manning AK, Florez JC, Groop L. The (in) famous GWAS P-value threshold revisited and updated for low-frequency variants. Eur J Hum Genet. 2016;24(8):1202-5. [Crossref] [PubMed] [PMC]
Szymczak S, Biernacka JM, Cordell HJ, González‐Recio O, König IR, Zhang H, et al. Machine learning in genome‐wide association studies. Genet Epidemiol. 2009;33(Suppl 1):S51-7. [Crossref] [PubMed]
Hong EP, Park JW. Sample size and statistical power calculation in genetic association studies. Genomics Inform. 2012;10(2):117-22. [Crossref] [PubMed] [PMC]
Bell J. Machine Learning: Hands-On for Developers and Technical Professionals. 2nd ed. USA: John Wiley & Sons; 2020. p.432. [Crossref]
Michie D, Spiegelhalter DJ, Taylor CC. Machine Learning, Neural and Statistical Classification. 1st ed. New York: Horwood; 1994. p.289.
Kononenko I, Bratko I. Information-based evaluation criterion for classifier's performance. Machine Learning. 1991;6(1):67-80. [Crossref]
Akpınar H. Data Veri Madenciliği Veri Analizi. 1. Baskı. İstanbul: Papatya Yayıncılık; 2013. p.212.
Tan PN, Steinbach M, Kumar V. Chapter 8 Cluster analysis: basic concepts and algorithms. Introduction to data mining, 6th ed. Boston: Pea-son Addison Wesley; 2006. p.486-568.
Temel GO, Çamdeviren H, Akkuş Z. Sınıflama Ağaçları Yardımıyla Restless Legs Syndrome (RLS) Hastalarına Tanı Koyma. 2005.
Breiman L. Random forests. Machine Learning. 2001;45:5-32. [Crossref]
Chen X, Ishwaran H. Random forests for genomic data analysis. Genomics. 2012;99(6):323-9. [Crossref] [PubMed] [PMC]
Jiawei H, Kamber M, Pei J. Data Mining: Concepts and Techniques. 3rd ed. San Francisco: Morgan Kaufmann, Elsevier; 2012. p.673.
Alpaydın E. Introduction to Machine Learning. 1st ed. Cambridge, Mass: The MIT Press; 2004. p.415.
Rathasamuth W, Pasupa K, Tongsima S. Selection of a Minimal Number of Significant Porcine SNPs by an Information Gain and Genetic Algorithm Hybrid Model. 2019. arXiv preprint arXiv:1905.09059.
Koç ML, Balas CE, Arslan A. [Preliminary design of ruble mound breakwaters by using artificial neural networks]. İMO Teknik Dergi. 2004;3351-75.
Silahdaroğlu G. Veri Madenciliği Kavram ve Algoritmaları. 2. Baskı. İstanbul: Papatya Yayıncılık; 2013. p. p.121.
Cai YD, Liu XJ, Xu XB, Zhou GP. Support Vector Machines for predicting protein structural class. BMC Bioinformatics. 2001;2:3. http://www.biomedcentral.com/1471-2105/2/3v [Crossref] [PubMed] [PMC]
Elmas Ç. Yapay Zeka Uygulamaları: (Yapay Sinir Ağı, Bulanık Mantık, Genetik Algoritma). 3. Baskı. Ankara: Seçkin Yayıncılık; 2007. p.425.
Heidari AA, Faris H, Mirjalili S, Aljarah I, Mafarja M. Ant lion optimizer: theory, literature review, and application in multi-layer perceptron neural networks. In: Mirjalili S, Dong JS, Lewis A, eds. Nature-Inspired Optimizers: Theories, Literature Reviews and Applications. 1st ed. Cham, Switzerland: Springer; 2020. p.23-46. [Crossref]
Korkmaz S. Small drug molecule classification using deep neural networks. Turkiye Klinikleri J Biostat. 2019;11(2):93-101. [Crossref]
Gürsakal N. Makine Öğrenmesi Derin Öğrenme. 1. Baskı. Bursa: Dora Yayıncılık; 2017. p.309.
Yu D, Eversole A, Seltzer ML, Yao K, Huang Z, Guenter B, et al. An introduction to computational networks and the computational network toolkit. Microsoft Technical Report, MSR-TR-2014-112. 2014. p.168.
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016. https://arxiv.org/pdf/1603.04467.pdf
Jia Y, Shelhamer E , Donahue J, Karayev S, Long J, Girshick R, et al. Caffe: Convolutional architecture for fast feature embedding. arXiv: 1408.5093v1, 2014. https://arxiv.org/pdf/1408.5093.pdf [Crossref]
Collobert R, Kavukcuoglu K, Farabet C. Torch7: a matlab-like environment for machine learning. BigLearn. NIPS Workshop. 2011, no. EPFL-CONF-192376.
Chen T, Li M , Li Y, Lin M, Wang N, Wang M, et al. MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015. https://arxiv.org/pdf/1512.01274.pdf
Seide F, Agarwal A. CNTK: Microsoft's open-source deep-learning toolkit. KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, August. p.2135. [Crossref]
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559-75. [Crossref] [PubMed] [PMC]
Hong EP, Park JW. Sample size and statistical power calculation in genetic association studies. Genomics Inform. 2012;10(2):117-22. [Crossref] [PubMed] [PMC]

.: Güncel

.: İşlem Listesi

Türkçe İngilizce

Hakkımızda İletişim Görüş ve Öneri

Veri Politikamız Kullanım Şartları

Ortadoğu Reklam Tanıtım Yayıncılık Turizm Eğitim İnşaat Sanayi ve Ticaret A.Ş.

.: Adres

Türkocağı Caddesi No:30 06520 Balgat / ANKARA
Telefon: +90 312 286 56 56
E-posta: info@turkiyeklinikleri.com

.: Yazı İşleri Servisi

Telefon: +90 312 286 56 56/ 154 - 153
E-posta: yaziisleri@turkiyeklinikleri.com

.: İngilizce Dil Redaksiyonu

Telefon: +90 312 286 56 56/ 145
E-posta: tkyayindestek@turkiyeklinikleri.com

.: Reklam Servisi

Telefon: +90 312 286 56 56/ 142
E-posta: reklam@turkiyeklinikleri.com

.: Abone ve Halkla İlişkiler Servisi

Telefon: +90 312 286 56 56/ 197
E-posta: abone@turkiyeklinikleri.com

.: Müşteri Hizmetleri

Telefon: +90 312 286 56 56/ 197
E-posta: satisdestek@turkiyeklinikleri.com