Alternatif Uç Birleştirme Bölgelerinin Makine Öğrenimi ve Derin Öğrenme Yöntemleriyle Tahmin Edilmesinde Örnek Genişliğinin Etkisi: Metodolojik Bir Çalışma

Ragıp Onur ÖZTORNACI

doi:10.5336/biostatic.2024-101686

Turkiye Klinikleri Journal of Biostatistics

Journal Identity

About Journal

Peer Review Process

Last Issue

Issue List

Editorial Board

Information For Authors

Author Forms

Article Submission

Subscription

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Alternatif Uç Birleştirme Bölgelerinin Makine Öğrenimi ve Derin Öğrenme Yöntemleriyle Tahmin Edilmesinde Örnek Genişliğinin Etkisi: Metodolojik Bir Çalışma

Effect of Sample Size on Predicting Alternative Splicing Regions Using Machine Learning and Deep Learning Methods: A Methodological Study

Ragıp Onur ÖZTORNACI^a
^aKoç Üniversitesi Translasyonel Tıp Araştırma Merkezi, Biyoistatistik AD, İstanbul, Türkiye

Turkiye Klinikleri J Biostat. 2024;16(2):84-94

doi: 10.5336/biostatic.2024-101686

Article Language: TR

Full Text

ÖZET
Amaç: Alternatif uç birleştirme, genlerin RNA işlenmesi sırasında farklı kombinasyonlarda birleştirilmesi sürecidir. Bu süreç, bir genin kodladığı proteinin farklı formlarının oluşturulmasını sağlar. Bu çalışmanın amacı, alternatif uç birleştirme bölgelerinin tespiti için hangi modelin daha yüksek doğrulukla sonuç verdiğini tespit etmektir. Gereç ve Yöntemler: Simülasyonlar Python programlama dili ile gerçekleştirilmiş olup örnek büyüklükleri 25, 50, 100, 150 ve 200 olarak belirlenmiştir. Genetik veri setlerinde A, C, G ve T nükleotitlerinin frekansları eşit olarak dağıtılmıştır. Ayrıca uç birleştirme bölgeleri GTAGC, GTAGT, GTAGA ve GTCGA olarak belirlenmiş ve simülasyon 10.000 kez tekrarlanmıştır. Tüm veri setlerinde, destek vektör makineleri, rastgele orman (RF), long short-term memory (LSTM) ve derin sinir ağları [deep neural networks (DNN)], yöntemleri için aşırı öğrenme (overfitting) problemi önlenmesi amacıyla %67,5 eğitim veri seti, %10 test veri seti ve %22,5 doğrulama seti olarak belirlenmiş ve modellerin tahmin gücü test edilmiştir. Grafikler için R programlama dili kullanılmıştır ve tüm işlemler Unix işletim sistemi kullanılarak yapılmıştır. Bulgular: Örneklem büyüklüğü arttıkça modellerin performansının da arttığı gözlemlenmiştir. Özellikle DNN ve LSTM modellerinin performansı, örneklem büyüklüğünün artmasıyla birlikte istikrarlı bir şekilde yükselmektedir. LSTM modeli, genellikle diğer modellere göre daha yüksek F1-skor, spesifite ve sensitivite değerlerine sahiptir. RF modeli, genellikle yüksek örneklem büyüklüklerinde etkili bir şekilde çalışmaktadır. Sonuç: Bu çalışma, alternatif uç birleştirmelerin belirlenmesi sürecinde, makine öğrenimi ve derin öğrenme modellerinin etkili bir şekilde kullanılabileceğini ortaya koymaktadır.

Anahtar Kelimeler: Alternatif uç birleştirme; derin öğrenme; makine öğrenimi

ABSTRACT
Objective: Alternative splicing, the process of generating diverse protein isoforms by combining genes in various ways during RNA processing, is a crucial mechanism in cellular function. This study aims to identify the model with the highest accuracy in detecting alternative splicing regions. Material and Methods: Simulations were executed in Python with sample sizes ranging from 25 to 200. Genetic datasets maintained even distributions of A, C, G, and T nucleotides. Splicing regions were defined as GTAGC, GTAGT, GTAGA, and GTCGA, with 10,000 repetitions of the simulation. To prevent overfitting, 67.5% of the data served as the training set, 10% as the test set, and 22.5% as the validation set for support vector machines, random forest (RF), long short-term memory (LSTM), and deep neural network (DNN) methods. Model performance was assessed using prediction metrics. R programming language and computations were conducted on a Unix system. Results: Increasing the sample size correlated with enhanced model performance, notably for DNN and LSTM models. The LSTM model consistently demonstrated superior F1-score, specificity, and sensitivity compared to other models. The RF model exhibited effective performance, particularly with larger sample sizes. Conclusion: This study underscores the effectiveness of machine learning and deep learning models in identifying alternative splicing events. The findings emphasize the importance of considering model choice and sample size in optimizing the accuracy of alternative splicing region detection during RNA processing studies.

Keywords: Alternative splicing; deep learning; machine learning

REFERENCES:

Roca X, Sachidanandam R, Krainer AR. Determinants of the inherent strength of human 5' splice sites. RNA. 2005;11(5):683-98. [Crossref] [PubMed] [PMC]
Satam H, Joshi K, Mangrolia U, Waghoo S, Zaidi G, Rawool S, et al. Next-Generation Sequencing Technology: Current Trends and Advancements. Biology (Basel). 2023;12(7):997. [Crossref] [PubMed] [PMC]
Schmidt B, Hildebrandt A. Deep learning in next-generation sequencing. Drug Discov Today. 2021;26(1):173-80. [Crossref] [PubMed] [PMC]
Halperin RF, Hegde A, Lang JD, Raupach EA; C4RCD Research Group; Legendre C, Liang WS, LoRusso PM, Sekulic A, Sosman JA, Trent JM, et al. Improved methods for RNAseq-based alternative splicing analysis. Sci Rep. 2021;11(1):10740. [Crossref] [PubMed] [PMC]
Oubounyt M, Louadi Z, Tayara H, Chong KT. Deep learning models based on distributed feature representations for alternative splicing prediction. IEEE Access. 2018;6:58826-34. [Crossref]
Zhou K, Salamov A, Kuo A, Aerts AL, Kong X, Grigoriev IV. Alternative splicing acting as a bridge in evolution. Stem Cell Investig. 2015;2:19. [PubMed] [PMC]
Wickham H. Wiley interdisciplinary reviews: computational statistics. ggplot2. 2011;3(2):180-18. [Crossref]
Suthaharan S, Suthaharan S. Support vector machine. In: Suthaharan S, ed. Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning. 1st ed. New York: Springer; 2016. p.207-35. [Crossref]
Alpaydın E. Introduction to Machine Learning. 3rd ed. London: The MIT Press; 2004.
Gönen M, Alpaydın E. Multiple kernel learning algorithms. The Journal of Machine Learning Research. 2011;12:2211-68. [Link]
Strobl C, Zeileis A. Danger: High power!-exploring the statistical properties of a test for random forest variable importance. 2008. [Link]
Rigatti SJ. Random forest. J Insur Med. 2017;47(1):31-9. [Crossref] [PubMed]
Yu Y, Si X, Hu C, Zhang J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019;31(7):1235-70. [Crossref] [PubMed]
Duan J, Gong Y, Luo J, Zhao Z. Air-quality prediction based on the ARIMA-CNN-LSTM combination model optimized by dung beetle optimizer. Sci Rep. 2023;13(1):12127. [Crossref] [PubMed] [PMC]
Miikkulainen R, Liang J, Meyerson E, Rawal A, Fink D, Francon O, et al. Evolving deep neural networks. In: Kozma R, Alippi C, Morabito FC, eds. Artificial Intelligence in the Age of Neural Networks and Brain Computing. 1st ed. London, United Kingdom: Academic Press; 2019. p.293-312. [Crossref]
Ser G, Bati CT. Derin sinir ağları ile en iyi modelin belirlenmesi: mantar verileri üzerine keras uygulaması [Determining the best model with deep neural networks: Keras application on mushroom data]. Yuzuncu Yıl University Journal of Agricultural Sciences. 2019;29(3):406-17. [Crossref]
Orozco-Arias S, Piña JS, Tabares-Soto R, Castillo-Ossa LF, Guyot R, Isaza G. Measuring performance metrics of machine learning algorithms for detecting and classifying transposable elements. Processes. 2020;8(6):638. [Crossref]
Strauch Y, Lord J, Niranjan M, Baralle D. CI-SpliceAI-Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites. PLoS One. 2022;17(6):e0269159. [Crossref] [PubMed] [PMC]
Regan K, Saghafi A, Li Z. Splice junction identification using long short-term memory neural networks. Curr Genomics. 2021;22(5):384-390. [Crossref] [PubMed] [PMC]
Baten A, Chang B, Halgamuge S, Li J. Splice site identification using probabilistic parameters and SVM classification. BMC Bioinformatics. 2006;7(Suppl 5):S15. [Crossref] [PubMed] [PMC]

.: Up To Date

.: Process List

Turkish English

About us Contact Us Comments

Ortadoğu Reklam Tanıtım Yayıncılık Turizm Eğitim İnşaat Sanayi ve Ticaret A.Ş.

.: Address

Turkocagi Caddesi No:30 06520 Balgat / ANKARA
Phone: +90 312 286 56 56
E-mail: info@turkiyeklinikleri.com

.: Manuscript Editing Department

Phone: +90 312 286 56 56/ 154 - 153
E-mail: yaziisleri@turkiyeklinikleri.com

.: English Language Redaction

Phone: +90 312 286 56 56/ 145
E-mail: tkyayindestek@turkiyeklinikleri.com

.: Marketing Sales-Project Department

Phone: +90 312 286 56 56/ 142
E-mail: reklam@turkiyeklinikleri.com

.: Subscription and Public Relations Department

Phone: +90 312 286 56 56/ 197
E-mail: abone@turkiyeklinikleri.com

.: Customer Services

Phone: +90 312 286 56 56/ 197
E-mail: satisdestek@turkiyeklinikleri.com