Classification of RNA-Sequencing Data Via Poisson and Negative Binomial Linear Discriminant Analyses: A Methodological Study

Dinçer GÖKSÜLÜK^a, Ahmet Ergün KARAAĞAOĞLU^b
^aDepartment of Biostatistics, Erciyes University Faculty of Medicine, Kayseri, Türkiye
^bDepartment of Biostatistics, Lokman Hekim University Faculty of Medicine, Ankara, Türkiye

Turkiye Klinikleri J Biostat. 2023;15(3):150-60

doi: 10.5336/biostatic.2023-98597

Article Language: EN

Full Text

ABSTRACT
Objective: Microarray and RNA sequencing (RNASeq) technologies are frequently employed in genetic data analysis for detecting disease-associated genes, identifying cancer subtypes, and enabling molecular diagnosis. While numerous methods have been proposed for classification problems using microarray data, there is a paucity of developed methods for classifying RNA-Seq data. This study aims to compare the performance of novel methods developed for RNA-Seq data on 3 distinct real-life datasets. Material and Methods: Cervical cancer, Alzheimer's disease, and kidney cancer RNA-Seq data were utilized in this study. The data were divided into training and test sets in a %70 and %30 ratio, respectively. Various preprocessing steps, such as normalization, power transformation, and variance filtering, were applied to the data. The Poisson Linear Discriminant Analysis (PLDA) and Negative Binomial Linear Discriminant Analysis (NBLDA) models were used for classification purposes, and the predictive performances of these models were compared. Results: Among the three datasets, the Alzheimer's data exhibited the lowest level of dispersion, while the cervical cancer data had the highest overdispersion. The NBLDA model demonstrated superior classification performance compared to the PLDA model. In cases of mild-to-moderate overdispersion, the predictive performance of the PLDA model improved when power transformation was applied, resulting in performance similar to that of the NBLDA model. Conclusion: PLDA and NBLDA models are two novel and promising techniques used in classifying RNA-Seq data. The performance of these models is influenced by the degree of overdispersion. In cases of high overdispersion, it is recommended to utilize the NBLDA model.

Keywords: Genomics; RNA-Sequencing; PLDA; NBLDA; classification

ÖZET
Amaç: Mikrodizi ve RNA dizileme teknolojileri, genetik çalışmalarda hastalıkla ilişkili genlerin tespiti, kanser alt tiplerinin belirlenmesi, moleküler teşhis gibi amaçlar için sıklıkla kullanılan yöntemlerdir. Mikrodizi verilerinde sınıflama problemleri için literatürde birçok yöntem önerilmiştir. Bununla birlikte RNA dizileme verilerinde sınıflama problemleri için sınırlı sayıda yöntem bulunmaktadır. Bu çalışma, RNA dizileme verileri için geliştirilen yeni yöntemlerin performansını 3 farklı gerçek veri seti üzerinde karşılaştırmayı amaçlamaktadır. Gereç ve Yöntemler: Bu çalışmada, serviks kanseri, Alzheimer hastalığı ve böbrek kanseri RNA dizileme verileri kullanılmıştır. Veriler, sırasıyla %70 ve %30 oranında eğitim ve test kümelerine ayrılmıştır. Normalizasyon, güç dönüşümü ve varyans filtreleme gibi çeşitli ön işlemlerden sonra veriler, Poisson Doğrusal Ayırma Analizi (PDAA) ve Negatif Binom Doğrusal Ayırma Analizi (NBDAA) modelleri kullanılarak modellenmiş ve modellerin tahmin performansları karşılaştırılmıştır. Bulgular: Üç veri seti arasında Alzheimer verisi en düşük, serviks kanseri verisi ise en yüksek aşırı dağılıma sahipti. NBDAA modeli, PDAA modeline göre daha iyi sınıflandırma performansı göstermiştir. Hafif-orta derecede aşırı dağılım gözlendiği durumlarda, PDAA modelinin tahmin performansı güç dönüşümü uygulandığında iyileşmiş ve NBDAA ile benzer performans elde edilmiştir. Sonuç: PDAA ve NBDAA modelleri, RNA dizileme verilerinin sınıflandırılmasında kullanılan yeni ve umut verici tekniklerdir. Bu modellerin performansı, veri setindeki aşırı yaygınlığın derecesinden etkilenmektedir. Veride yüksek aşırı yaygınlık olması durumunda NBDAA modelinin kullanılması önerilmektedir.

Anahtar Kelimeler: Genomik; RNA-dizileme; Poisson Doğrusal Ayırma Analizi; Negatif Binom Doğrusal Ayırma Analizi; sınıflama

REFERENCES:

Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47. [Crossref] [PubMed] [PMC]
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531-7. [Crossref] [PubMed]
Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18(9):1509-17. [Crossref] [PubMed] [PMC]
Witten DM. Classification and clustering of sequencing data using a Poisson model. Annals of Applied Statistics. 2011;5(4):2493-518.[Crossref]
Dudoit D, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association. 2002;97(457):77-87. [Crossref]
Goksuluk D, Zararsiz G, Korkmaz S, Eldem V, Zararsiz GE, Ozcetin E, et al. MLSeq: Machine learning interface for RNA-sequencing data. Comput Methods Programs Biomed. 2019;175:223-31.[Crossref] [PubMed]
Dong K, Zhao H, Tong T, Wan X. NBLDA: negative binomial linear discriminant analysis for RNA-Seq data. BMC Bioinformatics. 2016;17(1):369. [Crossref] [PubMed] [PMC]
Zararsiz G, Goksuluk D, Korkmaz S, Eldem V, Zararsiz GE, Duru IP, et al. A comprehensive simulation study on classification of RNA-Seq data. PLoS One. 2017;12(8):e0182507.[Crossref] [PubMed] [PMC]
Law CW, Chen Y, Shi W, Smyth GK. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014;15(2):R29. [Crossref] [PubMed] [PMC]
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139-40. [Crossref] [PubMed] [PMC]
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. [Crossref] [PubMed] [PMC]
Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010;11:94. [Crossref] [PubMed] [PMC]
Witten D, Tibshirani R, Gu SG, Fire A, Lui WO. Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls. BMC Biol. 2010;8:58. [Crossref] [PubMed] [PMC]
Leidinger P, Backes C, Deutscher S, Schmitt K, Mueller SC, Frese K, et al. A blood based 12-miRNA signature of Alzheimer disease patients. Genome Biol. 2013;14(7):R78. [Crossref] [PubMed] [PMC]
Saleem M, Padmanabhuni SS, Ngomo AN, Almeida JS, Decker S, Deus HF. Linked cancer genome atlas database. Proceedings of the 9th International Conference on Semantic Systems, I-SEMANTICS'13. New York, NY, USA. 2013. p.129-34. [Crossref]
Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008;5(7):613-9. [Crossref] [PubMed]
Osabe T, Shimizu K, Kadota K. Differential expression analysis using a model-based gene clustering algorithm for RNA-seq data. BMC Bioinformatics. 2021;22(1):511.[Crossref] [PubMed] [PMC]
Si Y, Liu P, Li P, Brutnell TP. Model-based clustering for RNA-seq data. Bioinformatics. 2014;30(2):197-205. [Crossref] [PubMed]
Le H, Peng B, Uy J, Carrillo D, Zhang Y, Aevermann BD, et al. Machine learning for cell type classification from single nucleus RNA sequencing data. PLoS One. 2022;17(9):e0275070. [Crossref] [PubMed] [PMC]
Hopper MA, Wenzl K, Hartert KT, Krull JE, Dropik AR, Novak JP, et al. Molecular classification and identification of an aggressive signature in low-grade B-cell lymphomas. Hematol Oncol. 2023. [PubMed]
Shen R, Fu D, Dong L, Zhang M, Shi Q, Shi Z, et al. Simplified algorithm for genetic subtyping in diffuse large B-cell lymphoma. Signal Transduction and Targeted Therapy. 2023;8(1):145. [Crossref] [PubMed] [PMC]
Rahman T, Huang H-E, Li Y, Tai A-S, Hseih W-P, McClung CA, et al. A sparse negative binomial classifier with covariate adjustment for RNA-seq data. Ann Appl Stat. 2022;16(2):1071-89. [Crossref]
Das S, Rai SN. Statistical methods for analysis of single-cell RNA-sequencing data. MethodsX. 2021;8:101580. [Crossref] [PubMed] [PMC]
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25. [Crossref] [PubMed] [PMC]
Han H, Men K. How does normalization impact RNA-seq disease diagnosis? J Biomed Inform. 2018;85:80-92. [Crossref] [PubMed]
Zhou Y, Wan X, Zhang B, Tong T. Classifying next-generation sequencing data using a zero-inflated Poisson model. Bioinformatics. 2018;34(8):1329-35. [Crossref] [PubMed]
Zhu J, Yuan Z, Shu L, Liao W, Zhao M, Zhou Y. Selecting classification methods for small samples of next-generation sequencing data. Front Genet. 2021;12:642227. [Crossref] [PubMed] [PMC]
Song JK, Zhang Y, Fei XY, Chen YR, Luo Y, Jiang JS, et al. Classification and biomarker gene selection of pyroptosis-related gene expression in psoriasis using a random forest algorithm. Front Genet. 2022;13:850108. [Crossref] [PubMed] [PMC]
Zhou Y, Peng M, Yang B, Tong T, Zhang B, Tang N. scDLC: a deep learning framework to classify large sample single-cell RNA-seq data. BMC Genomics. 2022;23(1):504. [Crossref] [PubMed] [PMC]
Chen X, Balko JM, Ling F, Jin Y, Gonzalez A, Zhao Z, et al. Convolutional neural network for biomarker discovery for triple negative breast cancer with RNA sequencing data. Heliyon. 2023;9(4):e14819. [Crossref] [PubMed] [PMC]

.: Up To Date

.: Process List

Turkish English

About us Contact Us Comments

Ortadoğu Reklam Tanıtım Yayıncılık Turizm Eğitim İnşaat Sanayi ve Ticaret A.Ş.

.: Address

Turkocagi Caddesi No:30 06520 Balgat / ANKARA
Phone: +90 312 286 56 56
E-mail: info@turkiyeklinikleri.com

.: Manuscript Editing Department

Phone: +90 312 286 56 56/ 154 - 153
E-mail: yaziisleri@turkiyeklinikleri.com

.: English Language Redaction

Phone: +90 312 286 56 56/ 145
E-mail: tkyayindestek@turkiyeklinikleri.com

.: Marketing Sales-Project Department

Phone: +90 312 286 56 56/ 142
E-mail: reklam@turkiyeklinikleri.com

.: Subscription and Public Relations Department

Phone: +90 312 286 56 56/ 197
E-mail: abone@turkiyeklinikleri.com

.: Customer Services

Phone: +90 312 286 56 56/ 197
E-mail: satisdestek@turkiyeklinikleri.com