Objective: The occurrence of colon cancer starts in the inner wall of the large intestine. The survival of colon cancer patients strongly relies on early detection. Diagnosing colon cancer using clinical approaches often takes longer, especially in most developing countries with limited facilities. The recent use of microarray technology has presented a new approach for the oncologist to diagnose cancer cells using non-clinical machine learning methods. In this paper, the aim is to predict the status of colon cancer tissues using the Bayesian Additive Regression Trees (BART) and 2 other machine learning methods. Material and Methods: The development and comparative analysis of BART alongside 2 other competing methods (Random Forest: RF and Gradient Boosting Machine: GBM) were implemented. The dataset used for the analysis is the microarray colon cancer data which consists of 2,000 gene expression measurements for 62 tissue samples. Results: The methods are compared based on overall metrics (accuracy, balance accuracy, detection rate, F-measure and AUC) and class-specific metrics (sensitivity, specificity, positive predictive value and negative predictive value). The overall metrics results showed that the best method is RF. The class-specific metrics results showed that BART is better than RF. Conclusion: On average, BART is more sensitive in detecting the presence of colon cancer cells, while RF is more accurate and specific in detecting the presence or absence of colon cancer cells.
Keywords: Colon cancer; Bayesian trees; random forest; gradient boosting
Amaç: Kolon kanseri kalın bağırsağın iç duvarında başlar. Kolon kanseri hastalarının sağ kalımı kuvvetle erken tanıya dayanır. Kolon kanserine klinik yaklaşımlarla tanı koyulması özellikle sınırlı kaynakları olan gelişmekte olan ülkelerde sıklıkla uzun zaman alır. Son zamanlarda mikrodizilim teknolojisinin kullanımı onkologlara klinik olmayan makine öğrenme yöntemleri kullanılarak kanser hücrelerini tanımaları için yeni bir yaklaşım sunmaktadır. Bu yazının amacı Bayesian Eklemeli Regresyon Ağaçları [ Bayesian Additive Regression Trees (BART) ve diğer 2 makine öğrenme yöntemi kullanılarak kolon kanseri dokularının durumunun öngörülmesidir. Gereç ve Yöntemler: Diğer 2 hesaplama yöntemi olan Rastgele Orman (Random Forest: RF) ve Gradyan Artırma Makinesi (Gradient Boosting Machine: GBM) yanı sıra BART'ın geliştirilmesi ve karşılaştırmalı analizi uygulandı. Analiz için kullanılan veri seti, 62 doku örneği için 2.000 gen ekspresyon ölçümünden oluşan mikrodizi kolon kanseri verisidir. Bulgular: Yöntemler, genel ölçülere (doğruluk, terazi denge doğruluğu, saptama oranı, F-ölçüm ve AUC) ve sınıfa özgü ölçülere (duyarlılık, özgüllük, pozitif tahmin değeri ve negatif tahmin değeri) dayalı olarak karşılaştırıldı. Genel ölçüm sonuçları, en iyi yöntemin RF olduğunu göstermiştir. Sınıfa özel ölçü sonuçları, BART'ın RF'den daha iyi olduğunu göstermiştir. Sonuç: Ortalama olarak, BART kolon kanseri hücrelerinin varlığını tespit etmede daha duyarlıyken, RF kolon kanseri hücrelerinin varlığını veya yokluğunu tespit etmede daha doğru ve özgüldür.
Anahtar Kelimeler: Kolon kanseri; Bayesian ağaçları; rastgele orman; gradyan artırma
- Sim AY, Minary P, Levitt M. Modeling nucleic acids. Curr Opin Struct Biol. 2012;22(3):273-8. [Crossref] [PubMed] [PMC]
- Banjoko AW, Yahya WB, Garba MK, Olaniran OR, Dauda KA, Olorede KO. Efficient support vector machine classification of diffuse large B-cell lymphoma and follicular lymphoma MRNA tissue samples. Annals Computer Science Series. 2015;13(2):69-79. [Link]
- Olaniran OR, Abdullah MAA. Gene selection for colon cancer classification using bayesian model averaging of linear and quadratic discriminants. Journal of Science and Technology. 2017;9(3):140-4. [Link]
- Olaniran OR, Abdullah MAA. BayesRandomForest: An R implementation of Bayesian Random Forest for Regression Analysis of High-dimensional Data. Romanian Statistical Review. 2018;66(1):95-102. [Crossref]
- Olaniran OR. Abdullah MAA. Bayesian variable selection for multiclass classification using Bootstrap Prior Technique. Austrian Journal of Statistics. 2019;48(2):63-72. [Crossref]
- Olaniran OR, Abdullah MAA. Bayesian analysis of extended cox model with time-varying covariates using bootstrap prior. Journal of Modern Applied Statistical Methods. 2020;18(2):7-17. [Crossref]
- Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A. 1999;96(12):6745-50. [Crossref] [PubMed] [PMC]
- Lin G, Shen C, Shi Q, Van den Hengel A, Suter D. Fast supervised hashing with decision trees for high-dimensional data. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA; 2014. p.1963-970. [Crossref]
- Zhou ZH. Ensemble Methods: Foundations and Algorithms. 1st ed. London: Chapman & Hall/CRC; 2012. [Crossref]
- Yang P, Hwa Yang Y, Zhou B, Zomaya Y. A review of ensemble methods in bioinformatics. Current Bioinformatics. 2010;5(4):296-308. [Crossref]
- Breiman L. Random forests. Machine Learning. 2001;45:5-32. [Crossref]
- Breiman L. Stacked regressions. Machine Learning. 1996a;24:41-64. [Crossref]
- Breiman L. Bagging predictors. Machine Learning. 1996b;26:123-40. [Crossref]
- Chipman HA, George EI, McCulloch RE. BART: Bayesian Additive Regression Trees. Annals Applied Statistics. 2010;4(1):266-98. [Crossref]
- Linero AR. Bayesian regression trees for high-dimensional prediction and variable selection. Journal of the American Statistical Association. 2018;113(522):626-36. [Crossref]
- Hernández B, Raftery AE, Pennington SR, Parnell AC. Bayesian additive regression trees using bayesian model averaging. Stat Comput. 2018;28(4):869-90. [Crossref] [PubMed] [PMC]
- Yahya WB, Olaniran OR, Ige SO. On Bayesian Conjugate Normal Linear Regression and Ordinary Least Square Regression Methods: A Monte Carlo Study. Ilorin Journal of Science. 2014;1(1):216-27. [Crossref]
- Olaniran OR, Olaniran SF, Yahya WB, Banjoko AW, Garba MK, Amusa LB, et al. Improved Bayesian Feature Selection and Classification Methods Using Bootstrap Prior Techniques. Anale SeriaInformatică. 2016;14(2):46-52. [Link]
- Olaniran OR, Yahya WB. Bayesian hypothesis testing of two normal samples using bootstrap prior technique. Journal of Modern Applied Statistical Methods. 2017;16(2):618-38. [Crossref]
- Friedman JH. Greedy function approximation: a gradient boosting machine. The Annals of Statistics. 2001;29(5):1189-232. [Crossref]
- Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Statist. 2002;29(5):1189-232. [Crossref]
- Powers DMW. Evaluation: From precision. Recall and F-measure to ROC informedness, markedness & correlation. Journal of Machine Learning Technologies. 2011;2(1):37-63. [Crossref]
- Fawcett,T. An introduction to ROC analysis. Pattern Recognition Letters. 2006;27(8):861-74. [Crossref]
- Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143(1):29-36. [Crossref] [PubMed]
- Demsar J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research. 2006;7:1-30. [Link]
- Octaviani TL, Rustam DZ. Random forest for breast cancer prediction. In AIP Conference Proceedings. 2019;2168(1):020050. [Crossref]
- Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008;9(1):1-10. [Crossref] [PubMed] [PMC]
- Jackins V, Vimal S, Kaliappan M, Lee MY. AI-based smart prediction of clinical disease using random forest classifier and Naive Bayes. The Journal of Supercomputing. 2021;77(5):5198-219. [Crossref]
.: Process List