Objective: This study presents a method for optimal selection of gene subsets to enhance the non-clinical diagnostic classification and prediction of colorectal cancer using gene expression level of gene expression profiles obtained with an Affymetrix oligonucleotide array. Material and Method: A Hybrid multi-objective Support vector Machine (SVM) feature selection and classification algorithm was employed to determine the Biomarker gene subsets that are highly statistically and clinically relevant to the 62 (tumour or normal) responses of the gene expression levels. The genes selection was done in two stages with the first stage using the Bayesian t-test to prune the non-informative genes and the second stage employed the multi-objective optimization method that allows sequential addition of genes for optimal determination of the pre-selected gene subsets. The SVM with RBF kernel (〖SVM〗_RBF ) was fitted sequentially to select the set of near-optimal genes that are correlated with the response class. Results: The optimally selected gene subset yielded an accuracy of 90.1% on the test data that were never used in the building process of the algorithm.Furthermore, the results obtained from the principal component analysis and the complete linkage hierarchical clustering indicated near-perfect discrimination of the two clinical response groups of the colorectal cancer status of the patients. Conclusion: This work has fully demonstrated that non-clinical colon cancer diagnosis and prediction of patients using their gene signatures from the gene microarray expression data is very possible when the appropriate data mining technique tools are used.
Keywords: Support vector machines; feature selection; multi-objective optimization; principal component analysis; clustering
Amaç: Bu çalışma, afimetrik oligo-nükleotid dizisi ile elde edilen gen ekspresyon profillerinin gen ekspresyon seviyesini kullanarak kolorektal kanserin klinik olmayan tanı sınıflandırmasını ve tahminini geliştirmek amacıyla gen alt kümelerinin optimal seçimi için bir yöntem sunar. Gereç ve Yöntemler: Gen ekspresyon seviyelerinin 62 (tümör veya normal) yanıtları ile istatistiksel ve klinik olarak oldukça ilgili biyo-belirteç alt kümelerini belirlemek için hibrit çok amaçlı destek vektör makinesi (DVM) özelliği seçimi ve sınıflandırma algoritması kullanılmıştır. Gen seçimi iki aşamada yapılmıştır; ilk aşamada bilgi vermeyen genleri budamak için Bayesçi t-testi, ikinci aşamada önceden seçilen gen alt kümelerinin optimal belirlenmesi için genlerin için sekansiyel eklenmesine izin veren çok amaçlı optimizasyon yöntemi kullanılmıştır. RBF çekirdeğine sahip SVM (SVMREF), yanıt sınıfı ile korele olan neredeyse optimal genler kümesini seçmek için sırayla yerleştirildi. Bulgular: Optimal olarak seçilen gen alt kümesi, algoritmanın oluşturulma sürecinde hiç kullanılmayan test verilerinde %90.1'lik bir doğruluk sağlamıştır. Ayrıca, ana bileşen analizinden ve tam bağlantı hiyerarşik kümelenmesinden elde edilen sonuçlar, kolorektal kanser durumunun iki klinik yanıt grubunun ayırt edilmesi için neredeyse mükemmele yakın ayrımını göstermiştir. Sonuç: Bu çalışma, klinik olmayan kolon kanseri teşhisinin ve hastaların gen mikro-dizi ekspresyon verilerinden kendi gen imzalarını kullanarak tahmin etmelerinin, uygun veri madenciliği teknik araçları kullanıldığında çok mümkün olduğunu tam olarak göstermiştir.
Anahtar Kelimeler: Destek vektör makineleri; öznitelik seçimi; çok amaçlı optimizasyon; temel bileşenler analizi; kümeleme
- Zhi J, Sun J, Wang Z, Ding W. Support vector machine classifier for prediction of the metastasis of colorectal cancer. Int J Mol Med. 2018;41(3):1419-26.[Crossref] [PubMed] [PMC]
- Zhao D, Liu H, Zheng Y, He Y, Lu D, Lyu C. A reliable method for colorectal cancer prediction based on feature selection and support vector machine. Med Biol Eng Comput. 2019;57(4):901-12.[Crossref] [PubMed]
- Khan J, Wei JS, Ringnér M, Saal LH, Ladanyi M, Westermann F, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med. 2001;7(6):673-9.[Crossref] [PubMed] [PMC]
- Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med. 2002;8(1):68-74.[Crossref] [PubMed]
- Banjoko AW, Yahya WB, Garba MK. Multiclass response feature selection and cancer tumour classification with support vector machine. J Epidemiol Biostat. 2019;5:91-104.[Crossref]
- Ge H, Yan Y, Wu D, Huang Y, Tian F. Potential role of LINC00996 in colorectal cancer: a study based on data mining and bioinformatics. Onco Targets Ther. 2018;11:4845-55.[Crossref] [PubMed] [PMC]
- Ge H, Yan Y, Yue C, Liang C, Wu J. Long noncoding RNA LINC00265 targets EGFR and promotes deterioration of colorectal cancer: a comprehensive study based on data mining and in vitro validation. Onco Targets Ther. 2019;12:10681-92.[Crossref] [PubMed] [PMC]
- Committee on Diagnostic Error in Health Care; Board on Health Care Services; Institute of Medicine; The National Academies of Sciences, Engineering, and Medicine. Improving Diagnosis in Health Care. Balogh EP, Miller BT, Ball JR, editors. Washington (DC): National Academies Press (US); 2015.[PubMed]
- Yahya WB, Ulm K, Fahrmeir L, Hapfelmeier A. k-SS: a sequential feature selection and prediction method in microarray study. International Journal of Artificial Intelligence. 2011;6(11):19-47.[Link]
- Hapfelmeier A, Yahya WB, Robert R, Ulm K. Predictive modeling of gene expression data. In: Crowley J, Hoering A, eds. Handbook of Statistics in Clinical Oncology. New York: Chapman & Hall/CRC; 2012. p.471-83.[Crossref]
- Banjoko A, Yahya WB, Garba MK, Olaniran OR, Dauda KA, Olorede KO. Efficient support vector machine classification of diffuse large B-cell lymphoma and follicular lymphoma mRNA tissue samples. Annals Computer Science Series. 2015;13:69-79.[Link]
- Kharrat N, Assidi M, Abu-Elmagd M, Pushparaj PN, Alkhaldy A, Arfaoui L, et al. Data mining analysis of human gut microbiota links Fusobacterium spp. with colorectal cancer onset. Bioinformation. 2019;15(6):372-9.[Crossref] [PubMed] [PMC]
- Liang H, Yang L, Tao L, Shi L, Yang W, Bai J, et al. Data mining-based model and risk prediction of colorectal cancer by using secondary health data: a systematic review. Chin J Cancer Res. 2020;32(2):242-51.[Crossref] [PubMed] [PMC]
- Pourhoseingholi MA, Kheirian S, Zali MR. Comparison of basic and ensemble data mining methods in predicting 5-year survival of colorectal cancer patients. Acta Inform Med. 2017;25(4):254-8.[Crossref] [PubMed] [PMC]
- Wang T, Yang C, Zhao H. Prediction analysis for microbiome sequencing data. Biometrics. 2019;75(3):875-84.[Crossref] [PubMed]
- Yahya WB, Aremu GT, Garba MK. Multiclass sequential feature selection and classification method for genomic data. JAST. 2016;20(1-2):50-61.[Link]
- Rápolti E, Szigeti A, Farkas R, Bellyei S, Boronkai A, Papp A, et al. [Neoadjuvant radiochemotherapy in the treatment of locally advanced rectal tumors]. Magy Onkol. 2009;53(4):345-9.[Crossref] [PubMed]
- Mohamad MS, Deris S, Illias RMD. A hybrid of genetic algorithm and support vector machine for features selection and classification of gene expression microarray. IJCIA. 2005;5(1):91-107.[Crossref]
- Mohamad MS, Omatu S, Deris S, Misman MF,Yoshioka M. Selecting informative genes from microarray data by using hybrid methods for cancer classification. Artif Life Robotics. 2009;13:414-7.[Crossref]
- Michalewicz Z. Genetic Algorithms + Data Structures = Evolution Programs. New York: Springer Verlag; 1996.[Crossref]
- Banjoko AW, Yahya WB, Garba MK, Abdulazeez KO. Weighted support vector machine algorithm for efficient classification and prediction of binary response data. J Phys Conf Ser. 2019;1366:012101.[Crossref]
- Zagorecki A. Feature selection for naïve Bayesian network ensemble using evolutionary algorithms. Proceedings of the 2014 Federated Conference on Computer Science and Information Systems. ACSIS. 2014. p.381-5.[Crossref]
- Yang K, Zhou B, Yi F, Chen Y, Cheng Y. Correction to: colorectal cancer diagnostic algorithm based on sub-patch weight color histogram in combination of improved least squares support vector machine for pathological image. J Med Syst. 2019;43(12):333.[Crossref] [PubMed]
- Zhao D, Liu H, Zheng Y, He Y, Lu D, Lyu C. Whale optimized mixed kernel function of support vector machine for colorectal cancer diagnosis. J Biomed Inform. 2019;92:103124.[Crossref] [PubMed]
- Olaniran OR, Yahya WB. Bayesian hypothesis testing of two normal samples using bootstrap prior technique. JMASM. 2017;16(2):185-96.[Crossref]
- Yahya WB. Gene Selection and Tumour Classification in Cancer Research. Lambert Academic Sabrick; 2012.
- Vapnik VN. The Nature of Satistical Learning Theory. Springer-Verlag; 1995.[Crossref] [PubMed]
- Banjoko AW, Yahya WB. Sequential optimization based feature selection algorithm for efficient cancer classification and prediction. In: 4th iSTEAMS International Multidisciplinary Conference, Vol. 14. p.265-74. (AlHikmah University, Ilorin, Nigeria, 2018).
- Hwang CL, Masud ASM. Multiple Objective Decision Making, Methods and Applications: A State-Of-The-Art Survey. Berlin: Springer-Verlag; 1979.[Crossref]
- Miettinen K, Ruiz F, Wierzbicki AP. Introduction to multiobjective optimization: interactive approaches. In: Branke J, Branke J, Deb K, Miettinen K, Słowiński R, eds. Multiobjective optimization: interactive and evolutionary approaches. Berlin, Heidelberg: Springer-Verlag; 2008. p.27-57.
- Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A. 1999;96(12):6745-50.[Crossref] [PubMed] [PMC]
- Ting-Lee ML. Analysis of Microarray Gene Expression Data. New York: Kluwer Academics; 2004.
- van den Berg RA, Hoefsloot HC, Westerhuis JA, Smilde AK, van der Werf MJ. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics. 2006;7:142.[Crossref] [PubMed] [PMC]
- Cristianini N, Shawe-Taylor J. An Introduction to Support Vector Machines: and Other Kernel-Based Learning Methods. Cambridge University Press; 1999.[Crossref] [PMC]
.: Process List