Objective: The aim of this study is to compare the effects of dimensionality reduction methods [least absolute shrinkage and selection operator (LASSO), principal component analysis (PCA), and independent component analysis (ICA)] on various support vector machine (SVM) classification methods in high-dimensional acute myeloid leukemia (AML) gene expression data. Material and Methods: In this study, gene expression omnibus database was used to analyze gene expression profiles in AML patients. Data included expression levels for 64 individuals and 22,283 genes. SVM with different kernel functions were used in dimensionality reduction analyses LASSO, PCA, and ICA classification analyses. 10-fold cross-validation with 10 iterations and random search were used for resampling and hyperparameter optimization. The performance of the model was evaluated using the average accuracy, sensitivity, specificity, precision, and F criterion of 500 iterations. Results: AML data were filtered to reveal 6,201 genes. After PCA/ICA, 10 components were extracted, and 21 genes were selected as biomarkers for AML disease. While the polynomial kernel function model with PCA achieved the highest accuracy, the SVM models with polynomial kernel function showed the best performance for all analyses. The models were then selected for their potential biomarkers. Conclusion: In order to build classification models using gene expression data, high dimensionality should be eliminated using dimensionality reduction methods. This reduces the analysis time and improves the prediction performance. In the AML gene expression dataset, SVM models with polynomial kernel function give better results than linear and radial basis models.
Keywords: Dimension reduction; gene expression; feature extraction; feature selection; biomarker discovery
Amaç: Bu çalışmanın amacı, yüksek boyutlu akut miyeloid lösemi (AML) hastalığı gen ifadesi verilerinde boyut azaltma yöntemlerinin [en az mutlak küçülme ve seçim operatörü (least absolute shrinkage and selection operator ''LASSO''), temel bileşen analizi (principal component analysis ''PCA'') ve bağımsız bileşen analizi (independent component analysis ''ICA'')], çeşitli destek vektör makinesi [support vector machine (SVM)] sınıflandırma yöntemleri üzerindeki etkilerini karşılaştırmaktır. Gereç ve Yöntemler: Bu çalışmada, AML hastalarında gen ekspresyon profillerini analiz etmek için gen ekspresyon omnibus veri tabanı kullanılmıştır. Veriler, 64 kişi ve 22.283 gen için ifade düzeylerini içermektedir. Boyut azaltma analizleri LASSO, PCA ve ICA sınıflandırma analizlerinde, farklı çekirdek fonksiyonlardaki SVM kullanıldı. Yeniden örnekleme için 10 tekrarlı 10 kat çapraz doğrulama ve hiperparametre optimizasyonu için rastgele arama kullanılmıştır. Modelin performansı, 500 tekrarlı örneğin ortalama doğruluk, duyarlılık, seçicilik, kesinlik ve F kriteri kullanılarak değerlendirilmiştir. Bulgular: AML verileri filtrelenerek 6.201 gen ortaya çıkarılmıştır. PCA/ICA sonrasında 10 bileşen çıkarılmış ve 21 gen, AML hastalığı için biyobelirteç olarak seçilmiştir. PCA ile polinom çekirdek fonksiyonu modeli en yüksek doğruluk elde ederken, polinom çekirdek fonksiyonlu SVM modelleri tüm analizler için en iyi performansı göstermiştir. Modeller, daha sonra potansiyel biyobelirteçleri için seçilmiştir. Sonuç: Gen ifadesi verilerini kullanarak sınıflandırma modelleri oluşturmak için boyut azaltma yöntemleri kullanılarak yüksek boyutluluk ortadan kaldırılmalıdır. Bu durum, analiz süresini kısaltır ve tahmin performansını artırır. AML gen ifadesi veri setinde polinomial çekirdek fonksiyonuna sahip SVM modelleri, doğrusal ve radyal tabanlı modellerden daha iyi sonuçlar vermektedir.
Anahtar Kelimeler: Boyut indirgeme; gen ifadesi; özellik çıkarımı; özellik seçimi; biyobelirteç keşfi
- Dziuda DM. Data Mining for Genomics And Proteomics: Analysis Of Gene And Protein Expression Data. Vol. 1. 1st ed. New Jersey: John Wiley & Sons; 2010. [Crossref] [PubMed]
- Apitz JC. A statistical method for selection, classification, and network construction in genetic systems [Master thesis]. USA, CA: California State University; 2016. [Link]
- Bașaran E, Aras S, Cansaran-Duman D. General outlook and applications of genomics, proteomics and metabolomics. Turk Hij Den Biyol Derg. 2010;67(2):85-96. [Link]
- Coşkun E, Karaağaoğlu E. Veri madenciliği yöntemleri ile mikrodizilim gen ifade analizi. Hacettepe Tıp Dergisi. 2011;42:180-9. [Link]
- Stirewalt DL, Meshinchi S, Kopecky KJ, Fan W, Pogosova-Agadjanyan EL, Engel JH, et al. Identification of genes with abnormal expression changes in acute myeloid leukemia. Genes Chromosomes Cancer. 2008;47(1):8-20. [PubMed]
- Bolstad B, Bolstad MB. affyPLM: Model based quality assessment of Affymetrix GeneChip data [Internet]. Bioconductor; 2013. Available from: [Link]
- Lê Cao KA, Rohart F, Gonzalez I, Singh A. mixOmics: an R package for ?omics feature selection and multiple data integration [Internet]. Bioconductor; 2017 [Link]
- Marchini JL, Heaton C, Ripley BD, Ripley MB. fastICA: FastICA algorithms to perform ICA and projection pursuit [Internet]. R package version 1.1-9; 2007 [cited 2025 Aug 29]. Available from: [Link]
- Friedman J, Hastie T, Tibshirani R, Narasimhan B, Tay K, Simon N, et al. glmnet: Lasso and elastic-net regularized generalized linear models. 2009. [Link]
- Gentleman R, Carey VJ, Huber W, Hahne F. Genefilter: methods for filtering genes from microarray experiments. 2011. [Link]
- Kuhn M, Wing J, Weston S, Williams A, Keefer C, Engelhardt A, et al. The caret package. Vienna, Austria. 2012. Available from: [Link]
- Fonti V, Belitser E. Feature selection using LASSO. 2017:1-25. [Link]
- Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometrics and Intelligent Laboratory Systems. 1987;2(1-3):37-52. [Link]
- Hérault J, Jutten C, Ans B. Détection de grandeurs primitives dans un message composite par une architecture de calcul neuromimétique en apprentissage non supervisé. Proc. 10th Colloque GRETSI sur le traitement du signal et des images; 1985. p. 1017-22.Available from: [Link]
- Comon P. Independent component analysis, a new concept? Signal Process. 1994;36(3):287-314. https:// [Crossref]
- Chao S, Lihui C. Feature dimension reduction for microarray data analysis using locally linear embedding. In: Wong L, Chen PY, editors. Proceedings of the 3rd Asia-Pacific Bioinformatics Conference; 2005 Jan 17-21; Singapore. Singapore: Institute for Infocomm Research/World Scientific; 2005. p. 211-17. [Crossref]
- Ehler M, Rajapakse VN, Zeeberg BR, Brooks BP, Brown J, Czaja W, et al. Nonlinear gene cluster analysis with labeling for microarray gene expression data in organ development. BMC Proc. 2011;5(Suppl 2):S3. from: https:// [Crossref]
- Amsterdam EA, Wenger NK, Brindis RG, Casey DE Jr, Ganiats TG, Holmes DR Jr, et al. 2014 AHA/ACC Guideline for the Management of Patients with Non-ST-Elevation Acute Coronary Syndromes: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines. J Am Coll Cardiol. 2014;64(24):e139-e228. Erratum in: J Am Coll Cardiol. 2014;64(24):2713-4. Dosage error in article text. [PubMed]
- Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273-297. from: https:// [Crossref]
- Eskidere Ö. A comparison of feature selection methods for diagnosis of Parkinson?s disease from vocal measurements. Sigma J Eng Nat Sci. 2012;30(4):402-14. [Link]
- Ben-Hur A, Horn D, Siegelmann HT, Vapnik V. Support vector clustering. Journal of Machine Learning Research. 2001;2:125-37. [Link]
- Bergstra J, Bengio Y. Random search for hyper-parameter optimization. Journal of Machine Learning Research. 2012;13:281-305. [Link]
- Dyer JO, Dutta A, Gogol M, Weake VM, Dialynas G, Wu X, et al. Myeloid leukemia factor acts in a chaperone complex to regulate transcription factor stability and gene expression. J Mol Biol. 2017;429(13):2093-107. ; [Crossref] [PubMed] [PMC]
- Tomasson MH, Xiang Z, Walgren R, Zhao Y, Kasai Y, Miner T, et al. Somatic mutations and germline sequence variants in the expressed tyrosine kinase genes of patients with de novo acute myeloid leukemia. Blood. 2008;111(9):4797-808. ; [Crossref] [PubMed] [PMC]
- Gasparetto M, Pei S, Minhajuddin M, Khan N, Pollyea DA, Myers JR, et al. Targeted therapy for a subset of acute myeloid leukemias that lack expression of aldehyde dehydrogenase 1A1. Haematologica. 2017;102(6):1054-65. ; [Crossref] [PubMed] [PMC]
- Silveira VS, Scrideli CA, Moreno DA, Yunes JA, Queiroz RG, Toledo SC, et al. Gene expression pattern contributing to prognostic factors in childhood acute lymphoblastic leukemia. Leuk Lymphoma. 2013;54(2):310-4. [Crossref] [PubMed]
- Beutler E, Kuhl W, Comings D. Hexosaminidase isozyme in type O Gm2 gangliosidosis (Sandhoff-Jatzkewitz disease). Am J Hum Genet. 1975;27(5):628-38. ; [PubMed] [PMC]
- Haferlach C, Bacher U, Kohlmann A, Schindela S, Alpermann T, Kern W, et al. CDKN1B, encoding the cyclin-dependent kinase inhibitor 1B (p27), is located in the minimally deleted region of 12p abnormalities in myeloid malignancies and its low expression is a favorable prognostic marker in acute myeloid leukemia. Haematologica. 2011;96(6):829-36. ; [Crossref] [PubMed] [PMC]
- Kapelko-Slowik K, Owczarek TB, Grzymajlo K, Urbaniak-Kujda D, Jazwiec B, Slowik M, et al. Elevated PIM2 gene expression is associated with poor survival of patients with acute myeloid leukemia. Leuk Lymphoma. 2016;57(9):2140-9. [PubMed]
- Dvorak AM, Letourneau L, Weller PF, Ackerman SJ. Ultrastructural localization of Charcot-Leyden crystal protein (lysophospholipase) to intracytoplasmic crystals in tumor cells of primary solid and papillary epithelial neoplasm of the pancreas. Lab Invest. 1990;62(5):608-15. [PubMed]
- Sasikala R, Deepthi KJ, Balakrishnan TS, Krishnan P, Ebenezar US. Machine Learning-Enhanced Analysis of Genomic Data for Precision Medicine. In: Proceedings of the 2024 OPJU International Technology Conference (OTCON) on Smart Computing for Innovation and Advancement in Industry 4.0; 2024 Jun 5-7; Raigarh, India. p. 1-5. [Link]
- van der Maaten L, Postma E, van den Herik J. Dimensionality reduction: a comparative review. J Mach Learn Res. 2009;10:66?71. Available from: [Link]
- Döhner H, Estey E, Grimwade D, Amadori S, Appelbaum FR, Büchner T, et al. Diagnosis and management of AML in adults: 2017 ELN recommendations from an international expert panel. Blood. 2017;129(4):424-47. ; [Crossref] [PubMed] [PMC]
- Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning. 2002;46:389-422. [Crossref]
.: Process List