Objective: This study aims to compare the accuracy, reliability, and validity levels of the techniques by using various performance measures applying logistic regression models based on regularization approaches from data mining classification techniques on a dataset. Material and Methods: With the development of computerization and technology, machine learning is used in many fields as well as in the field of medicine. It has grown in popularity, particularly in cancer diagnosis. A urine biomarkers dataset from the public platform Kaggle database, which is freely available to all researchers, was used to reveal the most appropriate model for diagnosing patients' pancreatic ductal adenocarcinoma (PDAC). Because of the multicollinearity, the following regression models were considered to classify the disease diagnosis: Logistic lasso, logistic ridge, logistic elastic net, logistic adaptive lasso, logistic adaptive elastic net, and logistic adaptive group lasso. The classification success of the methods used was compared using reliability and validity criteria. Results: There were three statistically significant variables in all logistic regularization models, according to PDAC diagnostic results. Compared to the estimated model results, the logistic adaptive group lasso regression model appears to perform better in PDAC diagnosis. In addition to the three variables in this model, the variables age and plasma CA19-19 have been identified as important variables in PDAC diagnosis. Conclusion: As a result of comparative analyses, the logistic adaptive group lasso regression model outperformed the others in terms of performance measures.
Keywords: Logistic regression; regularization methods; lasso; adaptive group lasso
Amaç: Bu çalışma, veri madenciliği sınıflandırma tekniklerinden düzenlileştirme yaklaşımlarına dayalı lojistik regresyon modellerini bir veri kümesi üzerinde uygulayarak tekniklerin doğruluk, güvenirlik ve geçerlilik düzeylerini çeşitli performans ölçüleri aracılığıyla karşılaştırmayı amaçlamaktadır. Gereç ve Yöntemler: Makineleşmenin ve teknolojinin gelişmesiyle makine öğrenmesi tıp alanında olduğu gibi birçok alanda kullanılmaktadır. Özellikle kanser teşhisi konusunda artan bir kullanıma sahiptir. Çalışmada pankreatik duktal adenokarsinomunu (PDAC) teşhis etmede en uygun modeli ortaya çıkarmak amacıyla tüm araştırmacıların kullanımına ve erişimine açık olarak sunulan Kaggle veri tabanından bir idrar biyobelirteçleri veri kümesi kullanıldı. Veri kümesindeki değişkenler arasında çoklu doğrusal bağıntı problemi olması nedeniyle, hastalık teşhisini sınıflandırmada lojistik lasso, lojistik ridge, lojistik elastik ağ, lojistik uyarlamalı lasso, lojistik uyarlamalı elastik ağ ve lojistik uyarlamalı grup lasso regresyon modelleri ele alınmıştır. Kullanılan modeller sınıflandırma başarısı, güvenirlik ve geçerlik kriterleri kullanılarak karşılaştırılmıştır. Bulgular: Tüm düzenlileştirme tahmin modellerindeki PDAC teşhisi sonuçlarına göre istatistiksel olarak anlamlı bulunan üç değişken belirlenmiştir. Tahmin edilen model sonuçları karşılaştırıldığında, PDAC teşhisinde lojistik uyarlamalı grup lasso regresyon modelinin daha iyi sonuç verdiği görülmektedir. Bu modelde tüm modellerde anlamlı bulunan değişkenlere ilave olarak yaş ve plazma CA19-19 değişkenlerinin de önemli değişkenler olarak belirlendiği görülmektedir. Sonuç: Karşılaştırmalı analizler sonucunda performans ölçülerine göre lojistik uyarlamalı grup lasso regresyon modelinin en iyi performansı gösterdiği gözlenmiştir.
Anahtar Kelimeler: Lojistik regresyon; düzenlileştirme yöntemleri; lasso; uyarlamalı grup lasso
- Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global Cancer Statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021;71(3):209-49. [Crossref] [PubMed]
- Mayerle J, Kalthoff H, Reszka R, Kamlage B, Peter E, Schniewind B, et al. Metabolic biomarker signature to differentiate pancreatic ductal adenocarcinoma from chronic pancreatitis. Gut. 2018;67(1):128-37. Erratum in: Gut. 2018;67(5):994. [Crossref] [PubMed] [PMC]
- Debernardi S, O'Brien H, Algahmdi AS, Malats N, Stewart GD, Plje?a-Ercegovac M, et al. A combination of urinary biomarker panel and PancRISK score for earlier detection of pancreatic cancer: a case-control study. PLoS Med. 2020;17(12):e1003489. [Crossref] [PubMed] [PMC]
- Shimizu Y, Yasui K, Matsueda K, Yanagisawa A, Yamao K. Small carcinoma of the pancreas is curable: new computed tomography finding, pathological study and postoperative results from a single institute. J Gastroenterol Hepatol. 2005;20(10):1591-4. [Crossref] [PubMed]
- Sarantis P, Koustas E, Papadimitropoulou A, Papavassiliou AG, Karamouzis MV. Pancreatic ductal adenocarcinoma: treatment hurdles, tumor microenvironment and immunotherapy. World J Gastrointest Oncol. 2020;12(2):173-81. [Crossref] [PubMed] [PMC]
- Sanoob M, Madhu A, Ajesh K, Varghese SM. Artificial neural network for diagnosis of pancreatic cancer. Int J Cybernet Inform. 2016;5:41-9. [Crossref]
- Blyuss O, Zaikin A, Cherepanova V, Munblit D, Kiseleva EM, Prytomanova OM, et al. Development of PancRISK, a urine biomarker-based risk score for stratified screening of pancreatic cancer patients. Br J Cancer. 2020;122(5):692-6. [Crossref] [PubMed] [PMC]
- Ge G, Wong GW. Classification of premalignant pancreatic cancer mass-spectrometry data using decision tree ensembles. BMC Bioinformatics. 2008;9:275. [Crossref] [PubMed] [PMC]
- Marengo E, Robotti E, Cecconi D, Scarpa A, Righetti PG. Application of fuzzy logic principles to the classification of 2D-PAGE maps belonging to human pancreatic cancers treated with Trichostatin-A. 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542). 2004;1:359-64. [Link]
- Torshabi AE, Riboldi M, Pella A, Negarestani A, Rahnema M, Baroni G. A clinical application of fuzzy logic. In: Dadios EP, ed. Emerging Technologies and Applications. 1st ed. Rijeka, Croatia: InTech; 2012. p.3-18. ISBN:978-953-51-0337-0 [Crossref] [PubMed]
- Jiang W, Shen Y, Ding Y, Ye C, Zheng Y, Zhao P, et al. A naive Bayes algorithm for tissue origin diagnosis (TOD-Bayes) of synchronous multifocal tumors in the hepatobiliary and pancreatic system. Int J Cancer. 2018;142(2):357-68. Erratum in: Int J Cancer. 2018;143(1):E2. [Crossref] [PubMed]
- Sadewo W, Rustam Z, Hamidah H, Chusmarsyah AR. Pancreatic cancer early detection using twin support vector machine based on kernel. Symmetry. 2020;12(4):667. [Crossref]
- Chang C-L, Hsu M-Y. The study that applies artificial intelligence and logistic regression for assistance in differential diagnostic of pancreatic cancer, Expert Systems with Applications. 2009;36(7):10663-72. [Crossref]
- Baecker A, Kim S, Risch HA, Nuckols TK, Wu BU, Hendifar AE, et al. Do changes in health reveal the possibility of undiagnosed pancreatic cancer? Development of a risk-prediction model based on healthcare claims data. PLoS One. 2019;14(6):e0218580. [Crossref] [PubMed] [PMC]
- Appelbaum L, Cambronero JP, Stevens JP, Horng S, Pollick K, Silva G, et al. Development and validation of a pancreatic cancer risk model for the general population using electronic health records: an observational study. Eur J Cancer. 2021;143:19-30. [Crossref] [PubMed]
- Hsieh MH, Sun LM, Lin CL, Hsieh MJ, Hsu CY, Kao CH. Development of a prediction model for pancreatic cancer in patients with type 2 diabetes using logistic regression and artificial neural network models. Cancer Manag Res. 2018;10:6317-24. [Crossref] [PubMed] [PMC]
- Lee HA, Chen KW, Hsu CY. Prediction model for pancreatic cancer-a population-based study from NHIRD. Cancers (Basel). 2022;14(4):882. [Crossref] [PubMed] [PMC]
- Cho S-B, Won H-H. Machine learning in DNA microarray analysis for cancer classification. Proceedings of the First Asia-Pacific Bioinformatics Conference. 2003;34:189-98.
- Hoerl A, Kennard R. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12:55-67. [Crossref]
- Schaefer RL, Roi LD, Wolfe RA. A ridge logistic estimator. Communications in Statistics: Theory and Methods. 1984;3:99-113. [Crossref]
- Çiftsüren MN, Akkol S. Prediction of internal egg quality characteristics and variable selection using regularization methods: ridge, lasso and elastic net. Archives Animal Breeding. 2018;61(3):279-84. [Crossref]
- Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological). 1996;58(1):267-88. [Crossref]
- Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2005;67:301-20. [Crossref]
- Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1-22. [Crossref] [PubMed] [PMC]
- Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101(476):1418-29. [Crossref]
- Algamal ZY, Lee MH. Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification. Comput Biol Med. 2015;67:136-45. [Crossref] [PubMed]
- Zou H, Zhang HH. On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics. 2009;37(4):1733-51. [Crossref] [PubMed] [PMC]
- Ghosh S. On the grouped selection and model complexity of the adaptive elastic net. Stat Comput. 2011;21:451-62. [Crossref]
- Wang H, Leng C. A note on adaptive group lasso. Computational Statistics and Data Analysis. 2008;52(12):5277-86. [Crossref]
- Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2006;68(1):49-67. [Crossref]
- Yang X, Tong Y, Meng X, Zhao S, Xu Z, Li Y, et al. Adaptive logistic group Lasso method for predicting the no-reflow among the multiple types of high-dimensional variables with missing data. 7th IEEE International Conference on Software Engineering and Service Science (ICSESS). 2016;1085-9. [Crossref]
- RStudio [Internet]. Copyright © 2022 RStudio, PBC [Accessing Date:19.04.2022]. Available from: [Link]
- van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software. 2011;45(3):1-67. [Crossref]
- Gonzalez Estrada E, Villasenor-Alva JA. mvShapiroTest: Generalized Shapiro Wilk test for multivariate normality. R package version 1.0. 2013.
.: Process List