Amaç: Çalışmanın amacı, uygun kümeleme yöntemleri ve küme sayıları kullanarak verileri uygun alt gruplara bölmek, ilgilenilen bağımlı değişken için analizleri bu alt gruplarda yaparak sınıflama performansındaki değişimleri incelemektir. Gereç ve Yöntemler: Çalışmada kullanılan veri seti Kaggle veri tabanından alınmıştır. Veri seti, 15 bağımsız ve 1 bağımlı değişken olmak üzere toplam 16 değişkenden oluşmakta olup, 948 hasta verisi içermektedir. Analizler R (ver. 4.1.2) programla dili kullanılarak yapılmış ve bu programla dili içindeki chisquare, wilcox.test, cluster, RWeka ve e1071 paketlerinden faydalanılmıştır. Kümeleme yöntemi olarak K-ortalama, sınıflandırma yöntemleri olarak ise lojistik regresyon, çok katmanlı algılayıcı, karar tablosu, karar destek makinesi, rastgele orman, J48, IBk, torbalama ve oylanmış algılayıcı kullanılmıştır. Veri seti 10 kat çapraz doğrulama kullanılarak test edilmiş ve tüm analizler 1.000 kez tekrarlanarak sonuçlar verilmiştir. Performans kriteri olarak doğru sınıflama oranı, F-ölçütü, Matthews korelasyon katsayısı, ROC eğrisi altında kalan alan ve precision-recall alanı kullanılmıştır. Bulgular: Modeller ile Tip 2 diyabet olacak ve olmayacak hastalar için tahmin sonuçlarına bakıldığında, birinci kümede Tip 2 diyabet görülecek tahmini yapılan hastalara ait performansta genel modele göre artış gözlenmiştir. İkinci kümede Tip 2 diyabet görülmeyecek tahmini yapılan hastalara ait performansta da genel modele göre artış gözlenmiştir. Sonuç: Çalışmada yapılan kümeleme analizi sonrasında hastalara ait spesifik özelliklerin daha iyi belirlendiği ve küme bazlı sınıflandırma ile de oluşturulan modellerinin daha iyi sınıflama performansına sahip olduğu gözlenmiştir.
Anahtar Kelimeler: Kümeleme; K-ortalama; sınıflandırma; makine öğrenmesi; heterojenite
Objective: The aim of the study is to divide the data into appropriate subgroups using appropriate clustering methods and cluster numbers, and to analyze the changes in classification performance by performing the analyzes for the dependent variable of interest in these subgroups. Material and Methods: The data set used in the study was taken from the Kaggle database. The data set consists of 16 variables, 15 independent and 1 dependent variable, and includes 948 patient data. For the analysis, the R (ver. 4.1.2) programming language was used and the chisquare, wilcox.test, cluster, RWeka and e1071 packages in this program were used. Kmeans was used as the clustering method, and logistic regression, multilayer perceptron, decision table, decision support machine, random forest, J48, IBk, bagging and voted perceptron were used as classification methods. The dataset was tested using 10-fold cross validation option and all analyzes were repeated 1,000 time. Accuracy, F-measure, Matthews correlation coefficient, area under ROC curve and precision-recall area were used as performance criteria. Results: Considering the estimation results for patients with and without Type 2 diabetes with the models, an increase was observed in the performance of patients who were predicted to have Type 2 diabetes in the first cluster compared to the general model. An increase was observed in the performance of patients who were predicted not to have Type 2 diabetes in the second cluster compared to the general model. Conclusion: After clustering, it was observed that the specific characteristics of the patients were determined better and the models created with the cluster-based classification had better performance.
Keywords: Clustering; K-means; classification; machine learning; heterogeneity
- Smith CT, Williamson PR, Marson AG. Investigating heterogeneity in an individual patient data meta-analysis of time to event outcomes. Stat Med. 2005;24(9):1307-19. [Crossref] [PubMed]
- Ramaekers BL, Joore MA, Grutters JP. How should we deal with patient heterogeneity in economic evaluation: a systematic review of national pharmacoeconomic guidelines. Value Health. 2013;16(5):855-62. [Crossref] [PubMed]
- Ali M, Salehnejad R, Mansur M. Hospital heterogeneity: what drives the quality of health care. Eur J Health Econ. 2018;19(3):385-408. [Crossref] [PubMed] [PMC]
- Hernández A. Heterogeneity of patients in clinical trials: Subgroup analysis and covariate adjustment in cardiovascular and neurosurgical trials [Master thesis]. Rotterdam: Erasmus University Rotterdam; 2006. [Cited: 24.02.2023]. Available from: [Link]
- Assmann SF, Pocock SJ, Enos LE, Kasten LE. Subgroup analysis and other (mis)uses of baseline data in clinical trials. Lancet. 2000;355(9209):1064-9. [Crossref] [PubMed]
- Lader EW, Cannon CP, Ohman EM, Newby LK, Sulmasy DP, Barst RJ, et al; American College of Cardiology Foundation. The clinician as investigator: participating in clinical trials in the practice setting. Circulation. 2004;109(21):2672-9. [Crossref] [PubMed]
- Li X, Jiao H, Li D. Intelligent medical heterogeneous big data set balanced clustering using deep learning. Pattern Recognit. Lett. 2020;138(1):548-55. [Crossref]
- Tigga NP, Garg S. Prediction of type 2 diabetes using machine learning classification methods. Procedia Comput. Sci. 2020;167:706-16. [Crossref]
- Tao R, Yu X, Lu J, Shen Y, Lu W, Zhu W, et al. Multilevel clustering approach driven by continuous glucose monitoring data for further classification of type 2 diabetes. BMJ Open Diabetes Res Care. 2021;9(1):e001869. [Crossref] [PubMed] [PMC]
- Ghassib IH, Batarseh FA, Wang HL, Borgnakke WS. Clustering by periodontitis-associated factors: A novel application to NHANES data. J Periodontol. 2021;92(8):1136-50. [Crossref] [PubMed]
- Ames CP, Smith JS, Pellisé F, Kelly M, Alanay A, Acaroğlu E, et al; European Spine Study Group, International Spine Study Group. Artificial intelligence based hierarchical clustering of patient types and intervention categories in adult spinal deformity surgery: towards a new classification scheme that predicts quality and value. Spine (Phila Pa 1976). 2019;44(13):915-26. [Crossref] [PubMed]
- Stoitsas K, Bahulikar S, de Munter L, de Jongh MAC, Jansen MAC, Jung MM, et al. Clustering of trauma patients based on longitudinal data and the application of machine learning to predict recovery. Sci Rep. 2022;12(1):16990. [Crossref] [PubMed] [PMC]
.: Process List