Explainable Machine Learning Methods for Person-Based Prediction in Simulated and Real Datasets: Methodological Research

İrem KAR; Batuhan BAKIRARAR

doi:10.5336/biostatic.2023-99050

Giriş Yeni Kullanıcı English

Türkiye Klinikleri Biyoistatistik Dergisi

Dergi Kimliği

Dergi Hakkında

Hakem İnceleme Süreci

Son Sayı

Sayı Arşivi

Yayın Kurulu

Yazım Kuralları

Yazar Formları

Yayın Gönder

Abone Satış

Bu eser Creative Commons Atıf-GayriTicari-Türetilemez 4.0 Uluslararası Lisansı ile lisanslanmıştır.

Explainable Machine Learning Methods for Person-Based Prediction in Simulated and Real Datasets: Methodological Research

Simüle ve Gerçek Veri Setlerinde Kişi Temelli Tahmin İçin Açıklanabilir Makine Öğrenmesi Yöntemleri: Metodolojik Araştırma

İrem KAR^a, Batuhan BAKIRARAR^a
^aDepartment of Biostatistics, Ankara University Faculty of Medicine, Ankara, Türkiye

Turkiye Klinikleri J Biostat. 2023;15(3):171-81

doi: 10.5336/biostatic.2023-99050

Makale Dili: EN

Tam Metin

ABSTRACT
Objective: The aim of this study is to build personbased prediction models for simulated and real datasets separately with the SHapley Additive exPlanations method, and to demonstrate whether the obtained person-based models are more valid and applicable than overall models. Material and Methods: Simulated datasets encompassed 13 independent and 1 dependent variable, across sample sizes of 250, 500, and 1,000, while the real dataset contained 826 patient records with 11 variables. ''bindata'', ''shaper'' and ''RWeka'' packages in the R (version 4.1.2) programming language were used. Extreme Gradient Boosting, Bagging, Random Forest, Support Vector Machine and Logistic Regression were used as classification methods. The assessment employed 10-fold crossvalidation, repetaed 1,000 times. Results: Accuracy values of the overall model in the datasets with 250, 500, and 1,000 samples were found to be 0.856, 0.886, and 0.891, respectively. In these samples, the person-based accuracy values were found to be 0.886, 0.964, and 0.962 for those with ''yes'' prediction results, and 0.930, 0.961, and 0.961 for those with ''no'' prediction results, respectively. In the real dataset, the accuracy of the overall model was found to be 0.736. The person-based accuracy values were found to be 0.783 in the patient who was predicted with stroke, and 0.868 in the patient who was predicted without stroke. Conclusion: Personbased predictions consistently outperformed model-based results across datasets due to real-life individual heterogeneity, emphasizing the need for attention. Considering this diversity, person-based modeling is expected to produce a more realistic and clinically applicable model.

Keywords: Prediction; person-based prediction models; SHapley Additive exPlanations

ÖZET
Amaç: Bu çalışmanın amacı, ''SHapley Additive exPlanations'' yöntemi ile simüle ve gerçek veri setleri için ayrı ayrı kişi temelli tahmin modelleri oluşturmak ve elde edilen kişi temelli modellerin genel modellere göre daha geçerli ve uygulanabilir olup olmadığını göstermektir. Gereç ve Yöntemler: Simüle veri setleri sırasıyla 250, 500 ve 1.000 örneklem büyüklükleriyle 13 bağımsız ve 1 bağımlı değişken içerirken, gerçek veri seti 11 değişkenden oluşmakta olup, 826 hasta verisi içermektedir. Analizler için R (versiyon 4.1.2) programlama dilindeki ''bindata'', ''shapper'' ve ''RWeka'' paketleri kullanılmıştır. Sınıflandırma yöntemleri olarak ''Extreme Gradient Boosting'', Bagging, Rastgele Orman, Destek Vektör Makinesi ve Lojistik Regresyon kullanılmıştır. Veri seti 10-kat çapraz doğrulama kullanılarak değerlendirilmiş ve analizler 1.000 kez tekrarlanmıştır. Bulgular: 250, 500 ve 1.000 örneklem büyüklüğüne sahip veri setlerinde genel modelin doğruluk değerleri sırasıyla 0,856, 0,886 ve 0,891 olarak bulunmuştur. Bu örneklem büyüklüklerinde kişi temelli doğruluk değerleri ''evet'' tahmin sonucuna sahip olanlar için sırasıyla 0,886, 0,964 ve 0,962; ''hayır'' tahmin sonucuna sahip olanlar için ise sırasıyla 0,930, 0,961 ve 0,961 olarak bulunmuştur. Gerçek veri setinde, genel modelin doğruluğu 0,736 olarak bulunmuştur. Kişi temelli doğruluk değerleri ise inme tahmini yapılan hastada 0,783, inme tahmini olmayan hastada ise 0,868 olarak bulunmuştur. Sonuç: Tüm veri setlerinde kişi temelli tahmin sonuçları, model bazlı sonuçlardan daha yüksek bulunmuştur. Bu gerçek hayatta kişiler arası heterojenite nedeniyle göz ardı edilmemesi gereken bir durumdur. Bu farklılık göz önünde bulundurularak, kişi temelli modelleme yapıldığında, modelin daha gerçekçi olacağı ve klinik kullanıma daha uygun hâle geleceği düşünülmektedir.

Anahtar Kelimeler: Tahmin; kişi temelli tahmin modelleri; SHapley Additive exPlanations

REFERANSLAR:

Kantardzic M. Data Mining: Concepts, Models, Methods, and Algorithms. 2nd ed. New Jersey: John Wiley & Sons; 2011.[Crossref] [PubMed]
Brownlee J. Data Preparation for Machine Learning: Data Cleaning, Feature Selection, and Data Transforms in Python. 1st ed. Machine Learning Mastery; 2020.
Kuhn M, Johnson K. Applied Predictive Modeling. 1st ed. New York: Springer; 2013. [Crossref] [PMC]
Ng K, Sun J, Hu J, Wang F. Personalized predictive modeling and risk factor identification using patient similarity. AMIA Jt Summits Transl Sci Proc. 2015;2015:132-6. [PubMed] [PMC]
Visweswaran S, Ferreira A, Ribeiro GA, Oliveira AC, Cooper GF. Personalized modeling for prediction with decision-path models. PLoS One. 2015;10(6):e0131022. [Crossref] [PubMed] [PMC]
Liu T, Fan W, Wu C. A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical dataset. Artif Intell Med. 2019;101:101723. [Crossref] [PubMed]
Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017. [Link]
Leisch F, Weingessel A, Hornik K. bindata: Generation of Artificial Binary Data. 2021.R package version 0.9-20. [Link]
Maksymiuk S, Gosiewska A, Biecek P. shapper: Wrapper of Python Library 'shap'. 2020. R package version 0.1.3. [Link]
Biecek P. "DALEX: explainers for complex predictive models in R." J Mach Learn Res. 2018;19(1):3245-9. [Link]
Hornik K, Buchta C, Zeileis A. Open-Source Machine Learning: R Meets Weka. Comput Stat. 2009;24(2):225-32. [Crossref]
Pan P, Li Y, Xiao Y, Han B, Su L, Su M, et al. Prognostic assessment of COVID-19 in the intensive care unit by machine learning methods: model development and validation. J Med Internet Res. 2020;22(11):e23128. [Crossref] [PubMed] [PMC]
Hong L, Xu H, Ge C, Tao H, Shen X, Song X, et al. Prediction of low cardiac output syndrome in patients following cardiac surgery using machine learning. Front Med (Lausanne). 2022;9:973147. [Crossref] [PubMed] [PMC]
Lin S, Zou Y, Hu J, Xiang L, Guo L, Lin X, et al. Development and assessment of machine learning models for predicting recurrence risk after endovascular treatment in patients with intracranial aneurysms. Neurosurg Rev. 2022;45(2):1521-31.[Crossref] [PubMed]
An C, Yang H, Yu X, Han ZY, Cheng Z, Liu F, et al. A machine learning model based on health records for predicting recurrence after microwave ablation of hepatocellular carcinoma. J Hepatocell Carcinoma. 2022;9:671-84. [Crossref] [PubMed] [PMC]
Alabi RO, Almangush A, Elmusrati M, Leivo I, Mäkitie AA. An interpretable machine learning prognostic system for risk stratification in oropharyngeal cancer. Int J Med Inform. 2022;168:104896. [Crossref] [PubMed]
Chan MC, Pai KC, Su SA, Wang MS, Wu CL, Chao WC. Explainable machine learning to predict long-term mortality in critically ill ventilated patients: a retrospective study in central Taiwan. BMC Med Inform Decis Mak. 2022;22(1):75. [Crossref] [PubMed] [PMC]
Yin M, Zhang R, Zhou Z, Liu L, Gao J, Xu W, et al. Automated machine learning for the early prediction of the severity of acute pancreatitis in hospitals. Front Cell Infect Microbiol. 2022;12:886935. [Crossref] [PubMed] [PMC]
Zheng Y, Guo Z, Zhang Y, Shang J, Yu L, Fu P, et al; Global Health Epidemiology Reference Group (GHERG). Rapid triage for ischemic stroke: a machine learning-driven approach in the context of predictive, preventive and personalised medicine. EPMA J. 2022;13(2):285-98. [Crossref] [PubMed] [PMC]

.: Güncel

.: İşlem Listesi

Türkçe İngilizce

Hakkımızda İletişim Görüş ve Öneri

Veri Politikamız Kullanım Şartları

Ortadoğu Reklam Tanıtım Yayıncılık Turizm Eğitim İnşaat Sanayi ve Ticaret A.Ş.

.: Adres

Türkocağı Caddesi No:30 06520 Balgat / ANKARA
Telefon: +90 312 286 56 56
E-posta: info@turkiyeklinikleri.com

.: Yazı İşleri Servisi

Telefon: +90 312 286 56 56/ 154 - 153
E-posta: yaziisleri@turkiyeklinikleri.com

.: İngilizce Dil Redaksiyonu

Telefon: +90 312 286 56 56/ 145
E-posta: tkyayindestek@turkiyeklinikleri.com

.: Reklam Servisi

Telefon: +90 312 286 56 56/ 142
E-posta: reklam@turkiyeklinikleri.com

.: Abone ve Halkla İlişkiler Servisi

Telefon: +90 312 286 56 56/ 197
E-posta: abone@turkiyeklinikleri.com

.: Müşteri Hizmetleri

Telefon: +90 312 286 56 56/ 197
E-posta: satisdestek@turkiyeklinikleri.com