Evaluation of Traditional Machine Learning and Mixed Effect Machine Learning Model Performances by Simulation Study

Ebru TURGAL

doi:10.5336/biostatic.2022-91651

Turkiye Klinikleri Journal of Biostatistics

Journal Identity

About Journal

Peer Review Process

Last Issue

Issue List

Editorial Board

Information For Authors

Author Forms

Article Submission

Subscription

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Evaluation of Traditional Machine Learning and Mixed Effect Machine Learning Model Performances by Simulation Study

Geleneksel Makine Öğrenmesi ve Karışık Etkili Makine Öğrenmesi Model Performanslarının Benzetim Çalışması ile Değerlendirilmesi

Ebru TURGAL^a
^aDepartment of Biostatistics, Ankara University Faculty of Medicine, Ankara, Türkiye

Turkiye Klinikleri J Biostat. 2022;14(3):147-57

doi: 10.5336/biostatic.2022-91651

Article Language: EN

Full Text

ABSTRACT
Objective: The aim of this study is to examine the effectiveness of linear mixed-effects (LME) model, one of the traditional models used in the classification of clustered data, and mixed-effects machine learning models, which are the latest approaches. Material and Methods: For the simulation, various data sets were created with different number of groups (250, 500, 1000) and different sample sizes (5000, 10000, 15000). Within the scope of the simulation, LME model, mixed-effects random forest (MERF) and Gaussian process boosting (GPBoost) models were compared in terms of root mean square error (RMSE) on two functions. Results: When the error variance (EV) is 4 for the linear function, sample size is small and the number of groups is high, RMSE of MERF model is smaller. In all other scenarios, RMSE of the linear model was smaller and. In cases where EV for the nonlinear function is 1, the sample size is small and the number of groups is high, RMSE of MERF is smaller. In all other scenarios (while EV was 1), RMSE of GPBoost model was small, p0.05 for MERF. In cases where EV is 4 for the nonlinear function and the sample size and groups is high, RMSE of MERF is smaller. In all other scenarios (while EV was 4), RMSE of GPBoost model was smaller. Conclusion: As a conclusion, for a nonlinear function, GPBoost performed better than MERF and LME methods in terms of RMSE and time. However, when a linear function is considered, LME gives a better result.

Keywords: Grouped data; machine learning; random effects; simulation

ÖZET
Amaç: Bu çalışmanın amacı, kümelenmiş verilerin sınıflamasında kullanılan geleneksel modellerden doğrusal karışık etkili [linear mixed-effects (LME)] model ile karışık etkili makine öğrenmesi modellerinin etkinliklerini incelemektir. Gereç ve Yöntemler: Benzetim tekniği için farklı küme sayılarında (250, 500, 1000) ve farklı örneklem büyüklüklerinde (5000, 10000, 15000) çeşitli veri setleri oluşturulmuştur. Benzetim çalışması kapsamında LME model, karışık etkili rastgele orman [mixed-effects random forest (MERF)] ve Gauss süreci boosting [Gaussian process boosting (GPBoost)] yöntemlerinin hata kareler ortalamasının karekökü [root mean square error (RMSE)] değeri bakımından karşılaştırılması 2 fonksiyon üzerinde gerçekleştirmiştir. Bulgular: Doğrusal fonksiyon için hata varyansı 4, örneklem sayısı az ve küme sayısı fazla olduğunda, doğrusal model yerine MERF modelinin RMSE daha küçük bulunmuştur. Bunun haricindeki tüm senaryolarda doğrusal modelin RMSE değerinin küçük olduğu görülmüştür. Doğrusal olmayan fonksiyon için hata varyansı 1, örneklem sayısının küçük ve küme sayısının yüksek olduğu durumlarda, MERF modelinin RMSE değeri daha küçük bulunmuştur. Bunun haricindeki tüm senaryolarda (hata varyansı 1 iken) GPBoost modelinin RMSE değerinin küçük, LME ile arasındaki fark için p0,05 olduğu görülmüştür. Doğrusal olmayan fonksiyon için hata varyansı 4, örneklem ve küme sayısının yüksek olduğu durumlarda, MERF modeline ait RMSE daha küçük bulunmuştur. Bunun haricindeki tüm senaryolarda (hata varyansı 4 iken) GPBoost modelinin RMSE değeri küçük bulunmuştur. Sonuç: Sonuç olarak doğrusal olmayan bir fonksiyon için GPBoost; MERF ve LME yöntemine göre RMSE ve zaman açısından daha iyi bir performans göstermiştir. Ancak doğrusal bir fonksiyon ele alındığında LME daha iyi bir sonuç vermektedir.

Anahtar Kelimeler: Kümelenmiş veri; makine öğrenmesi; rastgele etki; benzetim

REFERENCES:

Knagg O. An Intuitive Guide to Gaussian Processes. Towards Data Science. 2019. Cited: December 10, 2020. Available from: [Link]
Gneiting T, Balabdaoui F, Raftery AE. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2007;69(2):243-68. [Crossref]
Shumway RH, Stoffer DS, Stoffer DS. Time Series Analysis and its Applications. 3rd ed. New York: Springer; 2000. [Crossref]
Banerjee S, Carlin BP, Gelfand AE. Hierarchical Modeling and Analysis for Spatial Data. 1st ed. New York: Chapman and Hall/CRC; 2003. [Crossref]
Cressie N, Wikle CK. Statistics for Spatio-Temporal Data. 1st ed. Hoboken, NJ: John Wiley & Sons; 2015.
Kennedy MC, O'Hagan A. Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2001;63(3):425-64. [Crossref]
Jones DR, Schonlau M, Welch WJ. Efficient global optimization of expensive black-box functions. Journal of Global Optimization. 1998:13(4):455-92. [Crossref]
Snoek J, Larochelle H, Adams RP. Practical bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems. 2012;25. [Crossref]
Ateş B. Gemi yapılarında gerilme yığılması öngörülerinin kaba ağ yapısı ve makine öğrenmesi ile gerçekleştirilmesi [Yüksek lisans tezi]. İstanbul: İstanbul Teknik Üniversitesi; 2020. Erişim tarihi: 16.06.2021 [Link]
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. Journal of Machine Learning Research. 2011:12:2825-30. [Crossref]
Williams CK, Rasmussen CE. Gaussian Processes for Machine Learning. 1st ed. Cambridge, MA: MIT Press; 2006. [Crossref]
Hajjem A, Larocque D, Bellavance F. Generalized mixed effects regression trees. Statistics & Probability Letters. 2017;126:114-8. [Crossref]
Yangın G. XGboost ve karar ağacı tabanlı algoritmaların diyabet veri setleri üzerine uygulaması [Yüksek lisans tezi]. İstanbul: Mimar Sinan Güzel Sanatlar Üniversitesi; 2019. Erişim tarihi: 10.07.2021 [Link]
Sigrist F. Gaussian process boosting. arXiv. 2020:1-42. [Crossref]
Song XK, Song PXK. Correlated Data Analysis: Modeling, Analytics, and Applications. 1st ed. New York: Springer Science & Business Media; 2007.
Li B, Friedman J, Olshen R, Stone C. Classification and regression trees (CART). Biometrics. 1984;40(3):358-61. [Crossref]
Chen T, Guestrin C. Xgboost: a scalable tree boosting system. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016;785-94. [Crossref]
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: a highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems. 2017;30:3146-54. [Crossref]
Zhou F, Alsaid A, Blommer M, Curry R, Swaminathan R, Kochhar D, et al. Predicting driver fatigue in monotonous automated driving with explanation using GPBoost and SHAP. International Journal of Human-Computer Interaction. 2022;38(8):719-29. [Crossref]
Friedman JH. Multivariate adaptive regression splines. The Annals of Statistics. 1991;19(1):1-67. [Crossref]
Coşkun K. Ağ saldırılarının sınıflandırılmasında karar ağaçlarına dayalı arttırma (boosting) algoritmalarının karşılaştırılması [Yüksek lisans tezi]. Muğla: Muğla Sıtkı Koçman Üniversitesi; 2020. Erişim tarihi: 16.09.2021 [Link]
Hajjem A, Bellavance F, Larocque D. Mixed-effects random forest for clustered data. Journal of Statistical Computation and Simulation. 2014;84(6):1313-28. [Crossref]
Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems. 2017;1-10. [Crossref]
Rodríguez-Pérez R, Bajorath J. Interpretation of machine learning models using shapley values: application to compound potency and multi-target activity predictions. J Comput Aided Mol Des. 2020;34(10):1013-26. [Crossref] [PubMed] [PMC]
Manifold AI [Internet]. [Cited: August 1, 2019]. Mixed effects random forest (Python). 2014. Available from: [Link]
Natarajan SS. A Mixed Effects Modeling Approach to Predicting NBA Free Agency. 2019. Cited: August 07, 2020. Available from: [Link]

.: Up To Date

.: Process List

Turkish English

About us Contact Us Comments

Ortadoğu Reklam Tanıtım Yayıncılık Turizm Eğitim İnşaat Sanayi ve Ticaret A.Ş.

.: Address

Turkocagi Caddesi No:30 06520 Balgat / ANKARA
Phone: +90 312 286 56 56
E-mail: info@turkiyeklinikleri.com

.: Manuscript Editing Department

Phone: +90 312 286 56 56/ 154 - 153
E-mail: yaziisleri@turkiyeklinikleri.com

.: English Language Redaction

Phone: +90 312 286 56 56/ 145
E-mail: tkyayindestek@turkiyeklinikleri.com

.: Marketing Sales-Project Department

Phone: +90 312 286 56 56/ 142
E-mail: reklam@turkiyeklinikleri.com

.: Subscription and Public Relations Department

Phone: +90 312 286 56 56/ 197
E-mail: abone@turkiyeklinikleri.com

.: Customer Services

Phone: +90 312 286 56 56/ 197
E-mail: satisdestek@turkiyeklinikleri.com