The Effect of Cluster Size for Model Performance in High-Dimensional Longitudinal Studies: A Simulation Study

Merve TÜRKEGÜN ŞENGÜL^a , Bahar TAŞDELEN^b , Saim YOLOĞLU^c
^aDepartment of Biostatistics and Medical Informatics, Alaaddin Keykubat University Faculty of Medicine, Antalya, Türkiye
^bDepartment of Biostatistics and Medical Informatics, Mersin University Faculty of Medicine, Mersin, Türkiye
^cDepartment of Biostatistics and Medical Informatics, İnönü University Faculty of Medicine, Malatya, Türkiye

Turkiye Klinikleri J Biostat. 2023;15(3):161-70

doi: 10.5336/biostatic.2023-98699

Makale Dili: EN

Tam Metin

ABSTRACT
Objective: In order to prevent model estimation errors and deviations in high-dimensional longitudinal studies, risk models are established through penalized methods. The aim of this study is to examine the effect of small cluster effects on the generalized estimating equations (GEE) and penalized GEE (PGEE) model performances in high-dimensional longitudinal data. Material and Methods: A simulation study was designed to compare the GEE and PGEE model performances, Type I error rates, and power in two-period longitudinal data structures with different cluster sizes (n=20, 30, 50, 100, 200), different numbers of predictors (p=10, 20, 50) and different correlation levels between predictors (r=0.20, 0.50, 0.80). Results: It was observed that the GEE coefficient estimates were misleading and inconsistent, the Type I error rates were high, and the power of the test was weak at insufficient cluster sizes and high correlations between predictors. Even when the number of predictors and cluster size were in the balance (p=10, n=100, 200), Type I error rates were obtanied high for GEE. Increasing the cluster size was not enough to reduce the Type I error rate of GEE. The PGEE produced more successful results than GEE in all conditions. The power of PGEE increased to over 80% in all scenarios. Conclusion: The PGEE yielded more consistent results by controlling the relationships both within the cluster and between the predictors. In highdimensional longitudinal studies, it was observed that the use of PGEE is more effective than GEE.

Keywords: Generalized estimating equations; penalized generalized estimating equations; model selection; penalized methods; high dimensional longitudinal data

ÖZET
Amaç: Yüksek boyutlu boylamsal çalışmalardaki model tahmin hatalarının ve sapmaların önüne geçebilmek amacıyla risk modelleri, cezalı yöntemler aracılığı ile oluşturulur. Bu çalışmada amaç; yüksek boyutlu boylamsal veride küçük küme büyüklüğünün etkisinin, genelleştirilmiş tahmin eşitlikleri [generalized estimating equations (GEE)] ve cezalı genelleştirilmiş tahmin eşitlikleri [penalized generalized estimating equations (PGEE)] model performansları üzerine etkisini incelemektir. Gereç ve Yöntemler: Farklı küme büyüklüklerine (n=20, 30, 50, 100, 200), farklı açıklayıcı değişken sayılarına (P=10, 20, 50) ve açıklayıcı değişkenler arasında farklı korelasyon düzeylerine sahip (r=0.20, 0.50 ve 0.80) iki periyotlu boylamsal veri yapılarında GEE ve PGEE model performanslarını, Tip I hata oranlarını ve testin gücünü karşılaştırmak amacıyla simülasyon çalışması kurgulanmıştır. Bulgular: Yetersiz küme büyüklüklerinde ve açıklayıcı değişkenler arasındaki yüksek korelasyonlarda, GEE katsayı tahminlerinin yanıltıcı ve tutarsız olduğu, Tip I hata oranlarının yüksek ve testin gücünün ise zayıf olduğu gözlemlenmiştir. Değişken sayısı ile küme büyüklüğünün dengede olduğu durumlarda dahi (P=10, n=100, 200) GEE için Tip I hata oranları yüksek elde edilmiştir. Küme büyüklüğünü artırmak GEE'nin Tip I hata oranını düşürmek için yeterli olmamıştır. PGEE ise her koşulda GEE'den daha başarılı sonuçlar üretmiştir. PGEE'nin gücü tüm senaryolarda %80'in üzerine çıkmıştır. Sonuç: PGEE küme içi ve kümeler arası ilişkileri kontrol altında tutarak GEE'ye göre daha geçerli sonuçlar üretmiştir. Yüksek boyutlu boylamsal çalışmalarda GEE yerine PGEE'nin kullanımın daha etkili olduğu gözlemlenmiştir.

Anahtar Kelimeler: Genelleştirilmiş tahmin eşitlikleri; cezalı genelleştirilmiş tahmin eşitlikleri, model seçimi; cezalı yöntemler; yüksek boyutlu boylamsal veri

REFERANSLAR:

Zhong P-S, Li J, Kokoszka P. Multivariate analysis of variance and change points estimation for high-dimensional longitudinal data. Scand J Statist. 2021;48(2):375-405. [Crossref]
Konietschke F, Schwab K, Pauly M. Small sample sizes: a big data problem in high-dimensional data analysis. Stat Methods Med Res. 2021;30(3):687-701.[Crossref] [PubMed] [PMC]
Mondol MH, Rahman MS. Bias-reduced and separation-proof GEE with small or sparse longitudinal binary data. Stat Med. 2019;38(14):2544-60. [Crossref] [PubMed]
Gosho M, Ishii R, Noma H, Maruo K. A comparison of bias-adjusted generalized estimating equations for sparse binary data in small-sample longitudinal studies. Stat Med. 2023;42(15):2711-27. [Crossref] [PubMed]
Heinze G. A comparative investigation of methods for logistic regression with separated or nearly separated data. Stat Med. 2006;25(24):4216-26. [Crossref] [PubMed]
Pan W. Akaike's information criterion in generalized estimating equations. Biometrics. 2001;57(1):120-5. [Crossref] [PubMed]
Cantoni E, Flemming JM, Ronchetti E. Variable selection for marginal longitudinal generalized linear models. Biometrics. 2005;61(2):507-14. [Crossref] [PubMed]
Wang L, Qu A. Consistent model selection and data-driven smooth tests for longitudinal data in the estimating equations approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2009;71(1):177-90. [Crossref]
Fu WJ. Penalized estimating equations. Biometrics. 2003;59(1):126-32. [Crossref] [PubMed]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96(456):1348-60. [Crossref]
Dziak JJ. Penalized quadratic inference functions for variable selection in longitudinal research [Doctorate thesis]. Pennsylvania: The Pennsylvania State University; 2006. [Cited: June 23, 2023]. Available from: [Link]
Dziak JJ, Li R. An overview on variable selection for longitudinal data. In: Hong D, Shyr Y, eds. Quantitative Medical Data Analysis Using Mathematical Tools and Statistical Techniques. Default Book Series. 1st ed. Singapore: World Scientific; 2007. p.3-24. [Crossref]
Wang L, Zhou J, Qu A. Penalized generalized estimating equations for high-dimensional longitudinal data analysis. Biometrics. 2012;68(2):353-60. [Crossref] [PubMed]
Ren VS, Miočević, M. Introduction. Small Sample size Solutions: A Guide for Applied Researchers and Practitioners. 1st ed. New York: Routledge; 2020. p.viii.
Liang KY, Zeger, SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13-22. [Crossref]
Ma Y, Mazumdar M, Memtsoudis SG. Beyond repeated-measures analysis of variance: advanced statistical methods for the analysis of longitudinal data in anesthesia research. Reg Anesth Pain Med. 2012;37(1):99-105. [Crossref] [PubMed] [PMC]
Rousseeuw PJ. Least median of squares regression. Journal of the American Statistical Association. 1984;79(388):871-80. [Crossref]
Mancl LA, DeRouen TA. A covariance estimator for GEE with improved small-sample properties. Biometrics. 2001;57(1):126-34. [Crossref] [PubMed]
McNeish DM, Harring JR. Clustered data with small sample sizes: comparing the performance of model-based and design-based approaches. Communications in Statistics-Simulation and Computation. 2017;46(2):855-69. [Crossref]
Westgate PM, Burchett WW. Improving power in small-sample longitudinal studies when using generalized estimating equations. Stat Med. 2016;35(21):3733-44. [Crossref] [PubMed] [PMC]
Kauermann G, Carroll RJ. A Note on the efficiency of sandwich covariance matrix estimation. Journal of the American Statistical Association. 2001;96(456):1387-96. [Crossref]
Fay MP, Graubard BI. Small-sample adjustments for Wald-type tests using sandwich estimators. Biometrics. 2001;57(4):1198-206. [Crossref] [PubMed]
Morel JG, Bokossa MC, Neerchal NK. Small sample correction for the variance of GEE estimators. Biom. J. 2003;45(4):395-409. [Crossref]
McCaffrey DF, Bell RM. Improved hypothesis testing for coefficients in generalized estimating equations with small samples of clusters. Stat Med. 2006;25(23):4081-98. [Crossref] [PubMed]
Wang M, Long Q. Modified robust variance estimator for generalized estimating equations with improved small-sample performance. Stat Med. 2011;30(11):1278-91.[Crossref] [PubMed]
Westgate PM. A bias correction for covariance estimators to improve inference with generalized estimating equations that use an unstructured correlation matrix. Stat Med. 2013;32(16):2850-8. [Crossref] [PubMed]

.: Güncel

.: İşlem Listesi

Türkçe İngilizce

Hakkımızda İletişim Görüş ve Öneri

Veri Politikamız Kullanım Şartları

Ortadoğu Reklam Tanıtım Yayıncılık Turizm Eğitim İnşaat Sanayi ve Ticaret A.Ş.

.: Adres

Türkocağı Caddesi No:30 06520 Balgat / ANKARA
Telefon: +90 312 286 56 56
E-posta: info@turkiyeklinikleri.com

.: Yazı İşleri Servisi

Telefon: +90 312 286 56 56/ 154 - 153
E-posta: yaziisleri@turkiyeklinikleri.com

.: İngilizce Dil Redaksiyonu

Telefon: +90 312 286 56 56/ 145
E-posta: tkyayindestek@turkiyeklinikleri.com

.: Reklam Servisi

Telefon: +90 312 286 56 56/ 142
E-posta: reklam@turkiyeklinikleri.com

.: Abone ve Halkla İlişkiler Servisi

Telefon: +90 312 286 56 56/ 197
E-posta: abone@turkiyeklinikleri.com

.: Müşteri Hizmetleri

Telefon: +90 312 286 56 56/ 197
E-posta: satisdestek@turkiyeklinikleri.com