Performance Comparison of Some Imputation Methods Used in Missing Value(s) Analysis: A Simulation Study

Ahmet Kadir ARSLAN^a, Zeynep TUNÇ^a, Emek GÜLDOĞAN^a, Cemil ÇOLAK^a
^aİnönü University Faculty of Medicine, Department of Biostatistics and Medical Informatics, Malatya, TURKEY

Turkiye Klinikleri J Biostat. 2019;11(1):15-23

doi: 10.5336/biostatic.2018-62788

Makale Dili: EN

Tam Metin

ABSTRACT
Objective: In a research, it is not desirable that the dataset to be used contains missing value (s) and researchers try to cope with this situation. The main purpose of this research is to develop new user-friendly web-based software that uses various techniques to handle missing value(s). Material and Methods: In this study, to assess the performance of the software, various scenarios were tested: 5 variables were normally distributed, different sample sizes (n=1000, 1500, 2000 and 2500), high (r <-0.70 or r> 0.70) and low correlations (-0.30 <r <0.30) among between variables, different number of missing value in variables (5%, 10% and 20% missing data). The missing values were imputed by the developed web software and the results were compared. Thus, the performance of the software under different conditions was evaluated. Shiny, an open source R package was used to develop the web tool. In the developed software, linear regression (LR), random forest (RF), classification and regression trees (CART) and predictive mean matching (PMM) methods were used to impute missing values. In order to achieve more unbiased and reliable results, the "number of repetitions" and "number of multiple imputations" sections were used in the software. The normalized root mean squared error (NRMSE) metric was used to assess performance of imputation techniques. The developed web-based application can be accessed free of charge at http://biostatapps. inonu.edu.tr/KDAY/. Results: According to the outputs of the developed web-based application, better results were obtained by LR and PMM models for missing value imputation in datasets with high correlation. For missing value imputation in low-correlated data sets, the models showed similar imputation performances. Conclusion: For the datasets used in this study, when the correlation between the variables is high, the best imputation performance is obtained with the DR and PMM models regardless of the size of the dataset and the percentage of missing values.

Keywords: Assignment methods; missing value(s) analysis; Shiny; simulation; web based software

ÖZET
Amaç: Bir araştırmada kullanılacak veri setinin kayıp değer(ler) içermesi istenmeyen bir durum olup, araştırıcılar kayıp veri ile ilgili sorunları gidermeye çalışırlar. Bu araştırmanın temel amacı kayıp veri analizini ele almak için çeşitli teknikleri kullanan, yeni kullanıcı dostu bir web yazılımı geliştirmektir. Gereç ve Yöntemler: Bu çalışmada, yazılımın performansını değerlendirmek için çeşitli senaryolar test edilmiştir: 5 değişkenin normal olarak dağılması, Farklı örneklem büyüklüklerinin (n = 1000, 1500, 2000 ve 2500) olması, Değişkenler arasında yüksek (r <-0.70 veya r> 0.70) ve düşük korelasyonların (-0.30 <r <0.30) olması, Değişkenlerde farklı sayıda eksik değerlerin (% 5,% 10 ve% 20 eksik veri) olması Bu kayıp veriler geliştirilen web yazılımı ile doldurularak çıkan sonuçlar karşılaştırılmıştır. Böylece yazılımın farklı koşullardaki çalışma performansları değerlendirilmiştir. Açık kaynaklı bir R paketi olan Shiny, web aracını geliştirmek için kullanıldı. Yazılımımızda eksik değerlere atama yapmak için doğrusal regresyon (DR), rastgele orman (RF), sınıflandırma ve regresyon ağaçları (CART) ve tahmini ortalama eşleme (PMM) ele alındı. Kayıp veri atamalarından daha iyi sonuçlar alabilmek için yazılımda "Tekrar sayısı" ve "Çoklu Atama Sayısı" kısımları kullanıldı. Atama tekniklerinin performansını değerlendirmek için normalleştirilmiş hata kareler ortalamasının karekökü (NRMSE) metriği kullanılmıştır. Geliştirilen web tabanlı uygulamaya http://biostatapps. inonu.edu.tr/KDAY/ adresinden ücretsiz olarak erişilebilir. Bulgular: Geliştirilen web tabanlı uygulamanın çıktılarına göre yüksek korelasyona sahip veri setlerinde kayıp değer atama işlem için DR ve PMM modelleri ile daha iyi sonuçlar elde edilmiştir. Düşük korelasyona sahip veri setlerinde kayıp değer atama işlem için ise yazılımda yer verilen dört kayıp değer atama yönteminin hiçbirinin üstünlük sağlayamadığı görülmüştür. Sonuç: Bu çalışmada kullanılan veri kümeleri için, değişkenler arasındaki korelasyon yüksek olduğunda, verisetinin büyüklüğüne ve kayıp değerlerin yüzdesine bakılmaksızın DR ve PMM modelleri ile en iyi atama performansı elde edilmektedir.

Anahtar Kelimeler: Atama yöntemleri; benzetim; kayıp veri analizi; Shiny; web tabanlı yazılım

REFERANSLAR:

Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581-92. [Crossref]
Little RJA, Rubin DB. Statistical Analysis with Missing Data. 2nd ed. New Jersey: Wiley-Interscience; 2002. p.408. [Crossref]
Çüm S, Demir EK, Gelbal S, Kışla T. [A comparison of advanced methods used for missing data imputation under different conditions]. Mehmet Akif Ersoy Üniversitesi Eğitim Fakültesi Dergisi. 2018;(45):230-49.
Pigott TD. A review of methods for missing data. Educational Resarch and Evaluation. 2001;7(1):353-83. [Crossref]
Allison PD. Missing data techniques for structural equation modeling. J Abnorm Psychol. 2003;4(1):545-57. [Crossref]
Osborne JW. Best Practices in Data Cleaning: A Complete Guide to Everything You Need to do Before and After Collecting Your Data. 1st ed. California: Sage Publication Inc; 2013. p.275. [Crossref]
Çüm S, Gelbal S. [The effects of different methods used for value imputation instead of missing values on model data fit statistics]. Mehmet Akif Ersoy Üniversitesi Eğitim Fakültesi Dergisi. 2015;1(35):87-111.
Roth PL. Missing data: a conceptual review for applied psychologists Personnel Psychology. 1994;47(3):537-60. [Crossref]
Alpar R. Çok Değişkenli İstatistiksel Yöntemler. 5. Baskı. Ankara: Detay Yayıncılık; 2017. p.840.
IBM Corp. IBM SPSS Statistics for Windows. Version 25.0. Armonk, Ny: IBM Corp; 2017.
Enders CK. Applied Missing Data Analysis. 1st ed. New York: The Guilford Publications; 2010. p.377.
Akman M, Genç Y, Ankarali H. [Random forests methods and an application in health science]. Turkiye Klinikleri J Biostat. 2011;3(1):36-48.
Cutler A, Cutler DR, Stevens JR. Ensemble Machine Learning. 1st ed. New York: Springer; 2012. p.329.
Cutler A, Cutler DR, Stevens JR. Tree-based methods. High-Dimensional Data Analysis in Cancer Research. 1st ed. New York: Springer; 2009. p.1-19. [Crossref] [PMC]
Bertsimas D, Pawlowski C, Zhuo YD. From predictive methods to missing data imputation: an optimization approach. The Journal of Machine Learning Research. 2017;18(1):7133-71.
Zhang S, Qin Z, Ling CX, Sheng S. ?Missing is useful?: missing values in cost-sensitive decision trees. IEEE Trans Knowl Data Eng. 2005;17(12):1689-93. [Crossref]
Schenker N, Taylor JM. Partially parametric techniques for multiple imputation. Computational Statistics & Data Analysis. 1996;22(4):425-46. [Crossref]
Horton NJ, Lipsitz SR, Parzen M. A potential for bias when rounding in multiple imputation. The American Statistician. 2003;57(4):229-32. [Crossref]
Chang W, Cheng J, Allaire JJ, Xie Y, McPherson J. Shiny: web application framework for R. R Package version 0.13. 2016;2.
Bailey E. shinyBS: Twitter bootstrap components for shiny. R package version 0.61; 2015.
Chang W. Shinythemes: themes for shiny. R package version 1.0. 2015;1.
Chang W. Ribeiro BB. Shinydashboard: create dashboards with Shiny. R package version 0.6. 2017;1.
AnalytixWare, shinysky: A set of Shiny UI components/widest R package version 0.1.2.
Faraway J, Marsaglia G, Marsaglia J, Baddeley A. Goftest: Classical goodness-of-fit tests for univariate distributions; 2014.
Stekhoven DJ. missForest: Nonparametric Missing Value Imputation using Random Forest. R package version 1;2012.
Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A. The e1071 package; 2005.
Gross J, Ligges U. Nortest: Tests for Normality, R package version 1.0-2; 2012.
Wickham H, Chang W. Devtools: tools to make developing R packages easier. R package version 1.12. 0. 2016; 2017.
Zhang Z. Multiple imputation with multivariate imputation by chained equation (MICE) package. Ann Transl Med. 2016;4(2):30. [PMC]
Demir E, Parlak B. [Missing value problem in educational research in Turkey]. Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi. 2012;3(1):230-41.
Minitab, I. N. C. MINITAB statistical software, Minitab Release, 13;2000.

.: Güncel

.: İşlem Listesi

Türkçe İngilizce

Hakkımızda İletişim Görüş ve Öneri

Veri Politikamız Kullanım Şartları

Ortadoğu Reklam Tanıtım Yayıncılık Turizm Eğitim İnşaat Sanayi ve Ticaret A.Ş.

.: Adres

Türkocağı Caddesi No:30 06520 Balgat / ANKARA
Telefon: +90 312 286 56 56
E-posta: info@turkiyeklinikleri.com

.: Yazı İşleri Servisi

Telefon: +90 312 286 56 56/ 154 - 153
E-posta: yaziisleri@turkiyeklinikleri.com

.: İngilizce Dil Redaksiyonu

Telefon: +90 312 286 56 56/ 145
E-posta: tkyayindestek@turkiyeklinikleri.com

.: Reklam Servisi

Telefon: +90 312 286 56 56/ 142
E-posta: reklam@turkiyeklinikleri.com

.: Abone ve Halkla İlişkiler Servisi

Telefon: +90 312 286 56 56/ 197
E-posta: abone@turkiyeklinikleri.com

.: Müşteri Hizmetleri

Telefon: +90 312 286 56 56/ 197
E-posta: satisdestek@turkiyeklinikleri.com