The Effect of Missing Data Mechanisms on Deep Learning in Binary Classification: A Simulation Study

Ebru ÖZTÜRK^a, Yağmur ZENGİN^a, Merve KAŞIKCI^a, Erdal COŞGUN^b
^aDepartment of Biostatistics, Hacettepe University Faculty of Medicine, Ankara, Türkiye
^bGenomics Team, Microsoft Research, WA, USA

Turkiye Klinikleri J Biostat. 2023;15(1):1-12

doi: 10.5336/biostatic.2022-90804

Article Language: EN

Full Text

ABSTRACT
Objective: Investigating the effects of missing data and the methods to overcome problems in statistical models caused by missingness is a significant research topic due to the complex nature of the data, which includes missing observations. The different statistical approaches used in the case of the missing data are complete case analysis and missing data imputation. It is necessary to evaluate missing data mechanisms and patterns to handle missing data issues. However, understanding the missing data mechanism is not easy in relatively large data sets. Recently, deep learning algorithms have been widely used for classification, regression, or clustering tasks in large data sets due to computational advances. The objective of this study is to present the effect of missing data mechanisms on the performance of the deep learning algorithm for binary classification problems. Material and Method: To achieve the aim of this study, an extensive simulation study was conducted using Virtual Machine on Microsoft Azure by considering the missing proportion, the correlation structure, and the mechanism of the missing in the large data set. For different missing data mechanisms, the performance of deep learning with list-wise deletion and imputation compared to the original data set was investigated. Results: It is observed that while the proportion and the mechanism of the missing affect slightly the performance of the deep learning, the correlation level of data affects relatively. Conclusion: Although slight differences were obtained from the area under the curve values, deep learning algorithms can overcome the problem caused by missingness in large data sets.

Keywords: Missing data; missing data imputation; missing data mechanism; deep learning

ÖZET
Amaç: Eksik gözlemin etkisi ve istatistiksel modellemede eksik gözlem kaynaklı problemlerin çözümü, eksik gözlem içeren verilerin karmaşık yapısı nedeniyle önemli bir araştırma konusudur. Eksik gözlem söz konusu olduğunda kullanılan istatistiksel yöntemler tam gözlemlerin kullanılması ve eksik veri atamasıdır. Eksik veriden kaynaklı problemleri çözebilmek için eksik veri mekanizmalarını ve örüntülerini araştırmak gerekmektedir. Ancak büyük veri kümelerinde eksik veri mekanizmasını ve örüntüsünü anlamak kolay değildir. Son zamanlarda derin öğrenme algoritmaları, teknolojik ilerlemeler nedeniyle büyük veri kümelerinde sınıflandırma, regresyon veya kümeleme görevleri için yaygın olarak kullanılmaktadır. Bu çalışmanın amacı, ikili sınıflandırma problemleri için eksik veri mekanizmalarının derin öğrenme algoritmasının performansı üzerindeki etkisini ortaya koymaktır. Gereç ve Yöntemler: Bu çalışmanın amacına ulaşmak için büyük veri setindeki eksik gözlem oranı, korelasyon yapısı ve eksik veri mekanizması dikkate alınarak Microsoft Azure üzerinde Sanal Makine kullanılarak kapsamlı bir simülasyon çalışması yapılmıştır. Farklı kayıp veri mekanizmaları için tam gözlem ve eksik veri ataması yapılan veri kümelerinin orijinal veri kümeleriyle karşılaştırılması yapılmıştır. Bulgular: Kayıpların oranı ve mekanizması derin öğrenmenin performansını biraz etkilerken, verilerin korelasyon düzeyinin göreceli olarak etkilediği görülmektedir. Sonuç: Eğri altında kalan alan değerlerinde küçük farklılıklar elde edilmiş olsa da derin öğrenme algoritmaları büyük veri setlerinde eksik veriden kaynaklanan problemin üstesinden gelebilmektedir.

Anahtar Kelimeler: Eksik veri; eksik veri atama; eksik veri mekanizması; derin öğrenme

REFERENCES:

Leke CA, Marwala T. Deep Learning and Missing Data in Engineering Systems. 1st ed. Switzerland: Springer Cham; 2019. [Crossref]
Hinton GE, Osindero S, Teh YW. A fast learning algorithm for deep belief nets. Neural Comput. 2006;18(7):1527-54. [Crossref] [PubMed]
Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006;313(5786):504-7. [Crossref] [PubMed]
Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems. 2012;25:1097-105. [Link]
Emmert-Streib F, Yang Z, Feng H, Tripathi S, Dehmer M. An introductory review of deep learning for prediction models with big data. Front Artif Intell. 2020;3:4. [Crossref] [PubMed] [PMC]
Ravi D, Wong C, Deligianni F, Berthelot M, Andreu-Perez J, Lo B, et al. Deep learning for health informatics. IEEE J Biomed Health Inform. 2017;21(1):4-21. [Crossref] [PubMed]
Patterson J, Gibson A. Deep Learning: A Practitioner's Approach. 1st ed. Sebastopol, CA: O'Reilly Media, Inc; 2017.
Candel A, Parmar V, LeDell E, Arora A. Deep Learning with H2O. 6th ed. USA: H2O. ai Inc; 2016. [Link]
Nielsen MA. Neural Networks and Deep Learning. San Francisco, CA: Determination Press; 2015.
Enders CK. Applied Missing Data Analysis. 1st ed. New York: Guilford Press; 2010.
Little RJ, Rubin DB. Statistical Analysis with Missing Data. 3rd ed. New Jersey: John Wiley & Sons; 2019. [Crossref]
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological). 1977;39(1):1-38. [Crossref]
Rubin DB. Multiple imputations in sample surveys. Proceedings of the Survey Research Methods Section of the American Statistical Association. 1978. [Link]
Honaker J, King G, Blackwell M. Amelia ii: A program for missing data. Journal of Statistical Software. 2011;45(7):1-47. [Crossref]
Demirtas H, Doganay B. Simultaneous generation of binary and normal data with specified marginal and association structures. J Biopharm Stat. 2012;22(2):223-36. [Crossref] [PubMed]
Demirtas H, Amatya A, Doganay B. Binnor: An r package for concurrent generation of binary and normal data. Communications in Statistics-Simulation and Computation. 2014;43(3):569-79. [Crossref]
van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in r. Journal of Statistical Software. 2011;45(3):1-67. [Crossref]
LeDell E, Gill N, Aiello S, Fu A, Candel A, Click C, et al. h2o: R Interface for the 'H2O' Scalable Machine Learning Platform. R package version 3.30.0.7. 2020. [Link]
Pricope TV. A contextual analysis of multi-layer perceptron models in classifying hand-written digits and letters: limited resources. arXiv preprint. 2021. [Link]
Singh R, Srivastava S. Stock prediction using deep learning. Multimedia Tools and Applications. 2017;76(18):18569-84. [Crossref]
Zhu Y, Wang B, Deng Y. Massively parallel logic simulation with gpus. ACM Transactions on Design Automation of Electronic Systems (TODAES). 2011;16(3):1-20. [Crossref]
Microsoft [Internet]. © Microsoft 2022. Ncv3-series. [Cited: October, 2020] Available from: [Link]
Wickham H. ggplot2: Elegant Graphics for Data Analysis. 2nd ed. Switzerland: Springer-Verlag New York; 2016.
Köse T, Özgür S, Coşgun E, Keskinoğlu A, Keskinoğlu P. Effect of missing data imputation on deep learning prediction performance for vesicoureteral reflux and recurrent urinary tract infection clinical study. Biomed Res Int. 2020;2020:1895076. [Crossref] [PubMed] [PMC]
Yu K, Xie W, Wang L, Li W. ILRC: a hybrid biomarker discovery algorithm based on improved L1 regularization and clustering in microarray data. BMC Bioinformatics. 2021;22(1):514. [Crossref] [PubMed] [PMC]
Naqvi SA, King MT, DeVries TJ, Barkema HW, Deardon R. Data considerations for developing deep learning models for dairy applications: A simulation study on mastitis detection. Computers and Electronics in Agriculture. 2022;196:106895. [Crossref]
Kia SM, Rad NM, van Opstal D, van Schie B, Marquand AF, Pluim J, et al. PROMISSING: Pruning missing values in neural networks. arXiv preprint. 2022. [Link]
Ghorbani A, Zou JY. Embedding for informative missingness: Deep learning with incomplete data. 2018 56th Annual Allerton Conference on Communication: Control, and Computing (Allerton). IEEE. 2018:437-45. [Crossref]
Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation and Updating. 2nd ed. Switzerland: Springer International Publishing; 2019.

.: Up To Date

.: Process List

Turkish English

About us Contact Us Comments

Ortadoğu Reklam Tanıtım Yayıncılık Turizm Eğitim İnşaat Sanayi ve Ticaret A.Ş.

.: Address

Turkocagi Caddesi No:30 06520 Balgat / ANKARA
Phone: +90 312 286 56 56
E-mail: info@turkiyeklinikleri.com

.: Manuscript Editing Department

Phone: +90 312 286 56 56/ 154 - 153
E-mail: yaziisleri@turkiyeklinikleri.com

.: English Language Redaction

Phone: +90 312 286 56 56/ 145
E-mail: tkyayindestek@turkiyeklinikleri.com

.: Marketing Sales-Project Department

Phone: +90 312 286 56 56/ 142
E-mail: reklam@turkiyeklinikleri.com

.: Subscription and Public Relations Department

Phone: +90 312 286 56 56/ 197
E-mail: abone@turkiyeklinikleri.com

.: Customer Services

Phone: +90 312 286 56 56/ 197
E-mail: satisdestek@turkiyeklinikleri.com