Objective: Early phase of drug discovery studies include a virtual screening phase of detecting active molecules among a large number of small drug molecules. The number of publicly available datasets for drug molecules are growing exponentially every year thanks to the databases, such as PubChem and ChEMBL. Therefore, there is a strong need for analyzing and retrieving useful information from these datasets using automated processes. For this purpose, machine learning algorithms are often used for activity prediction of small drug compounds, since they are faster and comparatively cheaper. Deep neural networks has emerged as a powerful machine learning method with great advantages to deal with high-dimensional big datasets. Material and Methods: In this study, we applied different settings of deep neural networks models to reveal the effects of learning rate, batch size and minority class weight on performance of the network. Results: Small learning rate and large batch size are found to be the most important factors that improve performance of the deep neural network. The best performed model yielded 89% accuracy and 0.78 area under the curve value. Conclusion: Findings of this study is promising for use of deep neural networks in virtual screening of small drug compounds from publicly available databases.
Keywords: Drug molecule; deep learning; classification; databases; virtual screening
Amaç: İlaç keşfi çalışmalarının ilk aşamasında çok sayıdaki ilaç molekülü arasından aktif moleküllerin tespit edilmesi için sanal tarama çalışmaları yürütülür. İlaç moleküllerini içeren veri setlerinin sayısı, PubChem ve ChEMBL gibi veritabanları sayesinde, her yıl katlanarak artmaktadır. Bu nedenle, otomatize edilmiş süreçlerle bu verilerin analiz edilerek yararlı bilgilerin elde edilmesine ihtiyaç duyulmaktadır. Bu amaçla, makine öğrenmesi algoritmaları hem daha hızlı hem de daha ucuz oldukları için ilaç bileşiklerinin aktivitilerinin kestiriminde sıklıkla kullanılırlar. Derin sinir ağları, yüksek boyutlu büyük verierle baş edebilen ve çeşitli avantajlara sahip güçlü bir makine öğrenmesi yöntemi olarak ortaya çıkmıştır. Gereç ve Yöntemler: Bu çalışmada, öğrenme hızı, küme büyüklüğü ve azınlık sınıf ağırlığının ağın performansı üzerindeki etkilerini ortaya koymak için farklı sinir ağ modelleri uygulandı. Bulgular: Küçük öğrenme hızı ve büyük küme büyüklüğü, derin sinir ağının performansını artıran en önemli faktörler olarak bulundu. En iyi performans gösteren model %89 doğruluk oranı ve 0,78 eğri altında kalan alan değeri vermiştir. Sonuç: Bu çalışmanın bulguları, ücretsiz veri tabanlarından elde edilen küçük ilaç bileşiklerinin sanal taramasında derin sinir ağlarının kullanımının umut verici olduğunu göstermektedir.
Anahtar Kelimeler: İlaç molekülü; derin öğrenme; sınıflandırma; veri tabanları; sanal tarama
- Korkmaz S, Zararsiz G, Goksuluk D. MLViS: a web tool for machine learning-based virtual screening in early-phase of drug discovery and development. PLoS One. 2015;10(4):e0124600. PMID: 2592888 [Crossref] [PubMed] [PMC]
- Broach JR, Thorner J. High-throughput screening for drug discovery. Nature. 1996;384(6604 Suppl):14-6. PMID: 8895594
- Sadowski J, Kubinyi H. A scoring scheme for discriminating between drugs and nondrugs. J Med Chem. 1998;41(18):3325-9. PMID: 9719584 [Crossref] [PubMed]
- Byvatov E, Fechner U, Sadowski J, Schneider G. Comparison of support vector machine and artificial neural network systems for drug/nondrug classification. J Chem Inf Comp Sci. 2003;43(6):1882-9. PMID: 14632437 [Crossref] [PubMed]
- Zernov VV, Balakin KV, Ivaschenko AA, Savchuk NP, Pletnev IV. Drug discovery using support vector machines. The case studies of drug-likeness, agrochemical-likeness, and enzyme inhibition predictions. J Chem Inf Comp Sci. 2003;43(6):2048-56. PMID: 14632457 [Crossref] [PubMed]
- Korkmaz S, Zararsiz G, Goksuluk D. Drug/nondrug classification using Support Vector Machines with various feature selection strategies. Comput Methods Programs Biomed. 2014;117(2):51-60. PMID: 25224081 [Crossref] [PubMed]
- Fang JS, Yang RY, Gao L, Zhou D, Yang SQ, Liu AL, et al. Predictions of BuChE ınhibitors using support vector machine and naive Bayesian classification techniques in drug discovery. J Chem Inf Model. 2013;53(11):3009-20. PMID: 24144102 [Crossref] [PubMed]
- Sun H. A naive bayes classifier for prediction of multidrug resistance reversal activity on the basis of atom typing. J Med Chem. 2005;48(12):4031-9. PMID: 15943476 [Crossref] [PubMed]
- Miller DW. Results of a new classification algorithm combining K nearest neighbors and recursive partitioning. J Chem Inf Comp Sci. 2001;41(1):168-75. PMID: 11206369 [Crossref] [PubMed]
- Abdo A, Chen B, Mueller C, Salim N, Willett P. Ligand-based virtual screening using Bayesian Networks. J Chem Inf Model. 2010;50(6):1012-20. PMID: 20504032 [Crossref] [PubMed]
- Ehrman TM, Barlow DJ, Hylands PJ. Virtual screening of Chinese herbs with random forest. J Chem Inf Model. 2007;47(2):264-78. PMID: 17381165 [Crossref] [PubMed]
- Plewczynski D, von Grotthuss M, Rychlewski L, Ginalski K. Virtual high throughput screening using combined random forest and flexible docking. Comb Chem High Throughput Screen. 2009;12(5):484-9. PMID: 19519327 [Crossref]
- Chen H, Engkvist O, Wang Y, Olivecrona M, Blaschke T. The rise of deep learning in drug discovery. Drug Discov Today. 2018;23(6):1241-50. PMID: 29366762 [Crossref] [PubMed]
- Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, et al. Google's Neural Machine Translation System: bridging the gap between human and machine translation. arXiv e-prints 2016. https://arxiv.org/abs/1609.08144
- Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, et al. Deep speech: scaling up end-to-end speech recognition. arXiv e-prints 2014. https://arxiv.org/abs/1412.5567
- Hughes M, Li I, Kotoulas S, Suzumura T. Medical text classification using convolutional neural networks. Stud Health Technol Inform. 2017;235:246-50. PMID: 28423791
- Young T, Hazarika D, Poria S, Cambria E. Recent trends in deep learning based natural language processing. arXiv e-prints 2017. https://arxiv.org/abs/1708.02709
- Ma J, Sheridan RP, Liaw A, Dahl GE, Svetnik V. Deep neural nets as a method for quantitative structure-activity relationships. J Chem Inf Model. 2015;55(2):263-74. PMID: 25635324 [Crossref] [PubMed]
- Mayr A, Klambauer G, Unterthiner T, Hochreiter S. DeepTox: toxicity prediction using deep learning. Front Environ Sci. 2016;3. https://www.frontiersin.org/articles/10.3389/fenvs.2015.00080/full [Crossref]
- Ramsundar B, Kearnes S, Riley P, Webster D, Konerding D, Pande V. Massively multitask networks for drug discovery. arXiv e-prints 2015. https://arxiv.org/pdf/1502.02072.pdf
- Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Zhou Z, et al. PubChem's BioAssay Database. Nucleic Acids Res. 2012;40(Database issue):D400-12. PMID: 22140110 [Crossref] [PubMed] [PMC]
- Rohrer SG, Baumann K. Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data. J Chem Inf Model. 2009;49(2):169-84. PMID: 19434821 [Crossref] [PubMed]
- Mysinger MM, Carchia M, Irwin JJ, Shoichet BK. Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking. J Med Chem. 2012;55(14):6582-94. [Crossref] [PubMed] [PMC]
- Tice RR, Austin CP, Kavlock RJ, Bucher JR. Improving the human hazard characterization of chemicals: a Tox21 update. Environ Health Perspect. 2013;121(7):756-65. PMID: 23603828 [Crossref] [PubMed] [PMC]
- Koutsoukas A, Monaghan KJ, Li X, Huan J. Deep-learning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data. J Cheminform. 2017;9:42. [Crossref] [PubMed] [PMC]
- Lenselink EB, Ten Dijke N, Bongers B, Papadatos G, van Vlijmen HWT, Kowalczyk W, et al. Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set. J Cheminform. 2017;9(1):45. PMID: 29086168 [Crossref] [PubMed] [PMC]
- Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012;40(Database issue):D1100-7. PMID: 21948594 [Crossref] [PubMed] [PMC]
- Dander A, Mueller LA, Gallasch R, Pabinger S, Emmert-Streib F, Graber A, et al. [COMMODE] a large-scale database of molecular descriptors using compounds from PubChem. Source Code Biol Med. 2013;8(1):22. PMID: 24225386 [Crossref] [PubMed] [PMC]
- Cheng T, Pan Y, Hao M, Wang Y, Bryant SH. PubChem applications in drug discovery: a bibliometric analysis. Drug Discov Today. 2014;19(11):1751-6. PMID: 25168772 [Crossref] [PubMed] [PMC]
- Kim S. Getting the most out of PubChem for virtual screening. Expert Opin Drug Discov. 2016;11(9):843-55. PMID: 27454129 [Crossref] [PubMed] [PMC]
- QHTS Assay for Inhibitors of Aldehyde Dehydrogenase 1 (ALDH1A1). 2008. at https://pubchem.ncbi.nlm.nih.gov/bioassay/1030#section=Identity.
- Yap CW. PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem. 2011;32(7):1466-74. PMID: 21425294 [Crossref] [PubMed]
- Chollet F, Allaire JJ. Deep Learning with R. Manning Publications Co.; 2018. p.335.
- Larochelle H, Bengio Y, Louradour J, Lamblin P. Exploring strategies for training deep neural networks. J Mach Learn Res. 2009;10:1-40.
- Patterson J, Gibson A. Deep Learning: A Practitioner's Approach. 1st ed. Beijing: O'Reilly; 2017. p.532. Kırmızı renkle yazıldığı şekilde bulundu.
- Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15:1929-58.
- Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986;323:533-6. [Crossref]
- Dahl GE, Sainath TN, Hinton GE. Improving deep neural networks for LVCSR using rectified linear units and dropout. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing; 2013. p. 8609-13. [Crossref]
- Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, et al. MoleculeNet: a benchmark for molecular machine learning. Chem Sci. 2018;9(2):513-30. [Crossref] [PubMed] [PMC]
- Smith LN. Cyclical learning rates for training neural networks. arXiv e-prints 2015. https://arxiv.org/abs/1506.01186
- Shirish Keskar N, Mudigere D, Nocedal J, Smelyanskiy M, Tang PTP. On large-batch training for deep learning: generalization gap and sharp minima. arXiv e-prints 2016. /arxiv.org/abs/1609.04836
.: Process List