Objective: The main goal in the early phase of drug discovery studies is to detect small drug molecules that show activity against a specific receptor. For this purpose, small drug molecules are classified as actives or inactives by performing high-throughput screening (HTS) experiments. The datasets obtained from these experiments are uploaded to the PubChem database. This database contains more than one million bioassays that are obtained through HTS experiments. Alternatively, classification models can be developed using datasets in the PubChem database. Material and Methods: In this study, we obtained 5 datasets with different degrees of imbalance structure from the PubChem database. We trained these datasets using deep neural networks (DNN) for the classification of small drug molecules as actives or inactives. The test set performances of DNN models were compared with the support vector machines (SVM) and random forest (RF) algorithms. Results: The DNN achieved better balanced accuracy (minimum-maximum: 0.764-0.865), recall (minimum-maximum: 0.630-0.823), F1-score (minimum-maximum: 0.496-0.843) and Matthews correlation coefficient (minimum-maximum: 0.439- 0.721) compared to the SVM and RF. Conclusion: Our results showed that the DNN is a well-performed machine learning algorithm that can be in the early phase of drug discovery studies since it performs better than traditional machine learning algorithms in the case of imbalanced class structures.
Keywords: Deep learning; imbalanced classes; support vector machines; random forest; virtual screening
Amaç: İlaç keşif çalışmalarının erken evresindeki temel amaç, belirli bir reseptöre karşı aktivite gösteren küçük ilaç moleküllerini tespit etmektir. Bu amaçla küçük ilaç molekülleri, yüksek verimli tarama [high-throughput screening (HTS)] deneyleri gerçekleştirilerek aktif veya inaktif olarak sınıflandırılır. Bu deneylerden elde edilen veri setleri PubChem veri tabanına yüklenir. Bu veri tabanı, HTS deneyleri yoluyla elde edilen 1 milyondan fazla biyo-tahlil veri setini içerir. Alternatif olarak, PubChem veri tabanındaki veri kümeleri kullanılarak sınıflandırma modelleri de geliştirilebilir. Gereç ve Yöntemler: Bu çalışmada, PubChem veri tabanından farklı derecelerde dengesizlik yapısına sahip 5 adet veri seti elde ettik. Bu veri setlerini, küçük ilaç moleküllerinin aktif veya inaktif olarak sınıflandırılması için derin sinir ağları [deep neural networks (DNN)] kullanarak eğittik. DNN modellerinin test seti performansları, destek vektör makineleri [support vector machines (SVM)] ve rastgele orman [random forest (RF)] algoritmaları ile karşılaştırılmıştır. Bulgular: DNN modeli, dengeli doğruluk oranı (en küçük-en büyük: 0.764-0.865), duyarlılık (en küçük-en büyük: 0.630-0.823), F1-skoru (en küçüken büyük: 0.496-0.843) ve Matthews korelasyon katsayısı (en küçük-en büyük: 0.439-0.721) açısından SVM ve RF'den daha iyi performans göstermiştir. Sonuç: Sonuçlarımız DNN'nin dengesiz sınıf yapıları durumunda, klasik makine öğrenimi algoritmalarından daha iyi performans gösterdiğini, bu nedenle ilaç keşif çalışmalarının erken aşamasında iyi performans gösterebilen bir makine öğrenimi algoritması olduğunu ortaya koymuştur.
Anahtar Kelimeler: Derin öğrenme; dengesiz sınıflar; destek vektör makineleri; rastgele orman; sanal tarama
- Broach JR, Thorner J. High-throughput screening for drug discovery. Nature. 1996;384(6604 Suppl):14-6. [PubMed]
- Shoichet BK. Virtual screening of chemical libraries. Nature. 2004;432(7019):862-5. [Crossref] [PubMed] [PMC]
- Sadowski J, Kubinyi H. A scoring scheme for discriminating between drugs and nondrugs. J Med Chem. 1998;41(18):3325-9. [Crossref] [PubMed]
- Byvatov E, Fechner U, Sadowski J, Schneider G. Comparison of support vector machine and artificial neural network systems for drug/nondrug classification. J Chem Inf Comput Sci. 2003;43(6):1882-9. [Crossref] [PubMed]
- Zernov VV, Balakin KV, Ivaschenko AA, Savchuk NP, Pletnev IV. Drug discovery using support vector machines. The case studies of drug-likeness, agrochemical-likeness, and enzyme inhibition predictions. J Chem Inf Comput Sci. 2003;43(6):2048-56. [Crossref] [PubMed]
- Korkmaz S, Zararsiz G, Goksuluk D. Drug/nondrug classification using Support Vector Machines with various feature selection strategies. Comput Methods Programs Biomed. 2014;117(2):51-60. [Crossref] [PubMed]
- Korkmaz S, Zararsiz G, Goksuluk D. MLViS: A web tool for machine learning-based virtual screening in early-phase of drug discovery and development. PLoS One. 2015;10(4):e0124600. [Crossref] [PubMed] [PMC]
- Fang J, Yang R, Gao L, Zhou D, Yang S, Liu AL, et al. Predictions of BuChE inhibitors using support vector machine and naive Bayesian classification techniques in drug discovery. J Chem Inf Model. 2013;53(11):3009-20. [Crossref] [PubMed]
- Ehrman TM, Barlow DJ, Hylands PJ. Virtual screening of Chinese herbs with random forest. J Chem Inf Model. 2007;47(2):264-78. [Crossref] [PubMed]
- Ma J, Sheridan RP, Liaw A, Dahl GE, Svetnik V. Deep neural nets as a method for quantitative structure-activity relationships. J Chem Inf Model. 2015;55(2):263-74. [Crossref] [PubMed]
- Mayr A, Klambauer G, Unterthiner T, Hochreiter S. DeepTox: toxicity prediction using deep learning. Front Env Sci-Switz. 2016;3:80. [Crossref]
- Ramsundar B, Kearnes S, Riley P, Webster D, Konerding D, Pande V. Massively multitask networks for drug discovery. arXiv preprint. 2015. [Link]
- Koutsoukas A, Monaghan KJ, Li X, Huan J. Deep-learning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data. J Cheminform. 2017;9(1):42. [Crossref] [PubMed] [PMC]
- Lenselink EB, ten Dijke N, Bongers B, Papadatos G, van Vlijmen HWT, Kowalczyk W, et al. Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set. J Cheminform. 2017;9(1):45. [Crossref] [PubMed] [PMC]
- Korkmaz S. Small drug molecule classification using deep neural networks. Türkiye Klinikleri J Health Sci. 2019;11(2):93-101. [Crossref]
- Korkmaz S. Deep learning-based imbalanced data classification for drug discovery. J Chem Inf Model. 2020;60(9):4180-90. [Crossref] [PubMed]
- Larochelle H, Bengio Y, Louradour J, Lamblin P. Exploring strategies for training deep neural networks. J Mach Learn Res. 2009;10:1-40. [Link]
- Patterson J, Gibson A. Deep Learning: A Practitioner's Approach. 1st ed. USA: O'Reilly Media, Inc.; 2017.
- Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986;323(6088):533-6. [Crossref]
- LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436-44. [Crossref] [PubMed]
- Chollet F. Deep Learning with Python. 1st ed. Shelter Island: Manning Publications Company; 2017.
- Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273-97. [Crossref]
- Breiman L. Random forests. Mach Learn. 2001;45(1):5-32. [Crossref]
- Yap CW. PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem. 2011;32(7):1466-74. [Crossref] [PubMed]
.: Process List