Evaluation of the Performance of Large Language Models (ChatGPT-3.5, ChatGPT-4, Bing and Bard) in Turkish Ophthalmology Chief-Assistant Exams: A Comparative Study

Mehmet CANLEBLEBİCİ; Ali DAL; Murat ERDAĞ

doi:10.5336/ophthal.2024-102632

Giriş Yeni Kullanıcı English

Türkiye Klinikleri Oftalmoloji Dergisi

Dergi Kimliği

Dergi Hakkında

Hakem İnceleme Süreci

Son Sayı

Sayı Arşivi

Yayın Kurulu

Yazım Kuralları

Yazar Formları

Yayın Gönder

Abone Satış

Bu eser Creative Commons Atıf-GayriTicari-Türetilemez 4.0 Uluslararası Lisansı ile lisanslanmıştır.

Evaluation of the Performance of Large Language Models (ChatGPT-3.5, ChatGPT-4, Bing and Bard) in Turkish Ophthalmology Chief-Assistant Exams: A Comparative Study

Göz Hastalıkları Alanında Türkçe Başasistanlık Sınavlarında Geniş Dil Modellerinin (ChatGPT-3.5, ChatGPT-4, Bing ve Bard) Performansların Değerlendirilmesi: Karşılaştırmalı Bir Çalışma

Mehmet CANLEBLEBİCİ^a , Ali DAL^b , Murat ERDAĞ^c
^aKayseri State Hospital, Clinic of Ophthalmology, Kayseri, Türkiye
^bUğur Eye Hospital, Clinic of Ophthalmology, Kahramanmaraş, Türkiye
^cVan Training and Research Hospital, Clinic of Ophthalmology, Van, Türkiye

Turkiye Klinikleri J Ophthalmol. 2024;33(3):163-70

doi: 10.5336/ophthal.2024-102632

Makale Dili: EN

Tam Metin

ABSTRACT
Objective: Recent strides in artificial intelligence, particularly in large language models (LLM), have prompted their exploration in medical education. This study investigates the proficiency of LLMs in Turkish chief-assistant ophthalmology exams, assessing ChatGPT-3.5, ChatGPT4.0, Bing, and Bard. The aim is comparing their accuracy in answering 200 questions spanning six critical ophthalmology topics, providing insights into their potential applications in medical education. Material and Methods: The questions were asked in Turkish and obtained from the chief-assistant exam administered by the Ministry of National Education from internet. A total of 200 questions were presented to each LLM as fallows ChatGPT3.5, ChatGPT-4.0, Bing, and Bard, in October 2023. The questions covered six groups, including Retina and Vitreous as Group-1, Cornea, Cataract and Anterior Segment as Group-2, Glaucoma as Group-3, Pediatric Ophthalmology, Genetics and Clinical Refraction as Group-4, Adnexa, Uvea and Oculoplastic as Group-5, and Neuro-Ophthalmology and Strabismus as Group-6. The primary outcome measure was response accuracy, with topics grouped under these six main headings. Statistical analyses were employed to assess the accuracy and reliability of the responses with Pearson's chi-square test. Results: ChatGPT-4.0 emerges as the most accurate LLM with a 77.5% correct response rate, followed by Bing at 63.0%. In contrast, ChatGPT-3.5 and Bard exhibit lower accuracy at 51% and 45.5%, respectively. Subgroup analyses emphasize ChatGPT-4.0's superiority across all branches, showcasing its efficacy in diverse ophthalmology topics. Conclusion: Despite promising results, the study acknowledges challenges in accuracy and underscores the imperative for continual improvements in LLMs, especially in the realm of clinical applications and education.

Keywords: Artificial intelligence; education and training; large language models; ophthalmology

ÖZET
Amaç: Yapay zekâ alanında, özellikle de büyük dil modellerindeki [large language models (LLM)] son gelişmeler, bunların tıp eğitiminde araştırılmasına yol açmıştır. Bu çalışma, ChatGPT-3.5, ChatGPT-4.0, Bing ve Bard'ı değerlendirerek, LLM'lerin Türkçe başasistan oftalmoloji sınavlarındaki yeterliliğini araştırmaktadır. Çalışmanın amacı, 6 ana oftalmoloji konusunu kapsayan 200 soruyu yanıtlamadaki doğruluk oranlarını karşılaştırmak ve tıp eğitimindeki potansiyel uygulamalarına ilişkin alanları tartışmaktır. Gereç ve Yöntemler: Araştırmada kullanılmak üzere, önceki yıllarda Millî Eğitim Bakanlığı tarafından uygulanan ve Türkçe sorulan başasistanlık sınavı soruları internet üzerinden elde edildi. Toplam 200 soru Ekim 2023'te her bir LLM'ye ChatGPT-3.5, ChatGPT-4.0, Bing ve Bard olmak üzere tek tek sunulmuştur. Sorular, Grup 1 olarak Retina ve Vitreus, Grup 2 olarak Kornea, Katarakt ve Ön Segment, Grup 3 olarak Glokom, Grup 4 olarak Pediatrik Oftalmoloji, Genetik ve Klinik Refraksiyon, Grup 5 olarak Adneksa, Uvea ve Oküloplastik ve Grup 6 olarak Nöro-Oftalmoloji ve Şaşılık olmak üzere 6 grubu kapsamaktadır. Birincil değerlendirme ölçütü, bu 6 ana başlık altında gruplandırılan konularla birlikte doğru yanıtlama oranıdır. Yanıtların doğruluğunu ve güvenilirliğini değerlendirmek için Pearson'ın ki-kare testi ile istatistiksel analizler yapılmıştır. Bulgular: ChatGPT-4.0 %77,5 doğruluk oranıyla en iyi performansı gösteren LLM olmuştur ve onu %63,0 ile Bing takip etmektedir. Buna karşılık, ChatGPT-3.5 ve Bard sırasıyla %51 ve %45,5 ile daha düşük doğruluk oranı sergilemektedir. Alt grup analizleri ChatGPT-4.0'ın tüm branşlardaki üstünlüğünü vurgulayarak çeşitli oftalmoloji konularındaki etkinliğini ortaya koymaktadır. Sonuç: Umut verici sonuçlara rağmen bu çalışma doğruluk konusundaki sorunları göstermekte ve özellikle eğitim ve klinik uygulamalar alanında LLM'lerde sürekli iyileştirmeler yapılması zorunluluğunun altını çizmektedir.

Anahtar Kelimeler: Yapay zekâ; eğitim ve öğretim; geniş dil modelleri; oftalmoloji

REFERANSLAR:

Ting DSJ, Tan TF, Ting DSW. ChatGPT in ophthalmology: the dawn of a new era? Eye (Lond). 2024;38(1):4-7. [Crossref] [PubMed] [PMC]
Honavar SG. Eye of the AI storm: exploring the impact of AI tools in ophthalmology. Indian J Ophthalmol. 2023;71(6):2328-40. [Crossref] [PubMed] [PMC]
Panthier C, Gatinel D. Success of ChatGPT, an AI language model, in taking the French language version of the European Board of Ophthalmology examination: a novel approach to medical knowledge assessment. J Fr Ophtalmol. 2023;46(7):706-11. [Crossref] [PubMed]
Cai LZ, Shaheen A, Jin A, Fukui R, Yi JS, Yannuzzi N, et al. Performance of generative large language models on ophthalmology board-style questions. Am J Ophthalmol. 2023;254:141-9. [Crossref] [PubMed]
Tan TF, Thirunavukarasu AJ, Campbell JP, Keane PA, Pasquale LR, Abramoff MD, et al. Generative artificial intelligence through ChatGPT and other large language models in ophthalmology: clinical applications and challenges. Ophthalmol Sci. 2023;3(4):100394. [Crossref] [PubMed] [PMC]
Lim ZW, Pushpanathan K, Yew SME, Lai Y, Sun CH, Lam JSH, et al. Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023;95:104770. [Crossref] [PubMed] [PMC]
Bernstein IA, Zhang YV, Govil D, Majid I, Chang RT, Sun Y, et al. Comparison of ophthalmologist and large language model chatbot responses to online patient eye care questions. JAMA Netw Open. 2023;6(8):e2330320. [Crossref] [PubMed] [PMC]
Raimondi R, Tzoumas N, Salisbury T, Di Simplicio S, Romano MR; North East Trainee Research in Ophthalmology Network (NETRiON). Comparative analysis of large language models in the Royal College of Ophthalmologists fellowship exams. Eye (Lond). 2023;37(17):3530-3. [Crossref] [PubMed] [PMC]
Oftalmologlar [İnternet]. [Erişim tarihi: 17 Kasım 2023]. Nisan 2010 Dönemi Başasistanlık Sınav Soruları. Erişim linki: [Link]
Türk Oftalmoloji Başasistanlığı Sınavı 2015. Erişim tarihi: 17 Kasım 2023. Erişim linki: [Link]
Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the performance of chatgpt in ophthalmology: an analysis of its successes and shortcomings. Ophthalmol Sci. 2023;3(4):100324. [Crossref] [PubMed] [PMC]
Fowler T, Pullen S, Birkett L. Performance of ChatGPT and Bard on the official part 1 FRCOphth practice questions. Br J Ophthalmol. 2023:bjo-2023-324091. [Crossref] [PubMed]
Huang H, Zheng O, Wang D, Yin J, Wang Z, Ding S, et al. ChatGPT for shaping the future of dentistry: the potential of multi-modal large language model. International Journal of Oral Science. 2023;15:29. [Crossref] [PubMed] [PMC]
Giannakopoulos K, Kavadella A, Aaqel Salim A, Stamatopoulos V, Kaklamanos EG. Evaluation of the performance of generative AI large language models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study. J Med Internet Res. 2023;25:e51580. [Crossref] [PubMed] [PMC]
Motlagh NY, Khajavi M, Sharifi A, Ahmadi M. The impact of artificial intelligence on the evolution of digital education: a comparative study of openAI text generation tools including ChatGPT, Bing Chat, Bard, and Ernie. arXiv. 2023. [Link]
Lozić E, ?tular B. ChatGPT v Bard v Bing v Claude 2 v Aria v human-expert. How good are AI chatbots at scientific writing? arXiv. 2023. [Link]
Waisberg E, Ong J, Masalkhi M, Zaman N, Sarker P, Lee AG, et al. Google's AI chatbot "Bard": a side-by-side comparison with ChatGPT and its utilization in ophthalmology. Eye (Lond). 2024;38(4):642-5. [Crossref] [PubMed] [PMC]
Vaishya R, Misra A, Vaish A. ChatGPT: is this version good for healthcare and research? Diabetes Metab Syndr. 2023;17(4):102744. [Crossref] [PubMed]
Jiao C, Edupuganti NR, Patel PA, Bui T, Sheth V. Evaluating the artificial intelligence performance growth in ophthalmic knowledge. Cureus. 2023;15(9):e45700. [Crossref] [PubMed] [PMC]
Holmes J, Peng R, Li Y, Hu J, Liu Z, Wu Z, et al. Evaluating multiple large language models in pediatric ophthalmology. arXiv. 2023. [Link]
Sensoy E, Citirik M. A comparative study on the knowledge levels of artificial intelligence programs in diagnosing ophthalmic pathologies and intraocular tumors evaluated their superiority and potential utility. Int Ophthalmol. 2023;43(12):4905-9. [Crossref] [PubMed]
Delsoz M, Madadi Y, Munir WM, Tamm B, Mehravaran S, Soleimani M, et al. Performance of ChatGPT in diagnosis of corneal eye diseases. medRxiv [Preprint]. 2023:2023.08.25.23294635. Update in: Cornea. 2024;43(5):664-70. [Crossref] [PubMed] [PMC]
Azamfirei R, Kudchadkar SR, Fackler J. Large language models and the perils of their hallucinations. Crit Care. 2023;27(1):120. [Crossref] [PubMed] [PMC]
Ashraf H, Ashfaq H. The role of chatgpt in medical research: progress and limitations. Ann Biomed Eng. 2024;52(3):458-61. [Crossref] [PubMed]
Hadi MU, Qureshi R, Shah A, Irfan M, Zafar A, Shaikh M, et al. A survey on large language models: applications, challenges, limitations, and practical usage. TechRxiv. 2023. [Crossref]
Waters MR, Aneja S, Hong JC. Unlocking the power of ChatGPT, artificial intelligence, and large language models: practical suggestions for radiation oncologists. Pract Radiat Oncol. 2023;13(6):e484-e90. [Crossref] [PubMed]
Bang J, Lee B-T, Park P. Examination of Ethical Principles for LLM-Based Recommendations in Conversational AI. 2023 International Conference on Platform Technology and Service (PlatCon): IEEE. 2023:109-13. [Crossref] [PubMed]
Au Yeung J, Kraljevic Z, Luintel A, Balston A, Idowu E, Dobson RJ, et al. AI chatbots not yet ready for clinical use. Front Digit Health. 2023;5:1161098. [Crossref] [PubMed] [PMC]
Korot E, Wagner SK, Faes L, Liu X, Huemer J, Ferraz D, et al. Will AI replace ophthalmologists? Transl Vis Sci Technol. 2020;9(2):2. Erratum in: Transl Vis Sci Technol. 2021;10(8):6. [Crossref] [PubMed] [PMC]
Ghadiri N. Comment on: 'Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination' and 'ChatGPT in ophthalmology: the dawn of a new era?'. Eye (Lond). 2024;38(4):654-5. [Crossref] [PubMed] [PMC]
Kleinig O, Gao C, Kovoor JG, Gupta AK, Bacchi S, Chan WO. How to use large language models in ophthalmology: from prompt engineering to protecting confidentiality. Eye (Lond). 2024;38(4):649-53. [Crossref] [PubMed] [PMC]

.: Güncel

.: İşlem Listesi

Türkçe İngilizce

Hakkımızda İletişim Görüş ve Öneri

Veri Politikamız Kullanım Şartları

Ortadoğu Reklam Tanıtım Yayıncılık Turizm Eğitim İnşaat Sanayi ve Ticaret A.Ş.

.: Adres

Türkocağı Caddesi No:30 06520 Balgat / ANKARA
Telefon: +90 312 286 56 56
E-posta: info@turkiyeklinikleri.com

.: Yazı İşleri Servisi

Telefon: +90 312 286 56 56/ 154 - 153
E-posta: yaziisleri@turkiyeklinikleri.com

.: İngilizce Dil Redaksiyonu

Telefon: +90 312 286 56 56/ 145
E-posta: tkyayindestek@turkiyeklinikleri.com

.: Reklam Servisi

Telefon: +90 312 286 56 56/ 142
E-posta: reklam@turkiyeklinikleri.com

.: Abone ve Halkla İlişkiler Servisi

Telefon: +90 312 286 56 56/ 197
E-posta: abone@turkiyeklinikleri.com

.: Müşteri Hizmetleri

Telefon: +90 312 286 56 56/ 197
E-posta: satisdestek@turkiyeklinikleri.com