Comparison of performance of artificial intelligence tools in answering emergency medicine question pool: ChatGPT 4.0, Google Gemini and Microsoft Copilot

Article Information

Title: Comparison of performance of artificial intelligence tools in answering emergency medicine question pool: ChatGPT 4.0, Google Gemini and Microsoft Copilot

Authors: İskender Aksoy, Merve Kara Arslan

Journal: Pakistan Journal of Medical Sciences

HEC Recognition History

Category	From	To
Y	2024-10-01	2025-12-31
X	2023-07-01	2024-09-30
X	2022-07-01	2023-06-30
X	2021-07-01	2022-06-30
X	2020-07-01	2021-06-30
W	2009-07-01	2020-06-30
Z	2006-11-07	2009-07-01
Y	1900-01-01	2005-06-30

Publisher: Professional Medical Publications

Country: Pakistan

Year: 2025

Volume: 41

Issue: 4

Language: en

DOI: 10.12669/pjms.41.4.11178

Keywords: ARTIFICIAL INTELLIGENCEChatGPTMedical EducationEmergency MedicineGeminiCopilot

Abstract

Objective: Using artificial intelligence tools that work with different software architectures for both clinical and educational purposes in the medical field has been a subject of considerable interest recently. In this study, we compared the answers given by three different artificial intelligence chatbots to the Emergency Medicine question pool obtained from the questions asked in the Turkish National Medical Specialization Exam. We tried to investigate the effects on the answers given by classifying the questions in terms of content and form and examining the question sentences.
Methods: The questions related to emergency medicine of the Medical Specialization Exam questions between 2015-2020 were recorded. The questions were asked to artificial intelligence models, including ChatGPT-4, Gemini, and Copilot. The length of the questions, the question type and the topics of the wrong answers were recorded.
Results: The most successful chatbot in terms of total score was Microsoft Copilot (7.8% error margin), while the least successful was Google Gemini (22.9% error margin) (p&lt;0.001). It was important that all chatbots had the highest error margins in questions about trauma and surgical approaches and made mistakes in burns and pediatrics. The increase in the error rates in questions containing the root “probability” also showed that the question style affected the answers given.
Conclusions: Although chatbots show promising success in determining the correct answer, we think that they should not see chatbots as a primary source for the exam, but rather as a good auxiliary tool to support their learning processes.
doi: https://doi.org/10.12669/pjms.41.4.11178
How to cite this: Aksoy I, Arslan MK. Comparison of performance of artificial intelligence tools in answering emergency medicine question pool: ChatGPT 4.0, Google Gemini and Microsoft Copilot. Pak J Med Sci. 2025;41(4):968-972. doi: https://doi.org/10.12669/pjms.41.4.11178
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Paper summary is not available for this article yet.

Loading PDF...

Loading Statistics...

DefinePK

Comparison of performance of artificial intelligence tools in answering emergency medicine question pool: ChatGPT 4.0, Google Gemini and Microsoft Copilot

Article Information

HEC Recognition History

Categories

Abstract

DefinePK

Select Collection

Comparison of performance of artificial intelligence tools in answering emergency medicine question pool: ChatGPT 4.0, Google Gemini and Microsoft Copilot

Article Information

HEC Recognition History

Categories

Abstract