Posted February 10, 2023
With the advent of ChatGPT, a large language model developed by OpenAI, there have been growing conversations about the advancements of artificial intelligence (AI) programs and their intersectionality with medicine and medical education. Several studies have been conducted on the use of AI to answer multiple-choice test questions on medical knowledge. Some conversations about these studies seem to suggest that AI tools are correctly answering USMLE test questions and we wanted to provide some additional context.
A review of the MedQA-USMLE database revealed that the study used test preparation materials from a third party unaffiliated with USMLE. Another study examined the results of ChatGPT using practice questions available at USMLE.org. It’s not surprising that ChatGPT was successful in answering these questions, as the input material is largely representative of medical knowledge available from online sources.
However, it’s important to note that the practice questions used by ChatGPT are not representative of the entire depth and breadth of USMLE exam content as experienced by examinees. For example, certain question types were not included in the studies, such as those using pictures, heart sounds, and computer-based clinical skill simulations. This means that other critical test constructs are not being represented in their entirety in the studies.
Although there is insufficient evidence to support the current claims that AI can pass the USMLE Step exams, we would not be surprised to see AI models improve their performance dramatically as the technology evolves. If utilized correctly, these tools can have a positive impact on how assessments are built and how students learn.
The USMLE co-sponsors (NBME and Federation of State Medical Boards) recognize the importance of these studies and their findings. In the future, we would be very interested in examining the questions that ChatGPT answered incorrectly and the implications of these results. As the technology advances, we will continue to look for ways to enhance the assessment of skills and behavior so that we may evolve in tandem with medical education and potential changes to the practice of medicine. While we are optimistic, we remain mindful of the risks that large language models bring in terms of potential for misinformation and perpetuating harmful biases.