An artificial intelligence (AI) chatbot largely outperformed a panel of specialist ophthalmologists when given prompts about glaucoma and retinal health, a comparative single-center study found.
The ChatGPT chatbot powered by GPT-4 scored better than the panelists on measures of diagnostic and treatment accuracy when it analyzed 20 real-life cases and considered 20 possible patient questions, reported Andy S. Huang, MD, of the Icahn School of Medicine at Mount Sinai in New York City, and colleagues in .
Huang told that he had expected that the chatbot would do worse, "but there's no place where people did better." AI obviously can't do surgery, he said, but its ability to answer questions and evaluate cases does raise "the question of whether this is a real threat to optometrists and ophthalmologists."
The findings also provide more evidence that chatbots are getting better at offering reliable guidance regarding eye health. When researchers gave retinal health questions to a chatbot in January 2023, it bungled almost all the answers and even offered harmful advice. But the responses improved 2 weeks later as the chatbot evolved, and a similar reported high levels of accuracy. Another study found that a chatbot's responses to eye health questions from an online forum were about as accurate as those written by ophthalmologists.
The study by Huang's team is one of many that researchers have launched in recent months to gauge the accuracy of a type of AI program known as a large language model (LLM), which analyzes vast arrays of text to learn how likely words are to occur next to each other.
Huang said the new study was inspired by his own experiences experimenting with a chatbot: "I slowly realized that it was doing a better job than I was in a lot of tasks, and I started using it as an adjunct to improve my diagnoses," he said.
The findings are "eye opening," he said, adding that he doesn't think ophthalmologists should turn in their eye charts and let AI robots take over. "Right now we're hoping to use it as an adjunct, such as in places where there's a significant number of complex patients or a high volume of patients," Huang said. AI could also help primary care physicians triage patients with eye problems, he said.
Moving ahead, "it's very important for ophthalmologists to understand how powerful these large language models are for fact-checking yourself and significantly improving your workflow," Huang said. "This tool has been tremendously helpful for me with triaging or just improving my thoughts and diagnostic abilities."
In an accompanying , Benjamin K. Young, MD, MS, of Casey Eye Institute of Oregon Health & Sciences University in Portland, and Peter Y. Zhao, MD, of New England Eye Center of Tufts University School of Medicine in Boston, said the study "presents proof of concept that patients can copy the summarized history, examination, and clinical data from their own notes and ask version 4 to produce its own assessment and plan to cross-check their physician's knowledge and judgment."
Young and Zhao added that "medical errors will potentially be caught in this way," and that "at this time, LLMs should be considered a potentially fast and useful tool to enhance the knowledge of a clinician who has examined a patient and synthesized their active clinical situation." (The duo were co-authors of the previously mentioned January 2023 chatbot study.)
For the new study, the chatbot was told that an ophthalmologist was directing it to assist with "medical management and answering questions and cases." The chatbot replied that it understood its job was to provide "concise, accurate, and precise medical information in the manner of an ophthalmologist."
The chatbot analyzed extensive details from 20 real patients from Icahn School of Medicine at Mount Sinai-affiliated clinics – 10 glaucoma cases and 10 retinal cases -- and developed treatment plans. The chatbot also considered 20 questions randomly derived from the American Academy of Ophthalmology's list of commonly asked questions.
The researchers then asked 12 fellowship-trained retinal and glaucoma specialists and three senior trainees (ages 31 to 67 years) from eye clinics affiliated with the Department of Ophthalmology at Icahn School of Medicine to respond to the same prompts. Panelists evaluated all responses in a blinded fashion except their own on scales of accuracy (1-10) and medical completeness (1-6).
The combined question-case mean ranks for accuracy were 506.2 for the chatbot and 403.4 for the glaucoma specialists (n=831, Mann-Whitney U=27,976.5, P<0.001). The mean ranks for completeness were 528.3 and 398.7, respectively (n=828, Mann-Whitney U=25,218.5, P<0.001).
For retina-related questions, the mean ranks for accuracy were 235.3 for the chatbot and 216.1 for the retina specialists (n=440, Mann-Whitney U=15,518.0, P=0.17). The mean ranks for completeness were 258.3 and 208.7, respectively (n=439, Mann-Whitney U=13,123.5, P=0.005).
The results showed that "both trainees and specialists rated the chatbot's accuracy and completeness more favorably than those of their specialist counterparts, with specialists noting a significant difference in the chatbot's accuracy (z=3.23; P=0.007) and completeness (z=5.86; P<0.001)," wrote Huang and co-authors.
Limitations included that the single-center, cross-sectional study evaluated only LLM proficiency at a single time point among one group of attendings and trainees. In addition, the investigators cautioned that the "findings, while promising, should not be interpreted as endorsing direct clinical application due to chatbots' unclear limitations in complex decision-making, alongside necessary ethical, regulatory, and validation considerations not covered in this report."
Disclosures
The study was funded by the Manhattan Eye and Ear Ophthalmology Alumni Foundation and Research to Prevent Blindness.
Huang reported grants from the Manhattan Eye and Ear Ophthalmology Alumni Foundation, as did a co-author, who also reported a financial relationship with Twenty Twenty and grants from the National Eye Institute, the Glaucoma Foundation, and Research to Prevent Blindness.
Young and Zhao had no disclosures.
Primary Source
JAMA Ophthalmology
Huang AS, et al "Assessment of a large language model's responses to questions and cases about glaucoma and retina management" JAMA Ophthalmol 2024; DOI: 10.1001/jamaophthalmol.2023.6917.
Secondary Source
JAMA Ophthalmology
Young BK, Zhao PY "Large language models and the shoreline of ophthalmology" JAMA Ophthalmol 2024; DOI: 10.1001/jamaophthalmol.2023.6937.