ChatGPT Not So Great at Solving Complex Cardiology Cases

— But the tool performed well on cardiovascular trivia and answering patient questions

MedicalToday
A computer rendering of an android doctor holding a tablet in front of an MRI machine.

ChatGPT was good at answering low-complexity cardiology questions, but came up short on more complicated cases, researchers found.

When analyzing a series of case vignettes of cardiology-related questions that patients had for their primary care providers, the generative artificial intelligence (AI) tool answered correctly 90% of the time. But with a more complex set of vignettes involving questions that primary care doctors referred to their cardiology colleagues, the technology only answered correctly half of the time, reported Ralf Harskamp, MD, PhD, and Lukas de Clercq, MSc, of Amsterdam UMC in the Netherlands.

ChatGPT also performed decently well when answering cardiology trivia questions, scoring a 74% overall, the researchers .

"Specialized medical information probably comprises a very small part of the training data, and yet that implicit knowledge is somewhere in this giant stochastic space of the model. So we were surprised to see that it performed rather well," de Clercq, a PhD candidate, told .

"We did not give it any context for these questions," he added. "It might perform even better if you provide it with the current guidelines and then ask questions regarding those guidelines."

David Asch, MD, MBA, senior vice dean for strategic initiatives at the University of Pennsylvania Perelman School of Medicine in Philadelphia, who was not involved in the study, said it's important to evaluate ChatGPT's performance in all situations -- from answering medical trivia questions, to more complex tasks such as assessing patient questions and complicated clinical scenarios.

"We have high standards for accuracy in medicine," said Asch, who recently published a conversation with ChatGPT in the NEJM Catalyst. "If you're using an AI program to direct patients to hospital parking, a little bit of inaccuracy is not life-threatening. It may be frustrating, but there are no fatalities."

Since ChatGPT showed off its general medical capabilities by passing the U.S. Medical Licensing Examination (USMLE) a few times now, researchers have started to conduct more specialty-specific assessments, including in hepatology, , and now cardiology.

Harskamp and de Clercq put its cardiology knowledge to the test first by posing 50 cardiovascular trivia questions that were based on quizzes for medical professionals. While its overall accuracy was 74%, there was slight variation in accuracy across domains, from 80% for each of three categories -- coronary artery disease, pulmonary and venous thrombotic embolism, and heart failure -- to 70% for atrial fibrillation and 60% for cardiovascular risk management.

They then posed a total of 20 clinical vignettes to the program, comparing its accuracy with expert clinical opinion.

Of the 10 scenarios where patients posed questions to their primary care doctors -- about whether their symptoms should be a reason for concern, medication use, and behavioral changes such as diet, for instance -- ChatGPT's advice was in line with actual clinical advice and care in nine of the cases.

One major inconsistency occurred when the chatbot advised using thrombolytic agents as secondary preventive medications in a patient with a prior myocardial infarction (MI). "While there is a place for thrombolytics in the acute phase [in settings where primary percutaneous coronary intervention is not available], these medications have no place in chronic management of patients post-MI," the researchers wrote.

As for the 10 questions that primary care doctors sent out for expert consultation, five of ChatGPT's answers fully matched that of advice provided by experts. Two answers were partial matches, one was inconclusive, and two were incorrect.

The two questions it got completely wrong were: "Should I use NT-proBNP [N-terminal pro-B-type natriuretic peptide] to monitor chronic heart failure or rely on symptoms and weight gain only?" and "My patient is taking spironolactone and developed bilateral gynecomastia, is there an alternative drug available without this side effect?"

For the former, it recommended using NT-proBNP for monitoring heart failure, which is not recommended by any European guidelines. For the latter, it said thiazide diuretics, angiotensin-converting enzyme inhibitors, angiotensin receptor blockers, and beta blockers could be used, when the correct answer is eplerenone (Inspra), a selective mineralocorticoid receptor antagonist with less anti-androgen activity.

"This suggests that ChatGPT was not sufficiently trained with this type of data, and may not have fully comprehended the nuanced context of the medical questions that we provided it with," Harskamp and de Clercq wrote.

"Hallucinations" are a major limitation of generative AI technologies at this time, de Clercq noted. They occur, he said, because in training, the technology "rarely encounters places in which the author of the text will state that they don't know something."

"The model rarely expresses doubt or uncertainty because it's rarely found in written text," he added. "This is probably the biggest hindrance of using a model such as this. As of yet, there is no real way to express uncertainty."

Given that limitation, de Clercq believes medical applications of the technology "should not be done unless we achieve such a level of distinction between, 'This is what the model puts out in terms of, language-wise, what would probabilistically follow these words,' and, 'This is something that the model is certain of because of these specific facts.'"

Asch agreed that ChatGPT and generative AI technologies need to be deployed safely and judiciously in healthcare at this time.

"We can get better at offloading work from clinicians to AI in spaces that are still relatively safe," he said. "I personally think that some of the greatest opportunities for this kind of work right now are more on the administrative end."

In terms of patient interactions, the technology can build to being helpful by diminishing physician workload, Asch said. "Eventually it can give advice on managing hypertension or diabetes. And then, it builds to, how do you manage immunosuppression for your solid organ transplant? You eventually move up the chain to things that are more complicated."

"I think everyone wants to get there; I don't think anybody wants to get there recklessly," he added. "When you graduate from medical school, we don't let you suddenly do heart surgery. You go through residency in which you gradually take on more responsibility in a supervised setting. And even afterwards, there's constant evaluation of heart surgeons. Why should it be any different for a technology that's being released into the system?"

  • author['full_name']

    Kristina Fiore leads MedPage’s enterprise & investigative reporting team. She’s been a medical journalist for more than a decade and her work has been recognized by Barlett & Steele, AHCJ, SABEW, and others. Send story tips to k.fiore@medpagetoday.com.

Disclosures

Harskamp and de Clercq reported no conflicts of interest.

Asch reported being a partner of VAL Health, LLC.

Primary Source

medRxiv

Harskamp RE, de Clercq L "Performance of ChatGPT as an AI-assisted decision support tool in medicine: a proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2)" medRxiv 2023; DOI: 10.1101/2023.03.25.23285475.