MedicalToday

AI-Generated Case Reports Indistinguishable From Those Written by Humans

– Study compared AI tools with human authors in dermatology case reports


New study findings suggest that some artificial intelligence (AI) tools are capable of producing dermatology case reports that are essentially indistinguishable from those created by humans.

Reviewers were able to accurately detect an AI-generated case report only 35% of the time. AI-derived text had statistically higher scores in overall quality/readability compared with original manuscripts. The results, investigators said, also underscore ethical concerns surrounding the need to ensure that AI-generated content is reliable and appropriately identified.

Commonly available AI detection tools were only accurate 50% of the time in identifying AI-generated content. Reviewers who correctly identified AI-generated manuscripts cited three key features in their decisions: inaccurate references, poorly worded exam characteristics, and imprecise description of pathophysiology.

Charles Dunn, MD, a medical resident and dermatologist-researcher with Kansas City University, headquartered in Missouri, served as the first author of the research letter, which appeared in the. Dunn's exchange with the Reading Room has been edited for length and clarity.

What knowledge gap were you attempting to address?

Dunn: To answer that question, I think it's important to understand generative AI. Generative AI, at this point, has almost invaded society, across multiple different fields. Ever since ChatGPT became publicly available in November 2022 and rapidly demonstrated this remarkable capacity to both synthesize and communicate information in a way that's intelligent sounding, people have discovered many uses.

Even right at the beginning, these kinds of tools superficially invaded the medical literature. By the time it was about two months old, ChatGPT had contributed to several published medical manuscripts.

Still, there has been and continues to be this profound skepticism within the medical community about AI's capabilities to meaningfully or even independently contribute to more advanced forms of medical literature. By that I mean things like case reports, systematic reviews, meta-analyses, and clinical practice guidelines. These are pretty nuanced and complex manuscripts. It's that skepticism that underlies the thought process behind this project.

How did the study compare AI- and human-generated case reports?

Dunn: We created four dermatology case reports. Two were authored by humans, and the other two we created using patient information we fed into ChatGPT.

We then used 20 blinded medical reviewers with a range of experience levels, and we had them do three things: score each report with a validated grading rubric that we gave them -- essentially guess which manuscripts were created by humans and which ones were AI -- and provide their levels of pre- and post-study confidence in their ability to detect AI-generated content.

What were your key findings? The study noted a few caveats to the results.

Dunn: The results were fascinating: as noted, reviewers were accurate only 35% of the time. Notably, the more experience the reviewer had, the more accurate they were in identifying AI-generated content. So it seems like reviewer experience had a role in this.

The second really interesting result was that AI detection tools were only accurate 50% of the time. Not all tools in this vein are created equal.

Another result I thought was interesting was that the post-study reviewer confidence significantly declined. Reviewers were very confident that they'd be able to identify AI at the beginning of the experiment, and then by the end they were not confident.

What are the implications of this study -- particularly for those involved in developing, reviewing, and/or publishing case reports?

Dunn: I actually think these findings have implications for anyone consuming medical literature. This was a dermatology-specific case report study, but I don't think it's that big of a leap to extrapolate these results to all fields of medicine.

For authors, I'd say this study supports the notion that ChatGPT, and others like it, seem like they could be powerful, helpful tools that could improve the medical writing process. However, ChatGPT cannot replace a detailed literature review or a fundamental understanding of the subject matter being written about. If you mindlessly use this tool for nuanced manuscripts like case reports, you can actually harm the overall medical literature process with propagation of inaccurate information. At the same time, ChatGPT can drastically improve logical sequencing, your writing conventions, and the quality of what you produce.

For reviewers, I think this study provides valuable data about identified weaknesses. If reviewers in this experiment would've taken time to look at the references at the bottom of the manuscript, they might have found a falsified reference. So when we're reviewing medical literature, we need to trust but verify. You need to look at the source of the information that you're receiving to ensure that it's accurate.

For editorial teams, there is so much to unpack here. The first thing our study shows and quantifies is that not all AI detection tools are equivalent. So if you use an AI detection tool for your editorial process, make sure it's a vetted and solid output detector tool.

What are the ethical considerations to be mindful of in these results?

Dunn: I think this dives into the ethics of medical writing. It falls to authors to know what the limitations of this tool are, and to ensure that when we're using AI to create content, that the content is reliable, accurate, and transparently labeled.

Dunn did not disclose any relevant financial relationships with industry.

Primary Source

Journal of the American Academy of Dermatology

Source Reference:

AAD Publications Corner

AAD Publications Corner