Are AI Chatbots in Healthcare Ethical?

— Their use must require informed consent and independent review

MedicalToday
A photo of a man typing a question into the ChatGPT interface on a laptop

Within a week of its Nov. 30, 2022 release by OpenAI, ChatGPT was the most widely used and influential artificial intelligence (AI) chatbot in history with registered users. Like other chatbots built on large language models, ChatGPT is capable of accepting natural language text inputs and producing novel text responses based on probabilistic analyses of enormous bodies or corpora of pre-existing text. ChatGPT has been praised for producing particularly articulate and detailed text in many domains and formats, including not only casual conversation, but also expository essays, fiction, song, poetry, and computer programming languages. ChatGPT has displayed enough domain knowledge to narrowly miss for accountants, to earn C+ grades on and B- grades on , and to pass parts of the U.S. Medical Licensing Exams. It has been listed as a co-author on scientific publications.

At the same time, like other large language model chatbots, ChatGPT regularly makes misleading or flagrantly false statements with great confidence (sometimes referred to as "AI hallucinations"). Despite significant improvements over earlier models, it has at times of algorithmic racial, gender, and religious bias. Additionally, data entered into ChatGPT is explicitly stored by OpenAI and used in training, threatening user privacy. In my experience, I've asked ChatGPT to evaluate hypothetical clinical cases and found that it can generate reasonable but inexpert differential diagnoses, diagnostic workups, and treatment plans. Its responses are comparable to those of a well-read and overly confident medical student with poor recognition of important clinical details.

This suddenly widespread use of large language model chatbots has brought new urgency to questions of artificial intelligence ethics in education, law, cybersecurity, journalism, politics -- and, of course, healthcare.

As a case study on ethics, let's examine the results of a from the free peer-to-peer therapy platform Koko. The program used the same GPT-3 large language model that powers ChatGPT to generate therapeutic comments for users experiencing psychological distress. Users on the platform who wished to send supportive comments to other users had the option of sending AI-generated comments rather than formulating their own messages. Koko's co-founder Rob Morris reported: "Messages composed by AI (and supervised by humans) were rated significantly higher than those written by humans on their own," and "Response times went down 50%, to well under a minute." However, the experiment was quickly discontinued because "once people learned the messages were co-created by a machine, it didn't work." Koko has made ambiguous and conflicting statements about whether users understood that they were receiving AI-generated therapeutic messages but has consistently reported that there was or review by an independent institutional review board.

ChatGPT and Koko's therapeutic messages raise an urgent question for clinicians and clinical researchers: Can large language models be used in standard medical care or should they be restricted to clinical research settings?

In terms of the benefits, ChatGPT and its large language model cousins might offer guidance to clinicians and even participate directly in some forms of healthcare screening and psychotherapeutic treatment, potentially increasing access to specialist expertise, reducing error rates, lowering costs, and improving outcomes for patients. On the other hand, they entail currently unknown and potentially large risks of false information and algorithmic bias. Depending on their configuration, they can also be enormously invasive to their users' privacy. These risks may be especially harmful to vulnerable individuals with medical or psychiatric illness.

As researchers and clinicians begin to explore the potential use of large language model artificial intelligence in healthcare, applying principals of clinical research will be key. As most readers will know, clinical research is work with human participants that is intended primarily to develop generalizable knowledge about health, disease, or its treatment. Determining whether and how artificial intelligence chatbots can safely and effectively participate in clinical care would prima facie appear to fit perfectly within this category of clinical research. Unlike standard medical care, clinical research can involve deviations from the standard of care and additional risks to participants that are not necessary for their treatment but are vital for generating new generalizable knowledge about their illness or treatments. Because of this flexibility, additional ethical (and -- for federally funded research -- legal) requirements that do not apply to standard medical care but are necessary to protect research participants from exploitation. In addition to informed consent, clinical research is subject to independent review by knowledgeable individuals not affiliated with the research effort -- usually an institutional review board. Both clinical researchers and independent reviewers are responsible for ensuring the proposed research has a favorable risk-benefit ratio, with potential benefits for society and participants that outweigh the risks to participants, and minimization of risks to participants wherever possible. These informed consent and independent review processes -- while imperfect -- are enormously important to protect the safety of vulnerable patient populations.

There is another newer and evolving category of clinical work known as quality improvement or quality assurance, which uses data-driven methods to improve healthcare delivery. Some tests of artificial intelligence chatbots in clinical care might be considered quality improvement. Should these projects be subjected to informed consent and independent review? The NIH lays out a for determining whether such efforts should be subjected to the added protections of clinical research. Among these, two key questions are whether techniques deviate from standard practice, and whether the test increases the risk to participants. For now, it is clear that use of large language model chatbots is both a deviation from standard practice and introduces novel uncertain risks to participants. It is possible that in the near future, as AI hallucinations and algorithmic bias are reduced and as AI chatbots are more widely adopted, that their use may no longer require the protections of clinical research. At present, informed consent and institutional review remain critical to the safe and ethical use of large language model chatbots in clinical practice.

is a neurologist at Yale School of Medicine and the Yale Comprehensive Epilepsy Center, and the inaugural director of the Yale New Haven Health System Center for Clinical Ethics.