When Hong-Uyen Hua, MD, a fellow at Cleveland Clinic Cole Eye Institute, asked ChatGPT about comparing trials of age-related macular degeneration treatments, it took mere seconds for the artificial intelligence (AI) tool to respond. As quick as a click, ChatGPT curated 10 references, including a 2016 Ophthalmology article co-authored by Daniel F. Martin, MD, Chair of the Cole Eye Institute.
Cleveland Clinic is a non-profit academic medical center. Advertising on our site helps support our mission. We do not endorse non-Cleveland Clinic products or services. Policy
One problem: The article didn’t exist.
“I searched PubMed and Google Scholar and soon realized that ChatGPT had made up the reference,” says Dr. Hua. “It had listed believable citations that were syntactically correct, with titles of well-known journals and names of well-respected authors, but not real.”
Called “hallucinations” in the AI realm, these untruths are a common hazard of using ChatGPT (Generative Pre-Trained Transformer) and other large-language-model AI tools.
The experience — especially the surprising representation of Cole Eye Institute faculty — inspired Dr. Hua; Danny A. Mammo, MD; and colleagues to study the quality of AI-generated abstracts and references in ophthalmology. The resulting work, recently published in JAMA Ophthalmology, showed that AI can produce abstracts that are adequately written, although not necessarily factual. Of all ophthalmology references that the chatbots supposedly used to conceive the abstracts, about 30% did not exist.
“Our study is relevant for all specialties, not just ophthalmology,” says senior author Dr. Mammo, a vitreoretinal disease and uveitis specialist at the Cole Eye Institute. “It raises the specter that, while AI can generate ideas and maybe even references, any medical research content it produces should not be taken as final word. Everything needs to be verified.”
In the study, researchers asked ChatGPT version 3.5 (released in November 2022) and ChatGPT version 4.0 (released in March 2023) seven questions about treating ophthalmic conditions:
For each question, researchers prompted the chatbots to write an abstract with 10 references. Researchers then evaluated the quality of the abstracts on a scale of 1 to 5 according to DISCERN criteria (clear aims, achieving aims, relevance, clear sources, balance and non-bias, reference to uncertainty, and overall rating) and AI-specific criteria (helpfulness, truthfulness and harmlessness).
Out of a possible score of 50, the average quality score of ChatGPT-3.5 abstracts was 35.9. The average score of ChatGPT-4 abstracts was 38.1. Of note, the average truthfulness score was 3.64 (out of 5) for ChatGPT-3.5 abstracts and 3.86 for ChatGPT-4 abstracts.
“We found that ChatGPT generated some abstracts that were not as correct or nuanced as the human-interpreted body of literature,” says Dr. Hua. “For example, ChatGPT was unable to distinguish between steroid and bioequivalent dosing for optic neuritis. In another abstract, ChatGPT-3.5 suggested that the SMILE refractive surgery technique may have better safety and fewer complications than LASIK. However, scientific data show similar safety profiles.”
Researchers also checked the legitimacy of each reference listed by the chatbots by searching PubMed and Google Scholar. References that couldn’t be found — an average of 31% of those generated by ChatGPT-3.5 and 29% of those generated by ChatGPT-4 — were considered hallucinations. The rate of hallucination reached as high as 80% for some reference lists.
“The significant hallucination rate is alarming, especially when you consider that patients may be relying on AI for medical information, and clinicians may be using it to help make medical decisions,” says Dr. Mammo.
Perhaps more alarming is the finding that tools designed to detect AI-written content are unreliable, adds Dr. Mammo.
Researchers entered each abstract into two detectors (GPT-2 Output Detector and Sapling AI Detector) to determine the abstracts’ “fake” scores. A score of 100% signifies the text is likely AI generated.
Since all abstracts in this study were AI generated, all should have been scored 100%. However, ChatGPT-3.5 abstracts were scored an average of 65.4% and 69.5% by the two detectors. ChatGPT-4 abstracts were scored an average of 10.8% and 42.7%.
“Abstracts written by the more recent, more advanced chatbot were less likely to be flagged,” says Dr. Mammo. “This suggests that as AI technology gets better, it will be harder to detect what is created by a human versus what is created by AI.”
This study has implications for a wide range of audiences: patients searching for medical information, clinicians providing patient care, medical researchers seeking data and references, and journal editors reviewing manuscript submissions.
“AI may provide a decent start for generating abstracts, but it is clearly not infallible,” says Dr. Hua. “Every output should be thoroughly vetted and fact-checked.”