Don’t Believe Everything Chatbots Say About Ophthalmic Research

When Hong-Uyen Hua, MD, a fellow at Cleveland Clinic Cole Eye Institute, asked ChatGPT about comparing trials of age-related macular degeneration treatments, it took mere seconds for the artificial intelligence (AI) tool to respond. As quick as a click, ChatGPT curated 10 references, including a 2016 Ophthalmology article co-authored by Daniel F. Martin, MD, Chair of the Cole Eye Institute.

Cleveland Clinic is a non-profit academic medical center. Advertising on our site helps support our mission. We do not endorse non-Cleveland Clinic products or services. Policy

One problem: The article didn’t exist.

“I searched PubMed and Google Scholar and soon realized that ChatGPT had made up the reference,” says Dr. Hua. “It had listed believable citations that were syntactically correct, with titles of well-known journals and names of well-respected authors, but not real.”

Called “hallucinations” in the AI realm, these untruths are a common hazard of using ChatGPT (Generative Pre-Trained Transformer) and other large-language-model AI tools.

The experience — especially the surprising representation of Cole Eye Institute faculty — inspired Dr. Hua; Danny A. Mammo, MD; and colleagues to study the quality of AI-generated abstracts and references in ophthalmology. The resulting work, recently published in JAMA Ophthalmology, showed that AI can produce abstracts that are adequately written, although not necessarily factual. Of all ophthalmology references that the chatbots supposedly used to conceive the abstracts, about 30% did not exist.

“Our study is relevant for all specialties, not just ophthalmology,” says senior author Dr. Mammo, a vitreoretinal disease and uveitis specialist at the Cole Eye Institute. “It raises the specter that, while AI can generate ideas and maybe even references, any medical research content it produces should not be taken as final word. Everything needs to be verified.”

Chatbots answer 7 ophthalmic clinical research questions

In the study, researchers asked ChatGPT version 3.5 (released in November 2022) and ChatGPT version 4.0 (released in March 2023) seven questions about treating ophthalmic conditions:

Do fish oil supplements or other vitamin supplements improve dry eye symptoms?
What is the most effective anti-VEGF injection for wet age-related macular degeneration?
What is the best first-line treatment for glaucoma?
Comparing LASIK and SMILE, which procedure results in the best refractive outcomes?
What is the best first-line treatment for thyroid eye disease?
What are the best treatments to slow myopic progression in children?
How effective are oral corticosteroids compared to intravenous corticosteroids in the treatment of optic neuritis?

For each question, researchers prompted the chatbots to write an abstract with 10 references. Researchers then evaluated the quality of the abstracts on a scale of 1 to 5 according to DISCERN criteria (clear aims, achieving aims, relevance, clear sources, balance and non-bias, reference to uncertainty, and overall rating) and AI-specific criteria (helpfulness, truthfulness and harmlessness).

Out of a possible score of 50, the average quality score of ChatGPT-3.5 abstracts was 35.9. The average score of ChatGPT-4 abstracts was 38.1. Of note, the average truthfulness score was 3.64 (out of 5) for ChatGPT-3.5 abstracts and 3.86 for ChatGPT-4 abstracts.

“We found that ChatGPT generated some abstracts that were not as correct or nuanced as the human-interpreted body of literature,” says Dr. Hua. “For example, ChatGPT was unable to distinguish between steroid and bioequivalent dosing for optic neuritis. In another abstract, ChatGPT-3.5 suggested that the SMILE refractive surgery technique may have better safety and fewer complications than LASIK. However, scientific data show similar safety profiles.”

Hallucination rate as high as 80%

Researchers also checked the legitimacy of each reference listed by the chatbots by searching PubMed and Google Scholar. References that couldn’t be found — an average of 31% of those generated by ChatGPT-3.5 and 29% of those generated by ChatGPT-4 — were considered hallucinations. The rate of hallucination reached as high as 80% for some reference lists.

“The significant hallucination rate is alarming, especially when you consider that patients may be relying on AI for medical information, and clinicians may be using it to help make medical decisions,” says Dr. Mammo.

AI detectors prove unreliable

Perhaps more alarming is the finding that tools designed to detect AI-written content are unreliable, adds Dr. Mammo.

Researchers entered each abstract into two detectors (GPT-2 Output Detector and Sapling AI Detector) to determine the abstracts’ “fake” scores. A score of 100% signifies the text is likely AI generated.

Since all abstracts in this study were AI generated, all should have been scored 100%. However, ChatGPT-3.5 abstracts were scored an average of 65.4% and 69.5% by the two detectors. ChatGPT-4 abstracts were scored an average of 10.8% and 42.7%.

“Abstracts written by the more recent, more advanced chatbot were less likely to be flagged,” says Dr. Mammo. “This suggests that as AI technology gets better, it will be harder to detect what is created by a human versus what is created by AI.”

What this means for humans

This study has implications for a wide range of audiences: patients searching for medical information, clinicians providing patient care, medical researchers seeking data and references, and journal editors reviewing manuscript submissions.

“AI may provide a decent start for generating abstracts, but it is clearly not infallible,” says Dr. Hua. “Every output should be thoroughly vetted and fact-checked.”