ChatGPT’s Readability Gap in Opioid Use Disorder Education

As generative artificial intelligence becomes embedded in clinical and patient-facing workflows, questions about accuracy are increasingly joined by concerns about readability and tone. A new comparative analysis of ChatGPT-generated responses and U.S. health organization frequently asked questions (FAQs) on opioid use disorder (OUD) provides timely data on how large language models perform when tasked with patient education in a stigmatized and literacy-sensitive domain.

Cleveland Clinic is a non-profit academic medical center. Advertising on our site helps support our mission. We do not endorse non-Cleveland Clinic products or services. Policy

OUD affects an estimated 16 million people worldwide and has contributed to more than 1.2 million deaths globally between 2014 and 2023, including more than 500,000 opioid-involved overdose deaths in the United States alone. Against this backdrop, accessible and non-stigmatizing communication has become an essential component of treatment, explains Cleveland Clinic psychiatrist Akhil Anand, MD, who coauthored the study.

“When addressing a disorder that has claimed more than a million lives globally in less than a decade, how we communicate is central to care,” he says. “Patients with OUD are often navigating shame, misinformation and ambivalence about treatment. If the information they encounter is overly complex or subtly stigmatizing, we risk reinforcing barriers that can directly influence whether someone seeks treatment.”

Key findings

The study, recently published in the American Journal of Addictions, evaluated 50 OUD-related FAQs drawn from U.S. federal and state public health agencies, academic medical centers and national professional societies. Each question was entered into ChatGPT, and responses were compared with the original organizational FAQ answers. Outcomes included structural measures (word and sentence counts), linguistic complexity (lexical density, syllables and characters per word), six standard readability indices and frequency of stigmatizing terms using the National Institute on Drug Abuse “Words Matter” framework.

The differences were striking, says Dr. Anand,an addiction specialist at Lutheran Hospital.

ChatGPT responses were substantially longer, with a mean word count of 253.7 compared with 76.6 for organizational FAQs—a mean difference of 177 words (95% CI, 151–203). Sentence counts nearly doubled (18.2 vs. 9.0; mean difference 9.2). Lexical density was higher by 6.5 percentage points (95% CI, 4.0–9.0), and ChatGPT used longer words, with greater characters and syllables per word. Although words per sentence were only modestly higher, the cumulative effect was increased syntactic and informational load.

Readability indices were consistent across the board. Compared with organizational FAQs, ChatGPT responses scored higher (indicating more difficult reading levels) on the Coleman–Liau Index (+3.43), Gunning Fog (+3.47), SMOG (+2.96), Flesch–Kincaid Grade Level (+3.61), and Automated Readability Index (+4.33). Flesch Reading Ease scores were lower by 20.4 points. All differences were statistically significant (p < .05). Notably, both sources exceeded the recommended sixth- to eighth-grade reading level for patient materials, but ChatGPT deviated further from established health literacy targets.

By contrast, stigmatizing language was infrequent in both groups and did not differ significantly. Sentences containing terms flagged by the National Institute on Drug Abuse list occurred in 9.6% of ChatGPT responses versus 6.0% of organizational FAQs (difference 3.57 percentage points; p = .16). The study team emphasized that automated screening was supplemented with human review, underscoring the limits of purely computational approaches to stigma detection.

Addressing literacy

For physicians, the key takeaway is not that ChatGPT produces problematic content per se, but that its default language may be misaligned with the literacy needs of many patients with OUD.

“Clinicians often assume that more information is better – but in OUD care, cognitive load matters,” Dr. Anand says. “When responses triple in length and jump by three or four grade levels, you risk losing the very patients you’re trying to engage.”

He notes that while ChatGPT’s answers were more comprehensive, they also reflected a more academic, written style — higher lexical density and longer words — that may challenge patients with limited health literacy.

“The model appears to err on the side of completeness and nuance,” he notes. “That’s admirable from a medical standpoint, but it doesn’t necessarily translate into clarity for a patient in crisis.”

Dr. Anand emphasizes that the findings also raise concerns about the uneven distribution of health literacy and its effect on social determinants, digital access and educational opportunity. He notes that default outputs that exceed recommended reading levels may disproportionately disadvantage patients with limited literacy, older adults and those with chronic conditions — populations already overrepresented in OUD morbidity and mortality statistics.

Importantly, the study did not evaluate factual accuracy, empathic tone, or motivational interviewing — consistent language — factors that are central to addiction care. Nor did it assess how patients interpret or act on chatbot-generated information. The analysis represents a snapshot of a single model version at a single time point, and large language models are evolving rapidly.

Still, the results quantify a trade-off that many clinicians have intuited: scalability and comprehensiveness may come at the cost of readability.

“Large language models can simplify text when explicitly prompted,” Dr. Anand observes. “But this study shows that if you use them ‘out of the box,’ you may get content that’s technically sound yet overly complex.”

Looking ahead

For addiction medicine in particular, Dr. Anand says the study’s implications are clear.

“Communication is not neutral; it shapes trust, stigma, and willingness to seek treatment,” he explains. “Although we found no significant increase in stigmatizing terminology, increased complexity alone may constitute a barrier to care.”

As generative AI continues to permeate clinical practice, Dr. Anand notes that physicians will need to evaluate not only whether a model is accurate, but whether it is accessible.

The researchers ultimately support a hybrid approach that leverages AI for scalability and draft generation, but anchors patient education in human judgment, health literacy standards and person-first language.

“In OUD care, where engagement can be fragile and stakes are high, plain language is not a stylistic preference – it is a clinical intervention,” Dr. Anand concludes. “And for now at least, the art of clear communication in addiction care remains a distinctly human responsibility.”

When More Isn’t Better: ChatGPT’s Readability Gap in Opioid Use Disorder Education

Key findings

Addressing literacy

Looking ahead

Related Articles

Can Kappa and Alpha-2 Agonist Agents Treat Opioid-Induced Ventilatory Depression Risk While Preserving Analgesic Effects?

Experts Stress Importance of Improved Access to Fentanyl Testing

How to Do Better With CDC Opioid Guidelines

LLM-Based Tool Shows High Accuracy in Flagging Contraindications to Stroke Thrombolysis

AI Tools Boost Efficiency, Patient Experience

Complex Tech Is Improving Care for Complex Pain Conditions (Podcast)

Less Typing, More Talking: How Ambient AI Is Reshaping Clinical Workflow at Cleveland Clinic

Artificial Intelligence Tool Informs Prostate Cancer Management