Background

JMIR Cardio

cardio

JMIR Cardio

2561-1011

JMIR Publications

Toronto, Canada

v9i1e68817

10.2196/68817

Original Paper

Improving the Readability of Institutional Heart Failure–Related Patient Education Materials Using GPT-4: Observational Study

King

Ryan C

MD1Samaan

Jamil S

MD2Haquang

Joseph

DO1Bharani

Vishnu

MD1Margolis

Samuel

BS3Srinivasan

Nitin

BA4Peng

Yuxin

BS5Yeo

Yee Hui

MD, MSc2Ghashghaei

Roxana

MD1

Department of Medicine, Division of Cardiology, University of California, Irvine Medical Center

101 The City Dr S

Orange

United StatesDepartment of Medicine, Karsh Division of Gastroenterology and Hepatology, Cedars-Sinai Medical Center

Los Angeles

United StatesDavid Geffen School of Medicine, University of California, Los Angeles

Los Angeles

United StatesKeck School of Medicine, University of Southern California

Los Angeles

United StatesSchool of Mathematics and Statistics, Xi'an Jiaotong University

Xi'an

China

Rivers

John

Rouhi

Armaun D

Nomali

Mahin

Correspondence to Ryan C King, MD, Department of Medicine, Division of Cardiology, University of California, Irvine Medical Center, 101 The City Dr S, Orange, CA, 92868, United States, 1 714-456-7890; ryan.king2517@gmail.com

2025

872025

e68817

151120240506202508062025

© Ryan C King, Jamil S Samaan, Joseph Haquang, Vishnu Bharani, Samuel Margolis, Nitin Srinivasan, Yuxin Peng, Yee Hui Yeo, Roxana Ghashghaei. Originally published in JMIR Cardio (https://cardio.jmir.org), 8.7.2025.

2025

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Cardio, is properly cited. The complete bibliographic information, a link to the original publication on https://cardio.jmir.org, as well as this copyright and license information must be included.

Background

Heart failure management involves comprehensive lifestyle modifications such as daily weights, fluid and sodium restriction, and blood pressure monitoring, placing additional responsibility on patients and caregivers, with successful adherence often requiring extensive counseling and understandable patient education materials (PEMs). Prior research has shown PEMs related to cardiovascular disease often exceed the American Medical Association’s fifth- to sixth-grade recommended reading level. The large language model (LLM) ChatGPT may be a useful tool for improving PEM readability.

Objective

We aim to assess the readability of heart failure–related PEMs from prominent cardiology institutions and evaluate GPT-4’s ability to improve these metrics while maintaining accuracy and comprehensiveness.

Methods

A total of 143 heart failure–related PEMs were collected from the websites of the top 10 institutions listed on the 2022‐2023 US News & World Report for “Best Hospitals for Cardiology, Heart & Vascular Surgery.” PEMs were individually entered into GPT-4 (version updated July 20, 2023), preceded by the prompt, “Please explain the following in simpler terms.” Readability was assessed using the Flesch Reading Ease score, Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index, Coleman-Liau Index, Simple Measure of Gobbledygook Index, and Automated Readability Index. The accuracy and comprehensiveness of revised GPT-4 PEMs were assessed by a board-certified cardiologist.

Results

For 143 institutional heart failure–related PEMs analyzed, the median FKGL was 10.3 (IQR 7.9-13.1; high school sophomore) compared to 7.3 (IQR 6.1-8.5; seventh grade) for GPT-4’s revised PEMs (P<.001). Of the 143 institutional PEMs, there were 13 (9.1%) below the sixth-grade reading level, which improved to 33 (23.1%) after revision by GPT-4 (P<.001). No revised GPT-4 PEMs were graded as less accurate or less comprehensive compared to institutional PEMs. A total of 33 (23.1%) GPT-4 PEMs were graded as more comprehensive.

Conclusions

GPT-4 significantly improved the readability of institutional heart failure–related PEMs. The model may be a promising adjunct resource in addition to care provided by a licensed health care professional for patients living with heart failure. Further rigorous testing and validation is needed to investigate its safety, efficacy, and impact on patient health literacy.

patient educationheart failureartificial intelligencelarge language modelsChatGPTGPT-4health literacyreadability

Introduction

Heart failure affects approximately 1%‐2% of adults globally, with an estimated prevalence of 64 million people [1]. Treatment involves extensive patient adherence to lifestyle modifications such as daily weights, fluid and sodium restriction, and rigorous guideline-directed medication regimens. Altogether, these interventions attempt to prevent disease progression and hospital admissions, which drive most of the financial burden ($39.2-$60 billion) related to the disease [2]. Due to the complex degree of self-management required by patients with heart failure, improving patient education and health literacy may play a crucial role in improving outcomes [3,4].

In the United States, the average adult’s reading comprehension level is approximately seventh to eighth grade proficiency [5], resulting in the American Medical Association (AMA) recommendation of written patient education materials (PEMs) being at a fifth- to sixth-grade reading level [6]. However, a 2019 readability analysis of cardiovascular disease–related PEMs reported that the mean reading level of materials was tenth grade, comparable to that of a high school sophomore [7]. Inadequate health literacy has been associated with increased relative risk of emergency department visits, hospitalizations, and mortality for patients with heart failure [4,8], highlighting the need for accessible, readable, and high-quality PEMs.

ChatGPT is a large language model (LLM) that is gaining widespread public adoption [9]. With an increasing number of patients seeking health information online [10], the model has the potential to enhance patient health education and address the complexity of heart failure–related PEMs. As ChatGPT’s acceptance and usage have increased, initial research involved evaluating the model’s accuracy and reliability. Several studies have shown that ChatGPT provides appropriate, accurate, and reliable knowledge across a wide range of cardiac and noncardiac medical conditions, including heart failure [11-16]. In addition to accuracy, ChatGPT has been found to deliver more empathetic responses to real-world patient questions than physicians in online forums [17]. As prior data regarding accuracy have been promising, an emerging focus has been on investigating the readability of the model’s output.

Prior studies have shown ChatGPT provides accurate and comprehensive responses to questions related to heart failure, and another demonstrated its responses were at a college reading level, highlighting the need for further assessment of the readability of GPT’s outputs [12,18]. Similarly, another study examining GPT-4’s responses related to amyloidosis showed that while responses were often accurate and comprehensive, the average readability of responses ranged from a grade level of 10.3 (high school sophomore) to 21.7 (beyond graduate school) [16]. We aim to expand on the previous literature by assessing the readability of heart failure–related online PEMs from renowned cardiology institutions, assessing GPT-4’s ability to improve the readability of these PEMs, and comparing the accuracy and comprehensiveness between institutional PEMs and GPT-4’s revised PEMs.

MethodsInstitutional Patient Education Materials

There were 143 PEMs (Multimedia Appendix 1 and Figure 1) related to heart failure collected in July 2023 from the top 10 ranked cardiology institutions (deidentified) listed on the 2022‐2023 US News & World Report website as “Best Hospitals for Cardiology, Heart & Vascular Surgery.” These PEMs include frequently asked questions (FAQs) presented as text descriptions of various aspects of heart failure such as causes, symptoms, medications, and procedures. Duplicate institutional PEMs were included since education materials varied between institutions, and readability of each PEM was the primary outcome of interest.

Figure 1.

Diagram of institutional heart failure–related PEM curation, revised GPT-4 PEM generation, and subsequent assessment of readability, accuracy, and comprehensiveness. Created in BioRender [19]. FAQ: frequently asked question; PEM: patient education material.

GPT-4 Response Generation

Each institution’s PEMs were entered into GPT-4 (version updated July 20, 2023), preceded by the prompt, “Please explain the following in simpler terms.” GPT-4 was accessed using the OpenAI website interface. Default model settings were used (temperature, max tokens, etc). The “new chat” function was used for each PEM, thus creating a new conversation without a record of prior inputs. Materials containing nontext components (images or videos) were excluded.

Readability Assessment

The readability of institutional PEMs and GPT-4’s revised PEMs were then assessed using the following validated formulas: Flesch Reading Ease (FRE) score [20], Flesch-Kincaid Grade Level (FKGL) [21], Gunning Fog Index [22], Coleman-Liau Index [23], Simple Measure of Gobbledygook (SMOG) Index [24], and Automated Readability Index [25]. The FRE score, measured on a scale of 0 to 100, indicates a text with a higher score has better ease of understanding. The remaining formulas directly translate a score into its corresponding US reading grade level, such as a score of 10 translating to a tenth-grade reading level. These metrics derive their scores from the mean length of sentences and words used in a given text. In contrast to the FRE, lower scores in the other formulas correspond to an easier level of understanding. The readability formulas were assessed using the Textstat library in Python (Python Software Foundation) and the Textstat readability package in R software (R Foundation for Statistical Computing).

Accuracy and Comprehensiveness

Accuracy and comprehensiveness of GPT-4’s revised PEMs (Multimedia Appendix 1) were assessed as secondary outcomes by an actively practicing board-certified cardiologist at a tertiary academic medical center. The reviewer was not blinded during grading. The reviewer used the following grading scale in Textbox 1 when grading the original institutional PEMs and revised GPT-4 PEMs.

Grading scale used by reviewer.

“Compared to the institutional PEM, the GPT-4 revised PEM is”:

Less accurate

Equally accurate

More accurate

“Compared to the institutional PEM, the GPT-4 revised PEM is”:

Less comprehensive

Equally comprehensiveness

More comprehensive

Statistical Analysis

Descriptive statistics are presented as medians and IQRs. Readability metrics for institutional PEMs and GPT-4’s revised PEMs were compared using the Mann-Whitney U test. Further subanalysis was performed investigating the proportion of PEMs meeting the sixth-grade reading level recommendation by the AMA among institutional PEMs and GPT-4’s revised PEMs. Statistical analysis was conducted using SPSS (version 29; IBM Corporation).

Ethical Considerations

The data collection process in this observational study did not involve patients and did not require the deidentification or protection of data. Therefore, no institutional review board approval was sought.

ResultsReadability Assessment

Readability analysis revealed GPT-4’s revised PEMs were significantly more readable compared to institutional PEMs across all 6 metrics (P<.001) (Figure 2). The FRE score increased from a median institutional score of 48.6 (IQR 38.0-63.3; P<.001; hard-to-read text, college reading level) to 72.2 (IQR 66.2-77.5; P<.001; fairly easy-to-read text, seventh-grade level) after GPT-4 revision [20]. The FKGL also saw improvement, decreasing from an institutional median reading level of tenth grade (IQR 7.9-13.1; P<.001) to seventh grade (IQR 6.1-8.5; P<.001) after GPT-4 revision. Furthermore, the institutional Automated Readability Index of 11.2 (IQR 7.7-14.5; P<.001) improved to 8.3 (IQR 6.7-9.3; P<.001) after GPT-4 revision. The other readability metrics (Gunning Fog Index, Coleman-Liau Index, and SMOG Index) also showed improved scores after GPT-4 revision: 9.8 (IQR 8.5-11.1; P<.001), 8.9 (IQR 8.1-10.0; P<.001), and 9.6 (IQR 8.5-10.7; P<.001), respectively, compared to the median institutional scores of 13.1 (IQR 10.6-16.2), 12.3 (IQR 10.1-14.5), and 12.2 (IQR 10.3-14.6). Before GPT-4 revision, 9.1% (13/143) of institutional PEMs met the AMA’s recommended sixth-grade reading level (Table 1). However, after GPT-4’s revision, 23.1% (33/143) of PEMs met the sixth-grade recommendation. On average, GPT-4 revision led to a 3.6 reading grade level reduction.

An example of this simplification in reading level was seen when describing different types of heart failure. The institutional PEM described right-sided heart failure as most often resulting from left-sided heart failure due to increased pressure from the left ventricle not propelling blood to the rest of the body. However, GPT-4 provided a more basic explanation using an analogy of ventricles being small rooms and gave a more simplified explanation of right-sided heart failure as a result of left-sided heart failure. In another example, when explaining the various causes of heart failure, one institutional PEM provided a list of etiologies such as “heart valve disease” or “coronary artery disease” without a description, compared to GPT-4, which more thoroughly described the role of each cause in relation to heart failure in simple language.

Figure 2.

Box and whiskers plot of median readability scores across 5 metrics including Automated Readability Index, Coleman-Liau Index, Flesch-Kincaid Grade Level, Gunning Fog Index, Simple Measure of Gobbledygook (SMOG) Index for institutional and GPT-4’s revised PEMs. PEMs: patient education materials. * P<.05.

Table 1.

Comparison of the proportion of patient education materials (PEMs) meeting the American Medical Association’s (AMA) recommended sixth-grade reading level between institutional and GPT-4’s revised PEMs.

	≤Sixth-grade reading level	≥Sixth-grade reading level	Percent meeting AMA recommendation
Institutional Flesch-Kincaid Grade Level	13	130	9.10
GPT-4 Flesch-Kincaid Grade Level	33	110	23.10

Accuracy and Comprehensiveness

Following review by a board-certified cardiologist, 33 out of 143 (23.1%) revised GPT-4 PEMs were graded as more comprehensive than the corresponding institutional PEMs (Table 2). Additionally, all 143 (100%) revised GPT-4 PEMs were graded as equally accurate as their institutional PEM counterpart.

Table 2.

Evaluation of GPT-4’s accuracy and comprehensiveness of revised patient education materials (PEMs) compared to institutional PEMs (N=143).

Scoring	Accuracy, n (%)	Comprehensiveness, n (%)
Less	0 (0)	0 (0)
Equal	143 (100)	110 (76.9)
More	0 (0)	33 (23.1)

DiscussionPrincipal Results

LLMs are a rapidly developing technology with the potential to enhance the delivery of PEMs to patients of all levels of health literacy. In this study, we expanded on existing research that evaluated ChatGPT’s ability to generate accurate and reliable answers to heart failure questions by examining GPT-4’s ability to improve the readability of institutional PEMs. Our analysis shows that GPT-4, when prompted, was able to significantly enhance the readability of institutional PEMs for common heart failure–related patient questions. After evaluation by a board-certified cardiologist, all of GPT-4’s revised PEMs were graded as equally accurate and many were graded as more comprehensive as institutional PEMs, with no revised PEMs graded as less accurate or less comprehensive. GPT-4’s capabilities to provide accurate, comprehensive, and readable PEMs in real-time and in a conversational manner underscores the future potential of LLMs to enhance patient education and ultimately patient health literacy.

Comparison With Prior Work

Previous research has demonstrated that ChatGPT possesses a broad knowledge base comprising various medical conditions, including cirrhosis, hepatocellular carcinoma, and bariatric surgery [14,15,26,27]. Its knowledge base also spans cardiovascular diseases such as acute coronary syndrome [11,28], heart failure [12], atrial fibrillation [29], and even rare disorders like amyloidosis [16]—a multisystemic infiltrative disease. Specifically, regarding amyloidosis, while GPT-4 provided accurate, comprehensive, and reliable answers to gastrointestinal, neurologic, and cardiology queries, the average FKGL of responses was 15.5 (college level), significantly exceeding the recommended sixth-grade reading level set forth by the AMA [16]. Similar results were shown when examining responses to the surgical treatment of retinal diseases and hypothyroidism in pregnancy [30,31].

A previous study examined ChatGPT’s ability to simplify the readability of responses to bariatric surgery–related FAQs [32]. GPT-4 reduced the average grade reading level of PEMs from eleventh (high school junior) to sixth grade, aligning with the AMA’s recommendation. Another study also showed that GPT-4 improved the readability of cardiovascular magnetic resonance reports, reducing the average reading level from tenth grade to fifth grade while maintaining high factual accuracy [33]. When simplifying PEMs relating to aortic stenosis, GPT-3.5 was able to lower the mean FKGL from 9.2 to 5.9 when instructed to “translate to a 5th grade reading level” [34]. Our study further contributes to this body of work by demonstrating GPT-4’s ability to improve the median readability of institutional PEMs from 10.3 (high school sophomore) to 7.3 (seventh grade) while maintaining accuracy and often enhancing comprehensiveness (Table 1). However, a unique aspect of our study was the use of a general prompt, “Please explain the following in simpler terms,” compared to other studies that specifically requested simplification to a fifth- to sixth-grade reading level [34]. Our prompt simulates an organic patient encounter with the GPT-4 platform written in language meant to mirror an actual patient request for simplification. This difference in prompting but similar significant improvement in readability shows the adaptability of LLMs in this domain and may increase the likelihood of future adoption. Furthermore, the enhanced readability underscores the potential of LLMs in fostering better patient understanding of heart failure–related information.

Limitations and Ethical Concerns

ChatGPT, while adept at generating conversational answers, has inherent limitations in accuracy and privacy. The model cannot access real-time patient records and often does not cite peer-reviewed articles or reference updated guidelines, which is crucial for accurate and evidence-based responses. Additionally, the current model may not reliably understand nuanced medical topics or accurately interpret complex medical questions [35], leading to potential patient misunderstandings. In some cases, ChatGPT may also generate answers that initially seem factual due to its confident-appearing language but disseminate inaccurate information, known as artificial hallucinations [36]. Utilizing artificial intelligence (AI) models like ChatGPT in health care settings may also not guarantee secure handling of patient information as the model may collect users’ conversation data for future training. Although OpenAI does have a privacy setting allowing for disabling user data collection, prioritizing patient confidentiality will be an important aspect of development if the technology is to be used as an adjunct health care tool [37].

Furthermore, ChatGPT may also perpetuate social disparities due to implicit biases and contribute to accessibility gaps. Recent studies revealed that GPT-4 tended to promote outdated race-based medicine and overrepresent or underrepresent certain racial groups and sexes depending on the circumstance and thus potentially reinforce stereotypes [38,39]. Another concern is equitable access, as patients with lower socioeconomic status often have less access to certain technology such as the internet and may have barriers to utilizing these new AI tools [40]. Altogether, these validity and ethical considerations emphasize that clinical oversight, such as US Food and Drug Administration regulation, is warranted prior to LLM incorporation in patient care [41]. This would allow for consistent monitoring of this rapidly evolving technology, ensuring optimization of safety protocols with each new update of the model.

Our study has several limitations. Although we employed validated readability scoring systems as a surrogate for patient understanding, these formulas have their limitations, as previously reported [42,43]. These formulas often generate a reading level score that inherently grades longer words and sentences as being more complex but are unable to assess a text’s content for structure and clarity. Our study also did not involve patients, which is essential for the comprehensive assessment of ChatGPT as a patient educational resource. Future studies would benefit from involving patients to ensure relevance of questions, preference in language used, and assessment of patient understanding. A baseline assessment of a patient’s understanding of the given topic would also be beneficial to assess if ChatGPT can improve comprehension rather than relying on scoring tools. Additionally, we employed only one expert reviewer to assess the accuracy and comprehensiveness of ChatGPT’s responses. To limit the potential for bias through subjective review and promote diverse perspectives, future research would benefit from involving multiple reviewers from different backgrounds and training institutions. Our reviewer was also not blinded to the source of each PEM, allowing for possible bias when evaluating accuracy and comprehensiveness. Our study could also not incorporate or interpret questions containing multimedia at the time of data collection, but with the release of multimodal LLMs, like GPT-4v, including visual aids would be another valuable component of PEMs to investigate. The PEMs used are not comprehensive of all questions that may be asked by patients, which limits the generalizability of our results. Future studies using real-world patients and questions would be helpful to further understand the broad spectrum of questions patients may ask.

Future Directions

We opted for a pragmatic approach in designing the GPT-4 prompt used to revise institutional PEMs. Our focus was on ensuring the prompt reflected a simple, intuitive command that patients would be likely to use in real-world scenarios. Although this method provided promising results, highlighting the versatility of GPT-4, exploring more intricate prompts may yield even more impressive outputs and functionality. We advocate further research into prompt engineering to better replicate natural conversations and offer specific instructions for generating higher-quality and personalized responses.

Medical institutions can utilize this technology by integrating ChatGPT directly into their online patient education platforms with customized readability based on the highest level of education completed by the patient. This type of personalization of readability assessment can be implemented in all patient-facing AI applications to ensure the appropriate reading level of text for all patients. For example, Buoy Health, a chatbot developed by Harvard Medical School in 2014, uses natural language processing to help users assess symptoms with reported accuracy rates of 90%‐98% [44,45]. Boston Children’s Hospital has adopted this platform on their website to guide patients on symptoms and recommended next steps in seeking medical care [44,45]. While not solely focused on education, it demonstrates how leading institutions are successfully leveraging chatbots as interactive tools. The consideration of readability assessment and adaptability in these patient-facing applications may increase patient engagement and ensure patients of all education levels can use these tools. Greater collaboration between trusted medical institutions and LLM platforms could improve patient access to simplified, accurate medical information that aligns with the AMAs recommended fifth- to sixth-grade reading level.

Conclusions

Our study demonstrates GPT-4’s ability to improve the readability of institutional heart failure–related PEMs while also maintaining accuracy and comprehensiveness. Our results underscore the potential future utility of LLMs in improving the delivery of easy-to-understand and readable PEMs to patients of all health literacy levels. While ChatGPT may potentially be a valuable future tool in patient care, it should be used as a supplement to, rather than a replacement for, human expertise and judgment of a licensed health care professional. We recommend the development of future studies examining the optimization of readability outputs, personalization, and real-world implementation.

ChatGPT-4 (version updated 16 May 2024), by OpenAI was used to improve readability. There was no funding obtained for this study.

Data Availability

All data generated or analyzed during this study are included in this paper's main text and Multimedia Appendix 2.

RG is a consultant for Pfizer, Alnylam, and AstraZeneca. None of the other authors have interests to disclose.

Abbreviations

artificial intelligence

AMA

American Medical Association

FAQ

frequently asked question

FKGL

Flesch-Kincaid Grade Level

FRE

Flesch Reading Ease score

LLM

large language model

PEM

patient education material

SMOG

Simple Measure of Gobbledygook

References1

Groenewegen

Rutten

Mosterd

Hoes

Epidemiology of heart failure

Eur J Heart Fail20200822813421356

10.1002/ejhf.1858

32483830

Urbich

Globe

Pantiri

A systematic review of medical costs associated with heart failure in the USA (2014-2020)

Pharmacoeconomics202011381112191236

10.1007/s40273-020-00952-0

32812149

Berkman

Sheridan

Donahue

Halpern

Crotty

Low health literacy and health outcomes: an updated systematic review

Ann Intern Med20110719155297107

10.7326/0003-4819-155-2-201107190-00005

21768583

Peterson

Shetterly

Clarke

Health literacy and outcomes among patients with heart failure

JAMA201104273051616951701

10.1001/jama.2011.512

21521851

Fast facts: adult literacy

NCES2019

2024-10-29

https://nces.ed.gov/fastfacts/display.asp?id=69

Weiss

Health Literacy: A Manual for Clinicians2003

American Medical Association Foundation and American Medical Association

Ayyaswami

Padmanabhan

Patel

A readability analysis of online cardiovascular disease-related health education materials

Health Lit Res Pract20190432e74e80

10.3928/24748307-20190306-03

31049489

Fabbri

Murad

Wennberg

Health literacy and outcomes among patients with heart failure: a systematic review and meta-analysis

JACC Heart Fail20200686451460

10.1016/j.jchf.2019.11.007

32466837

Sidoti

McClain

34% of U.S. adults have used ChatGPT, about double the share in 2023

Pew Research Center20250625

2025-06-26

https://www.pewresearch.org/short-reads/2025/06/25/34-of-us-adults-have-used-chatgpt-about-double-the-share-in-2023/

The social life of health information

Pew Research Center2009

2024-10-29

https://www.pewresearch.org/internet/2009/06/11/the-social-life-of-health-information

Sarraju

Bruemmer

Van Iterson

Cho

Rodriguez

Laffin

Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model

JAMA2023031432910842844

10.1001/jama.2023.1044

36735264

King

Samaan

Yeo

Mody

Lombardo

Ghashghaei

Appropriateness of ChatGPT in answering heart failure related questions

Heart Lung Circ20240933913141318

10.1016/j.hlc.2024.03.005

38821760

King

Bharani

Shah

Yeo

Samaan

GPT-4V passes the BLS and ACLS examinations: an analysis of GPT-4V’s image recognition capabilities

Resuscitation202402195110106

10.1016/j.resuscitation.2023.110106

38160904

Yeo

Samaan

Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma

Clin Mol Hepatol202307293721732

10.3350/cmh.2023.0089

36946005

Samaan

Yeo

Rajeev

Assessing the accuracy of responses by the language model ChatGPT to questions regarding bariatric surgery

OBES SURG20230633617901796

10.1007/s11695-023-06603-5

37106269

King

Samaan

Yeo

A multidisciplinary assessment of chatgpt’s knowledge of amyloidosis: observational study

JMIR Cardio202404198e53421

10.2196/53421

38640472

Ayers

Poliak

Dredze

Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum

JAMA Intern Med20230611836589596

10.1001/jamainternmed.2023.1838

37115527

Riddell

Chan

McGrinder

Earle

Poppe

Doughty

College-level reading is required to understand ChatGPT’s answers to lay questions relating to heart failure

Eur J Heart Fail202312251223362337

10.1002/ejhf.3083

37964183

King

Figure 1

BioRender2025-06-27

https://BioRender.com/imijjhx

Flesch

Guide to academic writing

University of Canterbury School of Business and Economics2016

2024-10-29

https://web.archive.org/web/20160712094308/http://www.mang.canterbury.ac.nz/writing_guide/writing/flesch.shtml

Kincaid

Fishburne

Rogers

Chissom

Derivation of new readability formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy enlisted personnel

1975

2025-06-25

Institute for Simulation and Training

https://stars.library.ucf.edu/cgi/viewcontent.cgi?article=1055&context=istlibrary

Gunning

The Fog Index after twenty years

Journal of Business Communication19690162313

10.1177/002194366900600202

Coleman

Liau

A computer readability formula designed for machine scoring

Journal of Applied Psychology1975602283284

10.1037/h0076540

McLaughlin

SMOG grading: a new readability formula

J Read1969

2025-06-25

128639646

https://www.jstor.org/stable/40011226

Smith

Senter

Automated readability index

AMRL TR196705114

5302480

Samaan

Yeo

ChatGPT’s ability to comprehend and answer cirrhosis related questions in Arabic

Arab J Gastroenterol202308243145148

10.1016/j.ajg.2023.08.001

37673708

OpenAIAchiam

Adler

GPT-4 technical report

arXivPreprint posted online on Mar 15, 2023

10.48550/arXiv.2303.08774

Gurbuz

Varis

Is ChatGPT knowledgeable of acute coronary syndromes and pertinent European Society of Cardiology Guidelines?

Minerva Cardiol Angiol202406723299303

10.23736/S2724-5683.24.06517-7

38391252

Lee

Campbell

Rao

Evaluating ChatGPT responses on atrial fibrillation for patient education

Cureus202406166e61680

10.7759/cureus.61680

38841294

Onder

Koc

Gokbulut

Taskaldiran

Kuskonmaz

Evaluation of the reliability and readability of ChatGPT-4 responses regarding hypothyroidism during pregnancy

Sci Rep2024012141243

10.1038/s41598-023-50884-w

38167988

Momenaei

Wakabayashi

Shahlaee

Appropriateness and readability of ChatGPT-4-generated responses for surgical treatment of retinal diseases

Ophthalmol Retina202310710862868

10.1016/j.oret.2023.05.022

37277096

Srinivasan

Samaan

Rajeev

Kanu

Yeo

Samakar

Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources

Surg Endosc20240538525222532

10.1007/s00464-024-10720-2

38472531

Salam

Kravchenko

Nowak

Generative Pre-trained Transformer 4 makes cardiovascular magnetic resonance reports easy to understand

J Cardiovasc Magn Reson2024261101035

10.1016/j.jocmr.2024.101035

38460841

Rouhi

Ghanem

Yolchieva

Can artificial intelligence improve the readability of patient education materials on aortic stenosis? A pilot study

Cardiol Ther202403131137147

10.1007/s40119-023-00347-0

38194058

Cascella

Montomoli

Bellini

Bignami

Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios

J Med Syst202303447133

10.1007/s10916-023-01925-4

36869927

Alkaissi

McFarlane

Artificial hallucinations in ChatGPT: implications in scientific writing

Cureus202302152e35179

10.7759/cureus.35179

36811129

New ways to manage your data in ChatGPT

OpenAI2023

2024-10-29

https://openai.com/index/new-ways-to-manage-your-data-in-chatgpt

Zack

Lehman

Suzgun

Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study

Lancet Digit Health20240161e12e22

10.1016/S2589-7500(23)00225-X

38123252

Omiye

Lester

Spichak

Rotemberg

Daneshjou

Large language models propagate race-based medicine

NPJ Digit Med2023102061195

10.1038/s41746-023-00939-z

37864012

Wang

Sanders

Liu

ChatGPT: promise and challenges for deployment in low- and middle-income countries

Lancet Reg Health West Pac20231241100905

10.1016/j.lanwpc.2023.100905

37731897

Proposed regulatory framework for modifications to artificial intelligence/machine learning (AI/ML)-based Software as a Medical Device (SaMD)

2019

2025-06-26

Food and Drug Administration

https://www.fda.gov/media/122535/download?attachment

Scott

The Gunning Fog Index (or FOG) readability formula

Readability Formula2025

2024-10-29

https://readabilityformulas.com/the-gunnings-fog-index-or-fog-readability-formula

Tip 6. use caution with readability formulas for quality reports

AHRQ2015

2024-10-29

https://www.ahrq.gov/talkingquality/resources/writing/tip6.html

Buoy Health: a chatbot that helps diagnose your symptoms

Product Hunt2017

2025-05-01

https://www.producthunt.com/posts/buoy-health

Ćirković

Evaluation of four artificial intelligence-assisted self-diagnosis apps on three diagnoses: two-year follow-up study

J Med Internet Res20201242212e18097

10.2196/18097

33275113

Multimedia Appendix 1

Accuracy and comprehensiveness data.

Multimedia Appendix 2

Comparison of readability of institutional and GPT-4’s revised patient education materials.