Comparative Analysis of Large Language Models for Pediatric Kidney Stone Patient Education: A Multi-dimensional Assessment of Readability, Quality and Reliability


Ok F., Sukur I. H., Ok Z. O., Ates T., DEĞER M.

Archivos Espanoles de Urologia, cilt.79, sa.2, ss.247-254, 2026 (SCI-Expanded, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 79 Sayı: 2
  • Basım Tarihi: 2026
  • Doi Numarası: 10.56434/j.arch.esp.urol.20267902.30
  • Dergi Adı: Archivos Espanoles de Urologia
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, BIOSIS, DIALNET
  • Sayfa Sayıları: ss.247-254
  • Anahtar Kelimeler: artificial intelligence, chatbots, information quality, patient education, pediatric urolithiasis, readability
  • Çukurova Üniversitesi Adresli: Evet

Özet

Background: Pediatric urolithiasis is an increasingly important health concern, and affected children and their families require information that is both accurate and easily understandable. Artificial intelligence (AI)-powered chatbots have become widely used sources of health information; however, the readability, quality, and reliability of their outputs remain insufficiently evaluated. This study aimed to assess the effectiveness and reliability of AI chatbots in providing patient-oriented information on pediatric kidney stone disease and to identify factors influencing the quality and readability of their responses. Methods: Four AI chatbots (ChatGPT-5, Google Gemini, Claude 3 Opus, and DeepSEEK) were queried with 30 standardized questions related to pediatric kidney stones. Readability was evaluated using the Average Reading Level Consensus (ARLC), Automated Readability Index (ARI), and Simple Measure of Gobbledygook (SMOG). Response quality and reliability were asssessed using the Ensuring Quality Information for Patients (EQIP) tool and Modified DISCERN score. Statistical analyses included one-way analysis of variance ANOVA, Kruskal-Wallis tests, and appropriate post hoc comparisons. Results: Readability differed significantly among the chatbots. Google Gemini demonstrated the highest reading levels across all metrics (ARLC: 14.93, ARI: 16.2, and SMOG: 13.32), whereas ChatGPT, Claude, and DeepSEEK produced less complex test (p < 0.001; large effect sizes, η² = 0.195–0.512). EQIP scores did not differ significantly between models (p = 0.491, ε² = 0.021, negligible effect), indicating comparable informational quality. In contrast, reliability varied significantly: ChatGPT and Google Gemini achieved higher Modified DISCERN scores (median 4.00) than Claude and DeepSEEK (median 3.00; p = 0.001, ε² = 0.318, large effect). Subgroup analyses by question category revealed notable differences in performance, highlighting model-specific strenghts and limitations. Conclusions: Substantial variability exists in the readability and reliability of AI-generated health information on pediatric urolithiasis. Although ChatGPT and Google Gemini provided more reliable information, Google Gemini’s responses were consistently more complex and less accessible. These findings emphasize the need for careful validation and language simplification of AI-generated content before its use in patient and caregiver education.