Reliability of AI Chatbots in Providing Urinary Tract Infection Health Information: A Comparative Study of ChatGPT, Google Gemini, and DeepSeek

Shabana M1, Lalkiya D1, Suartz C1, Shahrour W1, Elkoushy M1, Shabana W1

Research Type

Clinical

Abstract Category

Urotechnology

Abstract 315
Urology 10 - Artificial Intelligence/Technology in Urology
Scientific Podium Short Oral Session 27
Saturday 20th September 2025
14:15 - 14:22
Parallel Hall 3
Infection, Urinary Tract Female Infection, other
1. Northern Ontario School of Medicine
Presenter
Links

Abstract

Hypothesis / aims of study
This study aims to evaluate and compare the accuracy and completeness of responses generated by three AI models—ChatGPT, Gemini, and DeepSeek—when prompted with patient-oriented questions regarding female urinary tract infections. The findings will be measured against evidence-based clinical guidelines and publications.
Study design, materials and methods
A cross-sectional design was employed. Researchers developed five standardized, patient-focused questions on UTI management based on recent evidence and authoritative guidelines. Each question was individually submitted to ChatGPT, Gemini, and DeepSeek in a private browser session. Two medical professionals independently evaluated each AI-generated response for accuracy (1–3 scale: Correct, Partially Correct, or Incorrect) and completeness (1–2 scale: Incomplete or Complete). Both raters compared the AI response with the AUA guidelines. Inter-rater agreement was used to assess the consistency of ratings between evaluators.
Results
Regarding completeness, all three AI models received the highest score of 2 (Complete) from both raters in all five categorical questions. Rater 1 administered a score of 3 (Correct) for all three AI models as it pertains to accuracy in all five responses. Rater 2 provided DeepSeek with a score of 3 (Correct) in accuracy for all five responses. Rater 2 provided ChatGPT and Gemini with a score of 2 (Partially Correct) for their responses in the prevention category.
Interpretation of results
Inter-rater agreement was high across all models. Overall agreement for accuracy was 86.7%, while completeness ratings had 100% agreement. DeepSeek demonstrated the highest consistency, with 100% agreement between evaluators on both accuracy and completeness. ChatGPT and Gemini each showed 80% agreement for accuracy but maintained full agreement for completeness.
Concluding message
All three AI models produced generally accurate and complete responses to UTI-related patient questions. High inter-rater agreement, especially for completeness, suggests strong reliability of the content. However, small variations in accuracy ratings highlight the importance of consistent evaluation frameworks. DeepSeek demonstrated the highest overall consistency, indicating potential for reliable patient education support.
Figure 1 Response Scores for ChatGPT, Gemini, and DeepSeek
Disclosures
Funding NA Clinical Trial No Subjects None
09/07/2025 20:47:34