Reliability of AI Chatbots in Providing Urinary Tract Infection Health Information: A Comparative Study of ChatGPT, Google Gemini, and DeepSeek

Shabana, Mohamed; Lalkiya, Dhruv; Suartz, Caio; Shahrour, Walid; Elkoushy, Mohamed; Shabana, Walid

Reliability of AI Chatbots in Providing Urinary Tract Infection Health Information: A Comparative Study of ChatGPT, Google Gemini, and DeepSeek

Shabana M¹, Lalkiya D¹, Suartz C¹, Shahrour W¹, Elkoushy M¹, Shabana W¹

Research Type

Clinical

Abstract Category

Urotechnology

Links
	Abstract 315
	Urology 10 - Artificial Intelligence/Technology in Urology Scientific Podium Short Oral Session 27
	Saturday 20th September 2025
	14:15 - 14:22
	Parallel Hall 3
	Infection, Urinary Tract Female Infection, other
	1. Northern Ontario School of Medicine
Presenter
M Mohamed Shabana Northern Ontario School of Medicine
Edit Abstract Abstract Centre

Abstract

Hypothesis / aims of study

This study aims to evaluate and compare the accuracy and completeness of responses generated by three AI models—ChatGPT, Gemini, and DeepSeek—when prompted with patient-oriented questions regarding female urinary tract infections. The findings will be measured against evidence-based clinical guidelines and publications.

Study design, materials and methods

A cross-sectional design was employed. Researchers developed five standardized, patient-focused questions on UTI management based on recent evidence and authoritative guidelines. Each question was individually submitted to ChatGPT, Gemini, and DeepSeek in a private browser session. Two medical professionals independently evaluated each AI-generated response for accuracy (1–3 scale: Correct, Partially Correct, or Incorrect) and completeness (1–2 scale: Incomplete or Complete). Both raters compared the AI response with the AUA guidelines. Inter-rater agreement was used to assess the consistency of ratings between evaluators.

Results

Regarding completeness, all three AI models received the highest score of 2 (Complete) from both raters in all five categorical questions. Rater 1 administered a score of 3 (Correct) for all three AI models as it pertains to accuracy in all five responses. Rater 2 provided DeepSeek with a score of 3 (Correct) in accuracy for all five responses. Rater 2 provided ChatGPT and Gemini with a score of 2 (Partially Correct) for their responses in the prevention category.

Interpretation of results

Inter-rater agreement was high across all models. Overall agreement for accuracy was 86.7%, while completeness ratings had 100% agreement. DeepSeek demonstrated the highest consistency, with 100% agreement between evaluators on both accuracy and completeness. ChatGPT and Gemini each showed 80% agreement for accuracy but maintained full agreement for completeness.

Concluding message

All three AI models produced generally accurate and complete responses to UTI-related patient questions. High inter-rater agreement, especially for completeness, suggests strong reliability of the content. However, small variations in accuracy ratings highlight the importance of consistent evaluation frameworks. DeepSeek demonstrated the highest overall consistency, indicating potential for reliable patient education support.

Figure 1

Response Scores for ChatGPT, Gemini, and DeepSeek

Disclosures

Funding NA Clinical Trial No Subjects None

Citation

Continence 15S (2025) 102239
DOI: 10.1016/j.cont.2025.102239

International Continence Society

Reliability of AI Chatbots in Providing Urinary Tract Infection Health Information: A Comparative Study of ChatGPT, Google Gemini, and DeepSeek

Shabana M1, Lalkiya D1, Suartz C1, Shahrour W1, Elkoushy M1, Shabana W1

Abstract

Shabana M¹, Lalkiya D¹, Suartz C¹, Shahrour W¹, Elkoushy M¹, Shabana W¹