Hypothesis / aims of study
Introduction and Objectives
Artificial intelligence (AI), designed to perform human cognitive tasks, is transforming various sectors in modern life, and health care is no exception. Daily advancements highlight AI’s potential to improve medical practice, benefiting both physicians and patients. However, it is essential to question whether these advantages are being directed appropriately, particularly as a support tool in overcrowded health systems.
Bladder diary analysis is a key diagnostic tool for lower urinary tract dysfunction but is often time-consuming. This study compares bladder diary analyses performed by different AI models with those conducted by a clinician to assess the potential for AI as a routine tool in urological practice.
Study design, materials and methods
Randomly selected 3-day ICIQ-bladder diaries were analyzed using four different AI models (Chat Generative Pre-trained Transformer (ChatGPT), Microsoft Copilot, Gemini, and Jasper). The diaries were manually completed by patients, scanned and analyzed. A single clinician analyzed each bladder diary, recording results in an Excel spreadsheet. The time taken for each analysis was noted to estimate the average review duration.
This study employs descriptive statistics to calculate the mean and standard deviation (SD) for each parameter across the different AI models and the clinician. To assess agreement between these sources, Intraclass Correlation Coefficient (ICC) and Kappa statistics are utilized. Statistical significance is determined using p-values, with p < 0.05 indicating significant differences between methods.
Results
Twenty five bladder diaries were analyzed. Two of the AI models were unable to analyze the data; Jasper AI was not designed for image analysis, and Gemini AI provided steps for manual calculation of the bladder diary rather than performing direct analysis itself. The comparative analysis of bladder diary parameters between ChatGPT, Microsoft Copilot, and the clinician reference values reveals varying levels of agreement. Parameters like maximum voided volume and pad usage showed high reliability between AI and clinician (ICC>0.80). However, other metrics, including nocturnal urine volume, micturition frequency, and nocturnal polyuria index, exhibited statistically significant differences (p < 0.05), with mean discrepancies up to 1000 mL in urine volume and 4-5 in voiding frequency.
Interpretation of results
Incorporating AI into bladder diary analysis offers valuable enhancements but requires standardizing how patients complete these charts to ensure compatibility, the different results obtained can be directly associated with the patient’s handwriting and interpretations of numbers, showing that changes in human handwriting can affect AI perception, for example confusing similar-looking number such as 5s and 8s. If the diaries were digitally filled and integrated into software with built-in AI capabilities, the process would be simplified for clinicians. Nevertheless, this could impose an additional burden on patients, potentially discouraging them from consistently completing the bladder diaries. The time required to analyze each bladder diary varied across cases, influenced by the quality of the bladder diary filling and handwriting. The mean analysis time was 5:08 minutes, with the shortest analysis taking 3:47 minutes and the longest 8:29 minutes. Implementing AI in bladder diary evaluations can automate data extraction and classification, significantly reducing the time required for analysis. This automation enhances efficiency, ensures consistent and accurate assessments, and alleviates clinician workload.
Concluding message
AI models, particularly those designed for pattern analysis, are not yet fully comparable to human analysis for critical aspects of clinical practice. The inconsistencies found in this study suggest that AI software may classify hand-filled data differently from the clinician, resulting in significant variations in metrics. While some parameters align well, differences in classification methods suggest that certain AI-generated values require closer calibration to meet clinician standards. Improper use of these tools could negatively impact patient treatment. Therefore, further development and refinement are necessary before integrating them into daily urological practice.