Accuracy of a Large Language Model (ChatGPT) in Responding to Patient-Generated Queries Following Benign Prostatic Hyperplasia Surgeries

Jad Najdi; Bilal Alameddine; Alexandre Armache; Marwan Zein; William S Azar; Towfik Sebai; Yara Ghandour; Albert El-Hajj

doi:10.33590/emjurol/NXXB4817

INTRODUCTION

The rapid advancements in AI, especially in large language models like ChatGPT (OpenAI, San Francisco, California, USA), hold potential for various applications in healthcare.1^–6 This study aims to assess the accuracy of ChatGPT in responding to post-operative patient enquiries after surgery for benign prostatic hyperplasia.

METHODS

Common post-operative questions were collected from discharge instruction booklets, online forums, and social media platforms. Surgeries of interest included transurethral resection of the prostate (TURP), simple prostatectomy, laser enucleation of the prostate, Aquablation, Rezum, greenlight photovaporisation of the prostate, Urolift, and iTIND. ChatGPT 3.5 outputs were graded by two independent senior urology residents using pre-defined evaluation criteria. A third senior reviewer resolved grading discrepancies. Response errors were categorised into different types. Categorical variables were analysed using the Chi-square test. Inter-rater agreement was measured using Cohen’s Kappa coefficient.

RESULTS

A total of 496 questions were evaluated by two reviewers, of which 280 were excluded. Of the 216 graded responses, 78.2% were comprehensive and correct, 9.3% were incomplete or partially correct, 10.2% were misleading or contained a mix of accurate and inaccurate information, and 2.3% were completely inaccurate (Figure 1). The highest percentage of correct answers was observed with newer procedures (Aquablation, Rezum, iTIND) as compared to older procedures (TURP, simple prostatectomy). Lack of context or incorrect information (36.6%) were the most common errors encountered.

Figure 1: Percentage of answers in the four different grading categories divided by procedure type.
AQUA: aquablation; G-PVP: greenlight photovaporisation of the prostate; LEP: laser enucleation of the prostate; Simple P: simple prostatectomy; TURP: transurethral resection of the prostate.

CONCLUSION

ChatGPT demonstrates promise in providing accurate post-operative guidance for patients undergoing benign prostatic hyperplasia surgeries. However, incomplete or misleading responses raise concerns about its current clinical applicability, emphasising the need for further research to enhance its accuracy and ensure patient safety.

Accuracy of a Large Language Model (ChatGPT) in Responding to Patient-Generated Queries Following Benign Prostatic Hyperplasia Surgeries

INTRODUCTION

METHODS

RESULTS

CONCLUSION

Erection Pills Under-Prescribed in Patients with Prostate Cancer

SBRT Outperforms HDR-BT in Intermediate-Risk Prostate Cancer

More articles

Emerging Prostate Cancer Diagnostics: Hashim Ahmed

Evolving Prostate Cancer Therapy: Bertrand Tombal

Predicting Prostate Cancer in PI-RADS 1–2 Lesions

Featured journals

EMJ Urology 13 [Supplement 2] 2025

EMJ Oncology 13 [Supplement 4] 2025

Therapy Area

About Us

Accuracy of a Large Language Model (ChatGPT) in Responding to Patient-Generated Queries Following Benign Prostatic Hyperplasia Surgeries

INTRODUCTION

METHODS

RESULTS

CONCLUSION

Related To This Subject

Erection Pills Under-Prescribed in Patients with Prostate Cancer

SBRT Outperforms HDR-BT in Intermediate-Risk Prostate Cancer

More articles

Emerging Prostate Cancer Diagnostics: Hashim Ahmed

Evolving Prostate Cancer Therapy: Bertrand Tombal

Predicting Prostate Cancer in PI-RADS 1–2 Lesions

Featured journals

EMJ Urology 13 [Supplement 2] 2025

EMJ Oncology 13 [Supplement 4] 2025