VALIDATION OF GUIDELINE-ALIGNED LARGE LANGUAGE MODELS FOR SAFE CLINICAL DECISION MAKING IN DIGESTIVE DISEASES

Giuffrè, Mauro

Large language models (LLMs) promise major gains for clinical decision support in gastroenterology and hepatology, but safe adoption requires more than clever prompting. This dissertation develops and validates a translational pipeline that (i) renders clinical guidance machine-readable, (ii) embeds expert oversight, and (iii) layers an automated safety stack. Across systematic evidence synthesis, expert-benchmarked tests, and society-curated question banks, engineered, guideline-grounded systems achieve expert-adjacent performance on defined tasks while preserving clear boundaries for human supervision. A systematic review of baseline LLM performance (Chapter 2) showed wide accuracy (6.4–91.4%) across 18 studies, with a median near 50% for larger sets, driven by inconsistent question design, evaluator expertise, and non-standard grading. Foundational models without domain adaptation pose unacceptable safety risks. Chapter 3 introduced a guideline-grounded retrieval-augmented generation (RAG) framework using European Association for the Study of the Liver (EASL) hepatitis C virus (HCV) guidelines. Ablations showed that converting guidelines into LLM-friendly formats—cleaning text, converting tables to structured lists, and enforcing schemas—plus principled prompts raised accuracy from 43% (baseline GPT-4 Turbo) to 99% overall; table questions from 28% to 96%, clinical scenarios from 20% to 100%. For external validation (Chapter 4), international hepatology experts, including guideline authors, blindly graded multiple configurations. Supervised fine-tuning (SFT) combined with RAG achieved mean scores 9.45/10 (authors) and 8.7/10 (independent clinicians), exceeding baselines (6.4/10, p<0.001). Temperatures 0–0.8 and top-p 0–0.5 minimized hallucinations. On direct-acting antiviral (DAA) regimen selection, median fully correct recommendations rose from 24% to 76% (RAG-Top10). To scale oversight (Chapter 5), the Expert-of-Experts Verification and Alignment (EVAL) framework used fine-tuned Contextualized Late Interaction over BERT (ColBERT) embeddings (Spearman ρ=0.91 with human judgment) for model-level ranking. A reward model trained on expert labels reproduced grades at 87.9% accuracy; rejection sampling reached 98% in high-temperature regimes, enabling automated filtering. Chapter 6 tested generalization on 110 EASL Campus multiple-choice questions (MCQs). Optimized RAG+SFT achieved 87.6% accuracy—31 points above pooled physicians (56.9%, p<0.001)—across liver tumors (95.0% vs 50.7%), viral hepatitis (80.0% vs 55.1%), and cirrhosis (80.0% vs 58.2%). LLMs can approach expert-level performance on guideline-referenced tasks when knowledge is machine-encoded, expert judgment shapes representation and evaluation, and automated verification enforces continuous safety. The work offers a reproducible blueprint for evidence-based AI in gastroenterology while delineating limits that require ongoing human oversight.

VALIDATION OF GUIDELINE-ALIGNED LARGE LANGUAGE MODELS FOR SAFE CLINICAL DECISION MAKING IN DIGESTIVE DISEASES

GIUFFRÈ, MAURO

2026

Abstract

Large language models (LLMs) promise major gains for clinical decision support in gastroenterology and hepatology, but safe adoption requires more than clever prompting. This dissertation develops and validates a translational pipeline that (i) renders clinical guidance machine-readable, (ii) embeds expert oversight, and (iii) layers an automated safety stack. Across systematic evidence synthesis, expert-benchmarked tests, and society-curated question banks, engineered, guideline-grounded systems achieve expert-adjacent performance on defined tasks while preserving clear boundaries for human supervision. A systematic review of baseline LLM performance (Chapter 2) showed wide accuracy (6.4–91.4%) across 18 studies, with a median near 50% for larger sets, driven by inconsistent question design, evaluator expertise, and non-standard grading. Foundational models without domain adaptation pose unacceptable safety risks. Chapter 3 introduced a guideline-grounded retrieval-augmented generation (RAG) framework using European Association for the Study of the Liver (EASL) hepatitis C virus (HCV) guidelines. Ablations showed that converting guidelines into LLM-friendly formats—cleaning text, converting tables to structured lists, and enforcing schemas—plus principled prompts raised accuracy from 43% (baseline GPT-4 Turbo) to 99% overall; table questions from 28% to 96%, clinical scenarios from 20% to 100%. For external validation (Chapter 4), international hepatology experts, including guideline authors, blindly graded multiple configurations. Supervised fine-tuning (SFT) combined with RAG achieved mean scores 9.45/10 (authors) and 8.7/10 (independent clinicians), exceeding baselines (6.4/10, p<0.001). Temperatures 0–0.8 and top-p 0–0.5 minimized hallucinations. On direct-acting antiviral (DAA) regimen selection, median fully correct recommendations rose from 24% to 76% (RAG-Top10). To scale oversight (Chapter 5), the Expert-of-Experts Verification and Alignment (EVAL) framework used fine-tuned Contextualized Late Interaction over BERT (ColBERT) embeddings (Spearman ρ=0.91 with human judgment) for model-level ranking. A reward model trained on expert labels reproduced grades at 87.9% accuracy; rejection sampling reached 98% in high-temperature regimes, enabling automated filtering. Chapter 6 tested generalization on 110 EASL Campus multiple-choice questions (MCQs). Optimized RAG+SFT achieved 87.6% accuracy—31 points above pooled physicians (56.9%, p<0.001)—across liver tumors (95.0% vs 50.7%), viral hepatitis (80.0% vs 55.1%), and cirrhosis (80.0% vs 58.2%). LLMs can approach expert-level performance on guideline-referenced tasks when knowledge is machine-encoded, expert judgment shapes representation and evaluation, and automated verification enforces continuous safety. The work offers a reproducible blueprint for evidence-based AI in gastroenterology while delineating limits that require ongoing human oversight.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				APPLIED DATA SCIENCE AND ARTIFICIAL INTELLIGENCE
			
	Data di pubblicazione
	
				28-gen-2026
			
	Lingua
	
				Inglese
			
	Abstract in italiano
	
				Large language models (LLMs) promise major gains for clinical decision support in gastroenterology and hepatology, but safe adoption requires more than clever prompting. This dissertation develops and validates a translational pipeline that (i) renders clinical guidance machine-readable, (ii) embeds expert oversight, and (iii) layers an automated safety stack. Across systematic evidence synthesis, expert-benchmarked tests, and society-curated question banks, engineered, guideline-grounded systems achieve expert-adjacent performance on defined tasks while preserving clear boundaries for human supervision.

A systematic review of baseline LLM performance (Chapter 2) showed wide accuracy (6.4–91.4%) across 18 studies, with a median near 50% for larger sets, driven by inconsistent question design, evaluator expertise, and non-standard grading. Foundational models without domain adaptation pose unacceptable safety risks.

Chapter 3 introduced a guideline-grounded retrieval-augmented generation (RAG) framework using European Association for the Study of the Liver (EASL) hepatitis C virus (HCV) guidelines. Ablations showed that converting guidelines into LLM-friendly formats—cleaning text, converting tables to structured lists, and enforcing schemas—plus principled prompts raised accuracy from 43% (baseline GPT-4 Turbo) to 99% overall; table questions from 28% to 96%, clinical scenarios from 20% to 100%.

For external validation (Chapter 4), international hepatology experts, including guideline authors, blindly graded multiple configurations. Supervised fine-tuning (SFT) combined with RAG achieved mean scores 9.45/10 (authors) and 8.7/10 (independent clinicians), exceeding baselines (6.4/10, p<0.001). Temperatures 0–0.8 and top-p 0–0.5 minimized hallucinations. On direct-acting antiviral (DAA) regimen selection, median fully correct recommendations rose from 24% to 76% (RAG-Top10).

To scale oversight (Chapter 5), the Expert-of-Experts Verification and Alignment (EVAL) framework used fine-tuned Contextualized Late Interaction over BERT (ColBERT) embeddings (Spearman ρ=0.91 with human judgment) for model-level ranking. A reward model trained on expert labels reproduced grades at 87.9% accuracy; rejection sampling reached 98% in high-temperature regimes, enabling automated filtering.

Chapter 6 tested generalization on 110 EASL Campus multiple-choice questions (MCQs). Optimized RAG+SFT achieved 87.6% accuracy—31 points above pooled physicians (56.9%, p<0.001)—across liver tumors (95.0% vs 50.7%), viral hepatitis (80.0% vs 55.1%), and cirrhosis (80.0% vs 58.2%).
LLMs can approach expert-level performance on guideline-referenced tasks when knowledge is machine-encoded, expert judgment shapes representation and evaluation, and automated verification enforces continuous safety. The work offers a reproducible blueprint for evidence-based AI in gastroenterology while delineating limits that require ongoing human oversight.
			
	Parola chiave
	
				LLM; AI; Gastroenterology; RAG; SFT
			
	Relatore, Supervisor, Advisor o Tutor
	
				CROCÈ SAVERIA LORY
BORTOLUSSI, LUCA
			
	Nome Editore
	
				Università degli Studi di Trieste
			
	Collezione di appartenenza
	
				Università degli Studi di Trieste

File in questo prodotto:

File	Dimensione	Formato
TESI PHD DECEMBER 20.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 18.44 MB Formato Adobe PDF Visualizza/Apri	18.44 MB	Adobe PDF	Visualizza/Apri
TESI PHD DECEMBER 20_1.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 18.44 MB Formato Adobe PDF Visualizza/Apri	18.44 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/357307

Il codice NBN di questa tesi è URN:NBN:IT:UNITS-357307