Large language models (LLMs) promise major gains for clinical decision support in gastroenterology and hepatology, but safe adoption requires more than clever prompting. This dissertation develops and validates a translational pipeline that (i) renders clinical guidance machine-readable, (ii) embeds expert oversight, and (iii) layers an automated safety stack. Across systematic evidence synthesis, expert-benchmarked tests, and society-curated question banks, engineered, guideline-grounded systems achieve expert-adjacent performance on defined tasks while preserving clear boundaries for human supervision. A systematic review of baseline LLM performance (Chapter 2) showed wide accuracy (6.4–91.4%) across 18 studies, with a median near 50% for larger sets, driven by inconsistent question design, evaluator expertise, and non-standard grading. Foundational models without domain adaptation pose unacceptable safety risks. Chapter 3 introduced a guideline-grounded retrieval-augmented generation (RAG) framework using European Association for the Study of the Liver (EASL) hepatitis C virus (HCV) guidelines. Ablations showed that converting guidelines into LLM-friendly formats—cleaning text, converting tables to structured lists, and enforcing schemas—plus principled prompts raised accuracy from 43% (baseline GPT-4 Turbo) to 99% overall; table questions from 28% to 96%, clinical scenarios from 20% to 100%. For external validation (Chapter 4), international hepatology experts, including guideline authors, blindly graded multiple configurations. Supervised fine-tuning (SFT) combined with RAG achieved mean scores 9.45/10 (authors) and 8.7/10 (independent clinicians), exceeding baselines (6.4/10, p<0.001). Temperatures 0–0.8 and top-p 0–0.5 minimized hallucinations. On direct-acting antiviral (DAA) regimen selection, median fully correct recommendations rose from 24% to 76% (RAG-Top10). To scale oversight (Chapter 5), the Expert-of-Experts Verification and Alignment (EVAL) framework used fine-tuned Contextualized Late Interaction over BERT (ColBERT) embeddings (Spearman ρ=0.91 with human judgment) for model-level ranking. A reward model trained on expert labels reproduced grades at 87.9% accuracy; rejection sampling reached 98% in high-temperature regimes, enabling automated filtering. Chapter 6 tested generalization on 110 EASL Campus multiple-choice questions (MCQs). Optimized RAG+SFT achieved 87.6% accuracy—31 points above pooled physicians (56.9%, p<0.001)—across liver tumors (95.0% vs 50.7%), viral hepatitis (80.0% vs 55.1%), and cirrhosis (80.0% vs 58.2%). LLMs can approach expert-level performance on guideline-referenced tasks when knowledge is machine-encoded, expert judgment shapes representation and evaluation, and automated verification enforces continuous safety. The work offers a reproducible blueprint for evidence-based AI in gastroenterology while delineating limits that require ongoing human oversight.

Large language models (LLMs) promise major gains for clinical decision support in gastroenterology and hepatology, but safe adoption requires more than clever prompting. This dissertation develops and validates a translational pipeline that (i) renders clinical guidance machine-readable, (ii) embeds expert oversight, and (iii) layers an automated safety stack. Across systematic evidence synthesis, expert-benchmarked tests, and society-curated question banks, engineered, guideline-grounded systems achieve expert-adjacent performance on defined tasks while preserving clear boundaries for human supervision. A systematic review of baseline LLM performance (Chapter 2) showed wide accuracy (6.4–91.4%) across 18 studies, with a median near 50% for larger sets, driven by inconsistent question design, evaluator expertise, and non-standard grading. Foundational models without domain adaptation pose unacceptable safety risks. Chapter 3 introduced a guideline-grounded retrieval-augmented generation (RAG) framework using European Association for the Study of the Liver (EASL) hepatitis C virus (HCV) guidelines. Ablations showed that converting guidelines into LLM-friendly formats—cleaning text, converting tables to structured lists, and enforcing schemas—plus principled prompts raised accuracy from 43% (baseline GPT-4 Turbo) to 99% overall; table questions from 28% to 96%, clinical scenarios from 20% to 100%. For external validation (Chapter 4), international hepatology experts, including guideline authors, blindly graded multiple configurations. Supervised fine-tuning (SFT) combined with RAG achieved mean scores 9.45/10 (authors) and 8.7/10 (independent clinicians), exceeding baselines (6.4/10, p<0.001). Temperatures 0–0.8 and top-p 0–0.5 minimized hallucinations. On direct-acting antiviral (DAA) regimen selection, median fully correct recommendations rose from 24% to 76% (RAG-Top10). To scale oversight (Chapter 5), the Expert-of-Experts Verification and Alignment (EVAL) framework used fine-tuned Contextualized Late Interaction over BERT (ColBERT) embeddings (Spearman ρ=0.91 with human judgment) for model-level ranking. A reward model trained on expert labels reproduced grades at 87.9% accuracy; rejection sampling reached 98% in high-temperature regimes, enabling automated filtering. Chapter 6 tested generalization on 110 EASL Campus multiple-choice questions (MCQs). Optimized RAG+SFT achieved 87.6% accuracy—31 points above pooled physicians (56.9%, p<0.001)—across liver tumors (95.0% vs 50.7%), viral hepatitis (80.0% vs 55.1%), and cirrhosis (80.0% vs 58.2%). LLMs can approach expert-level performance on guideline-referenced tasks when knowledge is machine-encoded, expert judgment shapes representation and evaluation, and automated verification enforces continuous safety. The work offers a reproducible blueprint for evidence-based AI in gastroenterology while delineating limits that require ongoing human oversight.

VALIDATION OF GUIDELINE-ALIGNED LARGE LANGUAGE MODELS FOR SAFE CLINICAL DECISION MAKING IN DIGESTIVE DISEASES

GIUFFRÈ, MAURO
2026

Abstract

Large language models (LLMs) promise major gains for clinical decision support in gastroenterology and hepatology, but safe adoption requires more than clever prompting. This dissertation develops and validates a translational pipeline that (i) renders clinical guidance machine-readable, (ii) embeds expert oversight, and (iii) layers an automated safety stack. Across systematic evidence synthesis, expert-benchmarked tests, and society-curated question banks, engineered, guideline-grounded systems achieve expert-adjacent performance on defined tasks while preserving clear boundaries for human supervision. A systematic review of baseline LLM performance (Chapter 2) showed wide accuracy (6.4–91.4%) across 18 studies, with a median near 50% for larger sets, driven by inconsistent question design, evaluator expertise, and non-standard grading. Foundational models without domain adaptation pose unacceptable safety risks. Chapter 3 introduced a guideline-grounded retrieval-augmented generation (RAG) framework using European Association for the Study of the Liver (EASL) hepatitis C virus (HCV) guidelines. Ablations showed that converting guidelines into LLM-friendly formats—cleaning text, converting tables to structured lists, and enforcing schemas—plus principled prompts raised accuracy from 43% (baseline GPT-4 Turbo) to 99% overall; table questions from 28% to 96%, clinical scenarios from 20% to 100%. For external validation (Chapter 4), international hepatology experts, including guideline authors, blindly graded multiple configurations. Supervised fine-tuning (SFT) combined with RAG achieved mean scores 9.45/10 (authors) and 8.7/10 (independent clinicians), exceeding baselines (6.4/10, p<0.001). Temperatures 0–0.8 and top-p 0–0.5 minimized hallucinations. On direct-acting antiviral (DAA) regimen selection, median fully correct recommendations rose from 24% to 76% (RAG-Top10). To scale oversight (Chapter 5), the Expert-of-Experts Verification and Alignment (EVAL) framework used fine-tuned Contextualized Late Interaction over BERT (ColBERT) embeddings (Spearman ρ=0.91 with human judgment) for model-level ranking. A reward model trained on expert labels reproduced grades at 87.9% accuracy; rejection sampling reached 98% in high-temperature regimes, enabling automated filtering. Chapter 6 tested generalization on 110 EASL Campus multiple-choice questions (MCQs). Optimized RAG+SFT achieved 87.6% accuracy—31 points above pooled physicians (56.9%, p<0.001)—across liver tumors (95.0% vs 50.7%), viral hepatitis (80.0% vs 55.1%), and cirrhosis (80.0% vs 58.2%). LLMs can approach expert-level performance on guideline-referenced tasks when knowledge is machine-encoded, expert judgment shapes representation and evaluation, and automated verification enforces continuous safety. The work offers a reproducible blueprint for evidence-based AI in gastroenterology while delineating limits that require ongoing human oversight.
28-gen-2026
Inglese
Large language models (LLMs) promise major gains for clinical decision support in gastroenterology and hepatology, but safe adoption requires more than clever prompting. This dissertation develops and validates a translational pipeline that (i) renders clinical guidance machine-readable, (ii) embeds expert oversight, and (iii) layers an automated safety stack. Across systematic evidence synthesis, expert-benchmarked tests, and society-curated question banks, engineered, guideline-grounded systems achieve expert-adjacent performance on defined tasks while preserving clear boundaries for human supervision. A systematic review of baseline LLM performance (Chapter 2) showed wide accuracy (6.4–91.4%) across 18 studies, with a median near 50% for larger sets, driven by inconsistent question design, evaluator expertise, and non-standard grading. Foundational models without domain adaptation pose unacceptable safety risks. Chapter 3 introduced a guideline-grounded retrieval-augmented generation (RAG) framework using European Association for the Study of the Liver (EASL) hepatitis C virus (HCV) guidelines. Ablations showed that converting guidelines into LLM-friendly formats—cleaning text, converting tables to structured lists, and enforcing schemas—plus principled prompts raised accuracy from 43% (baseline GPT-4 Turbo) to 99% overall; table questions from 28% to 96%, clinical scenarios from 20% to 100%. For external validation (Chapter 4), international hepatology experts, including guideline authors, blindly graded multiple configurations. Supervised fine-tuning (SFT) combined with RAG achieved mean scores 9.45/10 (authors) and 8.7/10 (independent clinicians), exceeding baselines (6.4/10, p<0.001). Temperatures 0–0.8 and top-p 0–0.5 minimized hallucinations. On direct-acting antiviral (DAA) regimen selection, median fully correct recommendations rose from 24% to 76% (RAG-Top10). To scale oversight (Chapter 5), the Expert-of-Experts Verification and Alignment (EVAL) framework used fine-tuned Contextualized Late Interaction over BERT (ColBERT) embeddings (Spearman ρ=0.91 with human judgment) for model-level ranking. A reward model trained on expert labels reproduced grades at 87.9% accuracy; rejection sampling reached 98% in high-temperature regimes, enabling automated filtering. Chapter 6 tested generalization on 110 EASL Campus multiple-choice questions (MCQs). Optimized RAG+SFT achieved 87.6% accuracy—31 points above pooled physicians (56.9%, p<0.001)—across liver tumors (95.0% vs 50.7%), viral hepatitis (80.0% vs 55.1%), and cirrhosis (80.0% vs 58.2%). LLMs can approach expert-level performance on guideline-referenced tasks when knowledge is machine-encoded, expert judgment shapes representation and evaluation, and automated verification enforces continuous safety. The work offers a reproducible blueprint for evidence-based AI in gastroenterology while delineating limits that require ongoing human oversight.
LLM; AI; Gastroenterology; RAG; SFT
CROCÈ SAVERIA LORY
BORTOLUSSI, LUCA
Università degli Studi di Trieste
File in questo prodotto:
File Dimensione Formato  
TESI PHD DECEMBER 20.pdf

accesso aperto

Licenza: Tutti i diritti riservati
Dimensione 18.44 MB
Formato Adobe PDF
18.44 MB Adobe PDF Visualizza/Apri
TESI PHD DECEMBER 20_1.pdf

accesso aperto

Licenza: Tutti i diritti riservati
Dimensione 18.44 MB
Formato Adobe PDF
18.44 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/357307
Il codice NBN di questa tesi è URN:NBN:IT:UNITS-357307