Artificial Intelligence (AI) has profoundly transformed medical imaging, enabling new frontiers in disease detection, staging, and diagnostic decision support. Yet, despite re markableprogress, mostAItoolsremainsiloed—disconnectedfromPictureArchivingand Communication Systems (PACS) and Radiology Information Systems (RIS)—and often constrained by closed vocabularies that limit interpretability and clinical integration. This thesis addresses these challenges by designing a sequence of retrieval-driven, explainable AI frameworks that progressively bridge detection, retrieval, and multimodal reasoning for medical imaging, with a specific focus on lung cancer and thoracic radiography. In the first stage, lung cancer is adopted as a case study for closed-vocabulary analysis. A 3D volumetric pipeline was developed using the YOLOv8 architecture for simulta neous detection and subtype classification of Adenocarcinoma (ADC), Squamous Cell Carcinoma (SCC), and Small Cell Lung Cancer (SCLC) on CT scans. Experimental re sults demonstrated a mean Average Precision (mAP) of 97.1%, with the YOLOv8-Small variant achieving 96.1% precision, a recall of 0.91, and a detection speed of 0.22 sec onds—surpassing two-stage models such as Faster R-CNN and YOLOv7. The extracted features were subsequently used to train a custom TNMClassifier, achieving an overall accuracy of 98% in staging classification. This integration of subtype and stage detection established a strong foundation for clinically relevant representation learning. Buildingupontheseembeddings, thesecondstageintroducedaContent-BasedImage Medical Retrieval (CBIMR) system that leverages YOLOv8-derived features to retrieve clinically similar cases across cancer groups and TNM stages. The retrieval framework achieved a precision of 0.961, recall of 0.945, and mAP@0.5 of 0.971, effectively linking detection and retrieval pipelines. This detection-driven CBIR model demonstrated how region-aware embeddings can enhance interpretability, consistency, and case-based reasoning—key prerequisites for integration into PACS/RIS environments. The final stage transitions toward open-vocabulary and multimodal learning through the introduction of MedFL (Medical Florence), a unified framework for medical im age retrieval and report generation. Built upon the Florence-2 Vision–Language Model (VLM), MedFL employs parameter-efficient fine-tuning via LoRA on the MIMIC-CXR dataset and introduces a novel prompt-conditioned feature fusion strategy combining three complementary representations: CAPTION, DETAILED_CAPTION, and OD. The fused 2304-dimensional embeddings enable cross-modal retrieval and automated report gener ation within a unified space. Extensive experiments demonstrate that MedFL achieves Recall@10 = 0.930 for retrieval and BLEU-4 = 0.3363 for report generation, outper forming strong baselines such as CLIP, BioMedCLIP, and GatorTron-CLIP. Furthermore, MedFLattains mAP=0.35@0.5IoUonVinDr-CXR,confirmingrobustlocalization and visual–textual alignment. Collectively, this thesis advances the paradigm of AI in medical imaging from de tection to understanding—evolving from closed-vocabulary CT-based detection to open vocabulary, multimodal reasoning in radiology. By integrating visual and textual rep resentations into a unified, explainable, and retrieval-driven architecture, the proposed vi frameworks lay the groundworkforclinically deployable systems that can assist physicians in case retrieval, diagnostic interpretation, and automated reporting, ultimately supporting more transparent, personalized, and efficient medical decision-making. Keywords: Medical imaging, Content-Based Image Retrieval (CBIR), YOLOv8, TNM staging, Vision–LanguageModels, MedFL,Multimodallearning,Florence-2,Explainable AI, Radiology
Personalized Medicine and Process Optimization: Analysis and Implementation of Intelligent Tools to Support the Clinical Process
WEHBE, ALAA
2026
Abstract
Artificial Intelligence (AI) has profoundly transformed medical imaging, enabling new frontiers in disease detection, staging, and diagnostic decision support. Yet, despite re markableprogress, mostAItoolsremainsiloed—disconnectedfromPictureArchivingand Communication Systems (PACS) and Radiology Information Systems (RIS)—and often constrained by closed vocabularies that limit interpretability and clinical integration. This thesis addresses these challenges by designing a sequence of retrieval-driven, explainable AI frameworks that progressively bridge detection, retrieval, and multimodal reasoning for medical imaging, with a specific focus on lung cancer and thoracic radiography. In the first stage, lung cancer is adopted as a case study for closed-vocabulary analysis. A 3D volumetric pipeline was developed using the YOLOv8 architecture for simulta neous detection and subtype classification of Adenocarcinoma (ADC), Squamous Cell Carcinoma (SCC), and Small Cell Lung Cancer (SCLC) on CT scans. Experimental re sults demonstrated a mean Average Precision (mAP) of 97.1%, with the YOLOv8-Small variant achieving 96.1% precision, a recall of 0.91, and a detection speed of 0.22 sec onds—surpassing two-stage models such as Faster R-CNN and YOLOv7. The extracted features were subsequently used to train a custom TNMClassifier, achieving an overall accuracy of 98% in staging classification. This integration of subtype and stage detection established a strong foundation for clinically relevant representation learning. Buildingupontheseembeddings, thesecondstageintroducedaContent-BasedImage Medical Retrieval (CBIMR) system that leverages YOLOv8-derived features to retrieve clinically similar cases across cancer groups and TNM stages. The retrieval framework achieved a precision of 0.961, recall of 0.945, and mAP@0.5 of 0.971, effectively linking detection and retrieval pipelines. This detection-driven CBIR model demonstrated how region-aware embeddings can enhance interpretability, consistency, and case-based reasoning—key prerequisites for integration into PACS/RIS environments. The final stage transitions toward open-vocabulary and multimodal learning through the introduction of MedFL (Medical Florence), a unified framework for medical im age retrieval and report generation. Built upon the Florence-2 Vision–Language Model (VLM), MedFL employs parameter-efficient fine-tuning via LoRA on the MIMIC-CXR dataset and introduces a novel prompt-conditioned feature fusion strategy combining three complementary representations: CAPTION, DETAILED_CAPTION, and OD. The fused 2304-dimensional embeddings enable cross-modal retrieval and automated report gener ation within a unified space. Extensive experiments demonstrate that MedFL achieves Recall@10 = 0.930 for retrieval and BLEU-4 = 0.3363 for report generation, outper forming strong baselines such as CLIP, BioMedCLIP, and GatorTron-CLIP. Furthermore, MedFLattains mAP=0.35@0.5IoUonVinDr-CXR,confirmingrobustlocalization and visual–textual alignment. Collectively, this thesis advances the paradigm of AI in medical imaging from de tection to understanding—evolving from closed-vocabulary CT-based detection to open vocabulary, multimodal reasoning in radiology. By integrating visual and textual rep resentations into a unified, explainable, and retrieval-driven architecture, the proposed vi frameworks lay the groundworkforclinically deployable systems that can assist physicians in case retrieval, diagnostic interpretation, and automated reporting, ultimately supporting more transparent, personalized, and efficient medical decision-making. Keywords: Medical imaging, Content-Based Image Retrieval (CBIR), YOLOv8, TNM staging, Vision–LanguageModels, MedFL,Multimodallearning,Florence-2,Explainable AI, Radiology| File | Dimensione | Formato | |
|---|---|---|---|
|
phdunige_5509991.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
7.01 MB
Formato
Adobe PDF
|
7.01 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/362460
URN:NBN:IT:UNIGE-362460