This thesis demonstrates how machine learning-based web data extraction can serve as both a methodological foundation and a research enabler for computational social science, particularly in domains shaped by multimodal, high-velocity content. Building on a modular and reproducible architecture, the work integrates automated web scraping, transformer-based language models, and advanced object detection to collect and analyze large-scale social media data. It argues for the broader adoption of these tools within the social sciences, highlighting their growing accessibility and alignment with ethical research standards. Using this infrastructure, the thesis turns to the case of financial influencers (finfluencers), an emergent and under-theorized category of digital actors who communicate financial content on platforms such as Instagram. A comprehensive desk review maps the definitional contours, content strategies, monetization models, regulatory tensions, and ethical risks associated with finfluencer activity. This section establishes the conceptual grounding required for empirical analysis and identifies key research gaps in the literature. In the final section, two interlinked experiments are presented. First, a large-scale thematic analysis is conducted on 22.854 Instagram captions using an LLM-assisted coding approach, revealing 36 dominant content themes. Second, a predictive model employing random forest classifiers is developed to test whether visual cues from post images, extracted via the YOLOv11 object detection model, can predict the thematic category associated with the captions. Results indicate a strong correspondence between visual and textual elements, offering new pathways for analyzing digital content where text is limited or absent.
A MACHINE LEARNING ENHANCED FRAMEWORK FOR NAVIGATING THE DIGITAL SPACE: RESULTS FROM THE CASE OF FINFLUENCERS
DE MATTEO, FRANCESCO
2026
Abstract
This thesis demonstrates how machine learning-based web data extraction can serve as both a methodological foundation and a research enabler for computational social science, particularly in domains shaped by multimodal, high-velocity content. Building on a modular and reproducible architecture, the work integrates automated web scraping, transformer-based language models, and advanced object detection to collect and analyze large-scale social media data. It argues for the broader adoption of these tools within the social sciences, highlighting their growing accessibility and alignment with ethical research standards. Using this infrastructure, the thesis turns to the case of financial influencers (finfluencers), an emergent and under-theorized category of digital actors who communicate financial content on platforms such as Instagram. A comprehensive desk review maps the definitional contours, content strategies, monetization models, regulatory tensions, and ethical risks associated with finfluencer activity. This section establishes the conceptual grounding required for empirical analysis and identifies key research gaps in the literature. In the final section, two interlinked experiments are presented. First, a large-scale thematic analysis is conducted on 22.854 Instagram captions using an LLM-assisted coding approach, revealing 36 dominant content themes. Second, a predictive model employing random forest classifiers is developed to test whether visual cues from post images, extracted via the YOLOv11 object detection model, can predict the thematic category associated with the captions. Results indicate a strong correspondence between visual and textual elements, offering new pathways for analyzing digital content where text is limited or absent.| File | Dimensione | Formato | |
|---|---|---|---|
|
Francesco De Matteo THESIS REVIEWED.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
1.41 MB
Formato
Adobe PDF
|
1.41 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/361617
URN:NBN:IT:IULM-361617