IN RECENT YEARS, THE INTEGRATION OF SOCIAL ROBOTS INTO EVERYDAY ENVIRONMENTS HAS WITNESSED SUBSTANTIAL GROWTH, PRIMARILY DRIVEN BY THE INCREASING DEMAND FOR MORE NATURAL, INTUITIVE, AND EFFECTIVE FORMS OF HUMAN–ROBOT INTERACTION (HRI). THIS TREND REFLECTS A BROADER SHIFT TOWARD DEVELOPING ROBOTIC SYSTEMS CAPABLE OF UNDERSTANDING AND RESPONDING TO HUMAN SOCIAL CUES, THEREBY FACILITATING SEAMLESS COLLABORATION AND COMMUNICATION IN REAL-WORLD SETTINGS. IN THIS CONTEXT, THIS THESIS ADDRESSES KEY CHALLENGES IN SOCIAL ROBOTICS BY PROPOSING A COMPREHENSIVE FRAMEWORK THAT ENHANCES HRI WITH A PARTICULAR FOCUS ON MULTI-USER INTERACTION, ROBOT’S PROACTIVE BEHAVIOR GENERATION, AND SOFT-BIOMETRIC RECOGNITION THROUGH MULTIMODAL AND MULTI-TASK LEARNING STRATEGIES. FIRST, WE INTRODUCE A HARDWARE-AGNOSTIC ARCHITECTURE BASED ON THE ROBOT OPERATING SYSTEM (ROS), DESIGNED TO SUPPORT REAL-TIME INTERACTION IN DYNAMIC, MULTI-USER SCENARIOS. THE SYSTEM INTEGRATES MULTIMODAL PERCEPTION MODULES, INCLUDING HEAD POSE ESTIMATION, ACTIVE SPEAKER DETECTION, AND DIRECTION-OF-ARRIVAL ANALYSIS, TOGETHER WITH REASONING COMPONENTS BUILT UPON BEHAVIOR TREES AND FINITE STATE MACHINES. THIS DESIGN EFFECTIVELY ADDRESSES THE MULTI-ENGAGEMENT PROBLEM, ENABLING ROBOTS TO MANAGE SIMULTANEOUS INTERACTIONS WITH MULTIPLE USERS. FURTHERMORE, THE FRAMEWORK WAS EXTENDED WITH AN INTERACTIVE GAME TO SUPPORT ELDERLY ASSISTANCE AND ENHANCE USER ENGAGEMENT THROUGH ENTERTAINMENT. SECOND, THE THESIS PRESENTS A NOVEL VISION-LANGUAGE MODEL (VLM)-BASED APPROACH FOR GENERATING CONTEXT-AWARE SENTENCES FROM VIDEO INPUT. A DUAL-TEACHER KNOWLEDGE DISTILLATION PIPELINE IS PROPOSED TO AUTOMATICALLY GENERATE TRAINING DATA, COMBINING THE STRENGTHS OF VLMS AND LARGE LANGUAGE MODELS (LLMS). THE RESULTING LIGHTWEIGHT MODEL IS OPTIMIZED FOR DEPLOYMENT ON EMBEDDED PLATFORMS, ENABLING PROACTIVE AND CONTEXTUALLY RELEVANT VERBAL BEHAVIORS IN SOCIAL ROBOTS. THIRD, THE MAGNET ARCHITECTURE IS INTRODUCED: A MULTI-MODAL MULTI-TASK LEARNING FRAMEWORK FOR SOFT-BIOMETRIC ESTIMATION, CAPABLE OF JOINTLY RECOGNIZING GENDER AND EMOTION FROM AUDIO AND VIDEO DATA. BY EMPLOYING SOFT PARAMETER SHARING, MAGNET ACHIEVES STATE-OF-THE-ART PERFORMANCE WHILE MAINTAINING LOW COMPUTATIONAL OVERHEAD, MAKING IT SUITABLE FOR REAL-TIME APPLICATIONS. EXTENSIVE QUANTITATIVE AND QUALITATIVE EVALUATIONS CONFIRM THE EFFECTIVENESS OF THE PROPOSED METHODS IN IMPROVING THE NATURALNESS, RESPONSIVENESS, AND SOCIAL ACCEPTABILITY OF HUMAN–ROBOT INTERACTIONS. THIS THESIS CONTRIBUTES A SCALABLE AND EFFICIENT SOLUTION FOR DEPLOYING SOCIALLY INTELLIGENT ROBOTS IN REAL-WORLD ENVIRONMENTS.

TOWARDS SOCIALLY INTELLIGENT ROBOTS: MULTI-USER ENGAGEMENT AND MULTIMODAL MULTI-TASK BIOMETRIC RECOGNITION

De Simone, Giuseppe
2026

Abstract

IN RECENT YEARS, THE INTEGRATION OF SOCIAL ROBOTS INTO EVERYDAY ENVIRONMENTS HAS WITNESSED SUBSTANTIAL GROWTH, PRIMARILY DRIVEN BY THE INCREASING DEMAND FOR MORE NATURAL, INTUITIVE, AND EFFECTIVE FORMS OF HUMAN–ROBOT INTERACTION (HRI). THIS TREND REFLECTS A BROADER SHIFT TOWARD DEVELOPING ROBOTIC SYSTEMS CAPABLE OF UNDERSTANDING AND RESPONDING TO HUMAN SOCIAL CUES, THEREBY FACILITATING SEAMLESS COLLABORATION AND COMMUNICATION IN REAL-WORLD SETTINGS. IN THIS CONTEXT, THIS THESIS ADDRESSES KEY CHALLENGES IN SOCIAL ROBOTICS BY PROPOSING A COMPREHENSIVE FRAMEWORK THAT ENHANCES HRI WITH A PARTICULAR FOCUS ON MULTI-USER INTERACTION, ROBOT’S PROACTIVE BEHAVIOR GENERATION, AND SOFT-BIOMETRIC RECOGNITION THROUGH MULTIMODAL AND MULTI-TASK LEARNING STRATEGIES. FIRST, WE INTRODUCE A HARDWARE-AGNOSTIC ARCHITECTURE BASED ON THE ROBOT OPERATING SYSTEM (ROS), DESIGNED TO SUPPORT REAL-TIME INTERACTION IN DYNAMIC, MULTI-USER SCENARIOS. THE SYSTEM INTEGRATES MULTIMODAL PERCEPTION MODULES, INCLUDING HEAD POSE ESTIMATION, ACTIVE SPEAKER DETECTION, AND DIRECTION-OF-ARRIVAL ANALYSIS, TOGETHER WITH REASONING COMPONENTS BUILT UPON BEHAVIOR TREES AND FINITE STATE MACHINES. THIS DESIGN EFFECTIVELY ADDRESSES THE MULTI-ENGAGEMENT PROBLEM, ENABLING ROBOTS TO MANAGE SIMULTANEOUS INTERACTIONS WITH MULTIPLE USERS. FURTHERMORE, THE FRAMEWORK WAS EXTENDED WITH AN INTERACTIVE GAME TO SUPPORT ELDERLY ASSISTANCE AND ENHANCE USER ENGAGEMENT THROUGH ENTERTAINMENT. SECOND, THE THESIS PRESENTS A NOVEL VISION-LANGUAGE MODEL (VLM)-BASED APPROACH FOR GENERATING CONTEXT-AWARE SENTENCES FROM VIDEO INPUT. A DUAL-TEACHER KNOWLEDGE DISTILLATION PIPELINE IS PROPOSED TO AUTOMATICALLY GENERATE TRAINING DATA, COMBINING THE STRENGTHS OF VLMS AND LARGE LANGUAGE MODELS (LLMS). THE RESULTING LIGHTWEIGHT MODEL IS OPTIMIZED FOR DEPLOYMENT ON EMBEDDED PLATFORMS, ENABLING PROACTIVE AND CONTEXTUALLY RELEVANT VERBAL BEHAVIORS IN SOCIAL ROBOTS. THIRD, THE MAGNET ARCHITECTURE IS INTRODUCED: A MULTI-MODAL MULTI-TASK LEARNING FRAMEWORK FOR SOFT-BIOMETRIC ESTIMATION, CAPABLE OF JOINTLY RECOGNIZING GENDER AND EMOTION FROM AUDIO AND VIDEO DATA. BY EMPLOYING SOFT PARAMETER SHARING, MAGNET ACHIEVES STATE-OF-THE-ART PERFORMANCE WHILE MAINTAINING LOW COMPUTATIONAL OVERHEAD, MAKING IT SUITABLE FOR REAL-TIME APPLICATIONS. EXTENSIVE QUANTITATIVE AND QUALITATIVE EVALUATIONS CONFIRM THE EFFECTIVENESS OF THE PROPOSED METHODS IN IMPROVING THE NATURALNESS, RESPONSIVENESS, AND SOCIAL ACCEPTABILITY OF HUMAN–ROBOT INTERACTIONS. THIS THESIS CONTRIBUTES A SCALABLE AND EFFICIENT SOLUTION FOR DEPLOYING SOCIALLY INTELLIGENT ROBOTS IN REAL-WORLD ENVIRONMENTS.
19-mar-2026
Inglese
SAGGESE, Alessia
Università degli Studi di Salerno
File in questo prodotto:
File Dimensione Formato  
abstract_1.pdf

accesso aperto

Licenza: Tutti i diritti riservati
Dimensione 52.59 kB
Formato Adobe PDF
52.59 kB Adobe PDF Visualizza/Apri
PhD_Thesis-De Simone_1.pdf

accesso aperto

Licenza: Tutti i diritti riservati
Dimensione 10.11 MB
Formato Adobe PDF
10.11 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/361447
Il codice NBN di questa tesi è URN:NBN:IT:UNISA-361447