IN RECENT YEARS, THE INTEGRATION OF SOCIAL ROBOTS INTO EVERYDAY ENVIRONMENTS HAS WITNESSED SUBSTANTIAL GROWTH, PRIMARILY DRIVEN BY THE INCREASING DEMAND FOR MORE NATURAL, INTUITIVE, AND EFFECTIVE FORMS OF HUMAN–ROBOT INTERACTION (HRI). THIS TREND REFLECTS A BROADER SHIFT TOWARD DEVELOPING ROBOTIC SYSTEMS CAPABLE OF UNDERSTANDING AND RESPONDING TO HUMAN SOCIAL CUES, THEREBY FACILITATING SEAMLESS COLLABORATION AND COMMUNICATION IN REAL-WORLD SETTINGS. IN THIS CONTEXT, THIS THESIS ADDRESSES KEY CHALLENGES IN SOCIAL ROBOTICS BY PROPOSING A COMPREHENSIVE FRAMEWORK THAT ENHANCES HRI WITH A PARTICULAR FOCUS ON MULTI-USER INTERACTION, ROBOT’S PROACTIVE BEHAVIOR GENERATION, AND SOFT-BIOMETRIC RECOGNITION THROUGH MULTIMODAL AND MULTI-TASK LEARNING STRATEGIES. FIRST, WE INTRODUCE A HARDWARE-AGNOSTIC ARCHITECTURE BASED ON THE ROBOT OPERATING SYSTEM (ROS), DESIGNED TO SUPPORT REAL-TIME INTERACTION IN DYNAMIC, MULTI-USER SCENARIOS. THE SYSTEM INTEGRATES MULTIMODAL PERCEPTION MODULES, INCLUDING HEAD POSE ESTIMATION, ACTIVE SPEAKER DETECTION, AND DIRECTION-OF-ARRIVAL ANALYSIS, TOGETHER WITH REASONING COMPONENTS BUILT UPON BEHAVIOR TREES AND FINITE STATE MACHINES. THIS DESIGN EFFECTIVELY ADDRESSES THE MULTI-ENGAGEMENT PROBLEM, ENABLING ROBOTS TO MANAGE SIMULTANEOUS INTERACTIONS WITH MULTIPLE USERS. FURTHERMORE, THE FRAMEWORK WAS EXTENDED WITH AN INTERACTIVE GAME TO SUPPORT ELDERLY ASSISTANCE AND ENHANCE USER ENGAGEMENT THROUGH ENTERTAINMENT. SECOND, THE THESIS PRESENTS A NOVEL VISION-LANGUAGE MODEL (VLM)-BASED APPROACH FOR GENERATING CONTEXT-AWARE SENTENCES FROM VIDEO INPUT. A DUAL-TEACHER KNOWLEDGE DISTILLATION PIPELINE IS PROPOSED TO AUTOMATICALLY GENERATE TRAINING DATA, COMBINING THE STRENGTHS OF VLMS AND LARGE LANGUAGE MODELS (LLMS). THE RESULTING LIGHTWEIGHT MODEL IS OPTIMIZED FOR DEPLOYMENT ON EMBEDDED PLATFORMS, ENABLING PROACTIVE AND CONTEXTUALLY RELEVANT VERBAL BEHAVIORS IN SOCIAL ROBOTS. THIRD, THE MAGNET ARCHITECTURE IS INTRODUCED: A MULTI-MODAL MULTI-TASK LEARNING FRAMEWORK FOR SOFT-BIOMETRIC ESTIMATION, CAPABLE OF JOINTLY RECOGNIZING GENDER AND EMOTION FROM AUDIO AND VIDEO DATA. BY EMPLOYING SOFT PARAMETER SHARING, MAGNET ACHIEVES STATE-OF-THE-ART PERFORMANCE WHILE MAINTAINING LOW COMPUTATIONAL OVERHEAD, MAKING IT SUITABLE FOR REAL-TIME APPLICATIONS. EXTENSIVE QUANTITATIVE AND QUALITATIVE EVALUATIONS CONFIRM THE EFFECTIVENESS OF THE PROPOSED METHODS IN IMPROVING THE NATURALNESS, RESPONSIVENESS, AND SOCIAL ACCEPTABILITY OF HUMAN–ROBOT INTERACTIONS. THIS THESIS CONTRIBUTES A SCALABLE AND EFFICIENT SOLUTION FOR DEPLOYING SOCIALLY INTELLIGENT ROBOTS IN REAL-WORLD ENVIRONMENTS.
TOWARDS SOCIALLY INTELLIGENT ROBOTS: MULTI-USER ENGAGEMENT AND MULTIMODAL MULTI-TASK BIOMETRIC RECOGNITION
De Simone, Giuseppe
2026
Abstract
IN RECENT YEARS, THE INTEGRATION OF SOCIAL ROBOTS INTO EVERYDAY ENVIRONMENTS HAS WITNESSED SUBSTANTIAL GROWTH, PRIMARILY DRIVEN BY THE INCREASING DEMAND FOR MORE NATURAL, INTUITIVE, AND EFFECTIVE FORMS OF HUMAN–ROBOT INTERACTION (HRI). THIS TREND REFLECTS A BROADER SHIFT TOWARD DEVELOPING ROBOTIC SYSTEMS CAPABLE OF UNDERSTANDING AND RESPONDING TO HUMAN SOCIAL CUES, THEREBY FACILITATING SEAMLESS COLLABORATION AND COMMUNICATION IN REAL-WORLD SETTINGS. IN THIS CONTEXT, THIS THESIS ADDRESSES KEY CHALLENGES IN SOCIAL ROBOTICS BY PROPOSING A COMPREHENSIVE FRAMEWORK THAT ENHANCES HRI WITH A PARTICULAR FOCUS ON MULTI-USER INTERACTION, ROBOT’S PROACTIVE BEHAVIOR GENERATION, AND SOFT-BIOMETRIC RECOGNITION THROUGH MULTIMODAL AND MULTI-TASK LEARNING STRATEGIES. FIRST, WE INTRODUCE A HARDWARE-AGNOSTIC ARCHITECTURE BASED ON THE ROBOT OPERATING SYSTEM (ROS), DESIGNED TO SUPPORT REAL-TIME INTERACTION IN DYNAMIC, MULTI-USER SCENARIOS. THE SYSTEM INTEGRATES MULTIMODAL PERCEPTION MODULES, INCLUDING HEAD POSE ESTIMATION, ACTIVE SPEAKER DETECTION, AND DIRECTION-OF-ARRIVAL ANALYSIS, TOGETHER WITH REASONING COMPONENTS BUILT UPON BEHAVIOR TREES AND FINITE STATE MACHINES. THIS DESIGN EFFECTIVELY ADDRESSES THE MULTI-ENGAGEMENT PROBLEM, ENABLING ROBOTS TO MANAGE SIMULTANEOUS INTERACTIONS WITH MULTIPLE USERS. FURTHERMORE, THE FRAMEWORK WAS EXTENDED WITH AN INTERACTIVE GAME TO SUPPORT ELDERLY ASSISTANCE AND ENHANCE USER ENGAGEMENT THROUGH ENTERTAINMENT. SECOND, THE THESIS PRESENTS A NOVEL VISION-LANGUAGE MODEL (VLM)-BASED APPROACH FOR GENERATING CONTEXT-AWARE SENTENCES FROM VIDEO INPUT. A DUAL-TEACHER KNOWLEDGE DISTILLATION PIPELINE IS PROPOSED TO AUTOMATICALLY GENERATE TRAINING DATA, COMBINING THE STRENGTHS OF VLMS AND LARGE LANGUAGE MODELS (LLMS). THE RESULTING LIGHTWEIGHT MODEL IS OPTIMIZED FOR DEPLOYMENT ON EMBEDDED PLATFORMS, ENABLING PROACTIVE AND CONTEXTUALLY RELEVANT VERBAL BEHAVIORS IN SOCIAL ROBOTS. THIRD, THE MAGNET ARCHITECTURE IS INTRODUCED: A MULTI-MODAL MULTI-TASK LEARNING FRAMEWORK FOR SOFT-BIOMETRIC ESTIMATION, CAPABLE OF JOINTLY RECOGNIZING GENDER AND EMOTION FROM AUDIO AND VIDEO DATA. BY EMPLOYING SOFT PARAMETER SHARING, MAGNET ACHIEVES STATE-OF-THE-ART PERFORMANCE WHILE MAINTAINING LOW COMPUTATIONAL OVERHEAD, MAKING IT SUITABLE FOR REAL-TIME APPLICATIONS. EXTENSIVE QUANTITATIVE AND QUALITATIVE EVALUATIONS CONFIRM THE EFFECTIVENESS OF THE PROPOSED METHODS IN IMPROVING THE NATURALNESS, RESPONSIVENESS, AND SOCIAL ACCEPTABILITY OF HUMAN–ROBOT INTERACTIONS. THIS THESIS CONTRIBUTES A SCALABLE AND EFFICIENT SOLUTION FOR DEPLOYING SOCIALLY INTELLIGENT ROBOTS IN REAL-WORLD ENVIRONMENTS.| File | Dimensione | Formato | |
|---|---|---|---|
|
abstract_1.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
52.59 kB
Formato
Adobe PDF
|
52.59 kB | Adobe PDF | Visualizza/Apri |
|
PhD_Thesis-De Simone_1.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
10.11 MB
Formato
Adobe PDF
|
10.11 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/361447
URN:NBN:IT:UNISA-361447