As social robots become integrated into daily life, optimizing their interactions with humans is crucial to enhancing their acceptance. Cultural differences significantly influence verbal and non-verbal human-robot communication, yet embedding culturally adaptive behaviors, such as co-speech gestures, remains largely unexplored. This thesis addresses the challenge of generating culture-aware co-speech gestures for social robots by introducing two novel approaches: a rule-based method leveraging semantic similarity scores and a data-driven model utilizing hierarchical diffusion transformers. Both approaches are designed to be computationally efficient and capable of generating gestures in real time, making them practical for social robots with limited computational resources. Furthermore, we consider only upper-body movements, which can be easily reproduced by humanoid robots with minimal or no hand movement, ensuring broader applicability across different robotic platforms. The rule-based approach employs algorithms to identify contextually relevant words associated with Symbolic and Deictic gestures within sentences. Three algorithms are proposed: one compares sentences that heuristically represent the context in which a set of gestures are produced with a fixed number of words in the objective sentence, another allows for a variable number of words, and a third relies on a statistical analysis of participant-labeled sentences without using semantic similarity scores. Evaluation using Average Precision (AP), Intersection Over Union (IOU), and Average Computational Time (ACT), demonstrates that the semantic-based algorithms outperform the statistics-based algorithm, with the variable-word approach achieving the highest performance despite increased computational demands. The data-driven model, designed to map multimodal speech features to gestures while incorporating cultural components, employs an autoregressive architecture using hierarchical diffusion transformers. It generates four seconds of motion-related encodings from noisy motion, audio, text, cultural embeddings, and one second of previous motion context. The motion encodings are eventually decoded using a pre-trained vector-quantized variational autoencoder. The model leverages two multi-head attention blocks to capture relationships within motion features and between motion and low-level audio features (e.g., rhythm and emphasis). Additionally, adaptive instance normalization layers condition the motion style on speech semantics and cultural context. Preliminary objective results indicate that the model produces realistic, culturally adaptive co-speech gestures in real-time using the various inputs. To efficiently embed cultural components and better understand the impact of culture on data, extensive studies were conducted across multimodal datasets, beginning with publicly available resources and progressing to a custom-built dataset. High-level analysis of the existing LISI-HHI multimodal interaction dataset revealed cultural differences in textual and gestural features. Culture classification using Fully Connected Neural Networks (FCNNs) and Random Forest (RF) models achieved high accuracy with subject-dependent data splits but struggled to generalize to unseen speakers in subject-independent splits. Although adversarial learning improved speaker-invariant representation to some extent, the dataset’s limited speaker count constrained further generalization. To overcome these limitations, the TED4C-L dataset was developed, comprising 737 speakers from four distinct cultures, extracted from YouTube TED Talks. TED4C-L offers a diverse, speaker-balanced, and multilingual collection that facilitates enhanced cultural representation learning. A similar analysis to that conducted on the LISI-HHI dataset applied to TED4C-L revealed significant improvements in classification accuracy. Subject-independent cultural representations derived from TED4C-L were subsequently embedded into the data-driven model to enhance its performance. Together, these efforts contribute to a robust framework for generating culture-aware co-speech gestures, paving the way for more effective and culturally adaptive human-robot interactions.

Multimodal Culture-Aware Gesture Generation for Social Robots: Combining Semantic Similarity with Generative Models

GJACI, ARIEL
2025

Abstract

As social robots become integrated into daily life, optimizing their interactions with humans is crucial to enhancing their acceptance. Cultural differences significantly influence verbal and non-verbal human-robot communication, yet embedding culturally adaptive behaviors, such as co-speech gestures, remains largely unexplored. This thesis addresses the challenge of generating culture-aware co-speech gestures for social robots by introducing two novel approaches: a rule-based method leveraging semantic similarity scores and a data-driven model utilizing hierarchical diffusion transformers. Both approaches are designed to be computationally efficient and capable of generating gestures in real time, making them practical for social robots with limited computational resources. Furthermore, we consider only upper-body movements, which can be easily reproduced by humanoid robots with minimal or no hand movement, ensuring broader applicability across different robotic platforms. The rule-based approach employs algorithms to identify contextually relevant words associated with Symbolic and Deictic gestures within sentences. Three algorithms are proposed: one compares sentences that heuristically represent the context in which a set of gestures are produced with a fixed number of words in the objective sentence, another allows for a variable number of words, and a third relies on a statistical analysis of participant-labeled sentences without using semantic similarity scores. Evaluation using Average Precision (AP), Intersection Over Union (IOU), and Average Computational Time (ACT), demonstrates that the semantic-based algorithms outperform the statistics-based algorithm, with the variable-word approach achieving the highest performance despite increased computational demands. The data-driven model, designed to map multimodal speech features to gestures while incorporating cultural components, employs an autoregressive architecture using hierarchical diffusion transformers. It generates four seconds of motion-related encodings from noisy motion, audio, text, cultural embeddings, and one second of previous motion context. The motion encodings are eventually decoded using a pre-trained vector-quantized variational autoencoder. The model leverages two multi-head attention blocks to capture relationships within motion features and between motion and low-level audio features (e.g., rhythm and emphasis). Additionally, adaptive instance normalization layers condition the motion style on speech semantics and cultural context. Preliminary objective results indicate that the model produces realistic, culturally adaptive co-speech gestures in real-time using the various inputs. To efficiently embed cultural components and better understand the impact of culture on data, extensive studies were conducted across multimodal datasets, beginning with publicly available resources and progressing to a custom-built dataset. High-level analysis of the existing LISI-HHI multimodal interaction dataset revealed cultural differences in textual and gestural features. Culture classification using Fully Connected Neural Networks (FCNNs) and Random Forest (RF) models achieved high accuracy with subject-dependent data splits but struggled to generalize to unseen speakers in subject-independent splits. Although adversarial learning improved speaker-invariant representation to some extent, the dataset’s limited speaker count constrained further generalization. To overcome these limitations, the TED4C-L dataset was developed, comprising 737 speakers from four distinct cultures, extracted from YouTube TED Talks. TED4C-L offers a diverse, speaker-balanced, and multilingual collection that facilitates enhanced cultural representation learning. A similar analysis to that conducted on the LISI-HHI dataset applied to TED4C-L revealed significant improvements in classification accuracy. Subject-independent cultural representations derived from TED4C-L were subsequently embedded into the data-driven model to enhance its performance. Together, these efforts contribute to a robust framework for generating culture-aware co-speech gestures, paving the way for more effective and culturally adaptive human-robot interactions.
19-mag-2025
Inglese
SGORBISSA, ANTONIO
MASSOBRIO, PAOLO
Università degli studi di Genova
File in questo prodotto:
File Dimensione Formato  
phdunige_4075095.pdf

accesso aperto

Dimensione 19.44 MB
Formato Adobe PDF
19.44 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/209834
Il codice NBN di questa tesi è URN:NBN:IT:UNIGE-209834