Multimodal communication, intended as the sequential or simultaneous integration of signals across different communicative modalities, is a key aspect of primate social behaviour. Among these, the interplay between facial gestures and vocal emissions has long been hypothesised to have evolutionary relevance for the emergence of complex communication systems, including human language. Nevertheless, empirical studies investigating the co-occurrence between different modalities remain scarce, in part due to technical limitations in integrating data derived from different methodological approaches. While continuous, quantitative measures prevail in acoustic research, facial gestures heavily rely on categorical and discrete labels, which are derived from a labour-intensive manual annotation of video footage. The present dissertation aims to address that gap through a comparative investigation of facial–vocal integration in different primate species. In particular, I aimed to investigate whether facial configurations differ reliably when co-occur with vocal emissions, and whether facial gestures are associated with distinct behavioural contexts. Moreover, by examining the visual information of oro-facial configurations in conjunction with spectral measurements from simultaneous audio, I also aimed to investigate the extent of vocal tract tuning, a process in which articulatory adjustments of the vocal tract modify its shape to enhance sound intensity. To answer the above questions, I collected video and audio recordings from four primate species that differ in terms of phylogenetic distance and vocal repertoire: indri (Indri indri), diademed sifaka (Propithecus diadema), yellow-cheeked crested gibbon (Nomascus gabriellae), and cotton-top tamarin (Saguinus oedipus). Data were collected in both in situ and ex-situ settings. I used a deep learning–based software to develop species-specific models to predict facial key points from video footage, which were then transformed into descriptors of facial configuration. Through the application of three machine learning classifiers, I demonstrated that voiced facial gestures differ from unvoiced ones across all examined species, suggesting that vocal emissions are associated with distinctive facial configurations. These findings represent an important step toward understanding the evolutionary origins of multimodal integration between facial expressions and vocal production. Following the same analytic workflow, I found that facial configurations of cotton-top tamarins exhibit varying degrees of context specificity across different behavioural contexts, showing that facial configurations associated with yawning, social activity, and resting are the most highly distinctive. Finally, I found a positive correlation between mouth opening and fundamental frequency in the singing behaviour of indris, supporting the hypothesis that indris can tune their supralaryngeal vocal tract to enhance sound intensity. By integration of comparative data and methodological innovation, this dissertation proposes a novel framework for studying multimodal communication in nonhuman primates. The application of ML-based techniques enabled the automation of the information extraction process and the acquisition of continuous parameters to describe facial gestures. In particular, the ability to quantify facial parameters, such as the degree of mouth opening, facilitated alignment with acoustic data, making it possible to examine the articulatory mechanisms involved in vocal production directly from video recordings. Behind methodological innovation, the findings collectively support the view that facial–vocal integration has a crucial role in primate communication, opening new avenues for understanding the evolutionary roots of human language.
Co-occurrence of facial gestures and vocalisations across primates
CARUGATI, FILIPPO
2025
Abstract
Multimodal communication, intended as the sequential or simultaneous integration of signals across different communicative modalities, is a key aspect of primate social behaviour. Among these, the interplay between facial gestures and vocal emissions has long been hypothesised to have evolutionary relevance for the emergence of complex communication systems, including human language. Nevertheless, empirical studies investigating the co-occurrence between different modalities remain scarce, in part due to technical limitations in integrating data derived from different methodological approaches. While continuous, quantitative measures prevail in acoustic research, facial gestures heavily rely on categorical and discrete labels, which are derived from a labour-intensive manual annotation of video footage. The present dissertation aims to address that gap through a comparative investigation of facial–vocal integration in different primate species. In particular, I aimed to investigate whether facial configurations differ reliably when co-occur with vocal emissions, and whether facial gestures are associated with distinct behavioural contexts. Moreover, by examining the visual information of oro-facial configurations in conjunction with spectral measurements from simultaneous audio, I also aimed to investigate the extent of vocal tract tuning, a process in which articulatory adjustments of the vocal tract modify its shape to enhance sound intensity. To answer the above questions, I collected video and audio recordings from four primate species that differ in terms of phylogenetic distance and vocal repertoire: indri (Indri indri), diademed sifaka (Propithecus diadema), yellow-cheeked crested gibbon (Nomascus gabriellae), and cotton-top tamarin (Saguinus oedipus). Data were collected in both in situ and ex-situ settings. I used a deep learning–based software to develop species-specific models to predict facial key points from video footage, which were then transformed into descriptors of facial configuration. Through the application of three machine learning classifiers, I demonstrated that voiced facial gestures differ from unvoiced ones across all examined species, suggesting that vocal emissions are associated with distinctive facial configurations. These findings represent an important step toward understanding the evolutionary origins of multimodal integration between facial expressions and vocal production. Following the same analytic workflow, I found that facial configurations of cotton-top tamarins exhibit varying degrees of context specificity across different behavioural contexts, showing that facial configurations associated with yawning, social activity, and resting are the most highly distinctive. Finally, I found a positive correlation between mouth opening and fundamental frequency in the singing behaviour of indris, supporting the hypothesis that indris can tune their supralaryngeal vocal tract to enhance sound intensity. By integration of comparative data and methodological innovation, this dissertation proposes a novel framework for studying multimodal communication in nonhuman primates. The application of ML-based techniques enabled the automation of the information extraction process and the acquisition of continuous parameters to describe facial gestures. In particular, the ability to quantify facial parameters, such as the degree of mouth opening, facilitated alignment with acoustic data, making it possible to examine the articulatory mechanisms involved in vocal production directly from video recordings. Behind methodological innovation, the findings collectively support the view that facial–vocal integration has a crucial role in primate communication, opening new avenues for understanding the evolutionary roots of human language.File | Dimensione | Formato | |
---|---|---|---|
Thesis_PhD_Filippo_Carugati.pdf
accesso aperto
Dimensione
2.27 MB
Formato
Adobe PDF
|
2.27 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/219064
URN:NBN:IT:UNITO-219064