Nowadays, Artificial Intelligence (AI) is the most used technology to empower old and new tasks. Indeed, even though they involve more expertise, it is not only possible to embed AI in several scenarios, but it is highly recommended since it improves the final result thanks to its versatility and its modern perspective. One of these cases is the Automatic Speech Recognition (ASR) task. From a general point of view, ASR is a process that analyses voice signals and generates its most likely transcription. It is a very challenging task because all subjects speak in a different way, even though they belong to the same culture and region. In the last few years, ASR has become one of the most important research fields due to the growing demand for hands-free interface devices. This technology can help people to interact with smart devices in critical situations where hands are involved in another task. Driving a car is one of the cases where the user may be in need of interaction with a smart device (e.g. smartphone) without involving hands keeping it in a safe situation. However, what happens if a user has an impaired speech? Is the ASR technology able to deal with this issue? Dysarthria is a speech disorder caused by impaired neurological function, motor control, and/or speech articulators. These speech impairments can result from acquired brain or spinal cord injuries (e.g., stroke) as well as congenital and neurodegenerative diseases, and age-related neurological decline. Within the field of ASR, the processing of dysarthric speech is a challenge because standard approaches are ineffective in the presence of dysarthria. As a result, users with such speech disorders are unable to get benefits from that kind of technology. Since the ASR technology is based on statistical analysis and neural networks, the first step in order to improve the performance of speech recognition for dysarthric speakers is to create dysarthric databases. When we started our research path, there was no Italian dysarthric speech database available. Therefore, our first aim was to develop the first Italian Dysarthric Speech database (IDEA) thanks to a partnership with three Italian Medical Facilities. For this purpose, we have developed a specific PC tool named RECORDIA, which leads doctors and caregivers in patients’ characterization and speech recordings procedures. All the data collected by our partners thanks to RECORDIA software, have been stored on an online server located at University of Pisa and accessible all over the world. After the Italian data collection process, we started to approach them with the well-known ASR technologies, in order to evaluate the goodness of the data, and we compared our results with the ones we obtained from other English databases. The standard ASR technologies are hybrid Hidden Markov Model combined with Gaussian Mixture Model (GMM-HMM) and Deep Neural Network (DNN-HMM). We also decided to use a new Features Extraction technique for dysarthric speakers, which tries to tune window and shift parameters of Short Time Fourier Transform basing it on the way a user speaks. Several experiments have been carried out by using audio contributions from 45 people with speech and communication disorders and 10 speakers without speech disorder. Then, a comparison was performed between performances of an ASR system, made through the Kaldi toolkit, that uses standard speech processing and our proposal. This approach has been found to be very effective for people with medium and high level of dysarthria, improving ASR performance. An additional study was carried out in order to analyse the possible correlation between the new window and shift parameters and certain vocal characteristics of the subjects studied.

Automatic Speech Recognition system for people with impaired speech: development of an Italian Dysarthric Speech database and a new Speech Analysis Technique to improve speech recognition performance

MARINI, MARCO
2022

Abstract

Nowadays, Artificial Intelligence (AI) is the most used technology to empower old and new tasks. Indeed, even though they involve more expertise, it is not only possible to embed AI in several scenarios, but it is highly recommended since it improves the final result thanks to its versatility and its modern perspective. One of these cases is the Automatic Speech Recognition (ASR) task. From a general point of view, ASR is a process that analyses voice signals and generates its most likely transcription. It is a very challenging task because all subjects speak in a different way, even though they belong to the same culture and region. In the last few years, ASR has become one of the most important research fields due to the growing demand for hands-free interface devices. This technology can help people to interact with smart devices in critical situations where hands are involved in another task. Driving a car is one of the cases where the user may be in need of interaction with a smart device (e.g. smartphone) without involving hands keeping it in a safe situation. However, what happens if a user has an impaired speech? Is the ASR technology able to deal with this issue? Dysarthria is a speech disorder caused by impaired neurological function, motor control, and/or speech articulators. These speech impairments can result from acquired brain or spinal cord injuries (e.g., stroke) as well as congenital and neurodegenerative diseases, and age-related neurological decline. Within the field of ASR, the processing of dysarthric speech is a challenge because standard approaches are ineffective in the presence of dysarthria. As a result, users with such speech disorders are unable to get benefits from that kind of technology. Since the ASR technology is based on statistical analysis and neural networks, the first step in order to improve the performance of speech recognition for dysarthric speakers is to create dysarthric databases. When we started our research path, there was no Italian dysarthric speech database available. Therefore, our first aim was to develop the first Italian Dysarthric Speech database (IDEA) thanks to a partnership with three Italian Medical Facilities. For this purpose, we have developed a specific PC tool named RECORDIA, which leads doctors and caregivers in patients’ characterization and speech recordings procedures. All the data collected by our partners thanks to RECORDIA software, have been stored on an online server located at University of Pisa and accessible all over the world. After the Italian data collection process, we started to approach them with the well-known ASR technologies, in order to evaluate the goodness of the data, and we compared our results with the ones we obtained from other English databases. The standard ASR technologies are hybrid Hidden Markov Model combined with Gaussian Mixture Model (GMM-HMM) and Deep Neural Network (DNN-HMM). We also decided to use a new Features Extraction technique for dysarthric speakers, which tries to tune window and shift parameters of Short Time Fourier Transform basing it on the way a user speaks. Several experiments have been carried out by using audio contributions from 45 people with speech and communication disorders and 10 speakers without speech disorder. Then, a comparison was performed between performances of an ASR system, made through the Kaldi toolkit, that uses standard speech processing and our proposal. This approach has been found to be very effective for people with medium and high level of dysarthria, improving ASR performance. An additional study was carried out in order to analyse the possible correlation between the new window and shift parameters and certain vocal characteristics of the subjects studied.
1-giu-2022
Italiano
Artificial Intelligence
Automatic Speech Recognition
Database
Dysarthria
Kaldi
Fanucci, Luca
File in questo prodotto:
File Dimensione Formato  
Marco_Marini_PhD_Thesis_final_version.pdf

Open Access dal 22/06/2025

Dimensione 6.8 MB
Formato Adobe PDF
6.8 MB Adobe PDF Visualizza/Apri
PhD_Report_Marco_Marini.pdf

Open Access dal 22/06/2025

Dimensione 343.59 kB
Formato Adobe PDF
343.59 kB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/216163
Il codice NBN di questa tesi è URN:NBN:IT:UNIPI-216163