In the present thesis, I focus on the three main approaches to speech research: perception, production, and interaction. Speech is a complex mechanism and it was originally considered as separated into two independent “streams”: the ventral stream for perception and the dorsal stream for production. In the dorsal stream, a central speech processing hub is represented by Broca’s area, whose role is still under investigation. Its crucial importance for speech production has been demonstrated by means of electrical stimulation, which induces the complete arrest of ongoing speech. During my work, I designed a machine learning algorithm that, exploiting high-gamma activity recorded from the speech-arrest area, successfully predicts speech onset in patients. The algorithm represents an important step towards the development of speech neuro-prostheses. Indeed, in patients with communication impairments, such as locked-in syndrome, decoding the intention to speak (and subsequently the covert speech itself) represents the only way to restore their connection with the external world. Broca’s area is, as mentioned, a fundamental component of the dorsal stream, where it is putatively directly connected to the part of the motor cortex governing mouth articulators. In the other hand, past research excluded the involvement of motor regions in the ventral stream, and thus from speech perception. In the last two decades, active listening theories became however prominent in the speech perception modelling. Several studies demonstrated indeed the activation of motor areas during listening. In my thesis, I investigated the role of the motor system in speech perception by implementing an experiment where subjects listened to phrases without having access to the mouth kinematics. During listening, neural tracking of the speech by auditory cortical areas is enabled by means of brain entrainment to speech. Nevertheless, visual and kinematic speech-related information is also demonstrated to be tracked when available. In this work, for the first time, neural tracking of unavailable mouth kinematics is investigated employing a novel information theoretic-based approach: the Partial Information Decomposition (PID). Through this sophisticated mathematical method, the movement of the speaker’s tongue was observed to be encoded by the motor regions, suggesting an ongoing kinematic simulation played by motor areas. Additionally, information about a synergistic interaction between speech and tongue kinematics was present. This result could imply the existence of an integration mechanism between acoustic and kinematic inputs. These evidences support active listening theories of speech perception, and highlight the possibility of a reconstructive top-down motor-driven mechanism that enhances speech comprehension. Speech comprehension is crucial when people interact verbally. In order for speakers to maximize mutual understanding during verbal interaction, they rely on a mechanism called convergence. Convergence is a multi-modal process which, at an acoustic level, entails the shift of acoustic features towards a common point. However, a mathematical definition of this complex mechanism is still missing. In order to address this scientific question, I designed a deep learning model based on Siamese architecture and Recurrent Neural Networks. The proposed model is tuned to learn temporal dependencies and produce a similarity index between couples of speech stream. When tested in a sentence independence scenario, the Siamese model succeeded in computing the voices distance. In contrast, the performance in speaker independence has still to be enhanced.

PERCEPTION, PRODUCTION AND INTERACTION: A COMPREHENSIVE INVESTIGATION ON HUMAN SPEECH

PASTORE, ALDO
2022-07-01T00:00:00+02:00

Abstract

In the present thesis, I focus on the three main approaches to speech research: perception, production, and interaction. Speech is a complex mechanism and it was originally considered as separated into two independent “streams”: the ventral stream for perception and the dorsal stream for production. In the dorsal stream, a central speech processing hub is represented by Broca’s area, whose role is still under investigation. Its crucial importance for speech production has been demonstrated by means of electrical stimulation, which induces the complete arrest of ongoing speech. During my work, I designed a machine learning algorithm that, exploiting high-gamma activity recorded from the speech-arrest area, successfully predicts speech onset in patients. The algorithm represents an important step towards the development of speech neuro-prostheses. Indeed, in patients with communication impairments, such as locked-in syndrome, decoding the intention to speak (and subsequently the covert speech itself) represents the only way to restore their connection with the external world. Broca’s area is, as mentioned, a fundamental component of the dorsal stream, where it is putatively directly connected to the part of the motor cortex governing mouth articulators. In the other hand, past research excluded the involvement of motor regions in the ventral stream, and thus from speech perception. In the last two decades, active listening theories became however prominent in the speech perception modelling. Several studies demonstrated indeed the activation of motor areas during listening. In my thesis, I investigated the role of the motor system in speech perception by implementing an experiment where subjects listened to phrases without having access to the mouth kinematics. During listening, neural tracking of the speech by auditory cortical areas is enabled by means of brain entrainment to speech. Nevertheless, visual and kinematic speech-related information is also demonstrated to be tracked when available. In this work, for the first time, neural tracking of unavailable mouth kinematics is investigated employing a novel information theoretic-based approach: the Partial Information Decomposition (PID). Through this sophisticated mathematical method, the movement of the speaker’s tongue was observed to be encoded by the motor regions, suggesting an ongoing kinematic simulation played by motor areas. Additionally, information about a synergistic interaction between speech and tongue kinematics was present. This result could imply the existence of an integration mechanism between acoustic and kinematic inputs. These evidences support active listening theories of speech perception, and highlight the possibility of a reconstructive top-down motor-driven mechanism that enhances speech comprehension. Speech comprehension is crucial when people interact verbally. In order for speakers to maximize mutual understanding during verbal interaction, they rely on a mechanism called convergence. Convergence is a multi-modal process which, at an acoustic level, entails the shift of acoustic features towards a common point. However, a mathematical definition of this complex mechanism is still missing. In order to address this scientific question, I designed a deep learning model based on Siamese architecture and Recurrent Neural Networks. The proposed model is tuned to learn temporal dependencies and produce a similarity index between couples of speech stream. When tested in a sentence independence scenario, the Siamese model succeeded in computing the voices distance. In contrast, the performance in speaker independence has still to be enhanced.
D'AUSILIO, Alessandro
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11392/2491653
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact