Synthetic speech optimised for low cognitive load

  • Position identifier: ESR6
  • Host partner: UEDIN

Text-to-speech (TTS) is defined as computer generated speech which takes text as input and outputs the corresponding speech waveform. TTS is used in many applications, which include audio-books, learning language applications, voiced enabled emails and assistive communication aids. However, since the quality of some text-to-speech systems are not as natural as a human voice, this leads to fatigue and an unsatisfactory experience for the end user. Therefore, improving the quality of synthetic speech is important to make its usability for the end user easier and better.

With recent developments in HMM-based speech synthesis, DNN-based speech synthesis and concatenative synthesis, the naturalness and intelligibility of synthetic speech has improved drastically to the extent that some synthesizers are now  almost comparable to natural speech. Yet, in adverse conditions such as noisy environments, synthetic speech is considered to be harder to listen to than natural speech. Even in quiet conditions, synthetic speech requires more effort from listeners.
The question this raises is, what factors in synthetic speech are resulting in this hard to listen effect?

Cognitive load is an interesting measure that was once considered in evaluating synthetic speech when rule-based systems were popular. However, there has been a lack of research with regards to measuring these cognitive load with state-of-the-art text-to-speech systems that currently exist. There is a lack of effort in understanding how synthetic speech interacts with the human cognitive processes and no attempt is being made in understanding the impact that the synthetic speech has on the listener. I believe that cognitive load could be a valuable measure in helping us determine the factors that are responsible for the hard to listen to effect.

Our research idea was to explore cognitive load as a measure to determine which factors of synthetic speech contributes to the “hard to listen” effect. In determining these factors, we gained insight on where to make improvements with the main goal in reducing the overall cognitive load imposed on the end user.

Objectives

The objectives of this project were to:

  • Investigate the cognitive load imposed by state-of-the art synthetic speech in both quiet and noise, using the multi-instrument techniques provided by other researchers in the project;
  • Discover which aspects of synthetic speech contribute most to the ‘hard to listen to’ effect (which is hypothesised to be an increased cognitive load);
  • Design, implement and test novel forms of synthetic speech that impose the lowest possible cognitive load on listeners.

We expected that, by solving these problems, we would also make advances in the quality and naturalness of synthetic speech.

This project was being carried out by Avashna Govender and supervised by Prof. Simon King and was carried our in collaboration with partners Radboud University Nijmegen (Netherlands) and the University of Crete (Greece).