SSW8 Accepted Papers

Sunday, September 1: OS3 OS4 KS2 PS2 Overview Sat. Sun. Mon.

Oral Session 3: Robustness in synthetic speech. (OS3)

Sunday, September 1, 9:00 - 10:15

Chair: Alistair Conkie

The effectiveness of phonetic-contrast motivated adaptation on HMM-based synthetic voices was previously tested on English successfully. The aim of this paper is to prove that such adaptation can be exported with minor changes to languages having different intrinsic characteristics. The Italian language was chosen because it has no obvious phonemic configuration towards which human speech tend when hypo-articulated such as the mid-central vowel (schwa) for English. Nonetheless, low-contrastive attractors were identified and a linear transformation was trained by contrasting each phone pronunciation with its nearest acoustic neighbour. Different degree of hyper and hypo articulated synthetic speech was then achieved by scaling such adaptation along the dimension identified by each contrastive pair. The Italian synthesiser outcome adapted with both the maximum and the minimum transformation strength was evaluated with two objective assessments: the analysis of some common acoustic correlates and the measurement of a intelligibility-in-noise index. For the latter, signals were mixed with different disturbances at various energy ratios and intelligibility was compared to the standard-TTS generated speech. The experimental results proved such transformation on the Italian voices to be as effective as those on the English one.
A phonetic-contrast motivated adaptation to control the degree-of-articulation on Italian HMM-based synthetic voices [bib]
09:00-09:25 Mauro Nicolao, Fabio Tesser, Roger K. Moore
Motivated by the fact that words are not equally confusable, we explore the idea of using word-level intelligibility predictions to selectively boost the harder-to-understand words in a sentence, aiming to improve overall intelligibility in the presence of noise. First, the intelligibility of a set of words from dense and sparse phonetic neighbourhoods was evaluated in isolation. The resulting intelligibility scores were used to inform two sentence-level experiments. In the first experiment the signal-to-noise ratio of one word was boosted to the detriment of another word. Sentence intelligibility did not generally improve. The intelligibility of words in isolation and in a sentence were found to be significantly different, both in clean and in noisy conditions. For the second experiment, one word was selectively boosted while slightly attenuating all other words in the sentence. This strategy was successful for words that were poorly recognised in that particular context. However, a reliable predictor of word-in-context intelligibility remains elusive, since this involves - as our results indicate - semantic, syntactic and acoustic information about the word and the sentence.
Using neighbourhood density and selective SNR boosting to increase the intelligibility of synthetic speech in noise [bib]
09:25-09:50 Cassia Valentini-Botinhao, Mirjam Wester, Junichi Yamagishi, Simon King
Speaker adaptation for TTS applications has been receiving more attention in recent years for applications such as voice customisation or voice banking. If these applications are offered as an internet service, there is no control on the quality of the data that can be collected. It can be noisy with people talking in the background or recorded in a reverberant environment. This makes the adaptation more difficult. This paper explores the effect of different levels of additive and convolutional noise on speaker adaptation techniques based on cluster adaptive training (CAT) and average voice model (AVM). The results indicate that although both techniques suffer degradation to some extent, CAT is in general more robust than AVM.
Noise Robustness in HMM-TTS Speaker Adaptation [bib]
09:50-10:15 Kayoko Yanagisawa, Javier Latorre, Vincent Wan, Mark J. F. Gales, Simon King

Coffee Break

Oral Session 4: Issues in HMM-based speech synthesis. (OS4)

Sunday, September 1, 10:45 - 12:25

Chair: Tomoki Toda

We present a new method to rapidly adapt the models of a statistical synthesizer to the voice of a new speaker. We apply a relatively simple linear transform that consists of a vocal tract length normalization (VTLN) part and a long-term average cepstral correction part. Despite the logical limitations of this approach, we will show that it effectively reduces the gap between source and target voices with only one reference utterance and without any phonetic segmentation. In addition, by using a minimum generation error criterion we avoid some of the problems that have been reported to arise when using a maximum likelihood criterion in VTLN.
New Method for Rapid Vocal Tract Length Adaptation in HMM-based Speech Synthesis [bib]
10:45-11:10 Daniel Erro, Agustin Alonso, Luis Serrano, Eva Navas, Inma Hernaez
This paper proposes a text-to-speech synthesis (TTS) system based on a combined model of the Composite Wavelet Model (CWM) and Hidden Markov Model (HMM). Conventional HMM-based TTS systems using cepstral features tend to produce over-smoothed spectra, which often result in muffled and buzzy synthesized speech. This is simply caused by the averaging of spectra associated with each phoneme during the learning process. To avoid the over-smoothing of generated spectra, we consider it important to focus on a different representation of the generative process of speech spectra. In particular, we choose to characterize speech spectra by the CWM, whose parameters correspond to the frequency, gain and peakiness of each underlying formant. This idea is motivated by our expectation that averaging of these parameters would not directly cause the over-smoothing of spectra, as opposed to the cepstral representations. To describe the entire generative process of a sequence of speech spectra, we combine the generative process of a formant trajectory using an HMM and the generative process of a speech spectrum using the CWM. A parameter learning algorithm for this combined model is derived based on an auxiliary function approach. We confirmed through experiments that our speech synthesis system was able to generate speech spectra with clear peaks and dips, which resulted in natural-sounding synthetic speech.
Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models [bib]
11:10-11:35 Nobukatsu Hojo, Kota Yoshizato, Hirokazu Kameoka, Daisuke Saito, Shigeki Sagayama
This paper presents an experimental comparison of a broad range of the leading vocoder types which have been previously described. We use a reference implementation of each of these to create stimuli for a listening test using copy synthesis. The listening test is performed using both Lombard and normal read speech stimuli, and with two types of question for comparison. Multi-dimensional Scaling (MDS) is conducted on the listener responses to analyse similarities in terms of quality between the vocoders. Our MDS and clustering results show that the vocoders which use a sinusoidal synthesis approach are perceptually distinguishable from the source-filter vocoders. To help further interpret the axes of the resulting MDS space, we test for correlations with standard acoustic quality metrics and find one axis is strongly correlated with PESQ scores. We also find both speech style and the format of the listening test question may influence test results. Finally, we also present preference test results which compare each vocoder with the natural speech.
An experimental comparison of multiple vocoder types [bib]
11:35-12:00 Qiong Hu, Korin Richmond, Junichi Yamagishi, Javier Latorre
To allow the average-voice-based speech synthesis technique to generate synthetic speech that is more similar to that of the target speaker, we propose a model training technique that introduces the label of speaker class. Speaker class represents the voice characteristics of speakers. In the proposed technique, first, all training data are clustered to determine classes of speaker type. The average voice model is trained using the labels of conventional context and speaker class. In the speaker adaptation process, the target speaker's class is estimated and is used to transform the average voice model into the target speaker's model. As a result, the speech of the target speaker is synthesized from the target speaker's model and the estimated target speaker's speaker class. The results of an objective experiment show that the proposed technique significantly reduces the RMS errors of log F0. Moreover, the results of a subjective experiment indicate that the proposal yields synthesized speech with better similarity than the conventional method.
Statistical Model Training Technique for Speech Synthesis Based on Speaker Class [bib]
12:00-12:25 Yusuke Ijima, Noboru Miyazaki, Hideyuki Mizuno

Lunch Break

Keynote Session 2 (KN2)

Sunday, September 1, 14:00 - 14:50

Chair: Nick Campbell

In human-human dialog, over 80% of the variance in prosody can be explained by just 20 prosodic patterns, most of which involve actions of both speakers and most of which last several seconds. In dialog these patterns frequently occur simultaneously, at varying offsets, and they are additive at the signal level and apparently compositional at the semantic/pragmatic level. These patterns provide a simple, non-structural way to model the prosodic implications of various functions important in dialog, including managing turn-taking, framing topic structure, grounding, expressing attitude, and conveying instantaneous cognitive state, among others. These patterns have been used for language modeling, for detecting important moments in the speech stream, and for information retrieval from audio archives, and may be useful for speech synthesis for dialog applications.
Prosodic Patterns in Dialog [slides] [handout]
14:00-14:50 Nigel Ward

Poster Session 2 (PS2) & Coffee

Sunday, September 1, 14:50 - 16:30

Chair: Junichi Yamagishi

In this paper, we present a comparative overview of 9 studies on perceptual quality dimensions of synthetic speech. Differ-ent subjective assessment techniques have been used to evalu-ate the text-to-speech (TTS) stimuli in each of these tests: in a semantic differential, the test participants rate every stimulus on a given set of rating scales, while in a paired comparison test, the subjects rate the similarity of pairs of stimuli. Percep-tual quality dimensions can be derived from the results of both test methods, either by performing a factor analysis or via mul-tidimensional scaling. We show that even though the 9 tests differ in terms of used synthesizer types, stimulus duration, lan-guage, and quality assessment methods, the resulting perceptual quality dimensions can be linked to 5 universal quality dimen-sions of synthetic speech: (i) naturalness of voice, (ii) prosodic quality, (iii) fluency and intelligibility, (iv) disturbances, and (v) calmness.
Is Intelligibility Still the Main Problem? A Review of Perceptual Quality Dimensions of Synthetic Speech [bib]
14:50-16:30 Florian Hinterleitner, Christoph Norrenbrock, Sebastian Möller
In HTS, a HMM-based speech synthesis system, about fifty contextual factors are introduced to label a segment to synthesize English utterances. Published studies indicate that most of them are used for clustering the prosodic component of speech. Nevertheless, the influence of all these factors on modeling is still unclear for French. The work presented in this paper deals with the analysis of contextual factors on acoustic parameters modeling in the context of a French synthesis purpose. Two objective and one subjective methodologies of evaluation are carried out to conduct this study. The first one relies on a GMM-approach to achieve a global evaluation of the synthetic acoustic space. The second one is based on a pairwise distance determined according to the acoustic parameter evaluated. Finally, a subjective evaluation is conducted to complete this study. Experimental results show that using phonetic context improves the overall spectrum and duration modeling and using syllable informations improves the F0 modeling. However other contextual factors do not significantly improve the quality of the HTS models.
Evaluation of contextual descriptors for HMM-based speech synthesis in French [bib]
14:50-16:30 Sébastien Le Maguer, Nelly Barbot, Olivier Boeffard
One of the biggest challenges in speech synthesis is the production of naturally sounding synthetic voices. This means that the resulting voice must be not only of high enough quality but also that it must be able to capture the natural expressiveness imbued in human speech. This paper focus on solving the expressiveness problem by proposing a set of different techniques that could be used for extrapolating the expressiveness of proven high quality expressive models into neutral speakers in HMM-based synthesis. As an additional advantage, the proposed techniques are based on adaptation approaches, which means that they can be used with little training data (around 15 minutes of training data are used in each style for this paper). For the final implementation, a set of 4 speaking styles were considered: news broadcasts, live sports commentary, interviews and political speech. Finally, the implementation of the 5 techniques were tested through a perceptual evaluation that proves that the deviations between neutral and expressive average models can be learned and used to imbue expressiveness into target neutral speakers as intended.
Towards Speaking Style Transplantation in Speech Synthesis [bib]
14:50-16:30 Jaime Lorenzo-Trueba, Roberto Barra-Chicote, Junichi Yamagishi, Oliver Watts, Juan M. Montero
This paper presents the beginnings of a framework for formal testing of the causes of the current limited quality of HMM (Hidden Markov Model) speech synthesis. This framework separates each of the effects of modelling to observe their independent effects on vocoded speech parameters in order to address the issues that are restricting the progression to highly intelligible and natural-sounding speech synthesis. The simulated HMM synthesis conditions are performed on spectral speech parameters and tested via a pairwise listening test, asking listeners to perform a "same or different" judgement on the quality of the synthesised speech produced between these conditions. These responses are then processed using multidimensional scaling to identify the qualities in modelled speech that listeners are attending to and thus forms the basis of why they are distinguishable from natural speech. The future improvements to be made to the framework will finally be discussed which include the extension to more of the parameters modelled during speech synthesis.
Investigating the shortcomings of HMM synthesis [bib]
14:50-16:30 Thomas Merritt, Simon King
The generation of synthetic speech with a certain degree of expressiveness has been successful for some particular applications or speaking styles (e.g. emotions). In this context, there is a particular speaking style with subtle speech nuances that may be of great interest for delivering expressive speech: the storytelling style. The purpose of this paper is to define a first step towards developing a storytelling Text-to-Speech (TTS) synthesis system by means of modelling the specific prosodic patterns (pitch, intensity and tempo) of this speaking style. We base our analysis of a tale in Spanish on discourse modes present in storytelling: narrative, descriptive and dialogue. Moreover, we introduce narrative situations (neutral narrative, post-character, decreasing suspense and affective situations) within the narrative mode, which are analysed at the sentence level. After grouping the sentences into modes and narrative situations, we analyse their corresponding prosodic patterns both objectively (via statistical tests) and subjectively (via perceptual test considering resynthesized sentences). The results show that the statistically validated prosodic rules perform equally (or even better) than the original prosody in most sentences.
Prosodic analysis of storytelling discourse modes and narrative situations oriented to Text-to-Speech synthesis [bib]
14:50-16:30 Raúl Montaño, Francesc Alías, Josep Ferrer
This paper investigates using objective quality measures to evaluate speaker adaptation performance in HMM-based speech synthesis. We compare severel objective measures to subjective evalution results from our earlier work about 1) comparison of speaker adaptation methods for child voices and 2) effects of noise in speaker adaptation. The results analysed in this work indicate a reasonable correlation between several objective and subjective quality measures.
Objective evaluation measures for speaker-adaptive HMM-TTS systems [bib]
14:50-16:30 Ulpu Remes, Reima Karhila, Mikko Kurimo
This paper presents a preliminary study on the use of symbolic prosody extracted from the speech signal to improve parameters prediction on HMM-based speech synthesis. The relationship between the prosodic labelling and the actual prosody of the training data is usually ignored in the building phase of corpus based TTS voices. In this work, different systems have been trained using prosodic labels predicted from speech and compared with the conventional system that predicts those labels solely from text. Experiments have been done using data from two speakers (one male and one female). Objective evaluation performed on a test set of the corpora shows that the proposed systems improve the prediction accuracy of phonemes duration and F0 trajectories. Advantages on the use of signal-driven symbolic prosody in place of the conventional text-driven symbolic prosody, and future works about the effective use of these information in the synthesis stage of a Text To Speech systems are also described.
Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis [bib]
14:50-16:30 Fabio Tesser, Giacomo Sommavilla, Giulio Paci, Piero Cosi
Phrase break prediction is very important for speech synthesis. Traditional methods of phrase break prediction have used linguistic resources like part-of-speech (POS) sequence information for modeling these breaks. In the context of Indian languages, we propose to look at syllable level features and explore the use of word-terminal syllables to model phrase breaks. We hypothesize that these terminal syllables serve to discriminate words based syntactic meaning, and can therefore be used to model phrase breaks. We utilize these terminal syllables in building models for automatic phrase break prediction from text and demonstrate by means of objective and subjective measures that these models perform as well as traditional models using POS sequence information. Thus the proposed method avoids the need for POS taggers for prosodic phrasing in Indian languages.
Significance of word-terminal syllables for prediction of phrase breaks in Text-to-Speech systems for Indian languages [bib]
14:50-16:30 Anandaswarup Vadapalli, Peri Bhaskararao, Kishore Prahallad
We investigate whether listener age or native speaker status has the biggest impact on the intelligibility of a synthetic New Zealand English voice. The paper presents findings from a speech intelligibility experiment based on a reminding task involving 67 participants. There were no significant differences in the results due to age (young and old adults), however there was for native speaker status. The non- native listeners performed significantly worse than the native listeners in the synthetic speech condition although no differences were found in the natural speech condition. We argue that despite the fact that aging impacts on speech perception, the older native listeners were able to draw on their in depth language model to help them parse the synthetic speech. The non-native speakers do not have such an in depth model to assist them.
The Effect of Age and Native Speaker Status on Synthetic Speech Intelligibility [bib]
14:50-16:30 Catherine Watson, Wei Liu, Bruce MacDonald
In the traditional voice conversion, converted speech is generated using statistical parametric models (for example Gaussian mixture model) whose parameters are estimated from parallel training utterances. A well-known problem of the statistical parametric methods is that statistical average in parameter estimation results in the over-smoothing of the speech parameter trajectories, and thus leads to low conversion quality. Inspired by recent success of so-called exemplar-based methods in robust speech recognition, we propose a voice conversion system based on non-negative spectrogram deconvolution with similar ideas. Exemplars, which are able to capture temporal context, are employed to generate converted speech spectrogram convolutely. The exemplar-based approach is seen as a data-driven, non-parametric approach as an alternative to the traditional parametric approaches to voice conversion. Experiments on VOICES database indicate that the proposed method outperforms the conventional joint density Gaussian mixture model by a wide margin in terms of both objective and subjective evaluations.
Exemplar-Based Voice Conversion using Non-Negative Spectrogram Deconvolution [bib]
14:50-16:30 Zhizheng Wu, Tuomas Virtanen, Tomi Kinnunen, Eng Siong Chng, Haizhou Li

Guided Visit to Sagrada Familia & Park Güell

Sunday, September 1, 16:30 - 20:00

Dinner at the Restaurant Maritim of the Royal Barcelona Maritim Club

Sunday, September 1, 20:00 - 22:00


International Speech Communication Association.


The aim of SynSIG is to promote the study of Speech Synthesis in general. Its international and multi-disciplinary nature provides a means for sharing information both to and from different research communities involved in the synthesis of various languages.


BarcelonaTech is close to the people who make up the university community, to future students, to researchers, to entrepreneurs, to institutions and to companies.