A simple and fast voice conversion method based only on vowel information is proposed. The proposed method relies on empirical distribution of perceptual spectral distances between representative examples of each vowel segment extracted using TANDEM-STRAIGHT spectral envelope estimation procedure . Mapping functions of vowel spectra are designed to preserve vowel space structure defined by the observed empirical distribution while transforming position and orientation of the structure in an abstract vowel spectral space. By introducing physiological constraints in vocal tract shapes and vocal tract length normalization, difficulties in careful frequency alignment between vowel template spectra of the source and the target speakers can be alleviated without significant degradations in converted speech. The proposed method is a frame-based instantaneous method and is relevant for real-time processing. Applications of the proposed method in-cross language voice conversion are also discussed.
|ジャーナル||Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH|
|出版ステータス||Published - 2009|
|イベント||10th Annual Conference of the International Speech Communication Association, INTERSPEECH 2009 - Brighton, United Kingdom|
継続期間: 6 9月 2009 → 10 9月 2009