A Voice Conversion Method Combining Segmental GMM Mapping with Target Frame Selection
Hung-Yan Gu and Sung-Feng Tsai
e-mail: guhy@mail.ntust.edu.tw


ABSTRACT
In this paper, a voice conversion approach that combines two distinct ideas is proposed to improve the converted-voice quality. The first idea is to map spectral features, e.g. discrete cepstrum coefficients (DCC), with segmental Gaussian mixture models (GMMs). That is, a single GMM of a large number of mixture components is replaced here with several voice-content specific GMMs each consisting of much fewer mixture components. In addition, the second idea is to find a frame, of spectral features near to the mapped feature vector, from the target-speaker frame pool corresponding to the segment class as the input frame belongs to. Both ideas are intended to alleviate the problem encountered by a traditional GMM based conversion method, i.e. converted spectral envelopes are usually over smoothed. To apply the first idea to implement an on-line voice conversion system, we have proposed an automatic GMM selection algorithm based on dynamic programming (DP). Furthermore, as pointed out by previous researchers, mapping with a single selected Gaussian probability density function (PDF) instead of a combination of several Gaussian PDFs is helpful to obtain better converted voice quality. Therefore, we have also proposed a Gaussian PDF selection algorithm and integrated it into our system. As to the implementation of the second idea, an algorithm based on DP is adopted which will consider both frame matching and connecting distances. For evaluating the performance of the two ideas studied here, three voice conversion systems are constructed, and used to conduct listening tests. The results of the tests show that the system with the two ideas combined can indeed obtain much improved voice quality besides improvement in timbre similarity.





Voice conversion from MA to MB

MA => MB
text (in Chinese)
VS
(source)
VSX
(source)
VT
(target)
VXA
(convrtd)
VXB
(convrtd)
VXC
(convrtd)
我明年將離開彰化去日本.

春天因你而閃閃發光, 笑臉因你而更加明媚,
微風因你而飄送芬芳, 日子就像緩緩的溪水.



link to wave file


 link to wave file
spctrg


link to wave file
original  GMM

link to wave file
spectrogram
segmental GMMs

link to wave file
(spctrg)
frame select

link to wave file
(spctrg)

接觸不到田野抓 泥鰍的喜樂.

link to wave file


link to wave file
original  GMM
link to wave file
segmental GMMs
link to wave file
frame select
link to wave file

拿不出任何解決 方案.

link to wave file


link to wave file
original  GMM
link to wave file
segmental GMMs
link to wave file
frame select
link to wave file

欣賞穿著入時的 來往行人.

link to wave file


link to wave file
original  GMM
link to wave file
segmental GMMs
link to wave file
frame select
link to wave file






Voice conversion from MA to FA

MA => FA
text (in Chinese)
VS
(source)
VSX
(source)
VT
(target)
VXA
(convrtd)
VXB
(convrtd)
VXC
(convrtd)
我明年將離開彰化去日本.

春天因你而閃閃發光, 笑臉因你而更加明媚,
微風因你而飄送芬芳, 日子就像緩緩的溪水.



link to wave file


 link to wave file
spctrg


link to wave file
original  GMM

link to wave file
spectrogram
segmental GMMs

link to wave file
(spctrg)
frame select

link to wave file
(spctrg)

接觸不到田野抓 泥鰍的喜樂.

link to wave file


link to wave file
original  GMM
link to wave file
segmental GMMs
link to wave file
frame select
link to wave file

拿不出任何解決方案.

link to wave file


 

link to wave file
original  GMM
link to wave file
segmental GMMs
link to wave file
frame select
link to wave file

欣賞穿著入時的 來往行人.

link to wave file


 

link to wave file
original  GMM
link to wave file
segmental GMMs
link to wave file
frame select
link to wave file






Examples of automatically selected segmental GMMs

For the spoken sentence, /zie-3  zyue-2  fang-1  an-4/ (解決方案), the segmental GMMs automatically selected are /i/, /ie/, /yu/, /yue/, /ie/, /iao/, /ang/, /ia/, and /an/, when each batch is of 30 frames in length.    For the spoken sentence,"接觸不到田野抓 泥鰍的喜樂" , the segmental GMMs automatically selected are shown below with each batch of 30 frames.


The influence of batch length in frames for segment selection.
Batch lenth
(in frames)
Number of segments selected Segment boundaries Converted voices
20 3, 3, 4, 2 link to wave file
30 2, 3, 2, 2 link to wave file
40 2, 2, 2, 2 link to wave file
50 2, 2, 2, 2 link to wave file

,




Illustation figures for the issues listed below




Text of the 375 sentences recorded from MA, MB, and FA (in Chinese)
1. 請把這籃兔子送走.
2. 叫一客肉絲麵.
3. 關心幼兒智力潛能.
. . .
358. 欣賞穿著入時的 來往行人.
. . .
363. 接觸不到田野抓 泥鰍的喜樂.
. . .
371. 拿不出任何解決方案.
. . .





Program interface

Screen recording of program execution



run on a notebook computer with an Intel Core 2 Duo T8300 CPU and 2 GB memory.