TY - JOUR
T1 - Harvest
T2 - 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017
AU - Morise, Masanori
N1 - Funding Information:
This work was supported by JSPS KAKENHI Grant Numbers JP16H01734, JP15H02726, JP16H05899, JP16K12511.
Publisher Copyright:
Copyright © 2017 ISCA.
PY - 2017
Y1 - 2017
N2 - A fundamental frequency (F0) estimator named Harvest is described. The unique points of Harvest are that it can obtain a reliable F0 contour and reduce the error that the voiced section is wrongly identified as the unvoiced section. It consists of two steps: estimation of F0 candidates and generation of a reliable F0 contour on the basis of these candidates. In the first step, the algorithm uses fundamental component extraction by many band-pass filters with different center frequencies and obtains the basic F0 candidates from filtered signals. After that, basic F0 candidates are refined and scored by using the instantaneous frequency, and then several F0 candidates in each frame are estimated. Since the frame-by-frame processing based on the fundamental component extraction is not robust against temporally local noise, a connection algorithm using neighboring F0s is used in the second step. The connection takes advantage of the fact that the F0 contour does not precipitously change in a short interval. We carried out an evaluation using two speech databases with electroglottograph (EGG) signals to compare Harvest with several state-of-the-art algorithms. Results showed that Harvest achieved the best performance of all algorithms.
AB - A fundamental frequency (F0) estimator named Harvest is described. The unique points of Harvest are that it can obtain a reliable F0 contour and reduce the error that the voiced section is wrongly identified as the unvoiced section. It consists of two steps: estimation of F0 candidates and generation of a reliable F0 contour on the basis of these candidates. In the first step, the algorithm uses fundamental component extraction by many band-pass filters with different center frequencies and obtains the basic F0 candidates from filtered signals. After that, basic F0 candidates are refined and scored by using the instantaneous frequency, and then several F0 candidates in each frame are estimated. Since the frame-by-frame processing based on the fundamental component extraction is not robust against temporally local noise, a connection algorithm using neighboring F0s is used in the second step. The connection takes advantage of the fact that the F0 contour does not precipitously change in a short interval. We carried out an evaluation using two speech databases with electroglottograph (EGG) signals to compare Harvest with several state-of-the-art algorithms. Results showed that Harvest achieved the best performance of all algorithms.
KW - Fundamental component
KW - Fundamental frequency
KW - Instantaneous frequency
KW - Speech analysis
UR - http://www.scopus.com/inward/record.url?scp=85039150165&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2017-68
DO - 10.21437/Interspeech.2017-68
M3 - Conference article
AN - SCOPUS:85039150165
VL - 2017-August
SP - 2321
EP - 2325
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SN - 2308-457X
Y2 - 20 August 2017 through 24 August 2017
ER -