Harvest: A high-performance fundamental frequency estimator from speech signals

Research output: Contribution to journalConference articlepeer-review

24 Citations (Scopus)

Abstract

A fundamental frequency (F0) estimator named Harvest is described. The unique points of Harvest are that it can obtain a reliable F0 contour and reduce the error that the voiced section is wrongly identified as the unvoiced section. It consists of two steps: estimation of F0 candidates and generation of a reliable F0 contour on the basis of these candidates. In the first step, the algorithm uses fundamental component extraction by many band-pass filters with different center frequencies and obtains the basic F0 candidates from filtered signals. After that, basic F0 candidates are refined and scored by using the instantaneous frequency, and then several F0 candidates in each frame are estimated. Since the frame-by-frame processing based on the fundamental component extraction is not robust against temporally local noise, a connection algorithm using neighboring F0s is used in the second step. The connection takes advantage of the fact that the F0 contour does not precipitously change in a short interval. We carried out an evaluation using two speech databases with electroglottograph (EGG) signals to compare Harvest with several state-of-the-art algorithms. Results showed that Harvest achieved the best performance of all algorithms.

Original languageEnglish
Pages (from-to)2321-2325
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2017-August
DOIs
Publication statusPublished - 2017
Event18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017 - Stockholm, Sweden
Duration: 20 Aug 201724 Aug 2017

Keywords

  • Fundamental component
  • Fundamental frequency
  • Instantaneous frequency
  • Speech analysis

Fingerprint

Dive into the research topics of 'Harvest: A high-performance fundamental frequency estimator from speech signals'. Together they form a unique fingerprint.

Cite this