The problem: Speech recognition software that can’t differentiate “Austin” from “Boston.”
The researcher: Stephen Zahorian, professor of electrical and computer engineering.
The research: Zahorian is developing a multilanguage, multispeaker database that will be available for spoken-language processing research. The project is supported by a grant of nearly half a million dollars from the Air Force Office of Scientific Research.
The strategy: First, Zahorian and his team will gather recordings of hundreds of speakers each in English, Spanish and Mandarin Chinese. The recordings will be annotated to create a kind of closed captioning, including time stamps and descriptions of background noises. Once the recordings are transcribed, automatic speech recognition algorithms will be used to align the recording with the captions. Next, software will be developed to verify and correct errors in the time alignment.
“Speech-recognition algorithms begin by mimicking what your ear does,” Zahorian says. “But we want the algorithms to extract just the most useful characteristics of the speech, not all of the possible data. That’s because more detail can actually hurt performance, past a certain point.”
While algorithms typically convert sounds into numbers, Zahorian represents speech as a picture in a time-frequency plane. He uses image-processing techniques to extract features of the speech, which has led him to focus more on time than on frequency.
Researchers test algorithms using databases held by the Linguistic Data Consortium. Zahorian’s unusual image-based approach has returned some of the best results ever reported for automatic speech recognition experiments.
The database Zahorian develops will join these others, offering any researcher a new way to test theories with samples of real-life speech.