![]() |
|||||
|
|
Home > News & Events > MITRE Publications > The MITRE Digest > | |||||||||||||||||||||
Audio Hot Spotting: Finding a Needle In a Haystack of Digital Sound February 2003 CNN has more than 100,000 hours of video programs in its archives. That's 11 nonstop years of TV viewing! Such bulging archives, however, pale in comparison to what's on the worldwide communications infrastructure, which annually produces more than an exabyte—one quintillion bytes—of mostly audio and video information. It is easy to see the futility in attempting to quickly sort through it all in search of an important yet elusive snippet of audio. The federal government is grappling with an equally challenging problem. The enormous volume of audio recordings in its archives demands new thinking about how to search audio files efficiently. MITRE, recognizing the ramifications of this challenge, launched an investigation into possible solutions through a research project called Audio Hot Spotting. Within a year, the research team had developed a promising prototype tool. Audio hot spotting is the process of finding "hot spots" within audio files, such as words of interest, speakers, or key sound effects (e.g., laughter, applause, and mechanical noises). Among industry, government, education, and medicine, there has long existed a compelling need to exactly locate, from either archived files or live streaming content, audio information that can be quickly retrieved for reference and analysis. And as more and more multimedia takes up residence on the Internet, audio files will soon begin to dominate search engine indexes, making the hot spotting of audio files a critical necessity.
MITRE's Qian Hu proposed the audio hot spotting research project and serves as the principal investigator. "The current approach to audio information retrieval of simply combining text-based information retrieval with automatic speech recognition (ASR) does not meet the need for audio hot spotting in real-world applications," she explains. ASR word error rates are too big because in existing systems only whole documents, i.e., a defined time-slice of audio such as a phone call, are returned with no pinpointing of relevant regions of audio or important information, such as who said what and in what environment of noise. Hu's first step was to put together a research team, pulling from the varied expertise across MITRE, in areas that include intelligent information processing, information retrieval, automatic speech recognition, audio feature detection and extraction, audio indexing, keyword spotting, speaker and language identification, as well as graphical user interface development. It didn't take the group long to come up with a prototype tool, which uses a Web-based interface for queries to automatically locate regions of interest in an audio/video file that meet the user's specified criteria. In the query, users may search for keywords or phrases, speakers, both keywords and speakers, nonverbal speech characteristics, or nonspeech signals of interest, which the user can then also play back. Accompanying each retrieved audio segment is a text transcription, emitted and time-stamped by a commercial automatic speech recognizer. If video is the primary media source, both the video and soundtrack will be played. The present and future applications of the prototype are considerable for both government and industry. "Since each component technology—automatic speech recognition, speaker identification, and natural language processing—is imperfect and as such unlikely to separately retrieve highly reliable information, combining the strength of each technology yields more reliable audio hot spotting results," explains Hu. Within six months, the team's prototype could query and retrieve keywords or phrases from multimedia files, query and retrieve both keywords and speaker information at the same time, index multimedia content, and detect and retrieve some audio effects, such as applause and laughter.
The project's rapid progress is partly attributed to collaboration with other MITRE projects to advance the research effort. One such collaboration is with MITRE's Advanced Center for Information Service team, "which provided us with much-needed real application data and a corporate server for the test bed," says Hu."They also shared with us their expertise in encoding, streaming, and network solutions." By adapting and customizing a commercial product, the two teams deployed the beta capability to index and retrieve multimedia content from the corporation's Distinguished Speaker Series and other company events, such as the president's annual address. The Next Generation Media Center beta, not yet available on the MITRE intranet, has the potential of providing the company with an unprecedented capability to live-index, query, and retrieve content from multimedia files. And there is still much to be done. Unlocking the myriad secrets of audio is a daunting challenge. Raw digital audio files, unlike their familiar cousins, text files, are invisible until somehow processed. A speaker's words, sentences, and paragraphs are locked away in dense blocks of opaque digital streams. They are impenetrable to conventional computer applications. And listening to them was, until recently, the only way to gain entry. But listening to audio recordings from end to end is an expensive, time-consuming, and mistake-prone exercise for humans. A machine would be the ideal substitute, if only one could be properly trained to take on the task. The increase in speed and capacity of modern computers and networks now allows the inclusion of audio or multimedia as a data type. And, after decades of research, audio laboratories are finally beginning to make progress in understanding what makes speech so difficult for computers to understand. "No one lab has the complete answer," cautions Hu. "But when both commercial and academic work are taken together—and effectively integrated—a powerful solution begins to emerge. The team's job then is to apply our own expertise around that core, to expand the prototype's capability by integrating those refinements, and, finally, to prove that it all works as a single system. "We're very pleased with our progress so far," says Hu of the prototype's debut, "but we have more challenging milestones ahead of us that we want to meet." Part of that future includes experimenting with applying and adapting text-based information retrieval and text mining algorithms to improve audio hot spotting performance. "For example, we'd envision the prototype allowing the user to query by example and retrieve not only by exact keyword match but also by semantic or phonetic association," adds Hu. One key to the success of this project is the MITRE Technology Program, which manages a wide range of research projects. "I have access to a great pool of expertise throughout the company," says Hu, "along with appropriate equipment and workspace. The whole process, as well as MITRE itself, encourages you to succeed." For MITRE's many government sponsors, all warehousing tons of raw audio, Hu and her team's audio hotspotting architecture and approach could have a major impact. —by Tom Green Related Information Articles and News Technical Papers and Presentations
|
||||||||||||||||||||||
| Page last updated: February 12, 2004 | Top of page |
Solutions That Make a Difference.® |
|
|