| 2006 Technology
Symposium > Human Language
Human Language
Human Language researches computer systems that understand and/or synthesize
spoken and written human languages. Included in this area are speech processing
(recognition, understanding, and synthesis), information extraction, handwriting
recognition, machine translation, text summarization, and language generation.
CLASR: Cross Language Automatic Speech Recognition
John Henderson, Principal Investigator
Location(s): Washington and Bedford
Problem
Fewer than 50 languages have large available lexical resources. Machine translation systems address only the 20 languages with obvious commercial and military impact. While progress has been made on these languages over 50 years of research, the techniques developed exploit large quantities of written resources. Consequently, these technologies are not applicable to languages with few written resources.
Objectives
We will investigate a novel technique for identifying and characterizing the concepts represented by target language (English) words in digitized audio recordings of a foreign source language. The result is a system for recovering English text from source language audio. It requires neither source language written resources for system development nor a source language written intermediate form at decoding time.
Activities
The problem of spoken language translation is broken down into several modeling subproblems, each of which is the target of recent research advances: automated acoustic unit discovery, bilingual lexicon design, structured language modeling, and transduction grammars. This project investigates how these separate research advances can be combined into a unified model and system for cross-language speech recognition.
Impact
Machine translation and speech recognition are typically viewed as properly separable efforts. This attempt at making a joint model will likely spur other such approaches in the research community, reminding many to explore bridging techniques. More practically, but longer term, many developing nations require assistance and resources for continued existence. This technology will aid international assistance and monitoring.
^TOP
Clipper & TrIM: Embedded Machine Translation Prototypes
Rod Holland, Principal Investigator
Location(s): Washington and Bedford
Problem
The global involvement of the United States brings us into contact with partners, opponents, and populations that speak the world's languages. Coalition operations require command and control across language barriers. The production of correct and timely intelligence requires rapid exploitation of foreign language materials. There are never enough trained linguists to meet all such needs.
Objectives
We seek to demonstrate the potential for embedded machine translation in mission systems by constructing a series of task-oriented prototypes and evaluating them in the field, with real users and real problems.
Activities
We have constructed a coalition collaboration prototype, Translingual Instant Messaging (TrIM), that supports cross-language chat in 15 languages (English, French, Italian, Spanish, Portuguese, German, Dutch, Polish, Russian, Ukrainian, Japanese, Korean, Thai, Chinese, Hebrew, and Arabic). We have also constructed Clipper, a prototype to allow analysts to directly and rapidly exploit Web content and private collections in five languages (Chinese, Arabic, Russian, Spanish, and Portuguese).
Impact
TrIM has been widely deployed in exercises, experiments, and evaluations, including Yama Sakura, Ulchi Focus Lens, MEFEX, Combined Endeavor, BALTOPS, Rescuer, JWIDS, and FRUKUS. It has been the subject of two military utility assessments, and has been adopted for operational use by coalition organizations in SOUTHCOM, PACOM, CENTCOM, EUCOM, and NORTHCOM. Clipper has been validated through support of real analytical work, and is in daily use at multiple sites.
^TOP
Closing the Semantic Gap
Marc Vilain, Principal Investigator
Location(s): Washington and Bedford
Problem
The explosive adoption of language-enabled analytic tools speaks to their abilities to interpolate what people mean from what people say. But much language-enabled analysis has reached a semantic gap: progress on key tasks is stymied because these methods can only approximate what humans mean. This is especially true for the critical unsolved problem of identifying events and their ramifications.
Objectives
We will create a computational database of word meanings that will bridge the semantic gap and enable analytic tools to better model human language. This database will be compiled from dictionary definitions that cover the full range of the English language. Further, we will marshal algorithmic methods that apply the inter-relationships of word meanings to identifying events and their ramifications.
Activities
In order to create this lexical database, we will: infer hierarchies of word meanings from their dictionary definitions; establish those non-hierarchical word relations that further capture essential meanings; and derive a repertoire of primitive semantic elements from which meanings are composed. We will also define evaluation measures to better assess progress and determine goodness of fit to actual analytic tasks.
Impact
This work will sharpen analytic capabilities in many areas. In particular, being able to reliably identify and compare events is necessary to next-generation capabilities in Indications and Warnings. It will enable: the ability to catalogue what happens to entities of interest; the ability to filter redundant information about the same event; and the ability to detect inconsistent versions of events.
^TOP
Document Analysis Methods in Fraud Detection
Marc Vilain, Principal Investigator
Location(s): Washington and Bedford
Problem
Each year, the U.S. Treasury is defrauded of $85 billion in corporate taxes owed by large- and medium-sized business. At fault are ever-new revenue-hiding schemes that are legally disallowed. The IRS, however, is limited in its ability to investigate, as corporate tax filings often lack key details required to detect noncompliance.
Objectives
A promising alternative is to exploit SEC filings to detect potentially noncompliant accounting practices. Our objective is to introduce text-oriented methods for securities filings that can complement the data-oriented methods used for tax filings. We are especially concerned with creating an exploratory testbed that will enable adaptability and interactive control by end-users.
Activities
We will apply current information extraction techniques to SEC filings and will identify relevant facts ("the company registered a tax loss") and relationships ("Mr. Smith is a shareholder in the partnership"). These language elements will be collated into document-level analyses through linguistic and statistical models. We will also pursue trainable classification techniques to abstract these analyses into automated noncompliance detectors.
Impact
As a research endeavor, this project will further the practice of information extraction through advances in fact detection, document categorization, and document structure understanding. Our experimental testbed will look at a number of key problems, among them discovering suspect ownerships in partnerships, identifying disallowed tax losses, and detecting new variants of existing tax abuse schemes. Because of the sheer magnitude of uncollected revenues, even partial solutions are valuable.
^TOP
Foreign Language Exploitation Tools and Experimentation
David Day, Principal Investigator
Location(s): Washington and Bedford
Problem
Since linguists have excessive workloads of backlogged documents, there is a need to address aids that will help enhance timely human identification, extraction, and translation of critical intelligence content. Many analysts today are insufficiently trained in the source language being exploited. Can we leverage automated tools and improved visualization to aid in the efficient understanding of foreign language material?
Objectives
We seek to improve the productivity and quality of human exploitation of foreign language material, including both document translation and the direct extraction of information for intelligence applications. We will establish an instrumented exploitation toolbox sufficient to perform carefully controlled experimentation that can help identify those capabilities that can provide the greatest increases in human quality and productivity.
Activities
We will perform a survey of existing tools that have been proposed or used as productivity aids for foreign language DocEx. We will integrate newly developed automatic foreign language processing and visualization capabilities that might contribute to effective DocEx. Using an instrumented toolbox we will study empirically the ways in which analyst productivity is most enhanced.
Impact
The results of this project's investment in integrated tool development and experimentation will be performance metrics and empirical results that can guide investment in the improvement to the nation's DocEx capabilities, as well as an operational prototype that demonstrates some of these productivity and quality improvements.
^TOP
Quick International Character Recognition (QUICR)
Amlan Kundu, Principal Investigator
Location(s): Washington and Bedford
^TOP
TransTac
Sherri Condon, Principal Investigator
Location(s): Washington and Bedford
^TOP
|