About Us Our Work Employment News & Events
MITRE Remote Access for MITRE Staff and Partners Site Map
edge top

August 2000,
Volume 4
Number 2

Data Mining Issue

Text Mining by Filter Composition

What is the Origin of Data Mining?

Data Mining for Aviation Safety

Identifying Dominant Air Traffic Flows in Complex Airspace

Detecting Changes in Overhead Imagery

Data Mining for Intrusion Detection

 

Home > News & Events > MITRE Publications > The Edge >

 Text Mining by Filter Composition

We all have access to lots of information, but are seldom in a position to exploit it effectively for decision making. In times of crisis, this problem can be especially severe.

Imagine you are a senior analyst besieged with news and intelligence reports of a hostage situation at an American embassy. Who is in charge of the terrorists? Is their group likely to attack other embassies? When the president calls for an emergency meeting, your boss is asked to make a 20-minute presentation that profiles the terrorist group and develops arguments describing their likely negotiation positions and the potential for further attacks.

How can computers help this process, which relies so critically on collective human understanding and insight, in the midst of the furor of a crisis?

Genoa, a project of the Defense Advanced Research Projects Agency (DARPA), is aimed at improving analysis and decision making in crisis situations by providing tools that allow analysts to collaborate in developing structured arguments in support of particular conclusions and to help predict likely future scenarios. Genoa also provides knowledge discovery tools to mine the information in these sources for important patterns, trends, and anomalies, to discover nuggets of valuable information.

One of the challenges Genoa faces is to make it easy for analysts to take knowledge gleaned with the use of these discovery tools and embed it in a concise and useful form in an intelligence product, as evidence in support of structured arguments. MITRE has been tasked with developing a summarization filter architecture to address this challenge. MITRE’s approach relies on component-based software composition, i.e., assembly of software units that have contractually specified interfaces and that can be independently deployed and reused. This component-based approach, which leverages XML and Java-Beans technologies, allows the analyst to select various text mining tools from a menu and, with just a few mouse clicks, assemble them to create a complex filter that fulfills whatever information discovery function is currently needed. A filter here is a tool that takes input information and turns it into some more abstract and useful representation. Filters can also weed out irrelevant parts of the input information.

For example, in response to the crisis situation discussed earlier, an analyst might use these mining tools to discover important nuggets of information in a large collection of news sources. This use of data mining tools can be illustrated by looking at TopCat, a MITRE-developed system that identifies different topics in a collection of documents and displays the key “players” for each topic. TopCat uses association rule mining technology to identify correlations among people, organizations, locations, and events (shown below in blue, violet, green, and red, respectively). Clustering these correlations creates topics such as the three in the following figure, built from six months of global news from several print, radio, and video sources--over 60,000 news stories in all.

Topics derived from clustering 60,000 news stories.

Topics derived from clustering 60,000 news stories.

This allows the analyst to discover, say, an association between people involved in a bombing incident, which gives a starting point for further analysis, e.g., do McVeigh and Nichols belong to a common organization? This, in turn, can lead to new knowledge that can be leveraged in the analytical model used to help predict whether this terrorist organization is likely to strike elsewhere in the next few days. Similarly, the third topic reveals the important players in an election in Cambodia. This discovered information can be leveraged to help predict whether the situation in Cambodia is going to explode into a crisis that affects U.S. interests.

Now, suppose an analyst wants to know more about the people in the last topic. Instead of reading more than 6,000 words of text from 10 articles on the topic, the analyst can compose a topic detection filter like TopCat with a biographical summarization filter that gleans facts about key persons from the topic’s articles. The result of the composition is a short, 86-word-long summary, seen below.

An 86-word summary of the news collection.

An 86-word summary of the news collection.

This summarization filter, developed under DARPA funding, identifies and aggregates descriptions of people from a collection of documents by means of an efficient syntactic analysis, the use of a thesaurus, and some simple natural language generation techniques. It also extracts from these documents salient sentences related to these people by weighting sentences based on the presence of the names of people as well as the location and proximity of terms in a document, their frequency, etc. (TopCat and a summarization filter perform a similar function for MITRE's Broadcast News Navigator, which applies them to continuously collected broadcast news in order to extract named entities and keywords and to identify the transcripts and sentences that contain them. (See Personalized Broadcast News). The summarization filter includes a parameter to specify the target length or the reduction rate, allowing summaries of different lengths to be generated. For example, allowing a longer summary would mean that facts about other people (e.g., Pol Pot) would also appear in the summary.

This example illustrates how mining a text collection using a composed summarization filter can reveal important associations at varying levels of detail. The component-based approach also allows these filters to be easily integrated into intelligence products such as reports and briefings. To help analysts present structured arguments and supporting information to decision makers, Genoa provides an electronic notebook briefing tool (the Virtual Situation Book) developed by Global Infotek. Summarization filters can be associated with regions on a page in a briefing book that can be shared across a community of collaborating analysts. When a document or a folder of documents is dropped onto a region associated with a filter, the filter applies and the textual summary or visualization appears in that region.


For more information, please contact Inderjeet Mani using the employee directory.


Homeland Security Center Center for Enterprise Modernization Command, Control, Communications and Intelligence Center Center for Advanced Aviation System Development

 
 
 

Solutions That Make a Difference.®
Copyright © 1997-2013, The MITRE Corporation. All rights reserved.
MITRE is a registered trademark of The MITRE Corporation.
Material on this site may be copied and distributed with permission only.

IDG's Computerworld Names MITRE a "Best Place to Work in IT" for Eighth Straight Year The Boston Globe Ranks MITRE Number 6 Top Place to Work Fast Company Names MITRE One of the "World's 50 Most Innovative Companies"
 

Privacy Policy | Contact Us