![]() |
|||||
|
|
WebSumm: Finding needles in the Internet haystack. Interactive Text Summarization for Fast Answers Courtney is an analyst who is preparing a high-profile report on countries and companies that may have shipped weapons to Iraq before 1991. After running a search on "chemical weapons AND Gulf War", the system returns a list of 140 hits. Courtney is frustrated at the sight of so many hits. She won't have time to read them all, and by skipping some, she may miss valuable information. Instead of throwing up her hands, however, she turns to WebSumm. Using the WebSumm tool, Courtney browses the automatically generated back-of-the-book-style index of terms, and pushes the most relevant articles with the terms "chemical weapons" and "Iraq" to the top of the list. The system has quickly and automatically summarized these articles based on these terms, and Courtney sees the relevant parts of each text that answer her question. As shown in Figure 1, of 140 documents returned from a search engine, WebSumm found that the 48th document was the most relevant one containing the user's selected terms "chemical weapons," "nerve gas," and "Iraq." These articles were pushed to the top of the display and automatically summarized based on these terms. Several relevant sentences were extracted, including: "In January 1989, the West German chemical company Sigma Chemie admitted that it had shipped 7 ounces of mycotoxin to Iraq in 1987."
Figure 1: What companies and countries were associated with providing chemical weapons to Iraq? WebSumm makes it easy for an analyst to zoom in to relevant sections of relevant documents. Now Courtney can choose what to do next: She can re-sort the hits by date, size, or title. She can view the original articles. She can create new summaries of the articles using other terms, or she can create longer summaries using the same terms. Courtney can also use WebSumm to compare documents for similarities and differences; for instance, as shown in Figure 2, she can quickly see that two different West German companies are alleged to be involved in shipping chemical weapons technology to Iraq. WebSumm can compare any pair of articles based on content; passages that are similarly related to the user's query terms are shown in the same color.
Figure 2: WebSumm produces a comparison of two similar articles. WebSumm operates in two modes: (1) as a filtering tool for re-ordering and navigating through the headlines and initial pieces of text returned from search engines; and (2) as a more in-depth navigation tool that uses information extraction and text summarization techniques to analyze information from the full text of a collection of documents. When the user submits a query to the system, WebSumm gathers a set of hits from a Web search engine. On the right side of the screen, the user is presented with a listing of hits with high-level summaries of the corresponding pages. The initial ordering of the hits is derived from terms from the initial query, and from a user-stated sorting preference (such as alphabetically, or by date or size). At the user's request, the system can perform a deeper summary analysis of documents corresponding to these hits. On the left side of the screen, terms extracted from these hits are used to create a "back-of-the-book" (BOB) index. The hits containing these index terms can be pushed up or down in the hit list based on user feedback about term relevance. By interactively selecting terms of interest and viewing the corresponding context-dependent summaries, users can quickly find answers relevant to their queries.
The technology behind WebSumm exploits recent advances in artificial intelligence and information retrieval. Extracted terms in the back-of-the-book index can consist of words, as well as phrases and proper names. Words are extracted by statistical techniques. Phrases are extracted by statistical and finite-state parsing methods; that is, by exploiting word-level statistics as well as grammatical patterns defined over the parts of speech of words in context (for instance, "lives" in "John lives" is a verb). The parts of speech are detected using MITRE's Alembic part-of-speech tagger. In addition, proper names are extracted from the text using a commercial name-extraction tool. Once the back-of-the-book index is used to find articles, deeper summaries can be created by searching for information in the text related to the user's query. Here, a document is represented as a network, whose nodes correspond to terms at different positions within the document and whose links connect the terms within the text. The user's query is treated as an activation signal that activates directly related nodes in the network. By automatically propagating the activation signal through the network, nodes that are indirectly related to the query also get activated. Eventually, highly activated nodes result in the extraction of salient sentences or fragments from a document, providing a cogent summary. This technique extends easily to two documents: in comparing a pair of related documents, the common activated nodes are used to align sentences across the pair, providing a content-based characterization of similarities and differences between the pair. In studies and user tests over the past year, WebSumm has been found to save valuable time when searching for information, without compromising the accuracy of the results. The technology behind WebSumm was developed under MITRE Sponsored Research by David House, Dr. Eric Bloedorn, and Dr. Inderjeet Mani (Principal Investigator). For more information, please contact David House using the employee directory. |
Solutions That Make a Difference.® |
|
|