![]() |
|||||
|
|
Home > News & Events > MITRE Publications > The MITRE Digest > | |||||||||||||||||||
Knowledge Discovery from Data and Data Mining January 1999 Advancements in processing power and storage capabilities have fueled unprecedented growth in data collection. The public and private sectors are amassing an increasing amount of information in order to get a competitive edge, and to advance science and engineering. To achieve these benefits, researchers must be able to find useful patterns in the data. Consequently, identifying methods for extracting the valuable information hidden in these vast data collections is an area of increasing interest. Knowledge Discovery in Databases (KDD) is the multidisciplinary field that draws from work in databases, statistics, machine learning, language processing, and visualization, to address this problem. The roots of KDD can be traced as far back as statistics, but KDD as it exists today really began to take shape in the early 1990s when a number of factors came together. In addition to the advancements in processing power and storage capabilities, network infrastructure, database tools and the maturity of machine learning techniques for discovering patterns led to the birth of KDD. KDD technology encompasses the entire process of selecting, preparing, extracting and reviewing patterns extracted from data, while data mining focuses on the extraction step of this process. Several MITRE project teams are researching data mining techniques to assist customers in better utilizing data collections. In particular, our customers need tools that analyze structured data, unstructured data such as text and video and combinations of these sources. Successful mining of tabular data has produced tremendous benefits in commercial settings; we believe mining less structured collections can yield similar gains. As a result, our engineers are focusing their research on the identifying tools for mining text-based data. By using available commercial and research tools, as well as identifying new methods, MITRE engineers are developing tools to find patterns in large text-based collections, to mine data which includes text and more structured data, and to summarize text. Mining for Patterns in Large Text Collections One of the more challenging aspects of data mining is identifying relationships in large collections of unstructured text. Dr. Chris Clifton, a lead scientist at MITRE, focuses a significant portion of his research on adapting commercial data mining tools to mine concepts extracted from text. As Clifton explains it, "We first developed a tool for extracting concepts, like a person or location, in text. From a research perspective, the next logical step was to explore modeling this tagged text with traditional data mining techniques. We extended the market basket model to tagged text." This market basket model finds association rules that describe correlations. Figure 1 shows this process.
Figure 1. The text mining process "Association rules can be particularly helpful to intelligence analysts examining large data sets, " Clifton says. "While an analyst may be an expert on any number of subject details, association rules often help an analyst focus on the big picture." For example, suppose a researcher is looking for evidence of a planned attack on South Korea by North Korea. An analyst using traditional methods of intelligence research would examine documents (messages and cable traffic that are selected using information retrieval or profiling techniques) for known indicators of an attack. Data mining goes beyond these methods to find additional indicators or data anomalies that suggest an attack is being planned. While mining techniques provide analysts with information on relationships in data, the rules used in data mining may still require human filtering and analysis. As Clifton points out, "If the rule holds a high degree of reliability (there are few or no cases where the rule does not hold) and a high degree of confidence (the rule holds for a significant number of cases), an exception to the rule indicates an anomaly. Many of these anomalies may not be of interest in and of themselves, but they can identify clues as to what to examine more closely." Challenges of Integrated Data Another benefit of text-based data mining tools is the tremendous potential for increasing the integration and interoperation of data collections. Since traditional data analysis is often limited to highly structured data, the ability to successfully mine unstructured text, or a combination of structured and unstructured data, expands the interoperability of data sources. The potential for an integrated data mining tool is illustrated in the following request by the Center for Advanced Aviation System Development (CAASD). Dr. Eric Bloedorn, a MITRE senior engineer, describes CAASD's request for a tool to improve the analysis of airline safety records and human factors involved in air accidents. "The databases CAASD is using have a large number of well-defined fields. Some of these fields indicate human involvement (flight crew) and others indicate whether the action taken that led to the error was intentional or not. These fields are checked to see if the report indicates deficiencies in human performance," Bloedorn explains. "The problem is that human involvement in the accident is often not clearly identified with these fields. Rather, an interviewer is more likely to leave that data field blank and remark on human error relating to the accident in a comments box." If the specially defined fields are not used, the database classifies the cause of the accident as unknown. As a result, it is difficult to identify human performance problems or effectively evaluate possible problems with training, policies or procedures. "Data mining helps reclassify these unknowns to their appropriate category," says Bloedorn. "By using machine learning techniques, we generate rules directly from the data to classify the accident examples. We also use natural language processing techniques to take the inputted text and generate additional attributes or clues that help us determine whether human factors were involved. These attributes are then used in conjunction with the information extracted from the original structured fields to help us understand if human factors contributed to the accident." Summarizing Text Beyond finding associations in information stored in unstructured or semi-structured databases, MITRE scientists are also researching techniques for summarizing text after it has been extracted. Currently, MITRE engineers are using summarization technology and its applications to text-based data mining. Principal Scientist Dr. Inderjeet Mani says, "Summarization technology is a tool for distilling information from text to produce a condensed version for a particular user and task. Given large collections of text data, this technology characterizes the content of these collections in a succinct manner, quickly focusing on text content relevant to a user's task." Essentially, summarization technology builds a representation of the salient concepts in the text, rather than treating the text as a simple sequence of characters. In addition to extracting the most relevant sentences from individual documents in a collection, this technology aggregates information across text to build abstracts of information content. For instance, an analyst may use summarization tools to discover the key topics in a collection, or similarities and differences in information content among articles in a collection. This ability to characterize and classify content is an essential part of the overall data mining picture. Consider a situation where an analyst is searching for information on countries and companies that may have shipped chemical weapons to the Middle East. Through a search engine, using key terms such as chemical weapons, sales, and Middle East, the analyst identifies hundreds of related articles. Summarization tools take this analysis one step further by quickly building a topic index based on key concepts in the retrieved information. These concepts are organized in a hierarchy and include individual words, as well as phrases and proper names. By selecting concepts from this index, the user zooms in on a subset of articles more specifically related to the analysis. Individual articles are then summarized in more detail, or compared automatically to highlight points of similarity and difference. As Mani points out, "A new generation of summarization tools is emerging, based on recent advances in information extraction, statistical language processing, and information retrieval. These tools construct much richer representations of information content than was earlier possible, without compromising on scalability. The full potential of these advances is just beginning to be exploited." Ongoing Research In order to utilize many current and future sources of data, the ability to mine data effectively for useful patterns needs to grow as quickly as the capacity to store information. Whether it is extracting text-based data, analyzing integrated data, or summarizing extracted information, MITRE is actively working to address our customers' needs for data analysis tools. As Clifton says, "Many MITRE customers have large amounts of textual information; the intelligence community being a prime example. This information is currently analyzed by humans, but with the explosion of information through computer-assisted generation and gathering of text, automation in the analysis process will be necessary." Overall, it is MITRE's ability to provide end-to-end solutions for storing, analyzing, browsing, and processing data that is valued by our customers. Because many of our customers have large text and image bases, they are dependent on MITRE researching the technologies that will enable them to extract relevant information from these data collections. MITRE is committed to providing advanced, well designed solutions to today's data mining problems and developing strategies to address future challenges in data analysis.
Page last updated: March 12, 1999 | Top of page |
Solutions That Make a Difference.® |
|
|