![]() |
|||||
|
Statisticians have been using computers for decades as a means to prove or disprove hypotheses on collected data. Linear regressions, nearest neighbors, and other types of analyses are common. Applications dependent on statistical analyses during the 1970s, 1980s, and 1990s include the drug approval process for the Food and Drug Administration and the creation of credit approval curves for credit card companies and banks. New statistical methods, including fuzzy logic and other non-linear means of analysis, have evolved. However, the use of statistics continues to assume that an analyst will start with a hypothesis about the relationships among the data attributes, then use the tool to validate or disprove that hypothesis. With data that exhibit dozens of attributes, the methodology of the hypothesize-and-test paradigm becomes a time-consuming process. Another element in the development of data mining is the increasing ability of the computer to store vast amounts of data. In the 1970s, most data storage depended upon COBOL programs and data structures--not entirely conducive to highly interactive analyses. Organizations can now store and query terabytes and even petabytes of data in sophisticated database management systems. In addition, the development of multidimensional data models such as those used in data warehouses has allowed users to move from a transaction-oriented way of thinking to a more analytical way of viewing the data. Human capability to analyze all this data manually, however, even through the use of sophisticated visualization mechanisms or on-line analytical process tools, is extremely limited. A final thread in data minings development is artificial intelligence (AI); its capabilities to analyze data were first touted during the 1970s. During the 1980s, with continued development of AI algorithms designed to enable a machine to learn, machine learning algorithms became realistic tools, and the idea of pushing them to deal with larger data sets became feasible. As opposed to statistical techniques that require the user to have a hypothesis in mind first, these algorithms automatically analyze data and identify relationships among attributes and entities in the data to build models that allow domain experts (i.e., non-statisticians) to understand the relationship between the attributes and the class. The hypothesize-and-test paradigm, therefore, has now been relaxed to a test-and-hypothesize paradigm. As a result of these developments, data mining flowered during the late 1990s. Retail companies eagerly applied complex analytical capabilities to their data to increase their customer base. The financial community found trends and patterns to predict fluctuations in interest rates, stock prices, and economic demand. The Financial Crimes Enforcement Network (FinCEN) in the Treasury Department built an application using sophisticated link analysis visualization tools that has helped analysts identify over $1 trillion in laundered money during the past eight years. These successes have contributed to the overall popularity of data mining. They demonstrate that, rather than requiring a human to attempt to deal with tens or hundreds of attributes, data mining allows automatic analysis of data and the recognition of trends and patterns. As described in this brief history, data mining is actually the synthesis of several technologies, including data management, statistics, machine learning (which can include pattern recognition techniques), and visualization. Today, data mining tools are capable of classifying data sets, associating certain attributes or entities with other attributes or entities, segmenting the data into similar clusters, and identifying outliers in the data. The entire process of knowledge discovery in databases (KDD) includes collection, abstraction, and cleansing of the data, use of data mining tools to find patterns, validation and verification of the patterns, visualization of the developed models, and refinement of the collection process. The number of commercial off-the-shelf (COTS) data mining tool vendors is shrinking, as larger, more stable companies buy out the start-ups. At the same time, the tools are absorbing each other (e.g., a decision tree tool absorbs a clustering technique into a tool that then provides both decision trees and clusters). Although the tools that remain have similar capabilities, their usability varies greatly, due to variation in the user interface, visualization of the output pattern, ease of manipulating the specific data mining technique variables, etc. The smaller, less capable PC versions of the tools are truly user friendly; still, the user must understand basic information about the entire KDD process, especially about validation of the rules. With non-structured data, such as text, imagery, video, and audio, COTS tools are very limited. They cannot accommodate, for example, newer technology for text summarization and text mining, which seeks to integrate information retrieval and language understanding techniques with machine learning and statistical techniques to obtain a summary of one or more documents. For more information, please contact Eric Bloedorn using the employee directory. |
Solutions That Make a Difference.® |
|
|