What is Data-Mining?

Data-mining combines statistical techniques and knowledge-based methods to extract meaningful patterns from large datasets. Data-mining is related to areas of computer science called artificial intelligence and machine learning. It has been applied to many fields, from financial or marketing analysis, to drug design and epidemiology, to analysis of terrestrial or astronomical data from satellite photographs. There are two major approaches to data-mining: supervised and unsupervised. In the supervised approach, specific examples of a target concept are given, and the goal is to learn how to recognize members of the class using the description attributes. For example, in a clinical study, medical records for a set of healthy patients and patients with a disease like heart disease might be collected. The goal would be to learn what combination of attributes, such as obesity, high-cholesterol, or smoking, are characteristic of patients with heart disease and discriminates them from healthy patients. In the unsupervised approach, a set of examples is provided without any prior classification, and data-mining is used to discover regularities and patterns, most often by identifying clusters or subsets of similar examples. An example of this approach would be to extract general rules about consumer purchasing behavior from a database of point-of-sale transactions in a supermarket.

There are a wide variety of algorithms available for data-mining. Some algorithms attempt to construct human-interpretable representations of the derived patterns, such as decision trees or rule sets. Other algorithms focus more heavily on a statistical characterization of the patterns, such as Bayesian networks or hidden Markov models. Still other algorithms avoid the issue of interpretability altogether, opting for powerful methods for capturing patterns with even a certain degree of non-linearity, such as multi-layer neural networks. Current topics of research interest in the data-mining community include scalability (can the algorithm be ported efficiently to a massively-parallel supercomputer to run on terabyte databases?), dealing with imperfect data (handling missing values, detecting outliers, reducing the effect of noise, managing uncertainty), finding good representations (using mathematical operations to construct new high-level summary features with improved target correlations or predictive power), and incorporating domain knowledge (using background knowledge about the target concept to focus the search for patterns, which is especially important if data is scarce).


Thomas R. Ioerger (8/10/98)