What is Data-Mining?
Data-mining combines statistical techniques and knowledge-based
methods to extract meaningful patterns from large datasets.
Data-mining is related to areas of computer science called artificial
intelligence and machine learning. It has been applied to many
fields, from financial or marketing analysis, to drug design and
epidemiology, to analysis of terrestrial or astronomical data from
satellite photographs. There are two major approaches to data-mining:
supervised and unsupervised. In the supervised approach, specific
examples of a target concept are given, and the goal is to learn how
to recognize members of the class using the description attributes.
For example, in a clinical study, medical records for a set of healthy
patients and patients with a disease like heart disease might be
collected. The goal would be to learn what combination of attributes,
such as obesity, high-cholesterol, or smoking, are characteristic of
patients with heart disease and discriminates them from healthy
patients. In the unsupervised approach, a set of examples is provided
without any prior classification, and data-mining is used to discover
regularities and patterns, most often by identifying clusters or
subsets of similar examples. An example of this approach would be to
extract general rules about consumer purchasing behavior from a
database of point-of-sale transactions in a supermarket.
There are a wide variety of algorithms available for data-mining.
Some algorithms attempt to construct human-interpretable
representations of the derived patterns, such as decision trees or
rule sets. Other algorithms focus more heavily on a statistical
characterization of the patterns, such as Bayesian networks or hidden
Markov models. Still other algorithms avoid the issue of
interpretability altogether, opting for powerful methods for capturing
patterns with even a certain degree of non-linearity, such as
multi-layer neural networks. Current topics of research interest in
the data-mining community include scalability (can the algorithm be
ported efficiently to a massively-parallel supercomputer to run on
terabyte databases?), dealing with imperfect data (handling missing
values, detecting outliers, reducing the effect of noise, managing
uncertainty), finding good representations (using mathematical
operations to construct new high-level summary features with improved
target correlations or predictive power), and incorporating domain
knowledge (using background knowledge about the target concept to
focus the search for patterns, which is especially important if data
is scarce).
Thomas R. Ioerger
(8/10/98)