2.
Articles
intimately familiar with the data and serving areas is astronomy. Here, a notable success
as an interface between the data and the users was achieved by SKICAT, a system used by as-
and products. tronomers to perform image analysis,
For these (and many other) applications, classiﬁcation, and cataloging of sky objects
this form of manual probing of a data set is from sky-survey images (Fayyad, Djorgovski,
slow, expensive, and highly subjective. In and Weir 1996). In its ﬁrst application, the
fact, as data volumes grow dramatically, this system was used to process the 3 terabytes
type of manual data analysis is becoming (1012 bytes) of image data resulting from the
completely impractical in many domains. Second Palomar Observatory Sky Survey,
Databases are increasing in size in two ways: where it is estimated that on the order of 109
(1) the number N of records or objects in the sky objects are detectable. SKICAT can outper-
database and (2) the number d of ﬁelds or at- form humans and traditional computational
tributes to an object. Databases containing on techniques in classifying faint sky objects. See
the order of N = 109 objects are becoming in- Fayyad, Haussler, and Stolorz (1996) for a sur-
creasingly common, for example, in the as- vey of scientiﬁc applications.
tronomical sciences. Similarly, the number of In business, main KDD application areas
There is an ﬁelds d can easily be on the order of 102 or includes marketing, ﬁnance (especially in-
even 103, for example, in medical diagnostic vestment), fraud detection, manufacturing,
urgent need applications. Who could be expected to di- telecommunications, and Internet agents.
for a new gest millions of records, each having tens or Marketing: In marketing, the primary ap-
hundreds of ﬁelds? We believe that this job is plication is database marketing systems,
generation of certainly not one for humans; hence, analysis which analyze customer databases to identify
computation- work needs to be automated, at least partially. different customer groups and forecast their
The need to scale up human analysis capa- behavior. Business Week (Berry 1994) estimat-
al theories bilities to handling the large number of bytes ed that over half of all retailers are using or
and tools to that we can collect is both economic and sci- planning to use database marketing, and
assist entiﬁc. Businesses use data to gain competi- those who do use it have good results; for ex-
tive advantage, increase efﬁciency, and pro- ample, American Express reports a 10- to 15-
humans in vide more valuable services to customers. percent increase in credit-card use. Another
extracting Data we capture about our environment are notable marketing application is market-bas-
the basic evidence we use to build theories ket analysis (Agrawal et al. 1996) systems,
useful and models of the universe we live in. Be- which ﬁnd patterns such as, “If customer
information cause computers have enabled humans to bought X, he/she is also likely to buy Y and
gather more data than we can digest, it is on- Z.” Such patterns are valuable to retailers.
(knowledge) ly natural to turn to computational tech- Investment: Numerous companies use da-
from the niques to help us unearth meaningful pat- ta mining for investment, but most do not
rapidly terns and structures from the massive describe their systems. One exception is LBS
volumes of data. Hence, KDD is an attempt to Capital Management. Its system uses expert
growing address a problem that the digital informa- systems, neural nets, and genetic algorithms
volumes of tion era made a fact of life for all of us: data to manage portfolios totaling $600 million;
overload. since its start in 1993, the system has outper-
digital formed the broad stock market (Hall, Mani,
data. Data Mining and Knowledge and Barr 1996).
Fraud detection: HNC Falcon and Nestor
Discovery in the Real World PRISM systems are used for monitoring credit-
A large degree of the current interest in KDD card fraud, watching over millions of ac-
is the result of the media interest surrounding counts. The FAIS system (Senator et al. 1995),
successful KDD applications, for example, the from the U.S. Treasury Financial Crimes En-
focus articles within the last two years in forcement Network, is used to identify ﬁnan-
Business Week, Newsweek, Byte, PC Week, and cial transactions that might indicate money-
other large-circulation periodicals. Unfortu- laundering activity.
nately, it is not always easy to separate fact Manufacturing: The CASSIOPEE trou-
from media hype. Nonetheless, several well- bleshooting system, developed as part of a
documented examples of successful systems joint venture between General Electric and
can rightly be referred to as KDD applications SNECMA, was applied by three major Euro-
and have been deployed in operational use pean airlines to diagnose and predict prob-
on large-scale real-world problems in science lems for the Boeing 737. To derive families of
and in business. faults, clustering methods are used. CASSIOPEE
In science, one of the primary application received the European ﬁrst prize for innova-
38 AI MAGAZINE

3.
Articles
tive applications (Manago and Auriol 1996). Data Mining and KDD
Telecommunications: The telecommuni-
cations alarm-sequence analyzer (TASA) was Historically, the notion of ﬁnding useful pat-
built in cooperation with a manufacturer of terns in data has been given a variety of
telecommunications equipment and three names, including data mining, knowledge ex-
telephone networks (Mannila, Toivonen, and traction, information discovery, information
Verkamo 1995). The system uses a novel harvesting, data archaeology, and data pattern
processing. The term data mining has mostly
framework for locating frequently occurring
been used by statisticians, data analysts, and
alarm episodes from the alarm stream and
the management information systems (MIS)
presenting them as rules. Large sets of discov-
communities. It has also gained popularity in
ered rules can be explored with ﬂexible infor-
the database ﬁeld. The phrase knowledge dis-
mation-retrieval tools supporting interactivity
covery in databases was coined at the ﬁrst KDD
and iteration. In this way, TASA offers pruning,
workshop in 1989 (Piatetsky-Shapiro 1991) to
grouping, and ordering tools to reﬁne the re- emphasize that knowledge is the end product
sults of a basic brute-force search for rules. of a data-driven discovery. It has been popular-
Data cleaning: The MERGE - PURGE system ized in the AI and machine-learning ﬁelds.
was applied to the identiﬁcation of duplicate In our view, KDD refers to the overall pro-
welfare claims (Hernandez and Stolfo 1995). cess of discovering useful knowledge from da- The basic
It was used successfully on data from the Wel- ta, and data mining refers to a particular step
fare Department of the State of Washington. in this process. Data mining is the application
problem
In other areas, a well-publicized system is of speciﬁc algorithms for extracting patterns addressed by
IBM’s ADVANCED SCOUT, a specialized data-min- from data. The distinction between the KDD
ing system that helps National Basketball As- process and the data-mining step (within the
the KDD
sociation (NBA) coaches organize and inter- process) is a central point of this article. The process is
pret data from NBA games (U.S. News 1995). additional steps in the KDD process, such as one of
ADVANCED SCOUT was used by several of the data preparation, data selection, data cleaning,
NBA teams in 1996, including the Seattle Su- incorporation of appropriate prior knowledge, mapping
personics, which reached the NBA ﬁnals. and proper interpretation of the results of low-level
Finally, a novel and increasingly important mining, are essential to ensure that useful
type of discovery is one based on the use of in- knowledge is derived from the data. Blind ap- data into
telligent agents to navigate through an infor- plication of data-mining methods (rightly crit- other forms
mation-rich environment. Although the idea icized as data dredging in the statistical litera-
of active triggers has long been analyzed in the ture) can be a dangerous activity, easily that might be
database ﬁeld, really successful applications of leading to the discovery of meaningless and more
invalid patterns.
this idea appeared only with the advent of the compact,
Internet. These systems ask the user to specify
The Interdisciplinary Nature of KDD more
a proﬁle of interest and search for related in-
KDD has evolved, and continues to evolve,
formation among a wide variety of public-do-
from the intersection of research ﬁelds such as
abstract,
main and proprietary sources. For example,
FIREFLY is a personal music-recommendation
machine learning, pattern recognition, or more
agent: It asks a user his/her opinion of several
databases, statistics, AI, knowledge acquisition useful.
for expert systems, data visualization, and
music pieces and then suggests other music
high-performance computing. The unifying
that the user might like (<http://
goal is extracting high-level knowledge from
www.fﬂy.com/>). CRAYON (http://crayon.net/>)
low-level data in the context of large data sets.
allows users to create their own free newspaper
The data-mining component of KDD cur-
(supported by ads); NEWSHOUND (<http://www. rently relies heavily on known techniques
sjmercury.com/hound/>) from the San Jose from machine learning, pattern recognition,
Mercury News and FARCAST (<http://www.far- and statistics to ﬁnd patterns from data in the
cast.com/> automatically search information data-mining step of the KDD process. A natu-
from a wide variety of sources, including ral question is, How is KDD different from pat-
newspapers and wire services, and e-mail rele- tern recognition or machine learning (and re-
vant documents directly to the user. lated ﬁelds)? The answer is that these ﬁelds
These are just a few of the numerous such provide some of the data-mining methods
systems that use KDD techniques to automat- that are used in the data-mining step of the
ically produce useful information from large KDD process. KDD focuses on the overall pro-
masses of raw data. See Piatetsky-Shapiro et cess of knowledge discovery from data, includ-
al. (1996) for an overview of issues in devel- ing how the data are stored and accessed, how
oping industrial KDD applications. algorithms can be scaled to massive data sets
FALL 1996 39

4.
Articles
and still run efﬁciently, how results can be in- A driving force behind KDD is the database
terpreted and visualized, and how the overall ﬁeld (the second D in KDD). Indeed, the
man-machine interaction can usefully be problem of effective data manipulation when
modeled and supported. The KDD process data cannot ﬁt in the main memory is of fun-
can be viewed as a multidisciplinary activity damental importance to KDD. Database tech-
that encompasses techniques beyond the niques for gaining efﬁcient data access,
scope of any one particular discipline such as grouping and ordering operations when ac-
machine learning. In this context, there are cessing data, and optimizing queries consti-
clear opportunities for other ﬁelds of AI (be- tute the basics for scaling algorithms to larger
sides machine learning) to contribute to data sets. Most data-mining algorithms from
KDD. KDD places a special emphasis on ﬁnd- statistics, pattern recognition, and machine
ing understandable patterns that can be inter- learning assume data are in the main memo-
preted as useful or interesting knowledge. ry and pay no attention to how the algorithm
Thus, for example, neural networks, although breaks down if only limited views of the data
a powerful modeling tool, are relatively are possible.
difﬁcult to understand compared to decision A related ﬁeld evolving from databases is
trees. KDD also emphasizes scaling and ro- data warehousing, which refers to the popular
bustness properties of modeling algorithms business trend of collecting and cleaning
Data mining for large noisy data sets. transactional data to make them available for
Related AI research ﬁelds include machine online analysis and decision support. Data
is a step in discovery, which targets the discovery of em- warehousing helps set the stage for KDD in
the KDD pirical laws from observation and experimen- two important ways: (1) data cleaning and (2)
tation (Shrager and Langley 1990) (see Kloes- data access.
process that gen and Zytkow [1996] for a glossary of terms Data cleaning: As organizations are forced
consists of ap- common to KDD and machine discovery), to think about a uniﬁed logical view of the
and causal modeling for the inference of wide variety of data and databases they pos-
plying data causal models from data (Spirtes, Glymour, sess, they have to address the issues of map-
analysis and and Scheines 1993). Statistics in particular ping data to a single naming convention,
discovery al- has much in common with KDD (see Elder uniformly representing and handling missing
and Pregibon [1996] and Glymour et al. data, and handling noise and errors when
gorithms that [1996] for a more detailed discussion of this possible.
produce a par- synergy). Knowledge discovery from data is Data access: Uniform and well-deﬁned
fundamentally a statistical endeavor. Statistics methods must be created for accessing the da-
ticular enu- provides a language and framework for quan- ta and providing access paths to data that
meration of tifying the uncertainty that results when one were historically difﬁcult to get to (for exam-
tries to infer general patterns from a particu- ple, stored ofﬂine).
patterns lar sample of an overall population. As men- Once organizations and individuals have
(or models) tioned earlier, the term data mining has had solved the problem of how to store and ac-
negative connotations in statistics since the cess their data, the natural next step is the
over the 1960s when computer-based data analysis question, What else do we do with all the da-
data. techniques were ﬁrst introduced. The concern ta? This is where opportunities for KDD natu-
arose because if one searches long enough in rally arise.
any data set (even randomly generated data), A popular approach for analysis of data
one can ﬁnd patterns that appear to be statis- warehouses is called online analytical processing
tically signiﬁcant but, in fact, are not. Clearly, (OLAP), named for a set of principles pro-
this issue is of fundamental importance to posed by Codd (1993). OLAP tools focus on
KDD. Substantial progress has been made in providing multidimensional data analysis,
recent years in understanding such issues in which is superior to SQL in computing sum-
statistics. Much of this work is of direct rele- maries and breakdowns along many dimen-
vance to KDD. Thus, data mining is a legiti- sions. OLAP tools are targeted toward simpli-
mate activity as long as one understands how fying and supporting interactive data analysis,
to do it correctly; data mining carried out but the goal of KDD tools is to automate as
poorly (without regard to the statistical as- much of the process as possible. Thus, KDD is
pects of the problem) is to be avoided. KDD a step beyond what is currently supported by
can also be viewed as encompassing a broader most standard database systems.
view of modeling than statistics. KDD aims to
provide tools to automate (to the degree pos- Basic Deﬁnitions
sible) the entire process of data analysis and KDD is the nontrivial process of identifying
the statistician’s “art” of hypothesis selection. valid, novel, potentially useful, and ultimate-
40 AI MAGAZINE

5.
Articles
Interpretation /
Evaluation
Data Mining
Transformation Knowledge
Preprocessing
Selection
Patterns
--- --- ---
--- --- ---
--- --- ---
Transformed
Preprocessed Data Data
Data Target Date
Figure 1. An Overview of the Steps That Compose the KDD Process.
ly understandable patterns in data (Fayyad, data) or utility (for example, gain, perhaps in
Piatetsky-Shapiro, and Smyth 1996). dollars saved because of better predictions or
Here, data are a set of facts (for example, speedup in response time of a system). No-
cases in a database), and pattern is an expres- tions such as novelty and understandability
sion in some language describing a subset of are much more subjective. In certain contexts,
the data or a model applicable to the subset. understandability can be estimated by sim-
Hence, in our usage here, extracting a pattern plicity (for example, the number of bits to de-
also designates ﬁtting a model to data; ﬁnd- scribe a pattern). An important notion, called
ing structure from data; or, in general, mak- interestingness (for example, see Silberschatz
ing any high-level description of a set of data. and Tuzhilin [1995] and Piatetsky-Shapiro and
The term process implies that KDD comprises Matheus [1994]), is usually taken as an overall
many steps, which involve data preparation, measure of pattern value, combining validity,
search for patterns, knowledge evaluation, novelty, usefulness, and simplicity. Interest-
and reﬁnement, all repeated in multiple itera- ingness functions can be deﬁned explicitly or
tions. By nontrivial, we mean that some can be manifested implicitly through an or-
search or inference is involved; that is, it is dering placed by the KDD system on the dis-
not a straightforward computation of covered patterns or models.
predeﬁned quantities like computing the av- Given these notions, we can consider a
erage value of a set of numbers. pattern to be knowledge if it exceeds some in-
The discovered patterns should be valid on terestingness threshold, which is by no
new data with some degree of certainty. We means an attempt to deﬁne knowledge in the
also want patterns to be novel (at least to the philosophical or even the popular view. As a
system and preferably to the user) and poten- matter of fact, knowledge in this deﬁnition is
tially useful, that is, lead to some beneﬁt to purely user oriented and domain speciﬁc and
the user or task. Finally, the patterns should is determined by whatever functions and
be understandable, if not immediately then thresholds the user chooses.
after some postprocessing. Data mining is a step in the KDD process
The previous discussion implies that we can that consists of applying data analysis and
deﬁne quantitative measures for evaluating discovery algorithms that, under acceptable
extracted patterns. In many cases, it is possi- computational efﬁciency limitations, pro-
ble to deﬁne measures of certainty (for exam- duce a particular enumeration of patterns (or
ple, estimated prediction accuracy on new models) over the data. Note that the space of
FALL 1996 41

6.
Articles
patterns is often inﬁnite, and the enumera- methods, the effective number of variables
tion of patterns involves some form of under consideration can be reduced, or in-
search in this space. Practical computational variant representations for the data can be
constraints place severe limits on the sub- found.
space that can be explored by a data-mining Fifth is matching the goals of the KDD pro-
algorithm. cess (step 1) to a particular data-mining
The KDD process involves using the method. For example, summarization, clas-
database along with any required selection, siﬁcation, regression, clustering, and so on,
preprocessing, subsampling, and transforma- are described later as well as in Fayyad, Piatet-
tions of it; applying data-mining methods sky-Shapiro, and Smyth (1996).
(algorithms) to enumerate patterns from it; Sixth is exploratory analysis and model
and evaluating the products of data mining and hypothesis selection: choosing the data-
to identify the subset of the enumerated pat- mining algorithm(s) and selecting method(s)
terns deemed knowledge. The data-mining to be used for searching for data patterns.
component of the KDD process is concerned This process includes deciding which models
with the algorithmic means by which pat- and parameters might be appropriate (for ex-
terns are extracted and enumerated from da- ample, models of categorical data are differ-
ta. The overall KDD process (ﬁgure 1) in- ent than models of vectors over the reals) and
cludes the evaluation and possible matching a particular data-mining method
interpretation of the mined patterns to de- with the overall criteria of the KDD process
termine which patterns can be considered (for example, the end user might be more in-
new knowledge. The KDD process also in- terested in understanding the model than its
cludes all the additional steps described in predictive capabilities).
the next section.
Seventh is data mining: searching for pat-
The notion of an overall user-driven pro-
terns of interest in a particular representa-
cess is not unique to KDD: analogous propos-
tional form or a set of such representations,
als have been put forward both in statistics
including classiﬁcation rules or trees, regres-
(Hand 1994) and in machine learning (Brod-
sion, and clustering. The user can signiﬁcant-
ley and Smyth 1996).
ly aid the data-mining method by correctly
performing the preceding steps.
Eighth is interpreting mined patterns, pos-
The KDD Process sibly returning to any of steps 1 through 7 for
further iteration. This step can also involve
The KDD process is interactive and iterative,
visualization of the extracted patterns and
involving numerous steps with many deci-
sions made by the user. Brachman and Anand models or visualization of the data given the
(1996) give a practical view of the KDD pro- extracted models.
cess, emphasizing the interactive nature of Ninth is acting on the discovered knowl-
the process. Here, we broadly outline some of edge: using the knowledge directly, incorpo-
its basic steps: rating the knowledge into another system for
First is developing an understanding of the further action, or simply documenting it and
application domain and the relevant prior reporting it to interested parties. This process
knowledge and identifying the goal of the also includes checking for and resolving po-
KDD process from the customer’s viewpoint. tential conﬂicts with previously believed (or
Second is creating a target data set: select- extracted) knowledge.
ing a data set, or focusing on a subset of vari- The KDD process can involve signiﬁcant
ables or data samples, on which discovery is iteration and can contain loops between
to be performed. any two steps. The basic ﬂow of steps (al-
Third is data cleaning and preprocessing. though not the potential multitude of itera-
Basic operations include removing noise if tions and loops) is illustrated in ﬁgure 1.
appropriate, collecting the necessary informa- Most previous work on KDD has focused on
tion to model or account for noise, deciding step 7, the data mining. However, the other
on strategies for handling missing data ﬁelds, steps are as important (and probably more
and accounting for time-sequence informa- so) for the successful application of KDD in
tion and known changes. practice. Having deﬁned the basic notions
Fourth is data reduction and projection: and introduced the KDD process, we now
ﬁnding useful features to represent the data focus on the data-mining component,
depending on the goal of the task. With di- which has, by far, received the most atten-
mensionality reduction or transformation tion in the literature.
42 AI MAGAZINE

7.
Articles
The Data-Mining Step
of the KDD Process
The data-mining component of the KDD pro- o
Debt
cess often involves repeated iterative applica-
o o
tion of particular data-mining methods. This x
section presents an overview of the primary o
o
goals of data mining, a description of the x
x x
methods used to address these goals, and a o o
brief description of the data-mining algo- o o
rithms that incorporate these methods. x
x x
The knowledge discovery goals are deﬁned x
o o
by the intended use of the system. We can o
x
distinguish two types of goals: (1) veriﬁcation x o
and (2) discovery. With veriﬁcation, the sys-
tem is limited to verifying the user’s hypothe- Income
sis. With discovery, the system autonomously
ﬁnds new patterns. We further subdivide the
discovery goal into prediction, where the sys-
tem ﬁnds patterns for predicting the future Figure 2. A Simple Data Set with Two Classes Used for Illustrative Purposes.
behavior of some entities, and description,
where the system ﬁnds patterns for presenta-
tion to a user in a human-understandable
form. In this article, we are primarily con- ily in the goodness-of-ﬁt criterion used to
cerned with discovery-oriented data mining. evaluate model ﬁt or in the search method
Data mining involves ﬁtting models to, or used to ﬁnd a good ﬁt.
determining patterns from, observed data. In our brief overview of data-mining meth-
The ﬁtted models play the role of inferred ods, we try in particular to convey the notion
knowledge: Whether the models reﬂect useful that most (if not all) methods can be viewed
or interesting knowledge is part of the over- as extensions or hybrids of a few basic tech-
all, interactive KDD process where subjective niques and principles. We ﬁrst discuss the pri-
human judgment is typically required. Two mary methods of data mining and then show
primary mathematical formalisms are used in that the data- mining methods can be viewed
model ﬁtting: (1) statistical and (2) logical. as consisting of three primary algorithmic
The statistical approach allows for nondeter- components: (1) model representation, (2)
ministic effects in the model, whereas a logi- model evaluation, and (3) search. In the dis-
cal model is purely deterministic. We focus cussion of KDD and data-mining methods,
primarily on the statistical approach to data we use a simple example to make some of the
mining, which tends to be the most widely notions more concrete. Figure 2 shows a sim-
used basis for practical data-mining applica- ple two-dimensional artiﬁcial data set consist-
tions given the typical presence of uncertain- ing of 23 cases. Each point on the graph rep-
ty in real-world data-generating processes. resents a person who has been given a loan
Most data-mining methods are based on by a particular bank at some time in the past.
tried and tested techniques from machine The horizontal axis represents the income of
learning, pattern recognition, and statistics: the person; the vertical axis represents the to-
classiﬁcation, clustering, regression, and so tal personal debt of the person (mortgage, car
on. The array of different algorithms under payments, and so on). The data have been
each of these headings can often be bewilder- classiﬁed into two classes: (1) the x’s repre-
ing to both the novice and the experienced sent persons who have defaulted on their
data analyst. It should be emphasized that of loans and (2) the o’s represent persons whose
the many data-mining methods advertised in loans are in good status with the bank. Thus,
the literature, there are really only a few fun- this simple artiﬁcial data set could represent a
damental techniques. The actual underlying historical data set that can contain useful
model representation being used by a particu- knowledge from the point of view of the
lar method typically comes from a composi- bank making the loans. Note that in actual
tion of a small number of well-known op- KDD applications, there are typically many
tions: polynomials, splines, kernel and basis more dimensions (as many as several hun-
functions, threshold-Boolean functions, and dreds) and many more data points (many
so on. Thus, algorithms tend to differ primar- thousands or even millions).
FALL 1996 43

8.
Articles
The purpose here is to illustrate basic ideas
on a small problem in two-dimensional
o space.
Debt
No Loan
o o Data-Mining Methods
x
o The two high-level primary goals of data min-
o ing in practice tend to be prediction and de-
x
x x scription. As stated earlier, prediction in-
o o
o o volves using some variables or ﬁelds in the
x
x
database to predict unknown or future values
x x of other variables of interest, and description
o o Loan
o focuses on ﬁnding human-interpretable pat-
x
x o terns describing the data. Although the
boundaries between prediction and descrip-
Income
tion are not sharp (some of the predictive
models can be descriptive, to the degree that
they are understandable, and vice versa), the
distinction is useful for understanding the
Figure 3. A Simple Linear Classiﬁcation Boundary for the Loan Data Set. overall discovery goal. The relative impor-
The shaped region denotes class no loan. tance of prediction and description for partic-
ular data-mining applications can vary con-
siderably. The goals of prediction and
description can be achieved using a variety of
particular data-mining methods.
Classiﬁcation is learning a function that
ine maps (classiﬁes) a data item into one of sever-
ionL
o
gr ess al predeﬁned classes (Weiss and Kulikowski
Debt Re
o o
1991; Hand 1981). Examples of classiﬁcation
x methods used as part of knowledge discovery
o
o applications include the classifying of trends
x in ﬁnancial markets (Apte and Hong 1996)
x x
o o and the automated identiﬁcation of objects of
o o
x
interest in large image databases (Fayyad,
x x Djorgovski, and Weir 1996). Figure 3 shows a
x
o o simple partitioning of the loan data into two
o
x class regions; note that it is not possible to
x o separate the classes perfectly using a linear
decision boundary. The bank might want to
Income use the classiﬁcation regions to automatically
decide whether future loan applicants will be
given a loan or not.
Figure 4. A Simple Linear Regression for the Loan Data Set. Regression is learning a function that maps
a data item to a real-valued prediction vari-
able. Regression applications are many, for
example, predicting the amount of biomass
present in a forest given remotely sensed mi-
crowave measurements, estimating the proba-
bility that a patient will survive given the re-
sults of a set of diagnostic tests, predicting
consumer demand for a new product as a
function of advertising expenditure, and pre-
dicting time series where the input variables
can be time-lagged versions of the prediction
variable. Figure 4 shows the result of simple
linear regression where total debt is ﬁtted as a
linear function of income: The ﬁt is poor be-
cause only a weak correlation exists between
the two variables.
Clustering is a common descriptive task
44 AI MAGAZINE

9.
Articles
where one seeks to identify a ﬁnite set of cat-
egories or clusters to describe the data (Jain
and Dubes 1988; Titterington, Smith, and + Cluster 2
Debt
Makov 1985). The categories can be mutually
Cluster 1 + +
exclusive and exhaustive or consist of a richer +
representation, such as hierarchical or over- +
+
lapping categories. Examples of clustering ap- +
+ +
plications in a knowledge discovery context +
+
include discovering homogeneous subpopula- + +
tions for consumers in marketing databases +
+ +
and identifying subcategories of spectra from +
+ +
infrared sky measurements (Cheeseman and +
+
Stutz 1996). Figure 5 shows a possible cluster- + + Cluster 3
ing of the loan data set into three clusters;
note that the clusters overlap, allowing data Income
points to belong to more than one cluster.
The original class labels (denoted by x’s and
o’s in the previous ﬁgures) have been replaced Figure 5. A Simple Clustering of the Loan Data Set into Three Clusters.
by a + to indicate that the class membership Note that original labels are replaced by a +.
is no longer assumed known. Closely related
to clustering is the task of probability density
estimation, which consists of techniques for
estimating from data the joint multivariate
discovering the most signiﬁcant changes in
probability density function of all the vari-
the data from previously measured or norma-
ables or ﬁelds in the database (Silverman
tive values (Berndt and Clifford 1996; Guyon,
1986).
Matic, and Vapnik 1996; Kloesgen 1996;
Summarization involves methods for ﬁnd- Matheus, Piatetsky-Shapiro, and McNeill
ing a compact description for a subset of da- 1996; Basseville and Nikiforov 1993).
ta. A simple example would be tabulating the
mean and standard deviations for all ﬁelds. The Components of
More sophisticated methods involve the Data-Mining Algorithms
derivation of summary rules (Agrawal et al.
The next step is to construct speciﬁc algo-
1996), multivariate visualization techniques, rithms to implement the general methods we
and the discovery of functional relationships outlined. One can identify three primary
between variables (Zembowicz and Zytkow components in any data-mining algorithm:
1996). Summarization techniques are often (1) model representation, (2) model evalua-
applied to interactive exploratory data analy- tion, and (3) search.
sis and automated report generation. This reductionist view is not necessarily
Dependency modeling consists of ﬁnding a complete or fully encompassing; rather, it is a
model that describes signiﬁcant dependencies convenient way to express the key concepts
between variables. Dependency models exist of data-mining algorithms in a relatively
at two levels: (1) the structural level of the uniﬁed and compact manner. Cheeseman
model speciﬁes (often in graphic form) which (1990) outlines a similar structure.
variables are locally dependent on each other Model representation is the language used to
and (2) the quantitative level of the model describe discoverable patterns. If the repre-
speciﬁes the strengths of the dependencies sentation is too limited, then no amount of
using some numeric scale. For example, prob- training time or examples can produce an ac-
abilistic dependency networks use condition- curate model for the data. It is important that
al independence to specify the structural as- a data analyst fully comprehend the represen-
pect of the model and probabilities or tational assumptions that might be inherent
correlations to specify the strengths of the de- in a particular method. It is equally impor-
pendencies (Glymour et al. 1987; Heckerman tant that an algorithm designer clearly state
1996). Probabilistic dependency networks are which representational assumptions are being
increasingly ﬁnding applications in areas as made by a particular algorithm. Note that in-
diverse as the development of probabilistic creased representational power for models in-
medical expert systems from databases, infor- creases the danger of overﬁtting the training
mation retrieval, and modeling of the human data, resulting in reduced prediction accuracy
genome. on unseen data.
Change and deviation detection focuses on Model-evaluation criteria are quantitative
FALL 1996 45

10.
Articles
Decision Trees and Rules
Decision trees and rules that use univariate
o splits have a simple representational form,
Debt
No Loan making the inferred model relatively easy for
o o the user to comprehend. However, the restric-
x
o tion to a particular tree or rule representation
o
x can signiﬁcantly restrict the functional form
x x (and, thus, the approximation power) of the
o o
o o model. For example, ﬁgure 6 illustrates the ef-
x fect of a threshold split applied to the income
x x
x variable for a loan data set: It is clear that us-
o o Loan
o ing such simple threshold splits (parallel to
x
x o the feature axes) severely limits the type of
classiﬁcation boundaries that can be induced.
t Income If one enlarges the model space to allow more
general expressions (such as multivariate hy-
perplanes at arbitrary angles), then the model
is more powerful for prediction but can be
Figure 6. Using a Single Threshold on the Income Variable to much more difﬁcult to comprehend. A large
Try to Classify the Loan Data Set. number of decision tree and rule-induction
algorithms are described in the machine-
learning and applied statistics literature
(Quinlan 1992; Breiman et al. 1984).
To a large extent, they depend on likeli-
hood-based model-evaluation methods, with
varying degrees of sophistication in terms of
statements (or ﬁt functions) of how well a par- penalizing model complexity. Greedy search
ticular pattern (a model and its parameters) methods, which involve growing and prun-
meets the goals of the KDD process. For ex- ing rule and tree structures, are typically used
ample, predictive models are often judged by to explore the superexponential space of pos-
the empirical prediction accuracy on some sible models. Trees and rules are primarily
test set. Descriptive models can be evaluated used for predictive modeling, both for clas-
along the dimensions of predictive accuracy, siﬁcation (Apte and Hong 1996; Fayyad, Djor-
novelty, utility, and understandability of the govski, and Weir 1996) and regression, al-
ﬁtted model. though they can also be applied to summary
Search method consists of two components: descriptive modeling (Agrawal et al. 1996).
(1) parameter search and (2) model search.
Once the model representation (or family of Nonlinear Regression and
representations) and the model-evaluation Classiﬁcation Methods
criteria are ﬁxed, then the data-mining prob- These methods consist of a family of tech-
lem has been reduced to purely an optimiza- niques for prediction that ﬁt linear and non-
tion task: Find the parameters and models linear combinations of basis functions (sig-
from the selected family that optimize the moids, splines, polynomials) to combinations
evaluation criteria. In parameter search, the of the input variables. Examples include feed-
algorithm must search for the parameters forward neural networks, adaptive spline
that optimize the model-evaluation criteria methods, and projection pursuit regression
given observed data and a ﬁxed model repre- (see Elder and Pregibon [1996], Cheng and
sentation. Model search occurs as a loop over Titterington [1994], and Friedman [1989] for
the parameter-search method: The model rep- more detailed discussions). Consider neural
resentation is changed so that a family of networks, for example. Figure 7 illustrates the
models is considered. type of nonlinear decision boundary that a
neural network might ﬁnd for the loan data
set. In terms of model evaluation, although
Some Data-Mining Methods networks of the appropriate size can univer-
A wide variety of data-mining methods exist, sally approximate any smooth function to
but here, we only focus on a subset of popu- any desired degree of accuracy, relatively little
lar techniques. Each method is discussed in is known about the representation properties
the context of model representation, model of ﬁxed-size networks estimated from ﬁnite
evaluation, and search. data sets. Also, the standard squared error and
46 AI MAGAZINE

11.
Articles
cross-entropy loss functions used to train
neural networks can be viewed as log-likeli-
hood functions for regression and
classiﬁcation, respectively (Ripley 1994; Ge- o
Debt
man, Bienenstock, and Doursat 1992). Back No Loan
o o
propagation is a parameter-search method x
o
that performs gradient descent in parameter o
(weight) space to ﬁnd a local maximum of x
x x
the likelihood function starting from random o o
o o
initial conditions. Nonlinear regression meth- x
ods, although powerful in representational x x
x
power, can be difﬁcult to interpret. o o Loan
o
For example, although the classiﬁcation x
x o
boundaries of ﬁgure 7 might be more accu-
rate than the simple threshold boundary of
Income
ﬁgure 6, the threshold boundary has the ad-
vantage that the model can be expressed, to
some degree of certainty, as a simple rule of
the form “if income is greater than threshold,
then loan will have good status.” Figure 7. An Example of Classiﬁcation Boundaries Learned by a Nonlinear
Classiﬁer (Such as a Neural Network) for the Loan Data Set.
Example-Based Methods
The representation is simple: Use representa-
tive examples from the database to approxi-
mate a model; that is, predictions on new ex-
amples are derived from the properties of
similar examples in the model whose predic-
tion is known. Techniques include nearest-
o
neighbor classiﬁcation and regression algo-
Debt No Loan
rithms (Dasarathy 1991) and case-based o o
x
reasoning systems (Kolodner 1993). Figure 8 o
illustrates the use of a nearest-neighbor clas- o
x
siﬁer for the loan data set: The class at any x x
new point in the two-dimensional space is o o
o
o
the same as the class of the closest point in x
x
the original training data set. x x
A potential disadvantage of example-based o o Loan
o
methods (compared with tree-based methods) x
x o
is that a well-deﬁned distance metric for eval-
uating the distance between data points is re-
Income
quired. For the loan data in ﬁgure 8, this
would not be a problem because income and
debt are measured in the same units. Howev-
er, if one wished to include variables such as
the duration of the loan, sex, and profession, Figure 8. Classiﬁcation Boundaries for a Nearest-Neighbor
then it would require more effort to deﬁne a Classiﬁer for the Loan Data Set.
sensible metric between the variables. Model
evaluation is typically based on cross-valida-
tion estimates (Weiss and Kulikowski 1991) of
a prediction error: Parameters of the model to
be estimated can include the number of
neighbors to use for prediction and the dis-
tance metric itself. Like nonlinear regression
methods, example-based methods are often
asymptotically powerful in terms of approxi-
mation properties but, conversely, can be
difﬁcult to interpret because the model is im-
plicit in the data and not explicitly formulat-
ed. Related techniques include kernel-density
FALL 1996 47

12.
Articles
estimation (Silverman 1986) and mixture evitably limited in scope; many data-mining
modeling (Titterington, Smith, and Makov techniques, particularly specialized methods
1985). for particular types of data and domains, were
not mentioned speciﬁcally. We believe the
Probabilistic Graphic general discussion on data-mining tasks and
Dependency Models components has general relevance to a vari-
Graphic models specify probabilistic depen- ety of methods. For example, consider time-
dencies using a graph structure (Whittaker series prediction, which traditionally has
1990; Pearl 1988). In its simplest form, the been cast as a predictive regression task (au-
model speciﬁes which variables are directly de- toregressive models, and so on). Recently,
pendent on each other. Typically, these mod- more general models have been developed for
els are used with categorical or discrete-valued time-series applications, such as nonlinear ba-
variables, but extensions to special cases, such sis functions, example-based models, and ker-
Understand- as Gaussian densities, for real-valued variables nel methods. Furthermore, there has been
are also possible. Within the AI and statistical signiﬁcant interest in descriptive graphic and
ing data communities, these models were initially de- local data modeling of time series rather than
mining and veloped within the framework of probabilistic purely predictive modeling (Weigend and
expert systems; the structure of the model and Gershenfeld 1993). Thus, although different
model the parameters (the conditional probabilities algorithms and applications might appear dif-
induction at attached to the links of the graph) were elicit- ferent on the surface, it is not uncommon to
ed from experts. Recently, there has been sig- ﬁnd that they share many common compo-
this niﬁcant work in both the AI and statistical nents. Understanding data mining and model
component communities on methods whereby both the induction at this component level clariﬁes
level clariﬁes structure and the parameters of graphic mod- the behavior of any data-mining algorithm
els can be learned directly from databases and makes it easier for the user to understand
the behavior (Buntine 1996; Heckerman 1996). Model-eval- its overall contribution and applicability to
of any uation criteria are typically Bayesian in form, the KDD process.
and parameter estimation can be a mixture of An important point is that each technique
data-mining closed-form estimates and iterative methods typically suits some problems better than
algorithm depending on whether a variable is directly others. For example, decision tree classiﬁers
observed or hidden. Model search can consist can be useful for ﬁnding structure in high-di-
and makes it of greedy hill-climbing methods over various mensional spaces and in problems with
easier for the graph structures. Prior knowledge, such as a mixed continuous and categorical data (be-
partial ordering of the variables based on cause tree methods do not require distance
user to metrics). However, classiﬁcation trees might
causal relations, can be useful in terms of re-
understand its ducing the model search space. Although still not be suitable for problems where the true
overall primarily in the research phase, graphic model decision boundaries between classes are de-
induction methods are of particular interest to scribed by a second-order polynomial (for ex-
contribution KDD because the graphic form of the model ample). Thus, there is no universal data-min-
and lends itself easily to human interpretation. ing method, and choosing a particular
algorithm for a particular application is some-
applicability Relational Learning Models thing of an art. In practice, a large portion of
to the Although decision trees and rules have a repre- the application effort can go into properly
sentation restricted to propositional logic, rela- formulating the problem (asking the right
KDD tional learning (also known as inductive logic question) rather than into optimizing the al-
process. programming) uses the more ﬂexible pattern gorithmic details of a particular data-mining
language of ﬁrst-order logic. A relational learn- method (Langley and Simon 1995; Hand
er can easily ﬁnd formulas such as X = Y. Most 1994).
research to date on model-evaluation methods Because our discussion and overview of da-
for relational learning is logical in nature. The ta-mining methods has been brief, we want
extra representational power of relational to make two important points clear:
models comes at the price of signiﬁcant com- First, our overview of automated search fo-
putational demands in terms of search. See cused mainly on automated methods for ex-
Dzeroski (1996) for a more detailed discussion. tracting patterns or models from data. Al-
though this approach is consistent with the
deﬁnition we gave earlier, it does not neces-
Discussion sarily represent what other communities
Given the broad spectrum of data-mining might refer to as data mining. For example,
methods and algorithms, our overview is in- some use the term to designate any manual
48 AI MAGAZINE

13.
Articles
search of the data or search assisted by queries oriented data, although making the applica-
to a database management system or to refer tion development more difﬁcult, make it po-
to humans visualizing patterns in data. In tentially much more useful because it is easier
other communities, it is used to refer to the to retrain a system than a human. Finally,
automated correlation of data from transac- and perhaps one of the most important con-
tions or the automated generation of transac- siderations, is prior knowledge. It is useful to
tion reports. We choose to focus only on know something about the domain —what
methods that contain certain degrees of are the important ﬁelds, what are the likely
search autonomy. relationships, what is the user utility func-
Second, beware the hype: The state of the tion, what patterns are already known, and so
art in automated methods in data mining is on.
still in a fairly early stage of development.
There are no established criteria for deciding Research and Application Challenges
which methods to use in which circum- We outline some of the current primary re-
stances, and many of the approaches are search and application challenges for KDD.
based on crude heuristic approximations to This list is by no means exhaustive and is in-
avoid the expensive search required to ﬁnd tended to give the reader a feel for the types
optimal, or even good, solutions. Hence, the of problem that KDD practitioners wrestle
reader should be careful when confronted with.
with overstated claims about the great ability Larger databases: Databases with hun-
of a system to mine useful information from dreds of ﬁelds and tables and millions of
large (or even small) databases. records and of a multigigabyte size are com-
monplace, and terabyte (1012 bytes) databases
are beginning to appear. Methods for dealing
Application Issues with large data volumes include more
For a survey of KDD applications as well as efﬁcient algorithms (Agrawal et al. 1996),
detailed examples, see Piatetsky-Shapiro et al. sampling, approximation, and massively par-
(1996) for industrial applications and Fayyad, allel processing (Holsheimer et al. 1996).
Haussler, and Stolorz (1996) for applications High dimensionality: Not only is there of-
in science data analysis. Here, we examine ten a large number of records in the database,
criteria for selecting potential applications, but there can also be a large number of ﬁelds
which can be divided into practical and tech- (attributes, variables); so, the dimensionality
nical categories. The practical criteria for KDD of the problem is high. A high-dimensional
projects are similar to those for other applica- data set creates problems in terms of increas-
tions of advanced technology and include the ing the size of the search space for model in-
potential impact of an application, the ab- duction in a combinatorially explosive man-
sence of simpler alternative solutions, and ner. In addition, it increases the chances that
strong organizational support for using tech- a data-mining algorithm will ﬁnd spurious
nology. For applications dealing with person- patterns that are not valid in general. Ap-
al data, one should also consider the privacy proaches to this problem include methods to
and legal issues (Piatetsky-Shapiro 1995). reduce the effective dimensionality of the
The technical criteria include considera- problem and the use of prior knowledge to
tions such as the availability of sufﬁcient data identify irrelevant variables.
(cases). In general, the more ﬁelds there are Overﬁtting: When the algorithm searches
and the more complex the patterns being for the best parameters for one particular
sought, the more data are needed. However, model using a limited set of data, it can mod-
strong prior knowledge (see discussion later) el not only the general patterns in the data
can reduce the number of needed cases sig- but also any noise speciﬁc to the data set, re-
niﬁcantly. Another consideration is the rele- sulting in poor performance of the model on
vance of attributes. It is important to have da- test data. Possible solutions include cross-vali-
ta attributes that are relevant to the discovery dation, regularization, and other sophisticat-
task; no amount of data will allow prediction ed statistical strategies.
based on attributes that do not capture the Assessing of statistical signiﬁcance: A
required information. Furthermore, low noise problem (related to overﬁtting) occurs when
levels (few data errors) are another considera- the system is searching over many possible
tion. High amounts of noise make it hard to models. For example, if a system tests models
identify patterns unless a large number of cas- at the 0.001 signiﬁcance level, then on aver-
es can mitigate random noise and help clarify age, with purely random data, N/1000 of
the aggregate patterns. Changing and time- these models will be accepted as signiﬁcant.
FALL 1996 49

14.
Articles
This point is frequently missed by many ini- edge is important in all the steps of the KDD
tial attempts at KDD. One way to deal with process. Bayesian approaches (for example,
this problem is to use methods that adjust Cheeseman [1990]) use prior probabilities
the test statistic as a function of the search, over data and distributions as one form of en-
for example, Bonferroni adjustments for inde- coding prior knowledge. Others employ de-
pendent tests or randomization testing. ductive database capabilities to discover
Changing data and knowledge: Rapidly knowledge that is then used to guide the da-
changing (nonstationary) data can make pre- ta-mining search (for example, Simoudis,
viously discovered patterns invalid. In addi- Livezey, and Kerber [1995]).
tion, the variables measured in a given appli- Integration with other systems: A stand-
cation database can be modiﬁed, deleted, or alone discovery system might not be very
augmented with new measurements over useful. Typical integration issues include inte-
time. Possible solutions include incremental gration with a database management system
methods for updating the patterns and treat- (for example, through a query interface), in-
ing change as an opportunity for discovery tegration with spreadsheets and visualization
by using it to cue the search for patterns of tools, and accommodating of real-time sensor
change only (Matheus, Piatetsky-Shapiro, and readings. Examples of integrated KDD sys-
McNeill 1996). See also Agrawal and Psaila tems are described by Simoudis, Livezey, and
(1995) and Mannila, Toivonen, and Verkamo Kerber (1995) and Stolorz, Nakamura, Mesro-
(1995). biam, Muntz, Shek, Santos, Yi, Ng, Chien,
Missing and noisy data: This problem is Mechoso, and Farrara (1995).
especially acute in business databases. U.S.
census data reportedly have error rates as
great as 20 percent in some ﬁelds. Important
Concluding Remarks: The
attributes can be missing if the database was Potential Role of AI in KDD
not designed with discovery in mind. Possible In addition to machine learning, other AI ﬁel-
solutions include more sophisticated statisti- ds can potentially contribute signiﬁcantly to
cal strategies to identify hidden variables and various aspects of the KDD process. We men-
dependencies (Heckerman 1996; Smyth et al. tion a few examples of these areas here:
1996). Natural language presents signiﬁcant op-
Complex relationships between ﬁelds: portunities for mining in free-form text, espe-
Hierarchically structured attributes or values, cially for automated annotation and indexing
relations between attributes, and more so- prior to classiﬁcation of text corpora. Limited
phisticated means for representing knowl- parsing capabilities can help substantially in
edge about the contents of a database will re- the task of deciding what an article refers to.
quire algorithms that can effectively use such Hence, the spectrum from simple natural lan-
information. Historically, data-mining algo- guage processing all the way to language un-
rithms have been developed for simple at- derstanding can help substantially. Also, nat-
tribute-value records, although new tech- ural language processing can contribute
niques for deriving relations between signiﬁcantly as an effective interface for stat-
variables are being developed (Dzeroski 1996; ing hints to mining algorithms and visualiz-
Djoko, Cook, and Holder 1995). ing and explaining knowledge derived by a
Understandability of patterns: In many KDD system.
applications, it is important to make the dis- Planning considers a complicated data
coveries more understandable by humans. analysis process. It involves conducting com-
Possible solutions include graphic representa- plicated data-access and data-transformation
tions (Buntine 1996; Heckerman 1996), rule operations; applying preprocessing routines;
structuring, natural language generation, and and, in some cases, paying attention to re-
techniques for visualization of data and source and data-access constraints. Typically,
knowledge. Rule-reﬁnement strategies (for ex- data processing steps are expressed in terms of
ample, Major and Mangano [1995]) can be desired postconditions and preconditions for
used to address a related problem: The discov- the application of certain routines, which
ered knowledge might be implicitly or explic- lends itself easily to representation as a plan-
itly redundant. ning problem. In addition, planning ability
User interaction and prior knowledge: can play an important role in automated
Many current KDD methods and tools are not agents (see next item) to collect data samples
truly interactive and cannot easily incorpo- or conduct a search to obtain needed data sets.
rate prior knowledge about a problem except Intelligent agents can be ﬁred off to col-
in simple ways. The use of domain knowl- lect necessary information from a variety of
50 AI MAGAZINE

15.
Articles
sources. In addition, information agents can Note
be activated remotely over the network or 1. Throughout this article, we use the term pattern
can trigger on the occurrence of a certain to designate a pattern found in data. We also refer
event and start an analysis operation. Finally, to models. One can think of patterns as compo-
agents can help navigate and model the nents of models, for example, a particular rule in a
World-Wide Web (Etzioni 1996), another area classiﬁcation model or a linear component in a re-
growing in importance. gression model.
Uncertainty in AI includes issues for man-
aging uncertainty, proper inference mecha- References
nisms in the presence of uncertainty, and the Agrawal, R., and Psaila, G. 1995. Active Data Min-
reasoning about causality, all fundamental to ing. In Proceedings of the First International Con-
KDD theory and practice. In fact, the KDD-96 ference on Knowledge Discovery and Data Mining
conference had a joint session with the UAI-96 (KDD-95), 3–8. Menlo Park, Calif.: American Asso-
ciation for Artiﬁcial Intelligence.
conference this year (Horvitz and Jensen 1996).
Knowledge representation includes on- Agrawal, R.; Mannila, H.; Srikant, R.; Toivonen, H.;
tologies, new concepts for representing, stor- and Verkamo, I. 1996. Fast Discovery of Association
Rules. In Advances in Knowledge Discovery and Data
ing, and accessing knowledge. Also included
Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P.
are schemes for representing knowledge and Smyth, and R. Uthurusamy, 307–328. Menlo Park,
allowing the use of prior human knowledge Calif.: AAAI Press.
about the underlying process by the KDD
Apte, C., and Hong, S. J. 1996. Predicting Equity
system. Returns from Securities Data with Minimal Rule
These potential contributions of AI are but Generation. In Advances in Knowledge Discovery and
a sampling; many others, including human- Data Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P.
computer interaction, knowledge-acquisition Smyth, and R. Uthurusamy, 514–560. Menlo Park,
techniques, and the study of mechanisms for Calif.: AAAI Press.
reasoning, have the opportunity to con- Basseville, M., and Nikiforov, I. V. 1993. Detection
tribute to KDD. of Abrupt Changes: Theory and Application. Engle-
In conclusion, we presented some deﬁni- wood Cliffs, N.J.: Prentice Hall.
tions of basic notions in the KDD ﬁeld. Our Berndt, D., and Clifford, J. 1996. Finding Patterns
primary aim was to clarify the relation be- in Time Series: A Dynamic Programming Approach.
tween knowledge discovery and data mining. In Advances in Knowledge Discovery and Data Mining,
We provided an overview of the KDD process eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
and basic data-mining methods. Given the R. Uthurusamy, 229–248. Menlo Park, Calif.: AAAI
broad spectrum of data-mining methods and Press.
algorithms, our overview is inevitably limit- Berry, J. 1994. Database Marketing. Business Week,
ed in scope: There are many data-mining September 5, 56–62.
techniques, particularly specialized methods Brachman, R., and Anand, T. 1996. The Process of
for particular types of data and domain. Al- Knowledge Discovery in Databases: A Human-Cen-
tered Approach. In Advances in Knowledge Discovery
though various algorithms and applications
and Data Mining, 37–58, eds. U. Fayyad, G. Piatet-
might appear quite different on the surface,
sky-Shapiro, P. Smyth, and R. Uthurusamy. Menlo
it is not uncommon to ﬁnd that they share Park, Calif.: AAAI Press.
many common components. Understanding
Breiman, L.; Friedman, J. H.; Olshen, R. A.; and
data mining and model induction at this Stone, C. J. 1984. Classiﬁcation and Regression Trees.
component level clariﬁes the task of any da- Belmont, Calif.: Wadsworth.
ta-mining algorithm and makes it easier for Brodley, C. E., and Smyth, P. 1996. Applying Clas-
the user to understand its overall contribu- siﬁcation Algorithms in Practice. Statistics and Com-
tion and applicability to the KDD process. puting. Forthcoming.
This article represents a step toward a Buntine, W. 1996. Graphical Models for Discover-
common framework that we hope will ulti- ing Knowledge. In Advances in Knowledge Discovery
mately provide a unifying vision of the com- and Data Mining, eds. U. Fayyad, G. Piatetsky-
mon overall goals and methods used in Shapiro, P. Smyth, and R. Uthurusamy, 59–82.
KDD. We hope this will eventually lead to a Menlo Park, Calif.: AAAI Press.
better understanding of the variety of ap- Cheeseman, P. 1990. On Finding the Most Probable
proaches in this multidisciplinary ﬁeld and Model. In Computational Models of Scientiﬁc Discov-
how they ﬁt together. ery and Theory Formation, eds. J. Shrager and P. Lan-
gley, 73–95. San Francisco, Calif.: Morgan Kauf-
Acknowledgments mann.
We thank Sam Uthurusamy, Ron Brachman, and Cheeseman, P., and Stutz, J. 1996. Bayesian Clas-
KDD-96 referees for their valuable suggestions siﬁcation (AUTOCLASS): Theory and Results. In Ad-
and ideas. vances in Knowledge Discovery and Data Mining, eds.
FALL 1996 51

18.
Articles
Gregory Piatetsky-Shapiro is a cal Engineering Departments at Caltech (1994) and
principal member of the technical regularly conducts tutorials on probabilistic learn-
staff at GTE Laboratories and the ing algorithms at national conferences (including
principal investigator of the UAI-93, AAAI-94, CAIA-95, IJCAI-95). He is general
Knowledge Discovery in Databas- chair of the Sixth International Workshop on AI
es (KDD) Project, which focuses and Statistics, to be held in 1997. Smyth’s research
on developing and deploying ad- interests include statistical pattern recognition, ma-
vanced KDD systems for business chine learning, decision theory, probabilistic rea-
applications. Previously, he soning, information theory, and the application of
worked on applying intelligent front ends to het- probability and statistics in AI. He has published 16
erogeneous databases. Piatetsky-Shapiro received journal papers, 10 book chapters, and 60 confer-
several GTE awards, including GTE’s highest tech- ence papers on these topics.
nical achievement award for the KEﬁR system for
health-care data analysis. His research interests in-
clude intelligent database systems, dependency
networks, and Internet resource discovery. Prior to
GTE, he worked at Strategic Information develop-
ing ﬁnancial database systems. Piatetsky-Shapiro re-
ceived his M.S. in 1979 and his Ph.D. in 1984, both
from New York University (NYU). His Ph.D. disser-
tation on self-organizing database systems received
NYU awards as the best dissertation in computer
science and in all natural sciences. Piatetsky-
Shapiro organized and chaired the ﬁrst three (1989,
1991, and 1993) KDD workshops and helped in de-
veloping them into successful conferences (KDD-95
and KDD-96). He has also been on the program
committees of numerous other conferences and
workshops on AI and databases. He edited and
coedited several collections on KDD, including two
books—Knowledge Discovery in Databases (AAAI
Press, 1991) and Advances in Knowledge Discovery in
Databases (AAAI Press, 1996)—and has many other
publications in the areas of AI and databases. He is
a coeditor in chief of the new Data Mining and
Knowledge Discovery journal. Piatetsky-Shapiro
founded and moderates the KDD Nuggets electronic
AAAI 97
newsletter (kdd@gte.com) and is the web master for
Knowledge Discovery Mine (<http://info.gte.com/
~kdd /index.html>).
Providence, Rhode Island
Padhraic Smyth received a ﬁrst-
class-honors Bachelor of Engi- July 27–31, 1997
neering from the National Uni-
versity of Ireland in 1984 and an
MSEE and a Ph.D. from the Elec-
trical Engineering Department at
the California Institute of Tech-
nology (Caltech) in 1985 and Title pages due January 6, 1997
1988, respectively. From 1988 to
1996, he was a technical group leader at the Jet Papers due January 8, 1997
Propulsion Laboratory (JPL). Since April 1996, he
has been a faculty member in the Information and Camera copy due April 2, 1997
Computer Science Department at the University of
California at Irvine. He is also currently a principal
investigator at JPL (part-time) and is a consultant to
ncai@aaaai.org
private industry. Smyth received the Lew Allen
Award for Excellence in Research at JPL in 1993 http://www.aaai.org/
and has been awarded 14 National Aeronautics and
Space Administration certiﬁcates for technical in- Conferences/National/1997/aaai97.html
novation since 1991. He was coeditor of the book
Advances in Knowledge Discovery and Data Mining
(AAAI Press, 1996). Smyth was a visiting lecturer in
the Computational and Neural Systems and Electri-
54 AI MAGAZINE