1. INTRODUCTION

In its broadest sense, data mining is simply the act of turning raw data
from an observation into useful information. This information can be
interpreted by hypothesis or theory, and used to make further
predictions. This scientific method, where useful statements are made
about the world, has been widely employed to great effect in the West
since the Renaissance, and even earlier in other parts of the
world. What has changed in the past few decades is the exponential rise
in available computing power, and, as a related consequence, the
enormous quantities of observed data, primarily in digital form. The
exponential rise in the amount of available data is now creating, in
addition to the natural world, a digital world, in which extracting new
and useful information from the data already taken and archived is
becoming a major endeavor in itself. This action of knowledge
discovery in databases (KDD), is what is most commonly inferred by
the phrase data mining, and it forms the basis for our review.

Astronomy has been among the first scientific disciplines to experience
this flood of data. The emergence of data mining within this and other
subjects has been described
[1,
2,
3]
as the fourth paradigm. The first two paradigms are the
well-known pair of theory and observation, while the third is another
relatively recent addition, computer simulation. The sheer volume of
data not only necessitates this new paradigmatic approach, but the
approach must be, to a large extent, automated. In more formal terms, we
wish to leverage a computational machine to find patterns in digital
data, and translate these patterns into useful information, hence
machine learning. This learning must be returned in a useful
manner to a human investigator, which hopefully results in human learning.

It is perhaps not entirely unfair to say, however, that scientists in
general do not yet appreciate the full potential of this fourth
paradigm. There are good reasons for this of course: scientists are
generally not experts in databases, or cutting-edge branches of
statistics, or computer hardware, and so forth. What we hope to do in
this review, primarily for the data mining skeptic, is to shed light on
why this is a useful approach. To accomplish this goal, we emphasize
either algorithms that have or could currently be usefully employed, and
the actual scientific results they have enabled. We also hope to give an
interesting and fairly comprehensive overview to those who do already
appreciate this approach, and perhaps provide inspiration for exciting
new ideas and applications. However, despite referring to data mining as
a whole new paradigm, we try to emphasize that it is, like theory,
observation, and simulation, only a part of the broader scientific
process, and should be viewed and utilized as such. The algorithms
described are tools that, when applied correctly, have vast
potential for the creation of useful scientific results. But, given that
it is only part of the process, it is, of course, not the answer to
everything, and we therefore enumerate some of the limitations of this
new paradigm.

We start in Section 1.1 with a summary of some of the
advantages of this approach. In Section 2, we
summarize the process from the input of raw data to the visualization of
results. This is followed in Section 3 by the
actual application of data mining tools in astronomy.
Section 2 is arranged algorithmically, and
Section 3 is arranged astrophysically. It is
likely that the expert in astronomy or data mining, respectively, could
infer much of Section 3 from
Section 2, and vice-versa. But it is unlikely
(we hope) that the combination of the two sections does not have new
ideas or insights to offer to either audience. Following these two
sections, in Section 4, we combine the lessons
learned to discuss the future of data mining in astronomy, pointing out
likely near-term future directions in both the data mining process and
its physical application. We conclude with a summary of the main points
in Section 5.

Of course, what astronomers care about is not a fashionable new
computational method for ever more complex data analysis, but the
science. A fancy new data mining system is not worth much if all
it tells you is what you could have gained by the judicious application
of existing tools and a little physical insight
[4].
We therefore summarize some of the advantages of this approach:

Getting anything at all: upcoming datasets will be almost
overwhelmingly large. When one is faced with Petabytes of data, a
rigorous, automated approach that intelligently extracts pertinent
scientific information will be the only one that is tractable.

Simplicity: despite the apparent plethora of methods,
straightforward applications of very well-known and well-tested data
mining algorithms can quickly produce a useful result. These methods can
generate a model appropriate to the complexity of an input dataset,
including nonlinearities, implicit prior information, systematic biases,
or unexpected patterns. With this approach, a priori data
sampling of the type exemplified by elaborate color cuts, is not
necessary. For many algorithms, new data can be trivially incorporated
as they become available.

Prior information: this can be either fully incorporated, or the
data can be allowed to completely `speak for themselves'. For example,
an unsupervised clustering algorithm can highlight new classes of
objects within a dataset that might be missed if a prior set of
classifications were imposed.

Pattern recognition: an appropriate algorithm can highlight
patterns in a dataset that might not otherwise be noticed by a human
investigator, perhaps due to the high dimensionality. Similarly, rare or
unusual objects can be highlighted.

Complimentary approach: although there are numerous examples
where the data mining approach demonstrably exceeds more traditional
methods in terms of scientific return. Even when the approach does not
produce a substantial improvement, it still acts as an important
complementary method of analyzing data, because different approaches to
an overall problem help to mitigate systematic errors in any one
approach.