5. CONCLUSIONS

In this review, we have introduced data mining in astronomy, given an
overview of its implementation in the form of knowledge discovery in
databases, reviewed its application to various science problems, and
discussed its future. Throughout, we have tried to emphasize data mining
as a tool to enable improved science, not as an end in itself, and to
highlight areas where improvements have been made over previous
analyses, where they might yet be made, and limitations of this
approach.

An astronomer is not a cutting-edge expert in data mining algorithms any
more than they are in statistics, databases, hardware, software, etc.,
but they will need to know enough to usefully apply such approaches to
the science problem they wish to address. It is likely that such
progress will be made via collaboration with people who are experts in
these areas, particularly within large projects, that will employ
specialists and have working groups dedicated to data mining. Fully
implemented, commercial-level databases will be required since the data
will be too big to organize, download, or analyze in any other way.

The available infrastructure should, therefore, be designed so that this
data mining approach to research is maximally enabled. The raw or
minimally-processed data should be made available in a manner so one can
apply user-specific codes either locally or using computational
resources local to the data if data size necessitates it. It is unlikely
that most researchers will either require or trust the exact resources
made available by higher level tools. Instead, they will be useful for
exploratory work, but ultimately one must be able to run personal or
trusted code on the data, from the level of re-reduction upwards.

A problem arises when one wishes to utilize multiple or distributed
datasets, for example in cross-matching data for multi-wavelength
studies. Therefore, datasets that can be easily made interoperable via a
standard storage schema should be made available. In this manner, a user
can bring computing power and algorithms to tackle their particular
science question. This problem is particularly acute when large datasets
are held at widely separated sites, because transfer of such data across
the network is currently impractical. A great deal of science is done on
small subsets of the full data, so data will still be frequently
downloaded and analyzed locally, but the paradigm of downloading entire
datasets is not sustainable.

Acknowledgments

We thank the referee for a useful and comprehensive report.

The authors acknowledge support from NASA through grants NN6066H156 and
NNG06GF89G, from Microsoft Research, and from the University of
Illinois.

The authors made extensive use of the storage and computing facilities
at the National Center for Supercomputing Applications and thank the
technical staff for their assistance in enabling this work.