1. UCI machine learning database
++++++++++++++++++++++++++++++++
A large collection of data sets accessible via anonymous FTP at
ftp.ics.uci.edu [128.195.1.1] in directory
/pub/machine-learning-databases" or via web browser at
http://www.ics.uci.edu/~mlearn/MLRepository.html
2. UCI KDD Archive
++++++++++++++++++
The UC Irvine Knowledge Discovery in Databases (KDD) Archive at
http://kdd.ics.uci.edu/ is an online repository of large datasets which
encompasses a wide variety of data types, analysis tasks, and application
areas. The primary role of this repository is to serve as a benchmark
testbed to enable researchers in knowledge discovery and data mining to
scale existing and future data analysis algorithms to very large and
complex data sets. This archive is supported by the Information and Data
Management Program at the National Science Foundation, and is intended to
expand the current UCI Machine Learning Database Repository to datasets
that are orders of magnitude larger and more complex.
3. The neural-bench Benchmark collection
++++++++++++++++++++++++++++++++++++++++
Accessible at http://www.boltz.cs.cmu.edu/ or via anonymous FTP at
ftp://ftp.boltz.cs.cmu.edu/pub/neural-bench/. In case of problems or if
you want to donate data, email contact is "neural-bench@cs.cmu.edu". The
data sets in this repository include the 'nettalk' data, 'two spirals',
protein structure prediction, vowel recognition, sonar signal
classification, and a few others.
4. Proben1
++++++++++
Proben1 is a collection of 12 learning problems consisting of real data.
The datafiles all share a single simple common format. Along with the
data comes a technical report describing a set of rules and conventions
for performing and reporting benchmark tests and their results.
Accessible via anonymous FTP on ftp.cs.cmu.edu [128.2.206.173] as
/afs/cs/project/connect/bench/contrib/prechelt/proben1.tar.gz. and also
on ftp.ira.uka.de as /pub/neuron/proben1.tar.gz. The file is about 1.8 MB
and unpacks into about 20 MB.
5. Delve: Data for Evaluating Learning in Valid Experiments
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Delve is a standardised, copyrighted environment designed to evaluate the
performance of learning methods. Delve makes it possible for users to
compare their learning methods with other methods on many datasets. The
Delve learning methods and evaluation procedures are well documented,
such that meaningful comparisons can be made. The data collection
includes not only isolated data sets, but "families" of data sets in
which properties of the data, such as number of inputs and degree of
nonlinearity or noise, are systematically varied. The Delve web page is
at http://www.cs.toronto.edu/~delve/
6. Bilkent University Function Approximation Repository
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
A repository of data sets collected mainly by searching resources on the
web can be found at http://funapp.cs.bilkent.edu.tr/DataSets/ Most of the
data sets are used for the experimental analysis of function
approximation techniques and for training and demonstration by machine
learning and statistics community. The original sources of most data sets
can be accessed via associated links. A compressed tar file containing
all data sets is available.
7. NIST special databases of the National Institute Of Standards
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
And Technology:
+++++++++++++++
Several large databases, each delivered on a CD-ROM. Here is a quick
list.
o NIST Binary Images of Printed Digits, Alphas, and Text
o NIST Structured Forms Reference Set of Binary Images
o NIST Binary Images of Handwritten Segmented Characters
o NIST 8-bit Gray Scale Images of Fingerprint Image Groups
o NIST Structured Forms Reference Set 2 of Binary Images
o NIST Test Data 1: Binary Images of Hand-Printed Segmented Characters
o NIST Machine-Print Database of Gray Scale and Binary Images
o NIST 8-Bit Gray Scale Images of Mated Fingerprint Card Pairs
o NIST Supplemental Fingerprint Card Data (SFCD) for NIST Special
Database 9
o NIST Binary Image Databases of Census Miniforms (MFDB)
o NIST Mated Fingerprint Card Pairs 2 (MFCP 2)
o NIST Scoring Package Release 1.0
o NIST FORM-BASED HANDPRINT RECOGNITION SYSTEM
Here are example descriptions of two of these databases:
NIST special database 2: Structured Forms Reference Set (SFRS)
--------------------------------------------------------------
The NIST database of structured forms contains 5,590 full page images of
simulated tax forms completed using machine print. THERE IS NO REAL TAX
DATA IN THIS DATABASE. The structured forms used in this database are 12
different forms from the 1988, IRS 1040 Package X. These include Forms
1040, 2106, 2441, 4562, and 6251 together with Schedules A, B, C, D, E, F
and SE. Eight of these forms contain two pages or form faces making a
total of 20 form faces represented in the database. Each image is stored
in bi-level black and white raster format. The images in this database
appear to be real forms prepared by individuals but the images have been
automatically derived and synthesized using a computer and contain no
"real" tax data. The entry field values on the forms have been
automatically generated by a computer in order to make the data available
without the danger of distributing privileged tax information. In
addition to the images the database includes 5,590 answer files, one for
each image. Each answer file contains an ASCII representation of the data
found in the entry fields on the corresponding image. Image format
documentation and example software are also provided. The uncompressed
database totals approximately 5.9 gigabytes of data.
NIST special database 3: Binary Images of Handwritten Segmented
---------------------------------------------------------------
Characters (HWSC)
-----------------
Contains 313,389 isolated character images segmented from the 2,100
full-page images distributed with "NIST Special Database 1". 223,125
digits, 44,951 upper-case, and 45,313 lower-case character images. Each
character image has been centered in a separate 128 by 128 pixel region,
error rate of the segmentation and assigned classification is less than
0.1%. The uncompressed database totals approximately 2.75 gigabytes of
image data and includes image format documentation and example software.
The system requirements for all databases are a 5.25" CD-ROM drive with
software to read ISO-9660 format. Contact: Darrin L. Dimmick;
dld@magi.ncsl.nist.gov; (301)975-4147
The prices of the databases are between US$ 250 and 1895 If you wish to
order a database, please contact: Standard Reference Data; National
Institute of Standards and Technology; 221/A323; Gaithersburg, MD 20899;
Phone: (301)975-2208; FAX: (301)926-0416
Samples of the data can be found by ftp on sequoyah.ncsl.nist.gov in
directory /pub/data A more complete description of the available
databases can be obtained from the same host as
/pub/databases/catalog.txt
8. CEDAR CD-ROM 1: Database of Handwritten Cities, States,
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ZIP Codes, Digits, and Alphabetic Characters
++++++++++++++++++++++++++++++++++++++++++++
The Center Of Excellence for Document Analysis and Recognition (CEDAR)
State University of New York at Buffalo announces the availability of
CEDAR CDROM 1: USPS Office of Advanced Technology The database contains
handwritten words and ZIP Codes in high resolution grayscale (300 ppi
8-bit) as well as binary handwritten digits and alphabetic characters
(300 ppi 1-bit). This database is intended to encourage research in
off-line handwriting recognition by providing access to handwriting
samples digitized from envelopes in a working post office.
Specifications of the database include:
+ 300 ppi 8-bit grayscale handwritten words (cities,
states, ZIP Codes)
o 5632 city words
o 4938 state words
o 9454 ZIP Codes
+ 300 ppi binary handwritten characters and digits:
o 27,837 mixed alphas and numerics segmented
from address blocks
o 21,179 digits segmented from ZIP Codes
+ every image supplied with a manually determined
truth value
+ extracted from live mail in a working U.S. Post
Office
+ word images in the test set supplied with dic-
tionaries of postal words that simulate partial
recognition of the corresponding ZIP Code.
+ digit images included in test set that simulate
automatic ZIP Code segmentation. Results on these
data can be projected to overall ZIP Code recogni-
tion performance.
+ image format documentation and software included
System requirements are a 5.25" CD-ROM drive with software to read
ISO-9660 format. For further information, see
http://www.cedar.buffalo.edu/Databases/CDROM1/ or send email to Ajay
Shekhawat at <ajay@cedar.Buffalo.EDU>
There is also a CEDAR CDROM-2, a database of machine-printed Japanese
character images.
9. AI-CD-ROM (see question "Other sources of information")
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
10. Time series
+++++++++++++++
Santa Fe Competition
--------------------
Various datasets of time series (to be used for prediction learning
problems) are available for anonymous ftp from ftp.santafe.edu in
/pub/Time-Series". Data sets include:
o Fluctuations in a far-infrared laser
o Physiological data of patients with sleep apnea;
o High frequency currency exchange rate data;
o Intensity of a white dwarf star;
o J.S. Bachs final (unfinished) fugue from "Die Kunst der Fuge"
Some of the datasets were used in a prediction contest and are described
in detail in the book "Time series prediction: Forecasting the future and
understanding the past", edited by Weigend/Gershenfield, Proceedings
Volume XV in the Santa Fe Institute Studies in the Sciences of Complexity
series of Addison Wesley (1994).
M3 Competition
--------------
3003 time series from the M3 Competition can be found at
http://forecasting.cwru.edu/Data/index.html
The numbers of series of various types are given in the following table:
Interval Micro Industry Macro Finance Demog Other Total
Yearly 146 102 83 58 245 11 645
Quarterly 204 83 336 76 57 0 756
Monthly 474 334 312 145 111 52 1428
Other 4 0 0 29 0 141 174
Total 828 519 731 308 413 204 3003
Rob Hyndman's Time Series Data Library
--------------------------------------
A collection of over 500 time series on subjects including agriculture,
chemistry, crime, demography, ecology, economics & finance, health,
hydrology & meteorology, industry, physics, production, sales, simulated
series, sport, transport & tourism, and tree-rings can be found at
http://www-personal.buseco.monash.edu.au/~hyndman/TSDL/
11. Financial data
++++++++++++++++++
http://chart.yahoo.com/d?s=http://www.chdwk.com/data/index.html
12. USENIX Faces
++++++++++++++++
The USENIX faces archive is a public database, accessible by ftp, that
can be of use to people working in the fields of human face recognition,
classification and the like. It currently contains 5592 different faces
(taken at USENIX conferences) and is updated twice each year. The images
are mostly 96x128 greyscale frontal images and are stored in ascii files
in a way that makes it easy to convert them to any usual graphic format
(GIF, PCX, PBM etc.). Source code for viewers, filters, etc. is provided.
Each image file takes approximately 25K.
For further information, see http://facesaver.usenix.org/
According to the archive administrator, Barbara L. Dijker
(barb.dijker@labyrinth.com), there is no restriction to use them.
However, the image files are stored in separate directories corresponding
to the Internet site to which the person represented in the image
belongs, with each directory containing a small number of images (two in
the average). This makes it difficult to retrieve by ftp even a small
part of the database, as you have to get each one individually.
A solution, as Barbara proposed me, would be to compress the whole set of
images (in separate files of, say, 100 images) and maintain them as a
specific archive for research on face processing, similar to the ones
that already exist for fingerprints and others. The whole compressed
database would take some 30 megabytes of disk space. I encourage anyone
willing to host this database in his/her site, available for anonymous
ftp, to contact her for details (unfortunately I don't have the resources
to set up such a site).
Please consider that UUNET has graciously provided the ftp server for the
FaceSaver archive and may discontinue that service if it becomes a
burden. This means that people should not download more than maybe 10
faces at a time from uunet.
A last remark: each file represents a different person (except for
isolated cases). This makes the database quite unsuitable for training
neural networks, since for proper generalisation several instances of the
same subject are required. However, it is still useful for use as testing
set on a trained network.
13. Linguistic Data Consortium
++++++++++++++++++++++++++++++
The Linguistic Data Consortium (URL:
http://www.ldc.upenn.edu/ldc/noframe.html) is an open consortium of
universities, companies and government research laboratories. It creates,
collects and distributes speech and text databases, lexicons, and other
resources for research and development purposes. The University of
Pennsylvania is the LDC's host institution. The LDC catalog includes
pronunciation lexicons, varied lexicons, broadcast speech, microphone
speech, mobile-radio speech, telephone speech, broadcast text,
conversation text, newswire text, parallel text, and varied text, at
widely varying fees.
Linguistic Data Consortium
University of Pennsylvania
3615 Market Street, Suite 200
Philadelphia, PA 19104-2608
Tel (215) 898-0464 Fax (215) 573-2175
Email: ldc@ldc.upenn.edu
14. Otago Speech Corpus
+++++++++++++++++++++++
The Otago Speech Corpus contains speech samples in RIFF WAVE format that
can be downloaded from
http://divcom.otago.ac.nz/infosci/kel/software/RICBIS/hyspeech_main.html
15. Astronomical Time Series
++++++++++++++++++++++++++++
Prepared by Paul L. Hertz (Naval Research Laboratory) & Eric D. Feigelson
(Pennsyvania State University):
o Detection of variability in photon counting observations 1
(QSO1525+337)
o Detection of variability in photon counting observations 2 (H0323+022)
o Detection of variability in photon counting observations 3 (SN1987A)
o Detecting orbital and pulsational periodicities in stars 1 (binaries)
o Detecting orbital and pulsational periodicities in stars 2 (variables)
o Cross-correlation of two time series 1 (Sun)
o Cross-correlation of two time series 2 (OJ287)
o Periodicity in a gamma ray burster (GRB790305)
o Solar cycles in sunspot numbers (Sun)
o Deconvolution of sources in a scanning operation (HEAO A-1)
o Fractal time variability in a seyfert galaxy (NGC5506)
o Quasi-periodic oscillations in X-ray binaries (GX5-1)
o Deterministic chaos in an X-ray pulsar? (Her X-1)
URL: http://xweb.nrl.navy.mil/www_hertz/timeseries/timeseries.html
16. Miscellaneous Images
++++++++++++++++++++++++
The USC-SIPI Image Database:
http://sipi.usc.edu/services/database/Database.html
CityU Image Processing Lab:
http://www.image.cityu.edu.hk/images/database.html
Center for Image Processing Research: http://cipr.rpi.edu/
Computer Vision Test Images:
http://www.cs.cmu.edu:80/afs/cs/project/cil/ftp/html/v-images.html
Lenna 97: A Complete Story of Lenna:
http://www.image.cityu.edu.hk/images/lenna/Lenna97.html
17. StatLib
+++++++++++
The StatLib repository at http://lib.stat.cmu.edu/ at Carnegie Mellon
University has a large collection of data sets, many of which can be used
with NNs.
------------------------------------------------------------------------
Next part is part 5 (of 7). Previous part is part 3.
--
Warren S. Sarle SAS Institute Inc. The opinions expressed here
saswss@unx.sas.com SAS Campus Drive are mine and not necessarily
(919) 677-8000 Cary, NC 27513, USA those of SAS Institute.

User Contributions:

Comment about this article, ask questions, or add new information about this topic: