User expectations in the VO era

Storing hundreds of TB of data on spinning disk is not expected to be a major
technological challenge, though it will not be a trivial expense. If Moore's law
continues, storage will become easier as time goes by. The expensive part of an
archive system is the human effort in operation, curation, calibration and
documentation. But what about software development of the on-line archive service ?
This rather depends on the level of service offered, and the expectations of users
are increasing considerably as we enter the Virtual Observatory (VO) era. One can
distinguish three levels of on-line archive service : data access, complex queries,
and database manipulations. We expect to provide at least the first two for WFCAM,
and in collaboration with AstroGrid, the UK VO project, we hope to provide service
at the third level.

Straightforward data access is the standard today. One can offer either
distinct data subsets such as plates or frames, chosen from a browsable list, or
pixel maps and catalogues over user-defined areas, created from some survey
database. Typically
this is done through a web-browser and CGI script interface, with the user seeing a
JPEG image, and then offered download of a FITS file. For such data subset access,
serving a large volume doesn't make much difference, as long as the data is sensibly
indexed. Access for arbitrary small datasets normally takes only a few seconds.
Download of larger datasets to users is limited by network bandwidth, not by service
at the data centre. The most common technical
solution is storage on a RAID disk array speaking to a web server. The data can be
stored in and interrogated through a Data Base Management System (DBMS), but
actually flat files and home grown software are often used and work very fast. (Such
systems however don't transfer well to the next database.)

With the second level of service, complex queries, one can construct questions
in SQL or a similar language, along the lines of ``give me a list of objects redder
than X in this area of sky, with measurable proper motion, that have such-and-such
quality flag better than Y (unless R-mag is brighter than Z, in which case accept
anything), and that were found on a Tuesday''. This kind of service is becoming
slowly more common now. An example is the SDDS science archive, used either with
the
user-installed SDSS qT query tool, or through a web browser interface (see http://archive.stcsci.edu/sdss). Another example is the 2dF Galaxy Redshift Survey
web page run from Mount Stromlo (http://www.mso.anu.edu.au/2dfGRS). This
requires a proper DBMS of some kind (Objectivity for SDSS, soon changing to MS SQL
server, and miniSQL for the 2DFGRS) and some sort of user-tool interfacing between
the
astronomer and the DBMS. To be able to answer arbitrary complex queries on very
large multi-parameter
datasets efficiently needs intelligent indexing and cacheing. Even then, there will
always be occasions when very large numbers of table rows need to be searched
through sequentially and only brute force will do. Most such requirements are
CPU-disk I/O limited, and sometimes seek-time limited. Even at 100 MB/s, a 10 TB
database takes over a day to search. The growingly accepted solution is to have a
cluster with many CPUs searching in parallel. Such a search engine can provide
the catalogue storage and the CPU power at the same time.

The leading edge now, which can be seen as Level-2a, is in federated queries, i.e. the ability to make joint queries of arbitrary databases
distributed round the world - e.g. ``give me all the objects in the UKIDSS LAS
survey
which were not seen in the SDSS but do have an X-ray ID in either a Chandra or an
XMM observation, and check the list against the ESO VLT observing log''. This is the
key problem being tackled by the various VO projects worldwide, and involves
standardisation of data and metadata, but also use of new standardised ``web
service'' data exchange methods with XML formatted data, SOAP message wrappers, and
service description with WSDL. It also needs some kind of
astronomical registry service, and a standardised method of `single sign on'' using
some kind of digital certificate rather than a multiplicity of passwords. The
prospects of the necessary technological solutions being in place by the end of
WFCAM's first year of
operation are good, so we anticipate participating in these kind of federated
queries, and indeed will work closely with the AstroGrid project in particular to
make the WFCAM archive and associated services VO-ready.

The third level of service involves large database manipulations. What we have
in mind is things like calculation of correlation functions, cluster analysis in N-D
parameter space, making statistical digests so that one can find objects 5
outside the main clump, visualisation and exploration of multi-faceted datasets, and
so on. Today, such data-intensive calculations are the province of specialised
``power users'' on their own machines, but we expect that such
calculations will increasingly be provided as a standard, fast, service at the data
centre, and that it will become common to do exploratory analysis this
way as well as rigorous calculations. Furthermore, because of the increasingly large
archive volume and network
limitations, it will be more practical to use a service provided by a data centre
than to download huge amounts of data and hack your own code. Such data-intensive
calculations
usually need N or N algorithms, so PC farms, with slow interconnect
between nodes, are too slow, and one needs a proper multi-processor SMP machine. In
other words, to offer this kind of service, then as well as a PC farm search engine,
one needs an expensive analysis engine, and facility-class data analysis
software to go with it. This is a major challenge but one we hope to work towards in
combination with AstroGrid and other VO projects.