David Cournapeau wrote:
> Well, I guess once scipy is modularized and can be installed package by
> package, having a package dataset ala R would be nice. For now, I have a
> small python script which convert those dataset to hdf5, so they can be
> read easily from python, and if including them to scipy is OK
> license-wise, I can easily add the data as a package for distribution
> (the compressed, pickled, related data takes ~ 100 kb).
I'm fiddling around with a convention for data packages. Let's suppose we have a
namespace package scipydata. Each data package would be a subpackage under
scipydata. It would provide some conventionally-named metadata to describe the
dataset (`__doc__` to describe the dataset in prose, `source`, `copyright`,
etc.) and a load() callable that would load the dataset and return a dictionary
with its data. The load() callable could do whatever it needs to load the data.
It might just return objects that are defined in code (e.g. numpy.array([...]))
if they are small enough. Or it might read a CSV, NetCDF4, or HDF5 file that is
included in the package. Or it might download something from a website or FTP site.
The scipydata.util package would provide some utilities to help writing
scipydata packages. Particularly, it would be provide utilities to read some
kind of configuration file or environment variable which establishes a cache
directory such that large datasets can be downloaded from a website once and
loaded from disk thereafter.
The scipydata packages could then be distributed extremely easily as eggs, and
getting your dataset would be as simple as
$ easy_install scipydata.cournapeaus_data
Does that sound good to you?
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco