Vox Populi

Bill Howe

My position is this: The set of operations on unstructured grids we wish to support should drive the choice of representation, rather than the other way around.

Structured grids were misleadingly easy: the data model (Cartesian products of coordinate variables) immediately implies a representation (multidimensional arrays), an API (reading and writing subslabs), and an efficient implementation of the API (iterate over dimensions of the array).

Unstructured grids are more difficult: An appropriate characterization of the data model itself is not universally accepted; a very general (and hopefully non-controversial) characterization is a collection of cells linked by an incidence relationship (c.f our GridField? model (1)). There are many possible representations for this model:

a multidimensional array, potentially with missing values, as found in the ​CF Conventions

Each of these representations support different operations. The correct choice of representation should be informed by the operations one wishes to support. For example, a Mx3 array representing triangles can efficiently support extraction of the 100th to the 150th triangles. However, since the order in computational coordinates does not correspond to any physical ordering, it is not llikely that this kind of query is very useful.

I propose that the ability to derive a self-consistent subgrid using a geometric bounding box is the "killer app" for any unstructured grid representation, and is the closest analog to subslab operations on structured grids. Other important operations include aggregation over time and depth, vector calculus, and regridding. Any representation should efficiently support subsetting, and should not confound future efforts on aggregation, vector calculus and regridding.

To implement subsetting efficiently, an index is required. At the workshop, there were some very elegant ad hoc solutions presented (links to come). More traditional data structures such as quad-trees, k-d-trees, and R-trees are also relevant. We should not bind ourselves to flat array-based representations just because they are simple.

The bigger picture is this: The file-oriented model of scientific data exchange does not scale. As datasets grow, downloading a file for local analysis becomes infeasible. We will eventually be forced to move the computation to the data, rather than move the data to the computation. This tendency is well-documented in High-Energy Physics, Biology, and Astronomy; Oceanography is on the brink (witness 200 million surface nodes in QUODDY grids). I believe the appropriate first step is to allow server-side subsetting of data.