Context Navigation

Chapter 9 Discrete Sampling Geometries

Chapter 5 explains how to specify Coordinate Systems for data arranged in a multidimensional rectangular spatiotemporal grid. This chapter extends and modifies that framework for data with discrete sampling geometries in space and time, meaning that the spatiotemporal dimensions are not all independent, and gridpoints do not exist for all possible combinations of space and time coordinates.

9.1.1 Feature types

The types of discrete sampling geometry, called feature types, that are specified by this chapter are:

point: a collection of data points with no structure in time and space

timeSeries: a series of data points at the same location, with varying time

trajectory: a series of data points along a curve in time and space

profile: a set of data points along a vertical line

timeSeriesProfile: a series of profiles at the same location, with varying time

trajectoryProfile: a set of profiles which originate from points along a trajectory

A single instance of each feature type can be formally described and distinguished by the coordinates and dimensions it involves (with dimensions shown in CDL order):

point

data(i)

x(i) y(i) [z(i)] t(i)

timeSeries

data(i,o)

x(i) y(i) [z(i)] t(i,o)

trajectory

data(i,o)

x(i,o) y(i,o) [z(i,o)] t(i,o)

profile

data(i,o)

x(i) y(i) z(i,o) t(i)

timeSeriesProfile

data(i,p,o)

x(i) y(i) z(i,p,o) t(i,p)

trajectoryProfile

data(i,p,o)

x(i,o) y(i,o) z(i,p,o) t(i,p)

where x y z t are the spatiotemporal coordinates, [] indicate optional coordinates, i is the subscript identifying the instance of the feature type, while o and p are subscripts of the data values that compose that instance. For example, in a collection of timeSeries features, each timeSeries instance i has data values at each o time index. The dimension which runs over instances of the feature type (timeSeries, profiles, etc.) should be the outer dimension i.e. the leading dimension in CDL, as shown by i in the table in 4.1.1. We call this the "instance dimension".

The aim of this chapter is to provide efficient ways of storing many instances of a given feature type in each data variable. There may be more than one data variable in the file, but in this version of CF the data variables must all be of the same feature type. Future versions of CF may generalize this to allow multiple feature types in a file.

For CF-1.5, the global attribute CF:featureType is required except to allow backwards compatibility with previous examples (section 5.4 and 5.5). New file writers are strongly encouraged to add CF:featureType in all cases, and to follow the newer conventions as described in this chapter.

9.1.2 Representations

There are two approaches for representing data with a discrete sampling geometry in CF:

the multidimensional (rectangular array) representation is simpler but requires that the same amount of space be reserved for each feature stored in the file

the ragged array representation allows different features to be stored with different lengths in the file

The multidimensional representation requires the coordinate variable for each dimension to contain the union of all the values taken for that dimension by all the instances. If this results in a sparsely populated data variable then this representation is inefficient in space. For example, if there are several timeSeries, the time coordinate variable must include all of the sampling times of all of the timeSeries; that wastes a lot of space if the timeSeries do not have sampling times in common. If, however they have the same set of time coordinates, then this is a reasonable encoding.

In the ragged representations, variables which have both the instance dimension (i) and dimensions running over data elements of the feature (o and p) become one-dimensional, with a size equal to the total number of data elements in all instances, as shown in the examples in following subsections. We call this the "sample dimension". Variables which have only the instance dimension provide metadata that describe the instance (timeSeries, trajectory, etc.). In general, we call these "instance variables". The instance dimension may be larger than the number of instances which are currently present in the data, with the unused instances having missing values in the instance variables. This is sometimes needed to preallocate the number of instance variables before you know how many will end up in the file.

A single instance of the point feature type is zero-dimensional, so a collection of them is one-dimensional. For that feature type, only the multidimensional representation is used (following chapter 5), because a one-dimensional array cannot be ragged, and the ragged representations would needlessly take up extra space.

For other feature types, if there is only a single instance, then there is no need for an instance dimension, the data will be therefore be one-dimensional, and again the multidimensional representation can be used efficiently. The multidimensional representation can also be used if there are multiple instances, and the multivalued coordinates have the same values for each instance. In other cases, the ragged representations are recommended. The following subsections detail each feature type and show examples of the possible ragged representations of each.

The two ragged representations have distinct advantages and structure:

The contiguous ragged array representation is the most efficient storage method but can be used only if each instance can be written all at once. It stores each instance as a set of adjacent elements in the data variable. The canonical use case for this is when all the data to be written is accessible at the same time, and you expect that the common pattern will be to read all the data at once from each instance. In this representation, the data for each timeSeries will be contiguous on disk. This representation is identifiable by the presence of a CF:ragged_row_count attribute on the count variable, which names the sample dimension being counted. This count variable must be of type integer and must have the instance dimension as its sole dimension.

the indexed ragged array representation stores the instances interleaved in the data variables, so they can be written incrementally. The canonical use case is when writing real-time data streams that contain reports from many sources; the data can be written as it arrives. If the sample dimension is the netCDF unlimited dimension, new data can be appended to the file. This representation is identifiable by the presence of a CF:ragged_row_index attribute on the index variable, which names the instance dimension being indexed. The values of the index variable contain zero-based indices that assign each sample to one of the feature instances. This index variable must be of type integer, and must have the sample dimension as its single dimension.

9.1.3 Coordinates

It is required that the data can be located by the space and time coordinates needed by each feature type (see table in section 9.1.1) using information contained entirely in the file. Therefore:

The coordinates attribute must identify the auxiliary coordinate variables needed to locate the data.

The location must be unambiguous, and so the coordinates attribute must not point to multiple variables of the same spatiotemporal type.

The lat, lon and time coordinates must always exist; a vertical coordinate may exist.

If there is a vertical coordinate variable, it must be identified as specified in chapter 4.3. The use of the attribute axis="Z" is recommended for clarity. A standard_name attribute (see section 3.3) that clarifies the vertical coordinate is recommended, e.g. "altitude", "height", "height_above_reference_ellipsoid", "geopotential_height", or "surface_altitude". See CF Standard Name Table for details.

It is strongly recommended to include a instance variable which uniquely identifies the instance with a standard name of station_id, trajectory_id or profile_id as appropriate. These ids must have unique values, but may have any data type.

There may optionally be other instance variables describing stations, trajectories or profiles, as appropriate to make your file "self describing".

Coordinate bounds may optionally be used, following section 7.1.

9.1.4 Missing Data

Auxiliary coordinates may use missing values to indicate that the sample should be skipped. The data variables that use these coordinates should also have missing values wherever the auxiliary coordinate does, although a reader may check just the coordinate values to infer missing data.

9.2 Point Data

To represent data at scattered, unconnected locations, both data and coordinates use the same, single dimension. The 'coordinates' attribute is used on the data variables to unambiguously identify the time, lat, lon, and (optional) vertical auxiliary coordinate variables.

The humidity(s,i) and temp(s,i) data are associated with the coordinate values time(s,i), lat(s), lon(s), and optionally vertical(s). The station dimension may be the unlimited dimension or not.

The time coordinate may use a missing value, which indicates that data is missing for that location and obs index. This allows one to have a variable number of observations at different stations, at the cost of some wasted space. The data variables may also use missing data values, to indicate that just that data variable is missing. If all the time values are identical for all timeSeries, you may use time(obs) to indicate this.

Note that this is a generalization of Example 5.4, which assumes that all the timeSeries have observations with the same time coordinates.

9.3.2 Ragged array (contiguous) representation

When the number of samples at each location vary, one can use the 'contiguous ragged array' representation if you are able to completely control the order in which the observations are written. The canonical use case for this is when rewriting raw data, and you expect that the common read pattern will be to read all the data from each time series.

Here, station is the instance dimension, and obs the sample dimension. The sample dimension could be the netCDF unlimited dimension, but that is not required. The auxiliary coordinate variables lat, lon, alt and station_name are station variables.

The row_size variable contains the length of each timeSeries, and is identified by having an attribute with name CF:ragged_row_count whose value is the sample dimension being counted. All variables having the obs dimension as their outer dimension are described by this row_size variable.

The row_size variable must have the instance dimension as its single dimension, and must be type integer.

9.3.3 Ragged array (indexed) representation

When the number of samples at each location vary, and the samples cannot be written in order, one can use the 'indexed ragged array' representation. The canonical use case is when writing real-time data streams that contain reports from many stations. The data can be written as it arrives; if the sample dimension is the unlimited dimension, this will effectively append to the file.

The humidity(i) and temp(i) data are associated with the coordinate values time(i), lat(s), lon(s), and alt(s), where s = stationIndex(i). Thus, time(0), humidity(0) and temp(0) belong to the element of the station dimension that is indicated by stationIndex(0); time(1), humidity(1) and temp(1) belong to element stationIndex(1) of the station dimension, etc.

The stationIndex variable is identified by having an attribute with name of "CF:ragged_row_index" whose value is the instance dimension. It must have the sample dimension as its single dimension, and must be type integer. The values in the stationIndex variable are the zero-based station indices that the observation belongs to.

The single dimension of the stationIndex variable is the sample dimension. All variables having this sample dimension as their outer dimension are described by this stationIndex variable.

9.3.4 Single timeSeries

When there is a single timeSeries in the file, one can can use the multidimensional representation with number of stations = 1. One can also use scalar coordinates. This case is identified when the lat and lon coordinates are scalar. In this case, no connecting variable between station and observations is required, since they all belong to the same station.

The NO3(t,i) and O3(t,i) data are associated with the coordinate values time(t,i), lat(t,i), lon(t,i), and alt(t,i). The trajectory dimension may be the unlimited dimension or not. All variables that have trajectory as their only dimension are considered to be information about that trajectory.

The time coordinate may use a missing value, which indicates that data is missing for that trajectory and obs index. This allows one to have a variable number of observations for different trajectories, at the cost of some wasted space. The data variables may also use missing data values.

9.4.2 Single Trajectory

When a single trajectory is stored in a file, one can use a variation of 9.4.1 which removes the trajectory dimension:

The NO3(n) and O3(n) data is associated with the coordinate values time(n), z(n), lat(n), and lon(n). When the time coordinate is ordered, it is appropriate to use a coordinate variable for time, i.e. time(time). The time dimension may be unlimited or not.

Note that structurally this looks like unconnected point data as in example 9.2.1. The presence of the CF:featureType = "trajectory" global attribute indicates that in fact the points are connected along a trajectory.

Note that this is the same as Example 5.5.

9.4.3 Ragged array (contiguous) representation

When the number of samples for each trajectory varies, and one can control the order of writing, one can use the contiguous ragged array representation. The canonical use case for this is when rewriting raw data, and you expect that the common read pattern will be to read all the data from each trajectory.

The O3(i) and NO3(i) data are associated with the coordinate values time(i), lat(i), lon(i), and alt(i). All samples for one trajectory are contiguous along the obs dimension. All variables that have trajectory as their single dimension are considered to be information about that trajectory. The obs dimension may use the unlimited dimension or not.

The row_size variable contains the number of samples for each trajectory, and is identified by having an attribute with name "CF:ragged_row_count" whose value is the sample dimension being counted. It must have the trajectory dimension as its single dimension, and must be type integer. The observations are associated with the trajectory using the same algorithm as in 9.3.2.

9.4.4 Ragged array (indexed) representation

When the number of samples at each trajectory vary, and the samples cannot be written in order, one can use the indexed ragged array representation. The canonical use case is when writing real-time data streams that contain reports from many trajectories. The data can be written as it arrives; if the obs dimension is the unlimited dimension, this will effectively append to the file.

The O3(i) and NO3(i) data are associated with the coordinate values time(i), lat(i), lon(i), and alt(i). All samples for one trajectory will have the same trajectory index value. The obs dimension may use the unlimited dimension or not. All indices are zero based.

The trajectory_index variable is identified by having an attribute with name of "CF:ragged_row_index" whose value is the trajectory dimension name. It must have the sample dimension as its single dimension, and must be type integer. The values in the trajectory_index variable are trajectory index that the sample belongs to.

9.5 Profile Data

A series of connected observations along a vertical line, like an atmospheric or ocean sounding, is called a profile. The lat, lon locations are factored out into the profile.

Some assumptions are common to all profile representations:

It is strongly recommended that there always be a variable (of any type) with standard_name attribute "profile_id", whose values uniquely identify the profile.

The outer dimension of the profile_id variable is the 'instance dimension' or 'profile dimension'.

All variables that have the profile dimension as their only dimension are considered to be information about that profile

The profile_id variable may use missing values. This allows one to reserve more space than is needed.

9.5.1 Multidimensional representation

When storing multiple profiles in the same file, and the numbers of vertical levels in each profile are the same, one can use the multidimensional representation:

The pressure(p,i), temperature(p,i), and humidity(p,i) data is associated with the coordinate values time(p), alt(p,i), lat(p), and lon(p). If the vertical coordinates are the same for all profiles, one can use z(z) instead of alt(profile,z). The time coordinate may depend on z also, e.g. time(profile,z).

When there are a variable number of observations for different profiles, use alt(profile, z) with missing values.

9.5.2 Single Profile

When a single profile is stored in a file, one can use a variation of the 9.5.1 which removes the profile dimension:

The pressure(i), temperature(i), and humidity(i) data is associated with the coordinate values time, alt(i), lat, and lon. The time coordinate may depend on z also, eg may be time(z).

9.5.3 Ragged array (contiguous) representation

When the number of vertical levels for each profile varies, one can use the contiguous ragged array representation. One stores the set of observation for each profile contiguously along the obs dimension. The canonical use case for this is when rewriting raw data, and you expect that the common read pattern will be to read all the data from each profile.

The pressure(i), temperature(i), and humidity(i) data is associated with the coordinate values time(p), z(i), lat(p), and lon(p), where p is found by reading the rowSize variable values as in 9.3.4. The time coordinate may depend on z also, e.g. time(p,z).

9.5.4 Ragged array (indexed) representation

When the number of vertical levels for each profile varies, and one cant write them contiguously, one can use the indexed ragged array representation. The canonical use case is when writing real-time data streams that contain reports from many profiles, arriving randomly.

The pressure(i), temperature(i), and humidity(i) data is associated with the coordinate values time(p), z(i), lat(p), and lon(p), where p=parentIndex(i). The time coordinate may depend on z also, e.g. time(p,z). All indices are zero based.

9.6 Time Series of Profiles

When profiles are taken at a set of stations, one gets a time series of profiles at each station, called a timeSeriesProfile.

The same assumptions are made as with timeSeries data:

The outer dimension of the latitude and longitude coordinates (which must agree) is the 'station dimension'.

All variables that have the station dimension as their outer dimension are considered to be station information, and are called 'station variables'.

It is strongly recommended that there always be station variable (of any type) with standard_name attribute "station_id", whose values uniquely identify the station.

The station_id variable may use missing values. This allows one to reserve more space than is needed for stations.

There may be station variables with standard_name attribute "station_desc", "surface_altitude", and "station_WMO_id"..

9.6.1 Multidimensional representation

When storing time series of profiles at multiple stations in the same file, if there are the same number of time points for all timeSeries, and the same number of vertical levels for every profile, one can use the multidimensional representation:

The pressure(s,p,i), temperature(s,p,i), and humidity(s,p,i) data is associated with the coordinate values time(s,p), z(s,p,i), lat(s), and lon(s).

The time coordinate may depend on z also, e.g. time(station,profile,z). If all of the profiles use the same z coordinate, alt(station, profile, z) may be factored out into z(z).

When there are varying number of profiles for different stations, use time(station, profile) with missing values. When there are varying number of levels for different profiles, use alt(station, profile, z) with missing values.

9.6.2 Profile time series at a single station

If there is only one station in a file, one can use a variation of 9.6.1 which removes the station dimension:

The pressure(i,j), temperature(i,j), and humidity(i,j) data are associated with the coordinate values time(p), alt(p,i), lat, and lon. The time coordinate may depend on z also, e.g. time(profile,z). If all of the profiles use the same z coordinate, alt(profile, z) may be factored out into z(z).

9.6.3 Ragged array of profile time series

When the number of profiles and levels for each station varies, one can use the ragged array representation. This uses the contiguous ragged array representation for profiles (9.5.3), and adds the (factored out) station information with station indexes (9.2.4). The canonical use case is when writing real-time data streams that contain profiles from many stations, arriving randomly. However, the data for entire profile is written all at once, and contiguously.

The profile is associated with a station using the station_index(profile). For each profile, the observations must be written contiguously, and the number of obs for each profile written in row_size(profile).

The pressure(i), temperature(i), and humidity(i) data is associated with the coordinate values time(p), z(i), lat(s), and lon(s), where s = station_index(p). The time coordinate may depend on z also, e.g. time(obs) instead of time(profile).

9.7 Trajectory of Profiles

When profiles are taken along a trajectory, one gets a time series of profiles called a trajectoryProfile. This looks like a collection of profiles (see 9.5), except that the profile locations are assumed to be a connected set of points along a trajectory. A single file may contain one or more such trajectoryProfile features.

Some assumptions are common to all trajectoryProfile representations:

It is strongly recommended that there always be a variable (of any type) with standard_name attribute "trajectory_id", whose values uniquely identify the trajectory.

The outer dimension of the trajectory_id variable is the 'trajectory dimension'.

All variables that have the trajectory dimension as their only dimension are considered to be information about that trajectory

The trajectory_id variable may use missing values. This allows one to reserve more space than is needed.

9.7.1 Trajectory Profile multidimensional representation

If there are the same number of profiles for all trajectories, and the same number of vertical levels for every profile, one can use the multidimensional representation:

The pressure(s,p,i), temperature(s,p,i), and humidity(s,p,i) data is associated with the coordinate values time(s,p), alt(s,p,i), lat(s,p), and lon(s,p).

The time coordinate may depend on z also, eg time(trajectory,profile,z). If all of the profiles use the same z coordinate, alt(trajectory, profile, z) may be factored out into z(z).

When there are varying number of profiles for different trajectorys, use time(trajectory, profile) with missing values. When there are varying number of levels for different profiles, use alt(trajectory, profile, z) with missing values.

9.7.2 Single Trajectory in the file

If there is only one trajectory in the file, one can use a variation of 9.7.1 which removes the trajectory dimension:

9.7.3 Ragged array of trajectoryProfile data

When the number of profiles and levels for each trajectory varies, one can use the ragged array representation. This uses the contiguous ragged array representation for profiles (9.5.3), and adds trajectory information with trajectory indexes. The canonical use case is when writing real-time data streams that contain profiles from many trajectories, arriving randomly. However, the data for entire profile is written contiguously all at once.

The profile is associated with a trajectory using the trajectory_index(profile). The observations for each profile must be written contiguously, and the number of obs in each profile is stored in row_size(profile).

The pressure(i), temperature(i), and humidity(i) data is associated with the coordinate values time(p), z(i), lat(p), and lon(p). The time coordinate may depend on z also, eg time(obs) instead of time(profile).

9.8 Other changes

9.8.1 New standard names

station_id : variable of any data type, containing unique values identifying the station

station_desc : variable of type CHAR, containing a description of the station

station_WMO_id : variable of type CHAR or int, containing the ​WMO identifier of the station

9.8.2 new variable attributes

9.8.3 new global attributes

CF:featureType can take one of these values:

point

timeSeries

trajectory

profile

timeSeriesProfile

trajectoryProfile

9.8.4 Modifications to other chapters

In section 5, third paragraph, change:

"The dimensions of an auxiliary coordinate variable must be a subset of the dimensions of the variable with which the coordinate is associated (an exception is label coordinates (Section 6.1, “Labels”) which contain a dimension for maximum string length)"

to

"The dimensions of an auxiliary coordinate variable must be a subset of the dimensions of the variable with which the coordinate is associated (with two exceptions: 1) label coordinates (see Section 6.1, “Labels”) contain a dimension for maximum string length, and 2) the Point Observation indexed and contiguous representations (see Section 9, “Point Observations”) allow special kinds of coordinates which are connected in a differrent way than by the dimension"
In section 5.4, first paragraph, add at the end:

It is strongly recommended that new data writers use the "Discrete Sampling" Conventions in Chapter 9.3 and 9.6, which provide more extensive options for writing Time Series Data.

In section 5.5, first paragraph, add at the end:

It is strongly recommended that new data writers use the "Discrete Sampling" Conventions in Chapter 9.4, which provide more extensive options for writing Trajectory Data.