An overview of the DART 'preprocess' program

First and foremost, check out the
DART/preprocess/preprocess.html document
for detailed information.

The preprocess program actually builds source code to
be used by all the remaining modules. It is imperative to
actually runpreprocess before building
any executables. This is how the same code can assimilate state
vector 'observations' for the Lorenz_63 model and real radar reflectivities
for WRF without needing to specify a set of radar operators for the
Lorenz_63 model!

preprocess combines multiple 'obs_def' modules into one
obs_def_mod.f90 that is then used by the rest of DART.
Additionally, a new obs_kind_mod.f90 is built that will
provide support for associating the specific observation TYPES
with corresponding (generic) observation KINDS.
More on that later.
The list of source codes is contained in the
&preprocess_nml namelist and they ultimately
determine what observations and operators are supported.
If you want to add another 'obs_def' module, you must rerun
preprocess and recompile the rest of your project.
preprocess is designed to abort if
the files it is supposed to build already exist. For this reason, it is necessary
to remove a couple files (if they exist) before you run the preprocessor.
It is just a good habit to develop.

preprocess will combine
DEFAULT_obs_def_mod.F90,
obs_def_gps_mod.f90,
obs_def_QuikSCAT_mod.f90,
obs_def_GWD_mod.f90,
obs_def_altimeter_mod.f90, and
obs_def_reanalysis_bufr_mod.f90, into
obs_def_mod.f90 -
which can be used by the rest of the project.

Building and Running 'preprocess'

preprocess is an executable, so it should come as
no surprise that it must be built in the normal DART fashion. The
DART/mkmf/mkmf.template must be correct for your
environment, and the input.nml must have your desired
preprocess_nml set correctly. Given that ...

csh mkmf_preprocess
make
./preprocess

will build and run preprocess.

The first command generates an appropriate Makefile and
the input.nml.preprocess_default file.
The second command results in the compilation of a series of
Fortran90 modules which ultimately produces an executable file:
preprocess.
The third command actually runs preprocess - which builds the new
obs_kind_mod.f90 and
obs_def_mod.f90 source code files. The rest of DART
may now be built.

A little background for 'preprocess'

IMPORTANT:
Since each 'observation kind' may require different amounts of metadata to
be read or written; any routine to read or write an observation sequence
must be compiled with support for those particular observations.
The supported observations are listed in the
input.nml&obs_kind_nml block.
This is the whole point of the 'preprocess' process ...

An overview of the observation sequence

Observation sequences are complicated, there's just no better
way to describe it. Trying to automatically accomodate a myriad
of observation file formats, structure, and metadata is simply not
an easy task. For this reason, DART has its own format for observations
and a set of programs to convert observations from their original formats
to DART's format. There are definitely some things to know ...

An obs_seq.in file actually contains
no observation quantities. It may be best thought of as a
perfectly-laid-out notebook - just waiting for an observer to fill
in the actual observation quantities. All the rows and columns are
ready, labelled, and repeated for every observation time and platform.
This file is generally the start of a "perfect model"
experiment. Essentially, one instance of the model is run through
perfect_model_obs - which applies the
appropriate forward operators to the model state and 'writes them down'
in our notebook. The completed notebook is renamed
obs_seq.out.

An obs_seq.out file contains a linked list
of observations - potentially (and usually) observations from different
platforms and of different quantities - each with their own error
characteristics and metadata. These files arise from running
perfect_model_obsOR from any number
of converter programs. The creation of observation sequences from real
observations is not automatic and an email to the DART team asking
for advice for your specific types of observations is perfectly within
reason.

There is something called an obs_seq.final
file - which contains everything in the obs_seq.out
file as well as a few additional 'copies' of the observation.
Remember, DART is an ensemble algorithm. Each ensemble member must
compute its own estimate of the observation for the algorithm.
The obs_seq.final file may
contain each of these estimates (namelist controlled). Minimally, the
mean and spread of the ensemble estimates is recorded in the
obs_seq.final file. The best method of determining
the performance of your 'real world' experiment is to compare in
observation-space since we can never know the
model state that perfectly represents the real world.

IMPORTANT:
Since each 'observation kind' may require different amounts of metadata to
be read or written; any routine to read or write an observation sequence
must be compiled with support for those particular observations.
The supported observations are listed in the
input.nml&obs_kind_nml block.
This is the whole point of the 'preprocess' process ...

observation sequence file structure

explanation

obs_seq.out

obs_seq.final

There are extensible parts of the observation sequence
file; for example, the number of observation kinds contained in the
file, whether the locations have 1 or more components, how
many quality control values are available for each
observation, where those quality control values come from,
how many 'copies' of each observation there
are ... et cetera. The images to the right are links to
full-size images. They are from entirely separate
experiments. They are just meant to show the flexibility
of the file format.

Creating observations and sequences.

It is strongly encouraged that you use a single observation to test
a new model implementation.

Experience has shown that starting 'simple' is the fastest way to good results.
Starting with a single observation will exercise a sufficient portion
of the procedure and provide insight into where to spend more effort. Starting with
a single synthetic observation will allow you to focus on the more interesting
parts of the DART scheme without getting bogged down in the world of
observation data formats.

Creating a synthetic observation sequence.

There are several steps to create an observation sequence file, which
follows directly from the modular nature of the DART programming
philosophy.

Decide what observations you want to investigate and edit the
input.nml&obs_kind_nml block.

Build and run preprocess to create code that supports
the observations you want.

Build and run create_obs_sequence to define the specifics about the observation you want.

Build and run create_fixed_network_sequence to replicate those specifics through time.

Build and run perfect_model_obs to create an observation consistent with the model state and specified error distribution at the requested times and locations.

Example: generating observations for the Lorenz '63 model.

1) There are no 'real' observations for the Lorenz '63 model, so the appropriate
namelist settings are:

3) create_obs_sequence
creates an observation set definition
(typically named set_def.out),
the time-independent part of an observation sequence. It may help
to think of it as trying to define what sorts of observations will
be taken at one 'reading' ... you walk out to the box and take
temperature, humidity, and wind observations all at the same time
and place, for example. You can think of it as one page in an observer's notebook,
and only contains the location, type,
and observational error characteristics
(normally just the diagonal observational error variance)
for a related set of observations. There are no actual observation values,
nor are there any times associated with the definition. The program
is interactive and queries the user for the information it needs.
Begin by creating a minimal observation set definition
in which each of the 3 state variables of L63 is directly observed
with an observational error variance of 1.0 for each observation.
To do this, use the following input sequence
(the text including and after # is a comment and does
not need to be entered):

The following is a screenshot
(much of the verbose logging has been left off for clarity),
the user input looks like this.

[unixprompt]$ ./create_obs_sequence
Starting program create_obs_sequence
Initializing the utilities module.
Trying to log to unit 10
Trying to open file dart_log.out
Registering module :
$url: http://squish/DART/trunk/utilities/utilities_mod.f90 $
{ ... }
Input upper bound on number of observations in sequence
4
Input number of copies of data (0 for just a definition)
0
Input number of quality control values per field (0 or greater)
0
input a -1 if there are no more obs
0
Registering module :
$url: http://squish/DART/trunk/obs_def/DEFAULT_obs_def_mod.F90 $
{ ... }
initialize_module obs_kind_nml values are
-------------- ASSIMILATE_THESE_OBS_TYPES --------------
RAW_STATE_VARIABLE
-------------- EVALUATE_THESE_OBS_TYPES --------------
------------------------------------------------------
Input -1 * state variable index for identity observations
OR input the name of the observation kind from table below:
OR input the integer index, BUT see documentation...
1 RAW_STATE_VARIABLE
-1
input time in days and seconds
0 0
Input error variance for this observation definition
1.0
input a -1 if there are no more obs
0
{ this gets repeated ... until you tell it to stop ... }
input a -1 if there are no more obs
-1
Input filename for sequence ( set_def.out usually works well)
set_def.out
write_obs_seq opening formatted file set_def.out
write_obs_seq closed file set_def.out

Rest assured that if you requested to assimilate more realistic
observation types, you will be queried for appropriate information by
create_obs_sequence.
Below is a table that explains all of the input you should need to supply
for observations of the L63 model state.

4

# upper bound on num of observations in sequence

0

# number of copies of data (0 for just a definition)

0

# number of quality control values per field (0 or greater)

0

# -1 to exit/end observation definitions

-1

# observe state variable 1

0 0

# time -- days, seconds

1.0

# observational variance

0

# -1 to exit/end observation definitions

-2

# observe state variable 2

0 0

# time -- days, seconds

1.0

# observational variance

0

# -1 to exit/end observation definitions

-3

# observe state variable 3

0 0

# time -- days, seconds

1.0

# observational variance

-1

# -1 to exit/end observation definitions

set_def.out

# Output file name

4) create_fixed_network_sequence takes
the observation set definition and repeats it in time, essentially
making multiple pages in our notebook.
Again, the program is interactive and queries the user for information.
You should be able to simply follow the prompts.
The table below represents the input needed for the L63 example:

set_def.out

# Input observation set definition file

1

# Regular spaced observation interval in time

1000

# 1000 observation times

0, 43200

# First observation after 12 hours (0 days, 12 * 3600 seconds)

0, 43200

# Observations every 12 hours

obs_seq.in

# Output file for observation sequence definition

5) perfect_model_obs advances the model from the state
defined by the initial conditions file specified in the
input.nml and 'applies the forward operator' to harvest
observations to fill in the observation sequence specified in
obs_seq.in. The observation sequence finally has
values for the observations and is saved in a file generally named
obs_seq.out.
perfect_model_obs is namelist-driven, as opposed to
the previous two (whose input is a lot harder to specify in a namelist).
Take a look at (and modify if you like) the
input.nml&perfect_model_obs_nml
section of the namelist.

The End. Not only should you have an observation sequence file
(usually obs_seq.out) , you also have a file
containing the exact evolution of the model consistent with
those observations - the True_State.nc.

Real Observations - Converting to a DART-compatible format.

Real observations come in a mind-boggling diversity of formats.
We have converters for some formats in the
DART/observations directory. Many of the
formats require their own libraries (like HDF), and require intimate
knowledge of the data format to extract the portions required for the
DART observation sequence file.
Please feel free to browse the converters and their companion documentation.
Feel free to donate converters for formats we don't already support!
We like that kind of stuff.

The DART framework enforces a clean separation between observations
and the models used for assimilation. The same observations can be
used in any model which understands how to generate a value for the
requested type of observation from the models' state-space values
(i.e. the forward observation operator must exist - DART provides
many for the most common state variables).

In many cases, the original datasets are in a standard scientific
format like netCDF, HDF, or BUFR, and library routines for those
formats can be used to read in the original observation data.
The DART software distribution includes Fortran subroutines and
functions to help create a sequence of observations in memory,
and then a call to the DART observation sequence write routine
will create an entire obs_seq
file in the correct format.

In many cases, a single, self-contained program can convert
directly from the observation location, time, value, and error
into the DART format. In other cases, especially those linking with
a complicated external library (e.g. BUFR), there is a two-step process
with two programs and an ASCII intermediate file. We are currently
leaning towards single-step conversions but either
approach can be used for new programs.

The DART system comes with several types of location modules for
computing distances appropriately. The two most commonly used
are for data in a 1D system and for data in a 3D spherical coordinate
system. All the programs in the DART/observations
directory assume the
location/threed_sphere/location_mod.f90
3D sphere location module is being used.

With the myriad of observation file formats, HDF, Grib,
BUFR, netCDF, ... we simply have not had the time nor need to
support all of them.
The converters are a work in progress. There are currently about
10 other observation sources and types which we are in the process
of collecting information and conversion programs for and which will
eventually be added to this directory. In the meantime, if you have
converters for data or interest in something that is not in the
repository, please email the DART group.
Your best bet is to contact our group at dart@ucar.edu with
a specific request and we can steer you to the most similar process.

Manipulating observation sequences.

First and foremost, check out the
DART/obs_sequence/obs_sequence_tool.html
document for detailed information and examples.

obs_sequence_tool is the primary tool for manipulating
observation sequence files. Observations sequence files are linked lists of
observations organized by time. That is to say, the observations may appear
in any order in the file, but traversing the linked list will result in
observations ordered by time. There are tools for querying the linked list to
extract the 'keys' to the list items for a particular date range, for example.
obs_sequence_tool
can be used to combine observation sequences, convert from ASCII to
binary or vice-versa, extract a subset of observations, etc.

For testing, it is terribly useful to extract a small number of
observations (like ONE) from an existing observation sequence file.

The difference between observation TYPE and observation KIND.

Broadly speaking, observation TYPES are specific instances of a
generic observation KIND. The distinction is useful for several reasons,
not the least of which is to evaluate observation platforms.
Zonal wind observations from QuikSCAT vs. radiosondes, for example.
They are both observations of zonal winds (what we call KIND_U_WIND_COMPONENT),
but they are different observation TYPES; QKSWND_U_WIND_COMPONENT,
and RADIOSONDE_U_WIND_COMPONENT, respectively.
The forward observation operators are implemented based on observation KIND.
When requested, the model generates a KIND_U_WIND_COMPONENT,
it doesn't need to know that it will be compared to a QuikSCAT value
or a radiosonde value.

However, it is usually scientifically very interesting to
be able to compare the assimilations one TYPE of observation vs. another.
One observation sequence file can have lots of types of observations; DART
has the capability to assimilate (or evaluate) any combination of
observation types without getting bogged down in dataset management. The
same observation sequence can be used for experiments that include/exclude
certain observation types - ensuring that you are performing the experiment
you THINK you are performing ...

The National Center for Atmospheric Research is sponsored by the National Science Foundation.
Any opinions, findings and conclusions or recommendations expressed in this publication are those
of the author(s) and do not necessarily reflect the views of the National Science Foundation.