Monday, July 26, 2010

The publication, collation of results, presentation of audit trails white paper is now available for discussion. When making posts please remember to follow the house rules. Please also take time to read the full pdf before commenting and where possible refer to one or more of section titles, pages and line numbers to make it easy to cross-reference your comment with the document.

The recommendations are reproduced below:1. When releasing data products, we would recommend that the following information must be provided about process of product generation, apart from data itself for the product to be considered an output of the project:(1) A listing of the source data (databank version, stations, period) along with methodological rationale(2) A file describing the quality control method and quality-control metadata flags;(3) Homogeneous and / or gridded version of the data;(4) Quality assessment report produced by running against at least a minimum set of the common test cases described in previous white papers;(5) A published paper based on the data construction method and related products in the peer reviewed press in a journal recognized by ISI;(6) Publication of an audit trail describing all intermediate processing steps and with a strong preference to inclusion of the source code used..

2. Datasets should be served or at the very least mirrored in a common format through a common portal akin to the CMIP portal to improve their utility.

3. Utility tools should be considered that manipulate these data in ways that end users wish.

4 comments:

This is excellent. I think it should be even stronger on source code. It mentions the ownership of source code (e.g. by an employer) as a possible reason not to release. Could we say that when source code is not released, the reason must be given and the owners of the source code copyrights must be identified?

I wholeheartedly agree on the necessity and importance of setting the minimum requirements that would ensure the scientific values and credibility of climate products worldwide.Having said that, I think I need to point out that the level of those requirements should be worked out with utmost prudence, because I am afraid that too much strict requirements might discourage some NMHSs (especially in developing countries) from embarking on producing a homogenized, quality-controlled dataset. In light of the essentiality of allowing climate service users to handpick products that best suit their individual purpose, as well as providing properly the general public with information on uncertainties in temperature reconstruction, it is desirable that more attention should be paid to keeping diversity as much as possible in products that are expected to come out of this project.Hence I strongly wish that every NMHS shall be properly and meticulously consulted before we try to draw up specifications on this matter.

Metadata is CRUCIAL...it allows the end user to be aware of a whole host of considerations that could introduce a bias into the data, and yet, not come to mind. Some of these biases are surprising...here are some closing remarks that I made at the 38th AMS Broadcast Meteorology Conference this June:

There are 3 biases that I reported on at the 14th Symposium on Meteorological Observation and Instrumentation that are perhaps unexpected that should be part of the metadata. They include daily temperature ranges of nearly 2F lower on ASOS/AWIS instrumentation than coop instruments. This is due to a smoothing effect to the temperatures reported on ASOS which uses a 5 minute running average of measurements made every 10 seconds in determining the high and low of the day. Coop instruments report the highest/lowest instantaneous reading. This doesn’t make either reading “wrong”. The ASOS averaging scheme may produce a temperature more representative of the current temperature of a 3KM grid box that will be assimilated into a NWP. The problems arise when the numbers from both platforms are utilized for the same purpose such as generating the high temperature of the day. This is why metadata is important. The other biases were related to the protocol of moving a measurement to the display screen that is read by an observer. The units from the 1980’s had a negative bias of .08F due to use of a lookup table intended for .1C display. The bias arose when the value was then truncated (not rounded) before being converted to F, and truncated again to be displayed. The newer units use a lookup table centered on each .1F. This means that a temperature of, say, 86.5 will be displayed, and rounded up to 87 by an observer, for a temperature as low as 86.45, and results in a .05F positive bias.

We find this the most opaque of the white papers, which is regrettable given the importance of publication in the process of disseminating high-level data products. (We take this to be the scope of the paper, although the title it is linked under, and the discussion areas identified, could suggest a wider scope). From an audit trail point of view, it is unfortunate that, for example, the WP PDFs do not contain their white paper i.d. number or version.Overall, we would stress at this point that there is no need for this WP (or others) to reinvent the wheel. As well as easing the burden on the authors, this would help place the surface temperature project in a supportive “web” of best practice. We would point to the 2009 report of the Committee on Ensuring the Utility and Integrity of Research Data in a Digital Age (National Academies press, 2009 and online). The ongoing discussion in the combustion kinetics community also serves as a useful comparator with respect to community development and the data-model interface: see Frenklach, Transforming data into knowledge — Process Informatics for combustion chemistry, Proceedings of the Combustion Institute 31 (2007) 125–140. Lastly, if the data products are to meet challenges from outside of academia e.g. business, we would plead that the project adhere where possible to established quality-management-system guides, particularly ISO-9000, albeit keeping as light a touch as possible. This connecting into established best practice is important for two reasons: (i) it gives the project the best chance of spreading (see WP 16) and (ii) it makes participation by small groups easier, because they can be reassured that the practices they adopt for surface temperature data will be transferable to their other activities.The 'Publication of Methodology' section and recommendation 5 appear to believe that publication of a process and algorithm in a journal recognized by ISI maintains the audit trail. However such a publication can never capture the nuances of implementation that can have significant effects on output, so to be truly transparent, the source code of a process/algorithm must also be captured.'Presentation of Data Products': If the data products are to include graphics, then the graphic file must include sufficient information to place it in the audit trail. A file name, as suggested, may be sufficient at the point of supply, but users can rename files and we want to help them continue the audit trail. The simplest way to continue transparency in graphics is to print version stamps and other information directly onto the graphic. Inclusion of such information requires referees, sub-editors, and publishers to accept that, although it may at first glance appear irrelevant or distracting, these stamps should be kept in the final published version of graphics.