National Human Genome Research Institute
Trace Repository Workshop

July 25, 2000

Workshop Report

The purpose of this workshop was to develop recommendations on the need for and feasibility of establishing a sequence trace repository. There were two components to this discussion, and the participants were encouraged to maintain the appropriate distinctions between them: 1) establishment of an archive for sequence traces from new projects, such as the sequencing of the mouse genome, and 2) establishment of an archive for "legacy" data, i.e., data already generated and archived at the centers as part of the human sequencing project.

1. Data From New Projects

John Bouck began the discussion with a presentation of an overview of the findings of the Mouse Trace Warehousing Working Group. The group had been organized last winter and charged with considering the mouse genome sequencing project's need for a trace archive and with developing specifications and estimating the cost of setting up and running such an archive for mouse data. The working group'sfinal report* focused on an archive whose primary function would be to serve the sequencing centers as a resource for assembly and finishing activities. The needs identified for such a resource included large withdrawals of Bacterial Artificial Chromosome (BAC)-sized data sets; regular requests; and staged withdrawals allowing a near-line system. The working group also considered a secondary role for the archive in meeting the needs of a limited number of investigators from the wider biological community. For this mode, the requirements would be different as such a resource would be expected to handle small as well as large withdrawals, accept irregular requests and have a rapid response time requiring an on-line system. The working group recommended that the incoming data be in SCF3 format (to be as close to the original data as possible), with ancillary information such as lab, vector used and sequencing chemistry included in an associated file. A number of searching capabilities and outgoing data formats were recommended to be included in the system. Some additional unresolved considerations raised by the group include: 1) response time needed for the different requests, 2) longevity of the archive, 3) search specifications, and 4) open-source policies of the algorithms.

Discussion Points:

There was general agreement that a trace archive is a good idea; besides the sequencing centers, the data will be valuable for a number of purposes (see more below).

Traces should be archived rather than the reads with quality scores; quality information can be recovered from the trace data, but traces cannot be reconstructed from quality data.

The databases can generate a variety of views (for different kinds of potential users), as derivative products, from the traces.

Different levels of access will be needed for different types of users/requests.

Some restrictions may need to be made on inputs and outputs. Further discussion will be needed once an archive is established.

There is a potential problem of trying to over-reach in setting up a system. Initial efforts should be focused.

The trace archive itself should be fixed and static. On top of it can be built a more flexible system that reflects the evolving, changing assembly.

Mike Zody reported on the progress of the recently convened Nomenclature Working Group, made up of representatives from the G5 labs. In anticipation of the Trace Archive Workshop, this group discussed how reads are named in the different centers and how the differences would affect submission to an archive. The group compiled a list of current naming conventions and concluded that the read names should be treated only as a unique identifier; each trace should also have a separate, attached file, with a defined format, containing the necessary ancillary data about the trace. The group recommended that the archive be capable of accepting multiple trace file formats and of converting them to a standard format. A draft format* for the ancillary data had been formulated and circulated to the working group; it was also made available to the workshop participants. It was agreed that another iteration is needed for the working group to finalize the report, which should then be distributed to participating large-scale sequencing centers for additional comment. This should be done in the next two weeks.

Discussion Points:

It was suggested that vector sequences be deposited in GenBank, and then the vector accession number be included in the ancillary data file of the archived traces.

A subset of the information in the ancillary file should be required elements for each type of sequence read submitted (the required elements will have to be made appropriate for the type of read [wgs, EST, finishing, etc.] ).

Further discussion is needed to agree which information is essential.

Jean Thierry-Mieg presented an National Center for Biotechnology Information (NCBI) proposal for a trace repository. He offered the following reasons why a repository is needed and why it should archive traces:

It will guarantee that the original data will be available in the future.

It will assist in identifying regions of clone overlap in the resulting deep coverage of the human sequence and hence diminish the sequencing required in human finishing.

A repository for reads and quality scores only saves disk space, so is not significantly cheaper.

The proposed archive would store compressed SCF files and accept multiple formats via FTP or tape from the sequencers. The system would cost roughly $250 thousand to set up, per 30 million sequence reads, with added cost for ongoing operations. It would be a part of the currently available NCBI resources and would have multiple query tools and export formats developed for it to be useable by the community.

As a guide to thinking about the design of a trace repository, the following table was constructed of the foreseeable uses, types of data sets and response time needed for different requests that could be made for whole genome shotgun and BAC by BAC sequence data from mouse, rat and other organisms.

Use/Users

Type of Data

Size
(of demand)

Response Needed

Human Annotation

Now: FASTA files w/Quality Scores

all

Batch data with delay

Large Scale

Later: piecemeal access to traces

3% - 5%

Nearly real time, with local caching of data

Small Scale

Traces

small

Rapid access

Sequence Variation

Large Scale

Now: FASTA files w/Quality Scores

all

Batch data with delay

Later: access to traces

small

Nearly real time, with local caching of data

Small Scale

Later: All traces

all

Batch data with delay

Traces

small

Rapid access

Finishing

Large Scale

Traces

large

Batch data with delay

Small Scale

Traces

small

Rapid access

Development of WG Assembly Methods

Traces

all

Batch data with delay

Discussion Points:

Requests that are more interactive and therefore more time consuming could be queued and addressed via FTP.

Assembly data and ancillary data need to be searchable.

Traces should be the primary data and remain static. Assembly data can point to the traces.

The logistics of deposition, the burden on the genome centers and potential effects on their pipelines must be considered.

Bandwidth may be the limiting factor to the deposition and recovery of traces. At present, the TSC archive solves this problem by taking only the ancillary data via FTP while the trace files themselves are periodically shipped on tape. Transferring all files via FTP requires only a one-time set-up, but will be limited by bandwidth. The shipment of tapes avoids the bandwidth issue, but requires constant human intervention. NCBI will work with each center to arrive at the best solution to deposition for each.

The above table generally applies to the user needs for the human trace data as well as the mouse but there are, in addition, unique near-term issues that need to be considered in the case of the human sequence data. The most urgent of these is the possibility of collecting and using the trace data to improve the Golden Path in time to meet publication deadlines. It was agreed that such data would be valuable for this purpose, but the issue is one of feasibility. Representatives of each of the G5 labs estimated that it would take a month or two to de-archive and prepare the legacy human draft data for submission to an archive. However, such efforts will involve some of the same staff, and therefore compete with the on-going effort within each center to clean up the data going into the Golden Path. The workshop participants agreed that the latter effort was of higher priority, but referred the issue to the G5 principal investigators (PI) for further discussion. It will probably also need to be taken to the G16.

Beyond the whole genome shotgun data, the participants also agreed that EST and BAC-end sequence reads should also be included in the archive. However, each of these will have some unique properties (in terms of ancillary data) and the format for collecting these data will have to be worked out. However, these reads can be added to the archive later, and so the establishment of the archive should not be delayed to work out this issue. NCBI will formulate a proposal for dealing with EST reads that have already been accessioned and provide feedback to the nomenclature group.

The participants also raised the question of archiving finishing reads, which are unique and need to be clearly identified as such. This issue was referred back to the Finishing Working Group to develop a proposal for dealing with the archiving of finishing reads.

Major Action Items

The Nomenclature Working Group will refine the proposal for a format for ancillary data to accompany trace files, and then circulate the proposal to the larger group of sequencing centers for comment and ratification.

A Trace Archive will be set-up immediately at NCBI. Once a format for ancillary data is adopted, centers can begin to deposit mouse traces and readily available human traces. The goal for beginning to accept trace data is September 1, 2000.

A more in-depth discussion will take place with the sequencing centers involved in the human working draft to assess the tradeoffs between the effort required to de-archive and submit the human draft data to the Trace Archive and the effort to clean up the working draft sequence data being used to construct the Golden Path.

A data release policy for WGS data must be formulated. The National Human Genome Research Institute (NHGRI) agreed to draft a proposal that will be circulated to the sequencing centers and the relevant funding agencies for comment.

*The two documents referred to above can be made available to anyone interested. Please contact Kris Wetterstrand to request a copy.