Data Release Principle and Standard

The NHGRI is committed to the principle of rapid data release to the scientific
community. This principle was initially implemented during the Human Genome Project
and has been recognized as leading to one of the most effective ways of promoting
the use of the human genome sequence to advance scientific knowledge. At a meeting
in Ft. Lauderdale co-sponsored by the Wellcome Trust and NHGRI in January 2003,
the concept of rapid data release by genomic sequence data producers was reaffirmed,
and the attendees strongly recommended applying the practice to other types of
data produced by "community resource projects". The attendees recognized,
however, that different issues, particularly with respect to data validation,
would be involved in the development of appropriate release practices for different
types of data. Since they also recognized that sustaining the practice of rapid,
prepublication data release by community resources requires that the interests
of all involved - including the data producers, data users, and funding agencies
- be addressed, they emphasized the need to develop a tripartite system of responsibility.
The meeting report from the Ft. Lauderdale meeting can be found on the Wellcome
Trust website at: Meeting Report

The NHGRI has identified the Encyclopedia of DNA Elements (ENCODE) Project,
designed to comprehensively identify functional elements in the human genome
sequence, as a community resource project. ENCODE has begun as a pilot effort
to test and compare methods for the exhaustive identification and validation
of functional sequence elements in a limited (~1% or 30 Mb) amount of the human
genome. In practice, the ENCODE data release policy will be affected by two
important considerations: (1) several different data types will be generated,
as a variety of experimental approaches will be taken in the Project to identify
functional sequence elements, and (2) the criteria for verification for each
data type, which will vary, need to be taken into account in developing appropriate
data release standards for each data type.

At the outset of the project, the ENCODE Consortium considers it relevant to
distinguish between data verification and data validation. 'Data verification'
is understood to refer to assessing the reproducibility of an experiment, while
'data validation' is understood to refer to confirmation by other, independent
methods. As outlined below, the Consortium believes that early deposit of data
in public databases is important, and this should happen as soon as data is
verified - even if it has not yet been validated. For each data type, the Consortium
is attempting to identify a minimal verification standard necessary for public
release of each data type. The Consortium members will also identify additional
levels of validation that will be applied in subsequent analyses of the data
or with additional experimentation where appropriate. When possible, estimates
of the false positive and false negative rates for the particular experimental
approach will be included in the data releases as a measure of data validation.
The data will be deposited to public databases, such as GenBank or ENCODE Consortium
databases and the data will be available
for all to use without restriction (See: Appendix A).

ENCODE Publication Policy/Intellectual Property Considerations

As recommended at the Ft. Lauderdale meeting for a community resource project,
the ENCODE Consortium has published an initial manuscript, a so-called "marker
paper", describing the goals of the project, its data release practices,
and the publication policies that it intends to follow.

As noted, the main goal of the ENCODE pilot project is to compare the ability
of a set of research methods to identify comprehensively all sequence-based
functional elements in genomic DNA. Thus, the final product of the Consortium,
which it intends to publish in a peer-reviewed journal, is planned to be an
overall analysis of the different methods tested by the Consortium members,
an annotated version of the full set of selected ENCODE target sequences, with
all of the functional elements identified by the Project, and a recommendation
for how to expand the ENCODE project to annotate the entire human genome. The
Consortium expects to submit this manuscript or manuscripts for publication
within six months of the end of the pilot project. In addition to group publication(s),
all of the individual research groups in the ENCODE Consortium are free to publish
the results of their own efforts in independent publications at any time. In
these individual papers, Consortium participants will not be restricted to describing
the methods developed for the project, but can and should expand into describing
biological insights that arise from their analyses. To facilitate comparison
of data between different groups involved in ENCODE, all publications by Consortium
members should, when possible, include data on a common reference set of reagents
agreed upon by the Consortium, e.g., a common cell line or a common antibody,
as applicable.

Users of Consortium data, whether members of the Consortium or not, should
be aware of the publication status of the data they use and treat them accordingly.
For example, all investigators, including other Consortium members, should obtain
the consent of the data producers before using unpublished data in their individual
publications.

Consortium members will not have privileged access to data from other members
of the Consortium. Rather, all data shared by the Consortium members will be
obtained from the data that has been released to public databases.

Investigators outside of the ENCODE Consortium are free to use the ENCODE Consortium
data, either en masse or specific subsets, but are asked to follow the guidelines
developed at the Ft. Lauderdale meeting. Specifically, data users should cite
the source of the data (referencing the initial ENCODE marker paper) and should
acknowledge the data producers from the ENCODE Consortium. In addition, the
data users are asked to recognize the interests of the data producers to publish
reports on the generation and analysis of their data. The ENCODE data are released
to public databases as pre-publication data and remain unpublished until they
appear in peer-reviewed publications. Outside investigators who perform an in-depth
analysis of data from the ENCODE Consortium and are interested in publishing
a report before the data producers do so should discuss their results with the
data producer(s) and are encouraged to establish collaborations. However, the
ENCODE Consortium members are not required to collaborate with any outside investigators.
All investigators, through their roles as journal and grant reviewers, should
enforce a high standard of respect for the scientific contribution of the data
producers.

This discussion of the ENCODE data release policy has been primarily directed
at issues concerning the use of ENCODE data in scientific publications. The
intent of the policy is to accelerate the use of the data by the scientific
community. To facilitate this goal, the data producers agree not to restrict
the use of the data by others while the data users are encouraged to act in
a manner that is consistent with this unrestricted access policy. The associated
issue of intellectual property as it pertains to the ENCODE data is addressed
in Appendix B.

Appendix A: Data Release Standard for the First Level of Verification

The Data Sharing/Release working group has recommended that the ENCODE Consortium
establish a well-articulated description of a first-level verification standard
for each data type produced by Consortium members: ENCODE labs should release,
to an appropriate public database, data obtained in experiments when this standard
has been met. In most cases, it is anticipated that additional efforts for further
verification and validation of the data will be carried out, but these should
not delay the initial release of data. The working group acknowledges that releasing
preliminary data may not be the first choice of the data producers. However,
on the assumption that such data can be useful to the scientific community,
NHGRI has adopted the policy for the ENCODE Project to make such data available
in a timely manner. This policy is consistent with the Institute's commitment
to rapid data release to the scientific community.

All of the data generated by the ENCODE project will be linked to the human
genome sequence. Data from the ENCODE Project that can be directly displayed
on the human genome sequence will be stored and delivered by the University
of California, Santa Cruz (UCSC) Genome Browser; other Project data will be
stored and delivered by the appropriate databases to be coordinated by the NHGRI
Genome Technology Branch. All ENCODE data must have the associated information
on how the experiment was performed and how the raw data were analyzed to generate
the conclusions (i.e., sequence elements) to be displayed. As data are deposited
into public databases, individual tracks will be created to display these data
on the UCSC Browser. Where applicable, the primary data underlying any sequence
elements will be linked directly to the browser track. Participating labs are
encouraged to submit their data rapidly even if they conflict with data from
other groups. As additional data validations are performed, the investigators
can modify the submitted data or even withdraw the data if further tests call
into question the validity of the released data. All data will be accompanied
by prominent caveats to notify users of the level of verification of the data
and that frequent data release and updates will be forthcoming as further validation
and analyses are performed.

Appendix B: ENCODE Intellectual Property Issues

Since the inception of the Human Genome Project, NHGRI policy has encouraged
the rapid release and ready accessibility of genomic data to the broad research
community. A related issue of availability pertains to any intellectual property
rights that might be sought by data generators, and the effect that the exercise
of such rights has on access to the data.

The Bayh-Dole Act of 1980 provides a statutory mandate to NIH grantees and
contractors to seek patent protection, when appropriate, on inventions made
using government funds and to license those inventions with the goal of promoting
their utilization, commercialization and public accessibility. While the NHGRI
has, in accordance with that law, encouraged grantees to seek patent protection
for genomic technologies that have been developed with grant funds, the Institute
has been concerned about the claims and exercises of those claims in the case
of large-scale genomic data sets because of the Institute's belief that broad
accessibility to the data is of paramount importance, and that such data are
generally pre-competitive, i.e., a considerable amount of work would need to
be performed beyond the initial data production to demonstrate utility. For
genomic sequence data, for example, NHGRI indicated its opinion that raw data,
in the absence of additional experimental biological information, lack demonstrated
specific utility and therefore are inappropriate materials for patent filing.
The grantees participating in the NHGRI large-scale sequencing program have
been monitored for whether they filed patent claims and, to date, none have.

In the case of the HapMap Project, the participants (including the NHGRI grantees)
agreed not to file for patents on the bulk data from the Project. However, there
was a complication because the raw data produced by the Project (SNPs and individual
genotypes) had to be processed to generate the Project's ultimate output (haplotypes).
In considering the issue of data release, HapMap participants were concerned
about the possibility that researchers outside of the Project could add some
of their own data to the raw Project data, develop haplotypes prior to the Project's
ability to do so, file patent claims based on the combined data, and then potentially
restrict access by others to the HapMap data (a so-called parasitic patent).
To deal with this concern, a click-wrap license was imposed on the individual
genotype data; to gain access to the data, researchers were required to agree
not to restrict the access of others to the data and not to share the data with
anyone who has not agreed to the click-wrap license. In December of 2004, this click-wrap license restriction was lifted to allow HapMap data to be incorporated into other public genomic databases.

In some respects, the cases of genomic sequence data and haplotype data were
relatively easy to deal with because the data themselves do not have "utility"
(in the patent law sense of the term). As a result, grantees did not express
concern about the NHGRI policies on data release. In the case of the ENCODE
Project, however, the applicability of this argument is not as obvious. The
ENCODE Consortium will include both members funded by NHGRI ENCODE grants and
those funded by other sources. The purpose of the ENCODE Project is to generate
data that identify or define genomic DNA sequence elements that have biological
function, and therefore might be considered to have utility and be able to be
patented. Therefore, the use of patents in ways that might restrict access to
large amounts or broad categories of data, e.g., all transcription factor binding
sites, is an issue that needs to be addressed.

NHGRI's primary interest is to ensure the widespread availability of all information
and any inventions that are generated during the ENCODE Project. NHGRI, therefore,
encourages all ENCODE data producers to consider placing all information generated
from their project-related efforts in the public domain and to address the NIH
guidelines [ott.od.nih.gov] on the sharing of research tools.
In the cases in which the Consortium members elect to exercise their intellectual
property rights, NHGRI encourages consideration of maximal use of non-exclusive
licensing of patents to allow for broad access and stimulate the development
of multiple products. As a criterion for joining the ENCODE Consortium, investigators
have agreed to abide by the Project's data release policy.

NHGRI also encourages users of the ENCODE data to act responsibly and share
the effort involved in maintaining unrestricted access to the data. Thus, for
example, if a data user were to incorporate ENCODE data into an invention, the
subsequent license should not restrict the access of others to the ENCODE data.
For this purpose, the term "data users" is meant to include both researchers
who are members of the ENCODE Consortium and researchers who are not.

The ENCODE pilot phase, during which time data corresponding to only 1% of
the human genome will be produced, will provide NHGRI with an opportunity to
observe data producer and data user practices with respect to intellectual property
and the ENCODE Project. NHGRI grantees are reminded that the grantee institution
is required to disclose each subject invention to the Federal Agency providing
research funds within two months after the inventor discloses it in writing
to grantee institution personnel responsible for patent matters. NHGRI will
monitor grantee activity in this area to learn whether or not attempts are being
made to patent large amounts of information derived from the ENCODE Project.
If, in the future, circumstances arise that convince NIH that additional measures
are needed to achieve the goal of widespread access to the results of the Project,
the Institute reserves the right to consider a determination of exceptional
circumstance to restrict or eliminate the right of parties, under future grants,
to elect to retain title. Similarly, NHGRI will monitor the activity of data
users to attempt to determine whether access to the ENCODE data is being encumbered
by any restrictive licenses. If the policy of reliance on data user responsibility
to maintain unrestricted data access is not effective, the NHGRI will consider
adopting a click-wrap license similar to that used by the HapMap Project to
protect the ENCODE data and to ensure unrestricted access to the use of this
data.