These data are significant because they are among the first to provide
labels that formalize the genome-wide peak detection problem, which is
a very important problem for biomedical / epigenomics researchers.
These labels can be used to train and test supervised
peak detection algorithms, as explained below.

represents a vector defined on all genomic positions from 48135600 to
86500000 on chr8 (for a particular tcell sample named ERS358697, in
the H3K9me3_TDH_BP data set). To save disk space the vectors are saved
using a run-length encoding; for example the first three lines of this
file are

which mean that the first 26 entries of the vector are 0, the next
four entries are 1, and the following three entries are 2. Note that
start positions are 0-based but end positions are 1-based, so the
first line means a 0 from all positions from 48135600 to 48135625
(excluding the start position 48135599 for which we have no
information).

The goal is to learn a function that takes the coverage.bedGraph.gz
file as input, and outputs a binary classification for every genomic
position. The positive class represents peaks (typically large counts)
and the negative class represents background noise (typically small
counts).

Weak labels are given in labels.bed files, each of which indicates
several regions of the genome with or without peaks. For example the
file

noPeaks: all of the predictions in this region should be negative /
background noise. For example the first line in the file above means
that for a vector x_i of count data from i=30028083 to i=103863906,
the desired function should predict negative / background noise
f(x_i)=0 from i=33111787 to i=33114894. If positive / peaks are
predicted f(x_i)=1 for any i in this region, that is counted as a
false positive label.

peakStart: there should be exactly one peak start predicted in this
region. A peak start is defined as a position i such that a peak is
predicted there f(x_i)=1 but not at the previous position
f(x_{i-1})=0. The exact position is unspecified; any position is fine,
as long as there is only one start in the region. Predicting exactly
one peak start in this region results in a true positive. More starts
is a false positive, and fewer starts is a false negative. For
example,

peakEnd: there should be exactly one peak end predicted in this
region. A peak end is defined as a position i such that a peak is
predicted there f(x_i)=1 but not at the next position f(x_{i+1})=0.
The exact position is unspecified; any position is fine, as long as
there is only one end in the region. Predicting exactly one peak end
in this region results in a true positive. More ends is a false
positive, and fewer ends is a false negative. For example,

peaks: there should be at least one peak predicted somewhere in this
region (anywhere is fine). Zero predicted peaks in this region is a
false negative. If there is a predicted peak somewhere in this region
that is a true positive.

For a particular set of predicted peaks f(x), the total number of
incorrect labels (false positives + false negatives) can be computed
as an evaluation metric (smaller is better). Typically the peak
predictions are also stored using a run-length encoding; the error
rates can be computed using the reference implementation in R package
PeakError, [Web Link]

Receiver Operating Characteristic curves can be computed for a family
of predicted peaks f_lambda(x), where lambda is some significance
threshold, intercept parameter, etc. Compute the TPR and FPR as follows:

TPR = (total number of true positives)/(total number of labels that could have a true positive)
= (number of correct peaks, peakStart, peakEnd labels)/(number of peaks, peakStart, peakEnd labels)

FPR = (total number of false positives)/(total number of labels that could have a false positive)
= (
number of peakStart/End labels with two or more predicted starts/end +
number of noPeaks labels with predicted peaks
)/(number of peakStart, peakEnd, and noPeaks labels)

Suggested fold ID numbers for four-fold cross-validation experiments
can be found in data/*/folds.csv files. For example
data/H3K36me3_TDH_[Web Link] contains

which means that problems chr16:8686921-32000000 and
chr16:60000-8636921 should be considered fold ID 1,
chr21:43005559-44632664 should be considered fold ID 2, etc. This
means that for data set H3K36me3_TDH_other, the fold ID 2 consists of
all data in
data/H3K36me3_TDH_other/samples/*/*/problems/chr21:43005559-44632664
directories.

There are several types of learning settings that could be used with
these data. Here are four examples.

Unsupervised learning. Train models only using the
coverage.bedGraph.gz files. Only use the labels for evaluation (not
for training model parameters).

Supervised learning. Train models only using the coverage.bedGraph.gz
and labels.bed files in the train set. Use the labels in the test set
to evaluate prediction accuracy.

Semi-supervised learning. Train models using the coverage.bedGraph.gz
and labels.bed files in the train set. You can additionally use the
coverage.bedGraph.gz files in the test set at training time. Use the
labels in the test set to evaluate prediction accuracy.

Multi-task learning. Many data sets come from different experiment
types, so have different peak patterns. For example H3K4me3_TDH_immune
is a H3K4me3 histone modification (sharp peak pattern) and
H3K36me3_TDH_immune is a H3K36me3 histone modification (broad peak
pattern). Therefore it is not expected that models should generalize
between data sets. However there is something common across data sets
in that in each data set, the peak / positive class is large values,
wheras the noise / negative class is small values. Therefore
multi-task learning may be interesting. To compare a multi-task
learning model to a single-task learning model, use the suggested
cross-validation fold IDs. For test fold ID 1, train both the
multi-task and single-task learning models using all other folds, then
make predictions on all data with fold ID 1.

Attribute Information:

Each attribute is a non-negative integer representing the number DNA sequence reads that has aligned at that particular region of the genome. Larger values are more likely to be peaks / positive, smaller values are more likely to be noise / negative.

Relevant Papers:

The labeling method and details on how to compute the number of incorrect labels is described in