Navigation

The development of bedtools was motivated by a need for fast, flexible tools with which to compare large sets of genomic
features. Answering fundamental research questions with existing tools was either too slow or required modifications to the
way they reported or computed their results. We were aware of the utilities on the UCSC Genome Browser and Galaxy websites, as
well as the elegant tools available as part of Jim Kent’s monolithic suite of tools (“Kent source”). However, we found that
the web-based tools were too cumbersome when working with large datasets generated by current sequencing technologies.
Similarly, we found that the Kent source command line tools often required a local installation of the UCSC Genome Browser.
These limitations, combined with the fact that we often wanted an extra option here or there that wasn’t available with
existing tools, led us to develop our own from scratch. The initial version of bedtools was publicly released in the spring of
2009. The current version has evolved from our research experiences and those of the scientists using the suite over the last
year. The bedtools suite enables one to answer common questions of genomic data in a fast and reliable manner. The fact that
almost all the utilities accept input from “stdin” allows one to “stream / pipe” several commands together to facilitate more
complicated analyses. Also, the tools allow fine control over how output is reported. The initial version of bedtools
supported solely 6-column BED files. However, we have subsequently added support for sequence alignments inBAMformat, as well as for features inGFF , “blocked” BED format, andVCFformat.
The tools are quite fast and typically finish in a matter of a few seconds, even for large datasets. This manual seeks to describe the behavior and
available functionality for each bedtool. Usage examples are scattered throughout the text, and formal examples are
provided in the last two sections, we hope that this document will give you a sense of the flexibility of
the toolkit and the types of analyses that are possible with bedtools. If you have further questions, please join the bedtools
discussion group, visit the Usage Examples on the Google Code site (usage, advanced usage), or take a look at the nascent
“Usage From the Wild” page.

Throughout this manual, we will discuss how to use bedtools to manipulate, compare and ask questions of genome “features”. Genome features can be functional elements (e.g., genes), genetic polymorphisms (e.g.
SNPs, INDELs, or structural variants), or other annotations that have been discovered or curated by genome sequencing groups or genome browser groups. In addition, genome features can be custom annotations that
an individual lab or researcher defines (e.g., my novel gene or variant).

The basic characteristics of a genome feature are the chromosome or scaffold on which the feature “resides”, the base pair on which the
feature starts (i.e. the “start”), the base pair on which feature ends (i.e. the “end”), the strand on which the feature exists (i.e. “+” or “-“), and the name of the feature if one is applicable.

The two most widely used formats for representing genome features are the BED (Browser Extensible Data) and GFF (General Feature Format) formats. bedtools was originally written to work exclusively with genome features
described using the BED format, but it has been recently extended to seamlessly work with BED, GFF and VCF files.

Two genome features (henceforth referred to as “features”) are said to overlap or intersect if they share at least one base in common.
In the figure below, Feature A intersects/overlaps Feature B, but it does not intersect/overlap Feature C.

The previous section briefly introduced a fundamental naming convention used in bedtools. Specifically, all bedtools that compare features contained in two distinct files refer to one file as feature set “A” and the other file as feature set “B”. This is mainly in the interest of brevity, but it also has its roots in set theory.
As an example, if one wanted to look for SNPs (file A) that overlap with exons (file B), one would use bedtools intersect in the following manner:

bedtools intersect –a snps.bed –b exons.bed

There are two exceptions to this rule: 1) When the “A” file is in BAM format, the “-abam” option must bed used. For example:

bedtools intersect –abam alignedReads.bam –b exons.bed

And 2) For tools where only one input feature file is needed, the “-i” option is used. For example:

bedtools users are sometimes confused by the way the start and end of BED features are represented. Specifically, bedtools uses the UCSC Genome Browser’s internal database convention of making the start position 0-based and the end position 1-based: (http://genome.ucsc.edu/FAQ/FAQtracks#tracks1)
In other words, bedtools interprets the “start” column as being 1 basepair higher than what is represented in the file. For example, the following BED feature represents a single base on chromosome 1; namely, the 1st base:

chr1 0 1 first_base

Why, you might ask? The advantage of storing features this way is that when computing the length of a feature, one must simply subtract the start from the end. Were the start position 1-based,
the calculation would be (slightly) more complex (i.e. (end-start)+1). Thus, storing BED features this way reduces the computational burden.

In contrast, the GFF format uses 1-based coordinates for both the start and the end positions. bedtools is aware of this and adjusts the positions accordingly.
In other words, you don’t need to subtract 1 from the start positions of your GFF features for them to work correctly with bedtools.

The VCF format uses 1-based coordinates. As in GFF, bedtools is aware of this and adjusts the positions accordingly.
In other words, you don’t need to subtract 1 from the start positions of your VCF features for them to work correctly with bedtools.

Whenever a bedtool compares two files of features, the “B” file is loaded into memory. By contrast, the “A” file is processed line by line and compared with the features from B.
Therefore to minimize memory usage, one should set the smaller of the two files as the B file. One salient example is the comparison of aligned sequence reads from a
current DNA sequencer to gene annotations. In this case, the aligned sequence file (in BED format) may have tens of millions of features (the sequence alignments),
while the gene annotation file will have tens of thousands of features. In this case, it is wise to sets the reads as file A and the genes as file B.

In an effort to allow one to combine multiple bedtools and other UNIX utilities into more complicated “pipelines”, all bedtools allow features
to be passed to them via standard input. Only one feature file may be passed to a bedtool via standard input.
The convention used by all bedtools is to set either file A or file B to “stdin” or “-”. For example:

In addition, all bedtools that simply require one main input file (the -i file) will assume that input is
coming from standard input if the -i parameter is ignored. For example, the following are equivalent:

To allow one to combine multiple bedtools and other UNIX utilities into more complicated “pipelines”,
most bedtools report their output to standard output, rather than to a named file. If one wants to write the output to a named file, one can use the UNIX “file redirection” symbol “>” to do so.
Writing to standard output (the default):

Some of the bedtools (e.g., genomecov, complement, slop) need to know the size of
the chromosomes for the organism for which your BED files are based. When using the UCSC Genome
Browser, Ensemble, or Galaxy, you typically indicate which species / genome build you are working.
The way you do this for bedtools is to create a “genome” file, which simply lists the names of the
chromosomes (or scaffolds, etc.) and their size (in basepairs).
Genome files must be tab-delimited and are structured as follows (this is an example for C. elegans):

chrI 15072421
chrII 15279323
...
chrX 17718854
chrM 13794

bedtools includes predefined genome files for human and mouse in the /genomes directory included
in the bedtools distribution. Additionally, the “chromInfo” files/tables available from the UCSC
Genome Browser website are acceptable. For example, one can download the hg19 chromInfo file here:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz

We have defined a new file format (BEDPE) to concisely describe disjoint genome features, such as
structural variations or paired-end sequence alignments. We chose to define a new format because the
existing BED block format (i.e. BED12) does not allow inter-chromosomal feature definitions. Moreover,
the BED12 format feels rather bloated when one want to describe events with only two blocks.

bedtools will typically reject BED features that contain negative positions. In special cases, however,
BEDPE positions may be set to -1 to indicate that one or more ends of a BEDPE feature is unaligned.

bedtools will ignore headers at the beginning of BED and GFF files. Valid header lines begin with a
“#” symbol, the work “track”, or the word “browser”. For example, the following examples are valid
headers for BED or GFF files:

intersect, coverage, and genomecov will optionally handle “split” BAM and/or
“blocked” BED by using the -split option. This will cause intersects or coverage to be computed only
for the alignment or feature blocks. In contrast, without this option, the intersects/coverage would be
computed for the entire “span” of the alignment or feature, regardless of the size of the gaps between
each alignment or feature block. For example, imagine you have a RNA-seq read that originates from
the junction of two exons that were spliced together in a mRNA. In the genome, these two exons
happen to be 30Kb apart. Thus, when the read is aligned to the reference genome, one portion of the
read will align to the first exon, while another portion of the read will align ca. 30Kb downstream to the
other exon. The corresponding CIGAR string would be something like (assuming a 76bp read):
30M*3000N*46M. In the genome, this alignment “spans” 3076 bp, yet the nucleotides in the sequencing
read only align “cover” 76bp. Without the -split option, coverage or overlaps would be reported for the
entire 3076bp span of the alignment. However, with the -split option, coverage or overlaps will only
be reported for the portions of the read that overlap the exons (i.e. 30bp on one exon, and
46bp on the other).

Using the -split option with bamToBed causes “spliced/split” alignments to be reported in BED12
format. Using the -split option with bed12tobed6 causes “blocked” BED12 features to be reported in
BED6 format.

When working with a large BAM file using a complex set of tools in a pipe/stream, it is advantageous
to pass uncompressed BAM output to each downstream program. This minimizes the amount of time
spent compressing and decompressing output from one program to the next. All bedtools that create
BAM output (e.g. intersect, window) will now optionally create uncompressed BAM output
using the -ubam option.

bedtools was implemented in C++ and makes extensive use of data structures and fundamental
algorithms from the Standard Template Library (STL). Many of the core algorithms are based upon the
genome binning algorithm described in the original UCSC Genome Browser paper (Kent et al, 2002).
The tools have been designed to inherit core data structures from central source files, thus allowing
rapid tool development and deployment of improvements and corrections. Support for BAM files is
made possible through Derek Barnett’s elegant C++ API called BamTools.