The IRanges and GRanges objects are core components of the Bioconductor infrastructure for defining integer ranges in general (IRanges), and specifically for addressing locations in the genome and hence including chromosome and strand information (GRanges). Here we will briefly explore what these objects are and a subset of the operations which manipulate IRanges and GRanges.

IRanges

First we load the IRanges package. This is included in the base installation of Bioconductor, but as with all Bioconductor packages it can be installed with biocLite. R will print out a bunch of messages when you load IRanges about objects which are masked from other packages. This is part of the normal loading process. Here we print these messages, although on some other book pages we suppress the messages for cleanliness:

library(IRanges)

The IRanges function defines interval ranges. If you provide it with two numbers, these are the start and end of a inclusive range, e.g. , which has width 6. When referring to the size of a range, the term width is used, instead of length.

We will continue to work with the single range though. We can look up a number of intra-range methods for IRanges objects, which mean that the operations work on each range independently. For example, we can shift all the ranges two integers to the left. By left and right, we refer to the direction on the integer number line: . Compare ir and shift(ir, -2):

Those same operations plotted in a single window. The red bar shows the shadow of the original range ir. The best way to get the hang of these operations is to try them out yourself in the console on ranges you define yourself.

There are also a set of inter-range methods. These are operations which work on a set of ranges, and the output depends on all the ranges, thus distinguishes these methods from the intra-range methods, for which the other ranges in the set do not change the output. This is best explained with some examples. The range function gives the integer range from the start of the leftmost range to the end of the rightmost range:

# full details on the inter-range methods:# ?"inter-range-methods"(ir<-IRanges(start=c(3,5,17),end=c(10,8,20)))

The disjoin function breaks up the ranges in ir into discrete ranges. This is best explained with examples, but here is the formal definition first:

returns a disjoint object, by finding the union of the
end points in ‘x’. In other words, the result consists of a range
for every interval, of maximal length, over which the set of
overlapping ranges in ‘x’ is the same and at least of size 1.

Note that this is not a comprehensive list. Check the man pages we listed above, and the best way to get the hang of the functions is to try them out on some ranges you construct yourself. Note that most of the functions are defined both for IRanges and for GRanges, which will be described below.

GRanges

GRanges are objects which contain IRanges and two more important pieces of information:

the chromosome we are referring to (called seqnames in Bioconductor)

the strand of DNA we are referring to

Strand can be specified as plus “+” or minus “-“, or left unspecified with “*”. Plus strand features have the biological direction from left to right on the number line, and minus strand features have the biological direction from right to left. In terms of the IRanges, plus strand features go from start to end, and minus strand features go from end to start. This may seem a bit confusing at first, but this is required because width is defined as end - start + 1, and negative width ranges are not allowed. Because DNA has two strands, which have an opposite directionality, strand is necessary for uniquely referring to DNA.

With an IRange, a chromosome name, and a strand, we can be sure we are uniquely referring to the same region and strand of the DNA molecule as another researcher, given that we are using the same build of genome. There are other pieces of information which can be contained within a GRanges object, but the two above are the most important.

library(GenomicRanges)

Let’s create a set of two ranges on a made-up chromosome, chrZ. And we will say that these ranges refer to the genome hg19. Because we have not linked our genome to a database, we are allowed to specify a chromosome which does not really exist in hg19.

## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 2 out-of-bound ranges located on sequence
## chrZ. Note that only ranges located on a non-circular sequence
## whose length is not NA can be considered out-of-bound (use
## seqlengths() and isCircular() to get the lengths and circularity
## flags of the underlying sequences). You can use trim() to trim
## these ranges. See ?`trim,GenomicRanges-method` for more
## information.

GRangesList

Especially when referring to genes, it is useful to create a list of GRanges. This is useful for representing groupings, for example the exons which belong to each gene. The elements of the list are the genes, and within each element the exon ranges are defined as GRanges.

The length of the GRangesList is the number of GRanges object within. To get the length of each GRanges we call elementLengths. We can index into the list using typical list indexing of two square brackets.

If we ask the width, the result is an IntegerList. If we apply sum, we get a numeric vector of the sum of the widths of each GRanges object in the list.

width(grl)

## IntegerList of length 2
## [[1]] 31 36
## [[2]] 41 41 41

sum(width(grl))

## [1] 67 123

We can add metadata columns as before, now one row of metadata for each GRanges object, not for each range. It doesn’t show up when we print the GRangesList, but it is still stored and accessible with mcols.

findOverlaps returns a Hits object which contains the information about which ranges in the query (the first argument) overlapped which ranges in the subject (the second argument). There are many options for specifying what kind of overlaps should be counted.

Rle and Views

Lastly, we give you a short glimpse into two related classes defined in IRanges, the Rle and Views classes. Rle stands for run-length encoding, which is a form of compression for repetitive data. Instead of storing: , we would store the number 1, and the number of repeats: 4. The more repetitive the data, the greater the compression with Rle.

We use str to examine the internal structure of the Rle, to show it is only storing the numeric values and the number of repeats

Note that the internal structure of the Views object is just the original object, and the IRanges which specify the windows. The great benefit of Views is when the original object is not stored in memory, in which case the Views object is a lightweight class which helps us reference subsequences, without having to load the entire sequence into memory.