Calculate the average distance between a given DNA motif within DNA sequences in R

Suppose that we want to calculate the expected distance of a DNA motif within a DNA target sequence, if we know the composition bias or the probability distribution (multinomial model) we can compute it just fine.

Download the R code here

FIRST PART

For example, supose that we want to compute the expected distance of the motif “GATC” in a sequence composed of 10,000 bases given that the whole sequence follows a probability distribution of p(A) = 0.3, p(C) = 0.2, p(G) = 0.2, p(T) = 0.3.

So open an R prompt and load the functions:

source("motifOccurrence.R")

Then, enter the initial values:

lengthSeq motif probDistr And finally, compute the average expected distance of every occurrence of the motif inside the target sequence. Using the following formula (this computation corresponds to the “computeExpectedDistance” function of the R script “motifOccurrence.R”):

expectedDistance = lengthDNAseq/(lengthDNAseq*p))

where “p” stands for the joint probability of the motif, in other words: p = p(GATC) = p(G)*p(A)*p(T)*p(C) = 0.2*0.3*0.3*0.2 = 0.0036

To easily compute the expected distance in R, type:

expectedDist As you see, it returns the value of “277”, this number means that, for the “GATC” motif inside a sequence of 10,000 bases with a composition bias of p(A) = 0.3, p(C) = 0.2, p(G) = 0.2, p(T) = 0.3 we may expect a distance of 277 bases between each “GATC”.

Or, a little more graphic:

277 bases | GATC | 277 bases | GATC | …..

SECOND PART

Another way to compute this (even though it involves more computations) we can simulate X number of DNA sequences with a fixed length with an equal probability distribution per sequence and extract the coordinates of the motif within each sequence to finally compute the average distance of the motif.

There is a function titled “iterateComputeDistance” to do the calculations for you. Add the next parameter to the R environment:

iterations Compute the average distance of the “GATC” motif within 100 DNA sequences (the other parameters remain equal)

expectedDistWithSimSeqs As we expected, the results of the two approaches are highly similar (ouuu yeah!)

THIRD PART

But, what happens when we already have a sequence and want to know the expected distance of that motif inside of it?.

Just like “Hey dude, I have an E.coli plasmid DNA sequence and want to know the average distance of the GATC motif”.

Ok, use the “ape” library to import the sequence to the R environment (if this library is not installed, type: install.packages(“ape”))

NOTE: the GenBank sequences are in lowecase, so it will be needed to use a motif in lowercase to do the right computations.

Import the library:require("ape")

Import the sequence:plasmid plasmidDNA plasmidDNA motifEcoli Get the coordinates:plasmidDNAcoord Get the average distance between the motif occurrences.plasmidDNAmotifDistance > plasmidDNAmotifDistance [1] 270

Compute the number of occurrences of the motif among the plasmid (the result is 151 occurrences):

The number of occurrences of the motif in R is:(length(plasmidDNAcoord)-1)

So, the main distance between the motif “gatc” finally is 270 bases and we are done 😀

# This function returns the mean distance # of a given motif given X number of DNA sequences given a multinomial model # (probability distribution of each base).

# So, it will generate X number of DNA sequences using a given # probability distribution and then it will compute the distance among # that mofit within the total set of sequences to finally returns # the average distance of the motif.

result for(i in 1:numberOfIterations){ currentGenome currentCoordinatesOfMotif currentSequence lengthDNAseq,rep=T, prob=multinomialDNAmodel) currentCoordinatesOfMotif result[i] cat(" *** Iteration number: ",i," completed *** | average distance = " ,result[i],"\n") } result cat(" \n*** Computation status: DONE ***\n\n") return(result) }##############################################################################coordMotif # This function returns the coordinates of the motif of study in a target # DNA sequence. In other words, if I found the motif, tell me exactly in # which position of the DNA sequence is.

lengthMotif lengthTargetSeq motif motif res for(i in 1:lengthTargetSeq){ currentTargetSeq currentTargetSeq currentTargetSeq if(currentTargetSeq == motif){ res[(length(res)+1)] } } return(res)}##############################################################################computeDistance # This function returns the mean distance # of a given motif given its coordinates within a target DNA sequence. # In other words, If I already got a list with the coordinates where the # motif is inside a DNA sequence, tell me the average distance between # this coordinates to get the expected distance of that motif.