|
This module contains a collection of functions for calculate hash values from nucleotide words.
(Actually, it mostly packs nucleotide words from into integers, so it's a one-to-one relationship.)
It is intended to be used for indexing large sequence collections.
\begin{code}

moduleBio.Sequence.HashWordwhereimportBio.Sequence.SeqDataimportData.List(sort)importData.Char(toUpper)importData.IntimportqualifiedData.ByteString.Lazy.Char8asB-- | This is a struct for containing a set of hashing functionsdataHashFk=HF{hash::SeqData->Offset->Maybek-- ^ calculates the hash at a given offset in the sequence,hashes::SeqData->[(k,Offset)]-- ^ calculate all hashes from a sequence, and their indices,ksort::[k]->[k]-- ^ for sorting hashes}-- | Adds a default "hashes" function to a @HashF@, when "hash" is defined.genkeys::HashFk->HashFkgenkeyskf=kf{hashes=gkeys}wheregkeyss=mkkeys(hashkf$s)[0..B.lengths-1]mkkeysf(p:ps)=casefpofNothing->mkkeysfpsJustk->(k,p):mkkeysfpsmkkeys_[]=[]

-- | Contigous constructs an int\/eger from a contigous k-word.contigous::Integralk=>Int->HashFkcontigousk'=letk=fromIntegralk'c_keysi|k+i>B.lengths=Nothing|otherwise=lets'=B.takek$B.dropisinifB.findisNs'==NothingthenJust(n2kk's')elseNothingc_keyss=c_keys'0sc_keys'cs|k>B.lengths=[]|otherwise=caseB.findIndexisNw0ofNothing->letcur=n2k(fromIntegralk)sin(cur,c):c_keys''curc(B.dropks)Justi->c_keys'(c+i+1)(B.drop(i+1)s)wherew0=B.takeksc_keys''curps|B.nulls=[]|isN(B.heads)=c_keys'(p+k+1)(B.tails)|otherwise=letnew=(cur`mod`4^(k-1))*4+val(B.heads)in(new,p+1):c_keys''new(p+1)(B.tails)inHF{hash=c_key,hashes=c_keys,ksort=Data.List.sort}

\end{code}
\begin{code}

-- | Like 'contigous', but returns the same hash for a word and its reverse complement.{-# SPECIALIZE rcontig :: Int -> HashF Int #-}rcontig::Integralk=>Int->HashFkrcontigk'=letk=fromIntegralk'c_keysi|k+i>B.lengths=Nothing|B.findisNs'/=Nothing=Nothing|otherwise=Just$min(n2kk's')(n2kk'rs')wheres'=B.takek$B.dropisrs'=B.reverse$B.mapcomplements'c_keyss=c_keys'0sc_keys'cs|k>B.lengths=[]|otherwise=caseB.findIndexisNw0ofNothing->letc1=n2k(fromIntegralk)w0c2=n2k(fromIntegralk)w0rin(minc1c2,c):c_keys''(c1,c2)c(B.dropks)Justi->c_keys'(c+i+1)(B.drop(i+1)s)wherew0=B.takeksw0r=B.reverse$B.mapcomplementw0c_keys''(c1,c2)ps|B.nulls=[]|isN(B.heads)=c_keys'(p+k+1)(B.tails)|otherwise=letn1=(c1`mod`4^(k-1))*4+val(B.heads)n2=c2`div`4+4^(k-1)*(val.complement)(B.heads)in(minn1n2,p+1):c_keys''(n1,n2)(p+1)(B.tails)inHF{hash=c_key,hashes=c_keys,ksort=Data.List.sort}

\end{code}
\section{454-hashes}
454 sequencing results in relatively few substitution errors, but has problems
with mulitplicity of mononucleotides. One option is to ignore multiplicity, and
simply hash nucleotide changes.
Note that genkeys won't work on this one.
\begin{code}

-- what about Ns?compact::SeqData->[SeqData]compactbs|B.nullbs=[]|otherwise=let(h,t)=B.break(/=B.headbs)bsinh:compactt-- | Like @rcontig@, but ignoring monomers (i.e. arbitrarily long runs of a single nucelotide-- are treated the same a single nucleotide.rcpacked::Integralk=>Int->HashFkrcpackedk'=letk=fromIntegralk'c_keysi|k>B.lengths'=Nothing|B.anyisNs'=Nothing|otherwise=Just$min(n2kk's')(n2kk'rs')wheres1=takek'$compact$B.dropiss'=fromStr$mapB.heads1rs'=B.reverse$B.mapcomplements'c_keys=c_keys'0.compact-- :: Integral k => SeqData -> [(k,Offset)]-- c_keys' :: Offset -> [SeqData] -> [(k,Offset)]c_keys'iss|k'>lengthss=[]|anyisN$mapB.head$takek'ss=c_keys'(i+(B.length.head)ss)(tailss)|otherwise=lets'=fromStr$mapB.head$takek'sskey1=n2kk's'key2=n2kk'(B.reverse$B.mapcomplements')in(minkey1key2,i):c_keys''(key1,key2)(i+(B.length.head)ss)(tailss)c_keys''(c1,c2)i[]=[]c_keys''(c1,c2)i(s:ss)|isN$B.heads=c_keys'(i+(sum.mapB.length)(s:takek'ss))ss|otherwise=lets1=B.headsn1=(c1`mod`4^(k-1))*4+vals1n2=c2`div`4+4^(k-1)*(val.complement)s1in(minn1n2,i):c_keys''(n1,n2)(i+B.lengths)ssinHF{hash=c_key,hashes=c_keys,ksort=Data.List.sort}

\end{code}
\section{Gapped words}
Shape uses a gapped shape specified via a string
Note that this will be challenging to get symmetric, since
an 'N' character may appear in the mask in only one direction.
\begin{code}

typeShape=Stringgapped::Integralk=>Shape->HashFkgapped=error"gapped"

\end{code}
\subsection{Key arithmetic}
A hash is an integer representing packed k-words.
This is currently limited to the nucleotide alphabet, but could easily(?)
be extended to arbitrary alphabets. Ideally, it should also handle
wildcards (i.e. inserting a position with multiple hashes)
\begin{code}