{- |
Module : Data.Set.BKTree
Copyright : (c) Josef Svenningsson 2007-2010
(c) Henning Günter 2007
License : BSD-style
Maintainer : josef.svenningsson@gmail.com
Stability : Alpha quality. Interface may change without notice.
Portability : portable
Burkhard-Keller trees provide an implementation of sets which apart
from the ordinary operations also has an approximate member search,
allowing you to search for elements that are of a distance @n@ from
the element you are searching for. The distance is determined using
a metric on the type of elements. Therefore all elements must
implement the 'Metric' type class, rather than the more usual
'Ord'.
Useful metrics include the manhattan distance between two points,
the Levenshtein edit distance between two strings, the number of
edges in the shortest path between two nodes in an undirected graph
and the Hamming distance between two binary strings. Any euclidean
space also has a metric. However, in this module we use int-valued
metrics and that's not compatible with the metrics of euclidean
spaces which are real-values.
The worst case complexity of many of these operations is quite bad,
but the expected behavior varies greatly with the metric. For
example, the discrete metric (@distance x y | y == x = 0 |
otherwise = 1@) makes BK-trees behave abysmally. The metrics
mentioned above should give good performance characteristics.
-}moduleData.Set.BKTree(-- The main typeBKTree-- Metric,Metric(..)--,null,size,empty,fromList,singleton,insert,member,memberDistance,delete,union,unions,elems,elemsDistance,closest#ifdef DEBUG,runTests#endif)whereimportData.Set.BKTree.InternalimportqualifiedData.IntMapasMimportqualifiedData.ListasLhiding(null)importPreludehiding(null)importData.Array.IArray(Array,array,listArray,(!),assocs)importData.Array.Unboxed(UArray)#ifdef DEBUGimportqualifiedPreludeimportTest.QuickCheckimportText.PrintfimportSystem.Exit#endif-- | A type is 'Metric' if is has a function 'distance' which has the following-- properties:---- * @'distance' x y >= 0@---- * @'distance' x y == 0@ if and only if @x == y@---- * @'distance' x y == 'distance' y x@---- * @'distance' x z <= 'distance' x y + 'distance' y z@---- All types of elements to 'BKTree' must implement 'Metric'.---- This definition of a metric deviates from the mathematical one in that it-- returns an integer instead of a real number. The reason for choosing -- integers is that I wanted to avoid the rather unpredictable rounding-- of floating point numbers.classEqa=>Metricawheredistance::a->a->IntinstanceMetricIntwheredistanceij=abs(i-j)-- Fishy instance. Maybe I shouldn't have it. -- Or generalize Metric to use integer?instanceMetricIntegerwheredistanceij=fromInteger(abs(i-j))instanceMetricCharwheredistanceij=abs(fromEnumi-fromEnumj)hirschberg::Eqa=>[a]->[a]->Inthirschbergxs[]=lengthxshirschbergxsys=letlxs=lengthxslys=lengthysstart_arr::UArrayIntIntstart_arr=listArray(1,lys)[1..lys]in(L.foldl'(\arr(i,xi)->letnarr::UArrayIntIntnarr=array(1,lys)(snd$L.mapAccumL(\(s,c)((j,el),yj)->letnc=minimum[s+(ifxi==yjthen0else1),el+1,c+1]in((el,nc),(j,nc)))(i-1,i)(zip(assocsarr)ys))innarr)start_arr(zip[1..]xs))!lysinstanceEqa=>Metric[a]wheredistance=hirschberg-- ---------- BKTrees-- ---------- | Test if the tree is empty.null::BKTreea->Boolnull(Empty)=Truenull(Node___)=False-- | Size of the tree.size::BKTreea->Intsize(Empty)=0size(Node_s_)=s-- | The empty tree.empty::BKTreeaempty=Empty-- | The tree with a single elementsingleton::a->BKTreeasingletona=Nodea1M.empty-- | Inserts an element into the tree. If an element is inserted several times-- it will be stored several times.insert::Metrica=>a->BKTreea->BKTreeainsertaEmpty=Nodea1M.emptyinserta(Nodebsizemap)=Nodeb(size+1)map'wheremap'=M.insertWithrecursed(Nodea1M.empty)mapd=distanceabrecurse_tree=insertatree-- | Checks whether an element is in the tree.member::Metrica=>a->BKTreea->BoolmemberaEmpty=Falsemembera(Nodeb_map)|d==0=True|otherwise=caseM.lookupdmapofNothing->FalseJusttree->memberatreewhered=distanceab-- | Approximate searching. @'memberDistance' n a tree@ will return true if-- there is an element in @tree@ which has a 'distance' less than or equal to-- @n@ from @a@.memberDistance::Metrica=>Int->a->BKTreea->BoolmemberDistancenaEmpty=FalsememberDistancena(Nodeb_map)|d<=n=True|otherwise=any(memberDistancena)(M.elemssubMap)whered=distanceabsubMap=caseM.split(d-n-1)mapof(_,mapRight)->caseM.split(d+n+1)mapRightof(mapCenter,_)->mapCenter-- | Removes an element from the tree. If an element occurs several times in -- the tree then only one occurrence will be deleted.delete::Metrica=>a->BKTreea->BKTreeadeleteaEmpty=Emptydeleteat@(Nodeb_map)|d==0=unions(M.elemsmap)|otherwise=Nodebszsubtreeswhered=distanceabsubtrees=M.update(Just.deletea)dmapsz=sum(L.mapsize(M.elemssubtrees))+1-- | Returns all the elements of the treeelems::BKTreea->[a]elemsEmpty=[]elems(Nodea_imap)=a:concatMapelems(M.elemsimap)-- | @'elemsDistance' n a tree@ returns all the elements in @tree@ which are -- at a 'distance' less than or equal to @n@ from the element @a@.elemsDistance::Metrica=>Int->a->BKTreea->[a]elemsDistancenaEmpty=[]elemsDistancena(Nodeb_imap)=(ifd<=nthen(b:)elseid)$concatMap(elemsDistancena)(M.elemssubMap)whered=distanceabsubMap=caseM.split(d-n-1)imapof(_,mapRight)->caseM.split(d+n+1)mapRightof(mapCenter,_)->mapCenter-- | Constructs a tree from a listfromList::Metrica=>[a]->BKTreeafromListxs=L.foldl'(flipinsert)emptyxs-- | Merges several treesunions::Metrica=>[BKTreea]->BKTreeaunionsxs=fromList$concat$mapelemsxs-- | Merges two treesunion::Metrica=>BKTreea->BKTreea->BKTreeauniont1t2=unions[t1,t2]-- | @'closest' a tree@ returns the element in @tree@ which is closest to-- @a@ together with the distance. Returns @Nothing@ if the tree is empty.closest::Metrica=>a->BKTreea->Maybe(a,Int)closestaEmpty=Nothingclosestatree@(Nodeb__)=Just(closeLoopa(b,distanceab)tree)closeLoopacandidateEmpty=candidatecloseLoopacandidate@(_,d)(Nodex_imap)=L.foldl'(closeLoopa)newCand(M.elemssubMap)wherenewCand=ifj>=dthencandidateelse(x,j)j=distanceaxsubMap=caseM.split(d-j-1)imapof(_,mapRight)->caseM.split(d+j+1)mapRightof(mapCenter,_)->mapCenter-- Helper functionsonrelfxy=rel(fx)(fy)#ifdef DEBUG-- Testing-- N.B. This code requires QuickCheck 2.0{- Testing using algebraic specification. The idea is that we have this
naive inefficient distance function. But instead of comparing it to our actual
implementation we take each clause in the definition and make it into an
equation. We also change each occurrence of the name naive to a call to the
distance function.
naive [] ys = length ys
naive xs [] = length xs
naive (x:xs) (y:ys) | x == y = naive xs ys
naive (x:xs) (y:ys) = 1 + minimum [naive (x:xs) ys
,naive (x:xs) (x:ys)
,naive xs (y:ys)]
For example, the third clause becomes:
distance (x:xs) (x:ys) == distance xs ys
That way we can construct a quickCheck property from it. So, one property for
each equation in the naive algorithm. Pretty sweet! Credits go to Koen.
-}-- Way too inefficient!-- prop_naive xs ys = distance xs ys == naive xs (ys :: [Int])prop_naiveEmptyxs=distance[]xs==lengthxs&&distancexs[]==length(xs::[Int])prop_naiveConsxxsys=distance(x:xs)(x:ys)==distancexs(ys::[Int])prop_naiveDiffxyxsys=x/=y==>distance(x:xs)(y:ys)==1+minimum[distance(x:xs)(ys::[Int]),distance(x:xs)(x:ys),distancexs(y:ys)]-- ------------------------------------------------------ Semantics of BKTrees. Just a boring list of integerssemtree=L.sort(elemstree)::[Int]-- For testing functions that transform treestransfxs=sem(f(fromListxs))invariantt=inv[]tinvdictEmpty=Trueinvdict(Nodea_imap)=all(\(d,b)->distanceab==d)dict&&all(\(d,t)->inv((d,a):dict)t)(M.toListimap)-- Tests for individual functionsprop_emptyn=not(member(n::Int)empty)prop_nullxs=null(fromListxs)==Prelude.null(xs::[Int])prop_singletonn=elems(fromList[n])==[n::Int]prop_fromListxs=sem(fromListxs)==L.sortxsprop_fromListInvxs=invariant(fromList(xs::[Int]))prop_insertnxs=trans(insertn)xs==L.sort(n:xs)prop_insertInvnxs=invariant(insertn(fromList(xs::[Int])))prop_membernxs=membern(fromListxs)==L.elem(n::Int)xsprop_memberDistancedistnxs=letd=dist`mod`5ref=L.any(\e->distancene<=d)xsincollectref$memberDistancedn(fromListxs)==L.any(\e->distancene<=d)(xs::[Int])prop_deletenxs=trans(deleten)xs==L.sort(removeFirst(xs::[Int]))whereremoveFirst[]=[]removeFirst(a:as)|a==n=as|otherwise=a:removeFirstasprop_deleteInvnxs=invariant(deleten(fromList(xs::[Int])))prop_elemsxs=L.sort(elems(fromListxs))==L.sort(xs::[Int])prop_elemsDistancedistnxs=letd=dist`mod`5inL.sort(elemsDistancedn(fromListxs))==L.sort(filter(\e->distancene<=d)(xs::[Int]))prop_unionsxss=sem(unions(mapfromListxss))==L.sort(concat(xss::[[Int]]))prop_unionsInvxss=invariant(unions(mapfromList(xss::[[Int]])))prop_unionxsys=sem(union(fromListxs)(fromListys))==L.sort(xs++(ys::[Int]))prop_unionInvxsys=invariant(union(fromList(xs::[Int]))(fromList(ys::[Int])))-- Error case : 0 [1073741824,0]-- QuickCheck 2.1 finds this easily. -- The above error case hit the limit of Int. -- Maybe I should use Integer after all?prop_closestnxs=-- Some arbitrary level so that we don't hit the limit of Intall(\x->absx<100000)xs==>case(closestn(fromListxs),xs)of(Nothing,[])->True(Just(_,d),ys)->d==minimum(map(distancen)(ys::[Int]))_->False-- Testing the relations between operationsprop_insertDeletenxs=trans(deleten.insertn)xs==L.sort(xs::[Int])prop_sizeEmpty=sizeempty==0prop_sizeFromListxs=size(fromListxs)==length(xs::[Int])prop_sizeSuccnxs=size(insert(n::Int)tree)==sizetree+1wheretree=fromListxsprop_sizeDeletenxs=size(delete(n::Int)tree)==sizetree-(ifn`member`treethen1else0)wheretree=fromListxsprop_sizeUnionxsys=size(uniontreeXstreeYs)==sizetreeXs+sizetreeYswhere(treeXs,treeYs)=(fromListxs,fromList(ys::[Int]))prop_sizeUnionsxss=size(unionstrees)==sum(mapsizetrees)wheretrees=mapfromList(xss::[[Int]])prop_unionsMemberxss=all(\x->memberxtree)(concat(xss::[[Int]]))wheretree=unions(mapfromListxss)prop_fromListMemberxs=all(\x->memberxtree)(xs::[Int])wheretree=fromListxs-- All the testsdataTestCase=forallprop.Testableprop=>TcStringproptests=[Tc"empty"prop_empty,Tc"null"prop_null,Tc"singleton"prop_singleton,Tc"fromList"prop_fromList,Tc"fromList inv"prop_fromListInv,Tc"insert"prop_insert,Tc"insert inv"prop_insertInv,Tc"member"prop_member,Tc"memberDistance"prop_memberDistance,Tc"delete"prop_delete,Tc"delete inv"prop_deleteInv,Tc"elems"prop_elems,Tc"elemsDistance"prop_elemsDistance,Tc"unions"prop_unions,Tc"unions inv"prop_unionsInv,Tc"union"prop_union,Tc"union inv"prop_unionInv,Tc"closest"prop_closest,Tc"size/empty"prop_sizeEmpty,Tc"size/fromList"prop_sizeFromList,Tc"size/succ"prop_sizeSucc,Tc"size/delete"prop_sizeDelete,Tc"size/union"prop_sizeUnion,Tc"size/unions"prop_sizeUnions,Tc"insert/delete"prop_insertDelete,Tc"fromList/member"prop_fromListMember,Tc"unions/member"prop_unionsMember,Tc"naiveEmpty"prop_naiveEmpty,Tc"naiveCons"prop_naiveCons,Tc"naiveDiff"prop_naiveDiff]runTests=mapM_runTesttestswhererunTest(Tcsprop)=doprintf"%-25s :"sresult<-quickCheckResultpropcaseresultofSuccess_->return()GaveUp__->return()_->exitFailure#endif