Hi,
I'm working on a librairie to analyse repport of the prelude IDS on a MySQL
database. I need set operations (- and intresection) on quite huge liste
(several thousand of elements) and i've discovered that the - operator seems
to be always in O(n*m) (where length of list are n and m).
In fact when the list is already sorted, this operation should be O(n+m)
which is much better.
I've realised a little Set class which ilustrate this pb, i've code a O(n+m)
solution in pure Ruby code and this solution is really faster than the Ruby
one when the length is important.
This is an example, i've created two list t1 and t2 of 10000 random number,
i apply sort! and uniq! and them i test the two algo. (source code is at the
end of mail).
cedric@lagavulin:~/projet/prelude$ ./set.rb
sorting... 0.053775
size t1:9535, t2:9523
diff... 69.653139
my new diff... 0.263852
check if we get the same result with operator <=>
ok
cedric@lagavulin:~/projet/prelude$ ruby -v
ruby 1.6.7 (2002-03-19) [i386-linux]
So the hard coded - operation spends 69s and my one 0.26 s.
So i propose few solution:
1) In the - method, test if the list are allready sorted (which is a O(n+m)
operation), them adapt the algo (ie O(n+m) or O(n*m)).
2) In the - method do a sort first (which is a O(n*ln n+m*ln m) operation)
and them do the - with the O(n+m) algo. Because the - operator is a set
operator, the order doesn't really make sense...
3) Add a Set class (like my one but hard coded) where the order doesn't make
a sense and so what we manipulate is internaly alway sorted. - operation is
alway in O(n+m).
Sorry for my english, it's not my mother language.
Regards
remark: Union operation should be in O(n+m) because list are sorted (like in
a merge sort). I did a O((n+m) ln (n+m)) because i'm lazzy but it's quite
trivial to improve that.
#!/usr/bin/ruby
class Set
def initialize(tab)
@set = tab.sort
@set.uniq!
end
def [](i)
return @set[i]
end
def length
return @set.length
end
def -(aSet)
res = []
i = 0
for j in 0..aSet.length-1
while i < @set.length and @set[i] < aSet[j] do
res << @set[i]
i+=1
end
i += 1 if @set[i] == aSet[j]
end
res.concat(@set[i.. / set.length-1]) if i<@set.length
return Set.new(res)
end
def +(aSet)
tmp = @set+aSet.to_a
tmp.sort!
tmp.uniq!
return Set.new(tmp)
end
def &(aSet)
return (self-aSet)+(aSet-self)
end
def to_a
return @set
end
def to_s
return "{"+@set.join(', ')+"}"
end
end
t1 = []
t2 = []
for i in 0..10000
t1[i]=rand(100000)
t2[i]=rand(100000)
end
print "sorting... "
time1 = Time.new
t1.sort!
t1.uniq!
t2.sort!
t2.uniq!
time2 = Time.new
print time2-time1,"\n"
puts "size t1:#{t1.length}, t2:#{t2.length}"
print "diff... "
tmp1 = t2-t1
time3 = Time.new
print time3-time2,"\n"
print "my new diff... "
tmp2 = (Set.new(t2)-Set.new(t1)).to_a
time4 = Time.new
print time4-time3,"\n"
puts "check if we get the same result with operator <=>"
puts "ok" if (tmp1 <=> tmp2) == 0
--
Cedric Foll
courriel: cedric dot foll at laposte dot net