Hi James,
> This "Divide and Conquer" approach is very effective when only a few
> words are banned, but I believe there are better approaches for many
> banned words...Anyone try anything closer to the suggested
> 10% blocked from the quiz?
I agree that if we know that 10% of the words are banned, we can do
better than this. If we have 1000 words with 10% banned, there's little
point in calling clean? on 1000, then the first 500, the last 500,
250, 250, 250, 250, 125, ... since there's a high probability that clean?
will return false on almost all of these. Since 10% is 1 in 10, it makes
sense to skip these larger chunks, and start with calling clean? on about
10 words, as there's a roughly equal chance that clean? will return true
as there is that it returns false, so we maximize the amount we learn by
calling clean? with about 10 words.
I've changed my original algorithm to start with chunks of 10, and then proceed
as before: if clean? on a chunk of 10 returns false, subdivide into 5 and 5,
etc. (The one changed routine is copied below.) This results in a moderate
improvement:
Original algorithm
Runs: 10
Words: 1000
Banned Words: 100
Average
Seconds: 51.9328
Mails: 596
Verified: 10/10
Rebuilt Banned Words
Start with chunks of 10 words
Runs: 10
Words: 1000
Banned Words: 100
Average
Seconds: 50.29
Mails: 514
Verified: 10/10
Rebuilt Banned Words
Of course, this requires knowing in advance the percentage of banned words.
If we didn't know this in advance we could try an adaptive algorithm that
figures out the percentage as it's running. Perhaps first make one pass
dividing by two each time: 1000, 500, 250, 125, 63, 32, 16, 8, 4, 2, 1.
(Starting the chunk at 0 each time: [0...1000], [0...500], etc.)
If there are really 10% banned words, they maybe clean? of 16 would be false,
but clean? of 8 would be true. From this, we _guess_ that the percentage of
banned words is between 1/16 and 1/8. We average these to get 0.09375 or 1
in every 10.6666. Rounding this to 1 in 11, we pick an initial chunk size
of 11, and do as in my algorithm above, but with chunks of 11. But we also
keep track of the percentage of banned words we've seen so far, and if this
percentage moves outside the range of 10.5 to 11.5, we adjust the chunk size
down or up, accordingly.
So, we quickly make a guess at the percentage, then as the algorithm runs we
improve our guess, and pick the corresponding chunk size.
Adaptive algorithms like this sometimes need damping, so that the chunk size
doesn't change wildly if we happen to hit a run of banned words, followed by
a run of non-banned words. But I'm not sure how much damping is needed in
this particular case, as the continually increasing denominator in our
percentage calculations will provide some damping.
Wayne
============================================
def run()
if @words.empty?
[]
else
biggestChunkSize = 10 # Use this for 10%
aBanned = []
iStart = 0
iEnd = iStart + biggestChunkSize
while iEnd <= @words.size
aBanned += findBanned(@words[iStart...iEnd])
iStart = iEnd
iEnd += biggestChunkSize
end
if iStart < @words.size
aBanned += findBanned(@words[iStart..-1])
end
aBanned
end
end