On Fri, 30 Jan 2004, Joe Landman wrote:
> On Fri, 2004-01-30 at 11:37, Robert G. Brown wrote:
>> > Note that dwc finds 67505 "distinct words". This is very interesting,
>> This is what I find in the Perl version as well. I wish I could get
> that Java code compiling on this platform so I could compare the two.
The awk script I posted earlier counts tokens separated by [:blank:] and I
guess that is what the C++ version also does -- tokens not words as rgb
carfully pointed out. I was trying to mimic the C++ code, so case and
punctuations were not ignored. Anyway, there is no easy answer to what a
word is -- e.g, simplistic removal of punctuations makes 3 "words" out of
100,000,000. But all that is beside the point and the point itself has
nothing to do with beowulf.
It is still tempting to to change the awk script to match RGB's C
version ;-) My first attempt with 3 extra lines of awk gives:
[selva at scala distinct_words]$ /usr/bin/time ./dwc.awk < shaks12.txt
Total number of words = 902299
Number of distinct words = 31384
2.04user 0.03system 0:02.06elapsed 100%CPU
Only 10% slower than the earlier version that was not
punctuation/capitalisation-agnostic.
The total number of words is still off by 4 and unique words are about
6000 smaller than the C version (see footnote). It may not be hard to hunt
down the difference, but that does not appear worthwhile. I think the
point has been made -- use your favourite scripting language for quick
answers or even for testing algorithms. "For everything else there is C"
-- be it for extensibility or performance or all the other reasons
eloguantly expressed by rgb. I can not see any reason to learn or use C++
simply for sets or hashes or trees in STL.
By the way, the performance of C++ STL appears to vary by up to a factor
of 10 between different versions of g++. The number I quoted (2.8 s) was
the fastest I could get and that was with g++ 2.96. For some reason
version 3.0.x runs much slower here..
Selva
Footnotes:
1. Googling shows that Shakespeare used 31,534 different
words and a total of 884,647 words in his complete works
(http://www-math.cudenver.edu/~wbriggs/qr/shakespeare.html)
And some statistical projections indicate he probably
knew that many more words but did not use. In any case, the awk
result is surprisingly close to 31,534 inspite of the
much smaller sample we used -- there is some over counting
due to numeric tokens and other "junk" in the Gutenberg text,
but I am still surprised by the close agreement. Also I wonder
why the C version comes up with as high as 37,000+ unique words.
2. FWIW, here is the new script:
#!/bin/awk -f
# Count number of distinct or unique words in a file
# Usage: dwc.awk inputfile
{
$0 = tolower($0); # be case insensitive
gsub(/[\r\n\t,.!?;"*<>@\[\](){}+=/\\\#:]/, " "); # remove punctuations etc..
# the above list is chosen to match the C version -- ', ` etc missing
$0 = $0; # hack to re-split $0
for(i = 1; i <= NF; ++i) {
# printf "line:%i word:%i %s\n", NR, i, $i; # DEBUG
++nwords; # increment total number of words
if (words[$i]) continue; # word is not new - skip
words[$i] = 1;
++uwords; # increment number of unique words
}
}
END {
printf("Total number of words = %i\n", nwords);
printf("Number of distinct or unique words = %i\n", uwords);
}
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf