On Wed, 28 Jan 2004, Jakob Oestergaard wrote:
> I have *tons* of work to do before saturday, where I board a plane for
> france, to spend next eight days skiing in the alps. I don't think I
> will be able to give you a proper followup until after the vacation.
> Say, in about one and a half week from now.
>> However, once back, I'll be fully energized and ready to take on the
> world - my asbestos underwear will be ready, my tinfoil hat polished up,
> and by then I think my lithium prescription should have run out again.
>> All in all, I'll be back! :)
Awwwww, that hash-table example has tons of unnecessary overhead. Are
you gonna make me actually WRITE a C variant to compete? Maybe I should
start by posting a related program I wrote just for fun. I like to play
hangman, and strategically a knowledge of letter frequency in the
English language seemed like a useful thing to have. After all,
everybody knows that 'e' is the most common letter, but what about the
second, third, twenty-third most common letter? So one day, bored, I
wrote letcount, which takes a filename as an argument, opens the file,
reads it, and counts the letters. It was the first program I ever wrote
that compiled and worked perfectly the first try! Not a lot of lines,
but still...:-)
Note that if all one wishes is to count the words, this program could be
trivially modified with a snippet to determine word boundaries and
increment a counter. If one wishes to count the words and save them
(non-uniqued) in order, an additional snippet that moves each
non-word-break character into a wor buffer, together with a realloc and
a malloc (realloc the **char word array, malloc the spac for the word)
would yield a scalable and storage-minimal solution. If one REALLY
wants performance (and don't care about memory usage, sensible on a
large memory modern system) one could realloc in largish blocks or
simply malloc "more than enough" space (storing a million words less
than or equal to
rgb at lilith|T:144>echo supercalifragilisticexpialadotious | wc
1 1 35
in length requires a whopping 36 MB of memory) hard allocation of
char words[1000000][36];
would yield the fastest possible access with no indirected pointer
address resolution at all at the expense of being horribly wasteful in
space that is probably idle anyway. Note that this wouldn't be
efficient for smaller files as it probably takes the kernel some time to
free up a block of memory that size; page size requests would presumably
be much faster up to a point.
Again, in addition to being perhaps the fastest possible code (note that
scanning the entire stream of characters inline in this way probably
would beat using strtok(), although not by much since this is likely
pretty much what strtok does:-) the code (included below) is READABLE!
At least I think so. I didn't even comment it the way I usually do --
open the file, stream it in, count the letters, trivially and
inefficiently sort the count vector, print the sorted result in place.
As the result of running the code on the linux words dictionary shows,
it is pretty fast -- it literally returns a count as my finger is
lifting from depressing the enter key, and the bulk of the DELAY is
writing to stdout at 9600 baud or whatever an xterm supports, not the
0.03 seconds it actually takes to run.
So there are LOTS of ways to solve your challenge problem in C; static
memory allocation (wasteful but fast, especially for big files/streams),
dynamic memory allocation (probably optimal, especially if one allocates
roughly a page at a time on demand as the previous page is filled and
maintain your own table of word boundaries as pointer offsets in an
otherwise linear buffer), dynamic allocation the simpler and probably
more readable/portable way (a word at a time in a char **words list),
dynamic allocation of the words as "simple objects" (build a linked list
struct of some sort to hold words and perhaps indices for use in a later
sort or uniqueing step), and of course the hash. In each case the
performance tradeoffs for using one memory management approach or the
other are fairly transparent, and in each case one can choose between
using a tool like e.g. strtok or doing it yourself. Of course, some of
these have significant risks associated with them in the hands of a
novice programmer or result in code that isn't horribly portable --
writing the incoming stream as rapidly as possible into a
resize-on-demand buffer and recording word offsets in a linear integer
(pointer) vector requires visualizing the way the memory space is laid
out and is very close to the way one would likely do this in assembler,
but accessing a word as (char *) buffer[offset[13]] isn't everybody's
cup of tea (including mine, unless the code HAD to be really fast).
So sure, give it your best shot in C++ using C++ constructs, but I
guarantee you won't beat the best one can do in C and probably won't
come terribly close unless you write the C++ code, after all, in C.
I'll see if I can spare an hour or so this afternoon to hack letcount
into wordcount using one or another of the buffer storage mechanisms I
mentioned above -- the buffer[offset[]] actually seems kinda cute, come
to think of it, but I may use strtok() instead of my code as a base to
make the code appear shorter. Especially since it RETURNS the offset in
buffer -- I think my entire program would be reading the file into a big
buffer and a single while loop where strtok extracts the offsets, puts
them into a vector, and counts.
rgb
--
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edurgb at lilith|T:141>/usr/bin/time ./letcount /usr/share/dict/linux.words
Opening /usr/share/dict/linux.words for counting
e: 42715
i: 31459
s: 29652
a: 28984
r: 27064
n: 26981
t: 24614
o: 21613
l: 19484
c: 15018
d: 13857
u: 11723
g: 10342
p: 10073
m: 9807
h: 7812
b: 7371
y: 6006
f: 4932
v: 3973
k: 3209
w: 3073
z: 1631
x: 1054
j: 727
q: 682
Command exited with non-zero status 7
0.03user 0.00system 0:00.40elapsed 7%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (93major+14minor)pagefaults 0swaps
%< snip snip =======================================================
#include<stdio.h>
main(int argc, char **argv)
{
char filename[128],buf[256];
int i,j,biggest,bigind,count[26];
FILE *infile;
if(argc != 2){
fprintf(stderr,"Usage: letcount filename\n");
exit(1);
}
argv++;
sprintf(filename,"%s",*argv);
printf("Opening %s for counting\n",filename);
infile = fopen(filename,"r");
for(i=0;i<26;i++) count[i] = 0;
while(fgets(buf,256,infile)!= NULL){
/* printf("%s",buf); */
i = 0;
while(buf[i] != 0){
if(buf[i] > 64 && buf[i] < 92) count[buf[i]-65]++;
if(buf[i] > 96 && buf[i] < 124) count[buf[i]-97]++;
i++;
}
}
for(i=0;i<26;i++){
biggest = 0;
for(j=0;j<26;j++){
if(count[j] > biggest) {
biggest = count[j];
bigind = j;
}
}
printf("%c: %d\n",bigind+97,biggest);
count[bigind] = 0;
}
}
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf