> After spending about 2 hours searching for ready made
> solutions in Perl and realizing that the available stuff do
> not meet my needs or that they perhaps do but I can't seem
> to figure out how... I decided to commit the horrible sin of
> making my own version for it. Just to understand how
What horrible sin? What fun is programming if you can't re-invent the
wheel sometimes? or perhaps the only viable form of programming is doing
the undone and producing the first truly artifficially intelligent
computer :) ?
> The reason I am bothering you with this code is that my
> input text is rather big: about 280MB of text.
>> I run the following onliner:
>> find archive/tknz/ -type f -name "*.tknz" -exec cat {} \; | bin/ngram.pl
>> which basically takes all text files located in/under some
> directory tree and shoves it through a pipe into the
> standard input of my script...
>
Is it important for all the input files to be concatenated? If not, you
can just fetch the file names and then process each at a time.
If it is, you can still process each at a time and then sum the results
from all files into one (however then sequences at ends of files will
not be concatenated to starts of files).
> And I run it on a powerful server with 515920K of memory plus 408008K of swap space.
>> Once the 'find' completes to read the text and the perl
> script is left to run alone it somes to a point where the
> swap space is reduced to 0 and the available RAM is less
> than 10%. At that point a series of memory allocation errors
> (probably requests for a large chunk of memory which isn't
> available due to fragmentation) cause the script to be
> killed by the operating system. Another possibility is that
> the script was asking for more memory than it was allowed
> to.
>> I wonder if you guys see any room for doing anything which
> can contribute to reducing the memory used by this script
> hoping that it will enable it to live long enough to
> actually finidh all the loops.
>
Oooh! A chance for some low level programming in perl, fun!
Inspired by the code of Acme::Bleach, I found a way to reduce the size
of your hash keys by packing them with single bits instead of space
charachters.
The script below adds a set bit at the beginnig of every charachter but
the first in a token, and an unset bit at the start of every token but
the first. This results in great saving for one charachter tokens, and
diminishing return for longer tokens - actually being worse for tokens
longer than 8 charachters - but how many words are that long?
Another gotchas: Only applies to ascii-derived encodings. Probably very
CPU intensive.
# best case - saves 3 bytes.
my @toks = ('I', 'a', 'n', 'a', 'm');
# avg. case - saves 1 byte.
my @toks = ('I', 'am', 'not', 'afraid', 'man');
# worst case - cost 4 bytes.
my @toks = ('International', 'association', 'of_notoriously',
'associated', 'meritocrats');
sub convtok {
return (join '1',
(map {unpack('b*', $_)}
split(//, shift) )) . '0';
}
my $key = join "", map {convtok $_} @toks;
chop $key;
$key .= '0' x (8-(length($key) % 8)) if (length($key) % 8);
$key = pack "b*", $key;
# use this to see how much you saved
print length($key), "\n -- $key\n";
# Now let's get back something printable:
my $back = unpack 'b*', $key;
sub deconv {
my $toke = '';
my ($str, $curroffs) = @_;
do {
$toke .= pack "b*", substr($str, $curroffs, 8);
$curroffs += 9;
} while (substr($str, $curroffs-1, 1));
return ($toke, $curroffs);
}
my @new_words;
my $offs = 0;
while ($offs<length $back) {
my $deconved;
($deconved, $offs) = deconv($back, $offs);
push @new_words, $deconved;
}
# Use this to write to files:
my $record_name = join ' ', @new_words;
print "rec -- $record_name\n";
> Please note that the run time is not the part which bothers
> me. Anyway -- most of the time consumed by this script is
> due to operating system trashing while trying desperately to
> swap data between the disk and the RAM...
>> I suppose it should be nice to speed it up -- but memory is
> my biggest concern at the time.
>
Yeah, count on the above script to cost you a lot of time.
>> Thanks.
You're welcome :)
Yosef Meller.
BTW, Re: the recent OSS debate - This script is _free_ of charge (and in
spirit), _opened_ for view and use, and GPL'ed (Copyright by me). As the
creator of this FOSS code I qualify as an open-source advocate, so I
guess I'm a perl programmer && OSS advocate... Use Linux! Use Apache
(bless you)! Use Subversion! use strict;
--
-------------
To verify my electronic signature or send me encrypted mail, use my
public key at:
http://wwwkeys.pgp.net:11371/pks/lookup?op=get&search=0x3D2CA0A8
If you don't know anything about that, try:
http://www.gnupg.org/gph/en/manual.html for a start on encription and
signing.