is a prime,
currently set to the greatest prime less than 2^32.
Each quantized value is mapped to I integers in the range
0..I, where I is an integer less than I

,
currently 2^17, using a family of hash functions,
computed by the C function.
Each hashed value is used as the index in a large bit vector.
Bits corresponding to terms present in the document are set to
1; all other bits are set to 0.
Of course, collisions may cause the same bit to be set twice,
by different terms. It follows that, if the document contains
I distinct terms, in the resulting bit vector at most
I bits are set to 1.
The resulting bit string is a very compact representation of the
presence/absence of terms in the document, and is therefore
characterised as a I. Moreover, it does not
depend on a pre-set dictionary of terms.
The signature may be used for:
=over 4
=item *
testing whether a given set of terms is present in the document,
=item *
computing which fraction of terms are common to two documents.
=back
The bit representation may be written to and read from a file.
C prepends a header to the bit stream proper;
moreover, whenever the package C is available,
the bit vector is compressed, so that disk space requirements
are drastically reduced, especially for small documents.
The hash function is obviously a crucial component of the filter;
the reference implementation uses a radix representation of
strings. Each term must therefore match the regular
expression C[0-9a-z]+/>.
There are quite a few viable alternatives, which can be pursued
by subclassing and redefining the method C.
=head1 FORESEEN REUSE
The package may be {re}used either by simple instantiation,
or by subclassing (defining a descendant package). In the
latter case the methods which are foreseen to be redefined are
those ending with a C suffix. Redefining other methods
will require greater attention.
=head1 CLASS METHODS
=head2 new
The constructor. No arguments are required.
$b = Text::Bloom->new();
=head2 NewFromString
Take a string written by C (see below)
and create a new C with the same contents;
call C whenever the restore is impossible or ill-advised,
for instance when the current version of the package is different
from the original one, or the compression library in unavailable.
my $b = Text::Bloom::NewFromString( $str );
The return value is a blessed reference; put in another way,
this is an alternative contructor.
The string should have been written by C;
you may of course tweak the string contents, but
at this point you're entirely on you own.
=head2 NewFromFile
Utility function that reads a binary file and performs a C
on its content; see its counterpart, C.
my $b2 = Text::Document::NewFromFile( 'foo.sig' );
=head1 INSTANCE METHODS
=head2 Size
Set and get the size of the filter, in bits. The default size
is currently 128K.
print 'size is ' . $b->Size() . "\n";
$b->Size( 65536 );
The C method must be called before the C method
in order to have effect.
=head2 Compute
Compute the Bloom signature from the given set of words
and store it internally.
$b->Compute( qw( foo bar baz foobar bazbaz ) );
Makes use of the C method.
=head2 QuantizeV
Convert a term into an integer; must return
an integer in the range 0 .. C.
It is called as
my $hash = $b->QuantizeV( $term );
The current version is designed for strings matching
C[a-z0-9]+/>. Other characters do not cause errors,
but degrade the hash function performance.
This function is a likely candidate for redefinition.
=head2 HashV
Convert an integer to a (smaller) integer, according
to one of a class of similar functions.
It is internally called as:
my $index = $b->HashV( $order, $value );
The C must belong to the interval
0..C, while the index must
lie in 0..I. C is
a small integer from 0 to I.
The default implementation is
index = m[order] * value + q[order] (mod size)
the values of I and I are taken from the array
C; the form of the function
is taken from [2].
=head2 WriteToString
Convert the Bloom signature into a string which can be saved and
later restored with C. C must have
been called previously.
my $str = $b->WriteToString();
The string begins with a header which encodes the
originating package, its version, the parameters
of the current instance.
Whenever possible, C is used in order to
compress the bit vector in the most efficient way.
On systems without C, the bit string is
saved uncompressed.
=head2 WriteToFile
These convenience functions just call their String counterparts
and read/write the file specified in the argument.
$b->WriteToFile( 'foo.sig' );
=head1 AUTHORS
spinellia@acm.org (Andrea Spinelli)
walter@humans.net (Walter Vannini)
=head1 BIBLIOGRAPHY
=over 4
=item [1]
Burton H. Bloom, "Space/time trade-offs in hash coding with allowable errors",
I, B<13>, 7, July 1970, pages 422-426. (available
electronically from ACM Digital Library).
=item [2]
M. V. Ramakrishna, "Practical Performance of Bloom FIlters
and Parallel Free-Text Searching",
I, B<32>, 10, October 1989, pages 1237-1239.
(available electronically from ACM Digital Library).
=back
=head1 BUGS
On Win32 we have experienced some instabilities when dealing
with a large number of signatures; in this case Perl crashes
without apparent explanation. The main suspect is Bit::Vector,
but without any evidence.
=head1 HISTORY
2001-11-02 - initial revision