Re: Prefix-only search?

> OK, it sounds like in your application it's valuable for some reason you
didn't state
there are lot of application where you can't use null terminated string and
ordering of the key is necessary:
- Intervall searching (ex. IP-range search)
- Longest prefix matching (ex. IP6)
- multi-field indexing (ex.combination of integer+string)
- prefix matching
- Longest prefix matching in binary files
- Match finder in data compression
1 - About variable length associated keys:
- Maybe it is hard to implement this with the current Judy arrays concept.
I think the Judy arrays functions (ordered, not JudyHS) can be extended to fixed length keys where you
can pass the fixed length as parameter.
Ex.
#define JLEN 18
char memptr[JLEN];
JudyMIns(judy, JLEN, memptr)
JudyMGet(judy, JLEN, memptr)
actually, you have only JudyL functions with the length = 4 (and 8 in the 64bits version).
2 - About optimizing Judy?Next
i have already proposed the cursor version. A simpler extension (without cursor) is
to return all (up to 256) leaf keys in 2 parameter arrays.
Ex. JudySLANext(judy, s, strArray[], pvaluesArray[]) witch means search for the next strings after s.
The number of the returned keys in strArray[] and PvaluesArray[] is variable (all keys found in the leaf nodes ).
After the call to JudySLANext, you can do:
for(i=0; pvaluesArray[i]; i++) printf("%d,%s\n", pvaluesArray[i], strArray[i] );
________________________________
Von: Alan Silverstein <ajs@...>
An: schaeferhaus@...
CC: judy-devel@...
Gesendet: 2:11 Mittwoch, 21.November 2012
Betreff: Re: Prefix-only search?
> yes, lexicograhically sorting is needed.
OK, it sounds like in your application it's valuable for some reason you
didn't state.
> This is necessary if for multi-field keys in databases (the null
> character can occurs within the key).
OK, but don't confuse length-associated (so the variable-length key can
contain any byte value) with lexigraphical sorting. The latter is only
important if for some reason you want to use First/Next (or converse) to
visit the keys in lexicographic order, rather than some other order.
> The sorted by length like in JudyHS function cannot be used.
OK.
> The big advantage of the judy arrays is the lexicograhically ordering.
For some people it's a complete don't-care, for others libJudy isn't
good enough because it's not "real" sorting (multibyte, various keys,
etc, see sort(1)), but for some applications a simple sorted order
(other than by length) does have some value.
> Otherwise you can simply use hashing, witch is faster.
Not always! Hashing is hard to beat for simple/fast hash algorithms
with good spectral properties (short synonym chains) for the particular
sparse data being saved. The problem with hashing is it's "fragile" in
real-world applications where the programmer doesn't understand it very
well, tune it up well to match the data, or (most often) keep it
maintained well as the dataset grows. I've personally sped up old
(buried and forgotten) hashing code by 10x just by retuning a few
parameters.
>> Storing variable-length, length-associated keys sorted
>> lexicographically is actually a fairly hard problem.
> I don't know the internals of the judy arrays but ithink, you can
> store the common prefixes as actually done and store the suffixes as
> variable length array (length as varbyte + the byte array).
Well I wish I could point you to the old paper where I put a lot of
thought into this. Consider that any possible substring/superstring, of
any length, is possible. While "decoding" the next 4/8 bytes of the key
string (including any binary byte values in the general case), you must
be able to efficiently distinguish (in the 4-byte case) all of these
possibilities:
- string has only 1 more byte (256 possible values)
- string has 2 more bytes (256^2 possible values)
- string has 3 more bytes (256^3 possible values)
- string has 4 more bytes => each of leads to 1 or more substring values
You can't do this with the present JudySL code... You must dig into the
internals to have some new kind of tree node that tells you how many and
which strings are present that terminate at this level, along with
all superstrings that go deeper. Say you store:
abcde
abcdef
abcdefxyz
After decoding "abcd" through the first JudyL array, you can't just
point to another JudyL array for the rest of the string(s). Something
has to sort by remaining length (1,2,3,4+) first and lead you off to the
children trees for each length. And you gotta do this while maintaining
lexicographical order too, can't have "abcdf" (shorter key) appear
before "abcdef" (longer key) while walking the list.
Simply listing N (large) substrings in lexicographical order is very
inefficient (in both space and CPU time) if there are more of them than
fit in 1-2 cache lines! And common substrings are problematic too.
I did dream up the most efficient (space and timewise) data structure I
could think of to solve this problem, but I've forgotten it, and it's
buried in the "variable size keys" paper I wrote over 10 years ago.
Cheers,
Alan
------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Judy-devel mailing list
Judy-devel@...
https://lists.sourceforge.net/lists/listinfo/judy-devel