Take a moment, if you dare, to look over the perldoc for vec. I think, like most perl hackers, that looking at that particular page, you'll see some perl 4 ism's and some particularly unuseful code, and think, gee, this is right up there with Larry using ' instead of ::. I don't need that garbage. Please, read the above again, and see how simple vec is to use.

Recently, I've been working with obscenely large arrays. One of the problems I've bumped into is one of our user-developers has said that they have occasionally an array so large that they cannot fit two copies of it into memory. Additionally, any time you iterate over the array, you find that passing references to the array and accessing elements individually is also time consuming.

Enter the vector.

It is helpful to think of a vector as a joined array (on ''). Essentially, it is as if you had done something like:

Wouldn't you rather use substr() (note: substr is fast!!)to refer to individual bytes of that "array" than to pass references around and deref them, or point at individual elements of it? String functions can be a lot faster than list operations. In fact, this is so important that many computers now ship with "vector units" (Crays and Macs among them). vec() is essentially a pretty wrapper around the C vector functions (with a little endian assumption. If you've compiled your perl, say, -faltivec, you'll have access to those units from within perl).

Vectors also allow you to perform bitwise operations on entire vectors at once, rather than walking the array (see vec). That is substantially faster.

If you don't think this is a substantial improvement, think about asking perl to my @foo = @{ $bar } and how much cpu that takes to do on a 30 million element array.

Providing benchmarks is kind of silly. I have a bunch of machines here with vector processors. My results will be different than yours. I would hope, though, by looking at the above, you can see that it might be worthwhile to actually try using a vector. In many cases, it is more efficient both cpu and memory-wise, than an array. And not much harder to use.

My own gains with vectors have been between 5 and 20 percent memory savings, and 30 to 500 percent increases in speed. My goal is not to convince people vectors are better -- they are sometimes, and not others -- but rather to convince people to try to use them. I rarely ever find a perl hacker who knows what a vector is, or why it might be much better to use a vector than an array for their application. vec is a perl builtin. Don't be afraid.

First of all, very interesting text. I never have looked the vec() function of Perl very well, but can be very important for a big amount of data.

Well, since vec is a CORE function, that work directly over the bits of the content of a string, it will much more faster, but don't forget that a vector is fixed, each element have the same size, is just an array of chars, and the Perl ARRAY is an array of SCALARS, that are much more complex.

There's nothing in principle forcing a vector to have elements all of the same size, you just have to be a bit careful with the indexing if they are not.

What I do occasionally find frustrating is the lack of certain useful primitives when dealing with vectors, particularly vectors of booleans, since the resulting code can often become somewhat ugly and non-intuitive. Consider for example testing whether all(a_i & b_i) == 0:

I look forward to the promise of perl6 - where you could have a vector class, for example, "just like a string" except that it has a different concept of which strings are true. I think such a class would be difficult to achieve cleanly in perl5 without going down the path of tying/overloading, and losing most of the efficiency gains that caused you to turn to vectors in the first place.

Notice that PP_Elias achieves identical compression to Elias in C (as you'd expect :), but that it's twice as slow at packing, and nearly 10 times slower when unpacking.

If you want to continue with using Elias, and can't build the C version yourself, I could let you have it pre-built (for win32), but I have to wonder why you would when W-BER is faster and achieves better compression?

Also, I wonder if you saw Re^8: Byte allign compression in Perl.. where I demonstrated that you can have the DB do the selection for you using the schema I suggested way back when, in 0.312 of a second? For the record, I found a small optimisation in the schema that reduce that by a factor of 10, to 31 milliseconds.

So the DB does the selection, sends you just the data you need to do your proximity calculations, and does it all faster than you could pack a single integer. Interested?

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

No my friend , i didnt missed the Tachyon comment about the vec function..but the deprecated's code help me to understand what is going on with this function..

For the SQL command that you propose i want to ask you for which server is appropriate because on MySQL there is no command for the intersection( i tried some inner join but the perfomance was very very very slow for 1GB dataset (250000 pages,Average document length : 602 words
Number of unique words: 907806)...

About your code for the Elias technique i have to say that it is 3 times faster than mine (Thanks one more time!!!)...The only reason why i want to use compression in my index is for perfomance reasons..that was my thougths untill now but it seems that i wasnt right..:(..since with the vec function i can decode 3000000 doc ids in 2 seconds and 10 milion in 6 secs!!!) as the below code shows..

In the above code i used 4 bytes for each doc id. I tried with a vector where i save the same number of doc ids but with only 1 byte for each doc id ( i saved only small numbers) and the time was completely the same..I cant understand why..DOes anyone??