Pedro wrote at 04/18/2012 02:54 PM:
> So to put it in a simple way, I need to tokenize all my data and
> create an index which I load into memory...?
>
That's a simple way that might do everything you want.
If you do this, and then find you want it to work better, then I suggest
hitting an IR textbook.
Regarding whether keeping everything in memory will work: You can do the
arithmetic on how much memory you'll need, once you know how many terms
and documents you need to support. Then see whether you'll have enough
free RAM for twice that number; if you're exhausting RAM and swapping
GC'd virtual memory to disk randomly, you're going to have a bad time.
> Is this how it is usually done? For example, does my browser (firefox)
> keep an index of all the words present in urls and page titles on
> memory at any given time?
>
I would guess so, though that might be indirectly, such as through an
SQLite cache.
Neil V.
--
http://www.neilvandyke.org/