Announcement (2017-05-07): www.ruby-forum.com is now read-only since I
unfortunately do not have the time to support and maintain the forum any
more. Please see rubyonrails.org/community and ruby-lang.org/en/community
for other Rails- und Ruby-related community platforms.

Hi,
Ferret is not only faster (as I have benchmarked a few times) as data
gets larger but its also more accurate because of its query analyser
(you can use google tike search query's). There are two options, you can
store everything in ferret (and not need a database anymore) or store
only the index (fields you need to index) and retrieve the other value's
from mysql.
At this moment I am trying to write a better plugin for ferret so you
can specify what needs to be index, use the find (instead of an special
method) with additional options. And automaticly query database for
additional fields.

On 12/13/05, Abdur-Rahman A. <removed_email_address@domain.invalid> wrote:
> Hi,>> Ferret is not only faster (as I have benchmarked a few times) as data> gets larger but its also more accurate because of its query analyser> (you can use google tike search query's).
This is great to know. I'm surprised. Ferret is going to by much much
faster soon. I'm rewriting it all in C.
>> At this moment I am trying to write a better plugin for ferret so you> can specify what needs to be index, use the find (instead of an special> method) with additional options. And automaticly query database for> additional fields.
Please keep us updated as to how this is going. I'd like to add more
stuff like this to the Ferret Wiki. You might like to look at this
page if you haven't already;
http://ferret.davebalmain.com/trac/wiki/FerretOnRails
Far from a perfect solution so please feel free to add to it. :-)
Cheers,
Dave

Hi Onur,
I can't offer any input on speed comparisons between Ferret and MySQL
fulltext search. I will say this though. If the results that MySQL
fulltext search returns are good enough then use it. But if you care
about the relevancy of your results and you want to be able to run
advanced queries like boolean queries or phrase queries, you'll want
to go with Ferret, and it should be fast enough.
As for having to query the database, that will depend how you want to
use Ferret. You can store the data in the Ferret index if you like, in
which case you won't have to query the database. I think it's better
just to keep the data in one spot though.
HTH,
Dave

Hi David,
I thinks you should be carefull replacing 'ferret' as a database till
its really mature. (Indexes can be recreated anytime with the original
data). Mysql has proven itself as a mature database sollution and has
many tools for maintaining and managing. Ferret in my opinion can't
replace that (I don't even think lucene can). It lacks certain
management tools that are needed for a database, however current
databases lack advanced query parsers (and thats good because it only
makes the database complexer). I know about linking lucene to existing
databases with very good result, this should be possible with ferret or
not?

Agreed. I meant it's probably not worth storing the data in Ferret.
Just use it for the indexing and keep your data in the database.
((On a side note, it is possible for some applications to do away with
the database and use Ferret as the only data store. I think that's how
Erik H.'s blog software Blogscene works.))

On 12/13/05, Abdur-Rahman A. <removed_email_address@domain.invalid> wrote:
> databases with very good result, this should be possible with ferret or not?
Sure. I wouldn't replace a database with Ferret in most instances and
probably not in a Rails app since rails makes it so easy to use a
database. I was just trying to say it was possible to use Ferret or
Lucene as a data store. :-)

David,
Are you trying to make a lucene compatible project? or a similar
project? Because I think with the possibilities of ruby, in time it
would be possible to go beyond what possible in java..
Really great project, I hope to be able to contribute, my C skill are a
little old (10 years orso) maybe I can help you out on the ruby end for
improvements...

On 12/13/05, Abdur-Rahman A. <removed_email_address@domain.invalid> wrote:
> David,>> Are you trying to make a lucene compatible project? or a similar> project? Because I think with the possibilities of ruby, in time it> would be possible to go beyond what possible in java..
Very good question. At the moment I'm trying to stay compatible. But
if I get enough contributers I'll consider forking off. Lucene is
quite a large project with a lot of contributers so it might be hard
to push ahead of them.
> Really great project, I hope to be able to contribute, my C skill are a> little old (10 years orso) maybe I can help you out on the ruby end for> improvements...
Any help is appreciated. Just recommending Ferret is going to help the
project in the long run so I thank you for that. Also contributing to
the wiki is very important.
Thanks,
Dave

On Dec 13, 2005, at 8:04 AM, David B. wrote:
> ((On a side note, it is possible for some applications to do away with> the database and use Ferret as the only data store. I think that's how> Erik H.'s blog software Blogscene works.))
If only I had that e-mail-to-blog gateway, I'd be blogging all the time!
Yes, http://www.blogscene.org/erik is powered entirely by a Lucene
index, a servlet, and some Velocity templates. The original blog
entries reside in blosxom-style text files, but at runtime only
Lucene is used.
It really depends on the scenario, but in general I don't recommend
using Lucene (or Ferret) as the definitive data source. The primary
reason is that an index is optimized for how it is going to be
searched, and you may later want to change how text is tokenized and
thus what terms are indexed. Having the original data around to be
able to re-index with different settings is a good thing. It's also
possible to store the original data in Lucene and pull it out for
reindexing purposes - but that is trickier.
Erik

On Dec 13, 2005, at 9:30 AM, Abdur-Rahman A. wrote:
> Are you trying to make a lucene compatible project? or a similar> project? Because I think with the possibilities of ruby, in time it> would be possible to go beyond what possible in java..
Could you elaborate in what ways you feel Ferret could go beyond what
is possible with Java Lucene? How does Java hold Lucene back?
Genuinely curious,
Erik

Erik,
I am sorry, I just exited about ruby in general. But I thing with
language like ruby and a project like lucene, it?s my personal opinion
that LOC makes a difference. Things like mixins and the way ruby you
program in ruby makes things just a bit easier. I took me 4/5 days to
understand and work with lucene (great book b.t.w.) and it only took me
a 10 days to learn most of edge rails and many other plugins by reading
code (yes not docs, code LOL)...
Lucene is a great product, and will continue on java (you can't kill
java, its really usable for many things). But ruby just makes it easy to
program, and with the integration with c. Well things are optimized. I
have only been rubying for a day or 20. But it amazes my howmuch a
language can make a difference...
So I have to revise my statement a bit, but I think, in time, melting
Ferret and ActiveRecord together could make it a better product then
lucene : ) But that future talk...
Well, I am amazed to see you here : ) what is your opinion?
Abdur-Rahman

On Dec 13, 2005, at 11:28 AM, Abdur-Rahman A. wrote:
> I am sorry, I just exited about ruby in general. But I thing with> language like ruby and a project like lucene, it?s my personal> opinion that LOC makes a difference. Things like mixins and the way> ruby you program in ruby makes things just a bit easier. I took me> 4/5 days to understand and work with lucene (great book b.t.w.) and> it only took me a 10 days to learn most of edge rails and many> other plugins by reading code (yes not docs, code LOL)...
It's not quite comparable the difference between a full-text search
engine and a web framework.
Lucene is optimized heavily - it's code is more C-like than Java-
like. Making Lucene more OO or taking advantage of all the fancy
Ruby ways of method trickery is likely to slow things down. The
entire idea of a full-text search engine is to be fast! (oh, and to
be easy on resources as well)
> Lucene is a great product, and will continue on java (you can't> kill java, its really usable for many things). But ruby just makes> it easy to program, and with the integration with c. Well things> are optimized. I have only been rubying for a day or 20. But it> amazes my howmuch a language can make a difference...
The folks that would be coding under the covers of Ferret or Lucene
are a highly specialized group of folks. Likewise with the core code
of Rails. Most users don't need to see what is underneath - it just
works.
Indeed the language makes a difference, but also the goal of the
effort. A full-text search engine has some very specialized needs
and even the most basic data structures in high level languages like
Hash and Array are only used if they are fast enough, otherwise
alternatives are created. This is definitely the case with Lucene.
> So I have to revise my statement a bit, but I think, in time,> melting Ferret and ActiveRecord together could make it a better> product then lucene : ) But that future talk...
Well, in all fairness to Lucene, it is orthogonal to the database
concern entirely. Of course Ferret + ActiveRecord > just Lucene, but
to make the comparison more fair, how about Lucene + Hibernate?
There are hooks for Hibernate to index with Lucene, even using Java
annotations to mark the fields to be indexed, and how they are to be
indexed. I see ActiveRecord + Ferret to be a great path to go, and
the acts_as_ferret initial implementation is on the right track. I
hope to delve into this area more myself in the future (though my
work does not currently involve relational databases, but will soon).
> Well, I am amazed to see you here : ) what is your opinion?
I've been a Ruby fan for ages, ever since catching a Dave T.
presentation in '02. I've dreamed of RubyLucene for years, creating
the rubylucene (formerly rucene) project at RubyForge once upon a
time but not doing much with it beyond some low-level I/O proof of
concept tests.
I'm ecstatic that Ferret exists! I do have some reservations on the
effort to port it all to C, as I'd really like the effort to aim
towards the architecture PyLucene has, where it uses GCJ against Java
Lucene, and then wraps it, using SWIG, into a Pythonic API. In order
to avoid porting every time Java Lucene changes (which is where the
guru creator Doug Cutting spends his effort), it would be a simple
recompilation (and perhaps some API glue).
Erik

I just got done reviewing some of the info in the ferret wiki. It looks
like
some great work - thanks!
I'm building an app that is going have some search capability and I was
planning
on using mysql with fulltext searches, but looking at ferret has got me
wondering if there might not be a better way.
Specifically, I was wondering about the idea of using an in memory index
for
increasing the speed of searches.
The data i'm storing will be most utilized when it is relatively new.
After it's
a few days old, people won't need it as much. So putting all this data
in the
same database may not make sense (if it's relatively easy to split it
into
'fresh' and 'stale' databases).
Would it make sense to consider using an in-memory cache of documents
for the
newest data while having a disk-based index for when people want to
search for
older documents? Or would the performance gains not be worth the effort?
-kevin

I just wanted to add that I think the ideal solution would be for me to
be able
to define a single index that did both -- that is, that would cache
documents
in memory while keeping full index in disk.
It would be great as well if I could specify how I wanted the cache to
work --
say, by giving it a regular expression or some query to tell it what
should be
cached in memory. Maybe I could also specify a limit on the total memory
it
should use for cache.
I might, for example, want to have it cache documents based on a certain
user or
customer id rather than cache them by date. Maybe whenever a new user
logs in I
modify the cache settings to include their documents in the cache -- and
whenever someone logs out I flush theirs.
The value of this is that it hides the complexity from developers/users
and
makes it easy to use.
Sorry for the 'stream of consciousness' design reqs -- I'm just dumping
the idea
now since I was thinking about it...

Erik H. wrote:
> It's not quite comparable the difference between a full-text search> engine and a web framework.>> Lucene is optimized heavily - it's code is more C-like than Java-> like. Making Lucene more OO or taking advantage of all the fancy> Ruby ways of method trickery is likely to slow things down. The> entire idea of a full-text search engine is to be fast! (oh, and to> be easy on resources as well)
The java version is really heavy a.t.m. (just to mention it ;)), but
your quite right, search querie's can't be cached very easily. So
writing optimized code is very important.
> Well, in all fairness to Lucene, it is orthogonal to the database> concern entirely. Of course Ferret + ActiveRecord > just Lucene, but> to make the comparison more fair, how about Lucene + Hibernate?> There are hooks for Hibernate to index with Lucene, even using Java> annotations to mark the fields to be indexed, and how they are to be> indexed. I see ActiveRecord + Ferret to be a great path to go, and> the acts_as_ferret initial implementation is on the right track. I> hope to delve into this area more myself in the future (though my> work does not currently involve relational databases, but will soon).
I am busy at the moment to create a plugin for rails, but ill be easy to
use to extend ActiveRecord. I am trying combine the database and Ferret
with a news methods that builds upon find (search), just ferret if a
query is present and fetch the rows using find.
> to avoid porting every time Java Lucene changes (which is where the> guru creator Doug Cutting spends his effort), it would be a simple> recompilation (and perhaps some API glue).
Thats a very good idea, but compiling java sound weird :). David have
you considered this? I wonder how will it would integrate..

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Dec 13, 2005, at 6:15 AM, David B. wrote:
>> existing>> databases with very good result, this should be possible with>> ferret or not?>> Sure. I wouldn't replace a database with Ferret in most instances and> probably not in a Rails app since rails makes it so easy to use a> database. I was just trying to say it was possible to use Ferret or> Lucene as a data store. :-)
I treat the data I store in the Ferret index as a denormalized table
tuned for the queries it answers.
jeremy
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (Darwin)
iD8DBQFDnwuCAQHALep9HFYRAvqDAJ9q3QwWgxpjke4XMrxW4tZh4vbsgACfb48b
odJNj9m2MkZgyg180o/s9z8=
=O3sr
-----END PGP SIGNATURE-----

On 12/14/05, Abdur-Rahman A. <removed_email_address@domain.invalid> wrote:
>> > the acts_as_ferret initial implementation is on the right track. I> > the rubylucene (formerly rucene) project at RubyForge once upon a>> Thats a very good idea, but compiling java sound weird :). David have> you considered this? I wonder how will it would integrate..
Yes, Erik and I have discussed it already. It might be a better way to
do it but I can't find the motivation. It's a lot more interesting and
motivating for me trying to create something that runs faster than
Lucene. Besides being slightly faster, C is also lighter on resources
and makes for a much smaller download. I was and still am interested
in desktop search so these are all important to me. Speaking of Doug
Cutting, he has some words to say on this too;
http://nutch.sourceforge.net/blog/2005/02/open-sou...
So those are my reasons with taking the route I am, and since I'm
currently doing the work, I get to choose. ;-) If anyone wants to get
stuck into porting the PyLucene stuff I'm more than willing to lend
and hand. It's definitely worth doing but it's not really my cup of
tea.

Hi Kevin,
I can't quite tell from your description. Do you actually want to
store and retrieve the documents from a Ferret index? Or do you just
want to run the search on the index and then retrieve the results from
the database? Also, how large a document set are you expecting? If you
still have to retrieve the documents from the database I think Ferret
should be fine as is without the caching. If you are running into
performance problems after it's implemented I could certainly help you
set up some caching.
Cheers,
Dave

>http://nutch.sourceforge.net/blog/2005/02/open-sou...>>So those are my reasons with taking the route I am, and since I'm>currently doing the work, I get to choose. ;-) If anyone wants to get>stuck into porting the PyLucene stuff I'm more than willing to lend>and hand. It's definitely worth doing but it's not really my cup of>tea.>
haha : ) wel, your doing a great job, ill continue to use ferret! I
don't have the client request a.t.m. for taking on such a project. Maybe
in after a couple of months...

>So those are my reasons with taking the route I am, and since I'm>currently doing the work, I get to choose. ;-) If anyone wants to get>stuck into porting the PyLucene stuff I'm more than willing to lend>and hand. It's definitely worth doing but it's not really my cup of>tea.>>>
My kudos for these honest words!! A motivated developer is often the
most important thing.
Even in this early stage the rails community owes a great deal of
compliment to the ongoing efforts on ferret.
regards
Jan

On 12/14/05, Jan P. <removed_email_address@domain.invalid> wrote:
> >>> Even in this early stage the rails community owes a great deal of> compliment to the ongoing efforts on ferret.
Especially the logos. ;-)
Thanks.

I'm not sure yet what's best. I haven't built that part of my app yet
and am
still working through the design. I'm just trying to think through the
best
approach for now. Do you have pointers to docs that can provide some
basic
'rules of thumb' for design - like when to store docs in a database and
run a
search on the index -v- when to store docs in the index directly?
I used Verity for search on an e-commerce site I helped build a few
years ago.
We stored the actual docs in a database (product descriptions, actually)
but
used verity for searching - it worked fine, but was a pain since
updating the
product catalog tables and the verity search index had to be closely
coordinated or you'd find search results for products that weren't in
the
database...
Also, regarding creating an index in memory -v- creating it on disk --
are there
significant performance differences (eg, 20% - 50% faster or more) when
using an
in-memory index? Has anyone published test results?
Thanks again for your help and your efforts. My needs aren't pressing,
I'm just
trying to figure out using ferret might benefit the app I'm building.
-k
Quoting David B. <removed_email_address@domain.invalid>:

On Dec 13, 2005, at 12:54 PM, Abdur-Rahman A. wrote:
>> The java version is really heavy a.t.m. (just to mention it ;)),> but your quite right, search querie's can't be cached very easily.> So writing optimized code is very important.
What do you mean by "heavy"? I guess I'm being a bit defensive
about Java Lucene. I'm not understanding your negatives to Java
Lucene other than your preference for Ruby. It still remains to be
seen how performant and optimized Ferret can be compared to Java
Lucene. My hunch is that porting to C will make it slightly faster
in spots, but whether it is worth the headaches of maintaining the
port is my question.
>> Pythonic API. In order to avoid porting every time Java Lucene>> changes (which is where the guru creator Doug Cutting spends his>> effort), it would be a simple recompilation (and perhaps some API>> glue).>> Thats a very good idea, but compiling java sound weird :). David> have you considered this? I wonder how will it would integrate..
PyLucene is *fast*. Super fast.
Erik

On Dec 13, 2005, at 1:58 PM, Jan P. wrote:
>> Even in this early stage the rails community owes a great deal of> compliment to the ongoing efforts on ferret.
Hear hear! Kudos to Dave for Ferret and I fully encourage him to
choose the development path he wants to go on. I hope he succeeds in
making a faster Lucene, for sure, regardless of what language he
creates it for.
Erik

On 12/14/05, Kevin B. <removed_email_address@domain.invalid> wrote:
> I'm not sure yet what's best. I haven't built that part of my app yet and am> still working through the design. I'm just trying to think through the best> approach for now. Do you have pointers to docs that can provide some basic> 'rules of thumb' for design - like when to store docs in a database and run a> search on the index -v- when to store docs in the index directly?
I don't know if you caught the other thread on Ferret but as we were
discussing, it's usually better to store the documents in the database
and use ferret for finding the relevent documents. In rails, the way
to go is probably use something like this;
http://ferret.davebalmain.com/trac/wiki/FerretOnRails
The main reason you'd store stuff in the index is to allow result
searching. For example, if you wanted to sort your search results by
create_date then you'd need to store create_date in the index. There
are a few other times I can think of that you might want to store
documents in an index but they don't apply to a rails app.
> I used Verity for search on an e-commerce site I helped build a few years ago.> We stored the actual docs in a database (product descriptions, actually) but> used verity for searching - it worked fine, but was a pain since updating the> product catalog tables and the verity search index had to be closely> coordinated or you'd find search results for products that weren't in the> database...
You need to be careful of this with Ferret too. This is the problem
the acts_as_ferret ActiveRecord hocks are trying to solve. It still
requires a bit of work. I haven't played with rails for a while now
but when I get the chance I'll try and come up with something better.
> Also, regarding creating an index in memory -v- creating it on disk -- are there> significant performance differences (eg, 20% - 50% faster or more) when using an> in-memory index? Has anyone published test results?>> Thanks again for your help and your efforts. My needs aren't pressing, I'm just> trying to figure out using ferret might benefit the app I'm building.
This is kind of a catch-22. If you can store your index in memory then
it is probably small enough that it won't need to be stored in memory.
With the C version I'm working on the difference is only about 20%-30%
so not worth worrying about in my opinion.
HTH,
Dave

> What do you mean by "heavy"? I guess I'm being a bit defensive> about Java Lucene. I'm not understanding your negatives to Java> Lucene other than your preference for Ruby. It still remains to be> seen how performant and optimized Ferret can be compared to Java> Lucene. My hunch is that porting to C will make it slightly faster> in spots, but whether it is worth the headaches of maintaining the> port is my question.
I think I am sounding more negative then I am : ) I repeat I like lucene
for most of the project, but for something like a large scale search
engine, its maybe a better I think, to have a C implementation. Some
project we have used Clucene or lucene4c (I don't remember, I was
projectleader) and it was much faster then using lucene. I was only
mentioning making the C port as it maybe faster to implement this.
> PyLucene is *fast*. Super fast.
Erik, you are the expert, I am just trying to learn as I go along...
thnx for your feedback : )

Thanks - all this info is right on. Great!
Quoting David B. <removed_email_address@domain.invalid>:
> This is kind of a catch-22. If you can store your index in memory then> it is probably small enough that it won't need to be stored in memory.> With the C version I'm working on the difference is only about 20%-30%> so not worth worrying about in my opinion.
My situation is potentially different. The data I am storing is
text-based and
somewhat time sensitive. That is, the newest data is what most users
will be
interested in.
However, I need to allow the ability to search for *all results* -- both
new
data and old. Once the database is large, then this "new data" may be
only 1%
or less of the overall database. The new data may consist of several
thousand
documents.
I'm wondering if it might be useful to store *all data* in a disk-based
index
while *also* storing the newest data in an in-memory index. This would
allow me
to offer faster results when searching only the new data (which is what
most
people will likely use) while still allowing people to search the entire
dataset if they want to.
Of course, this is only a good idea if it provides a significantly
faster
response time for searching the in-memory index.
-k

On Dec 13, 2005, at 3:11 PM, Abdur-Rahman A. wrote:
> scale search engine, its maybe a better I think, to have a C> implementation. Some project we have used Clucene or lucene4c (I> don't remember, I was projectleader) and it was much faster then> using lucene. I was only mentioning making the C port as it maybe> faster to implement this.
Java Lucene is powering search in some very very heavy duty places,
not to mention some top secret ones.
For example, Doug is using Nutch (an open source "Google", with
Lucene as a core component) to revamp the infrastructure behind The
Internet Archive. Yahoo Research Labs and others have funded Doug's
Nutch efforts. I just want to be clear about Java Lucene being as
"enterprise" savvy as anyone needs. CLucene was a valiant effort,
and supposedly is slightly speedier in some cases, but also not up to
date with the latest Java Lucene API. lucene4c hasn't gotten off the
ground.
Java Lucene is the most up to date version available and has many
features not found in the ports that haven't kept up. PyLucene just
released a version up to date with Java Lucene's Subversion trunk
(mostly by just recompiling, though there were some tweaks to the GCJ/
SWIG pieces apparently as well). All the ports, Ferret included,
will always be playing catch-up with Java Lucene. If the maintainers
of the ports take a break, they will be behind.
I don't want to discourage folks from porting Lucene at all. But I'm
guardedly optimistic about a port being as good as Java Lucene. It
truly is one of the few gems in the Java open source world with very
little quality competition.
Erik

On 12/14/05, Kevin B. <removed_email_address@domain.invalid> wrote:
> interested in.>> However, I need to allow the ability to search for *all results* -- both new> data and old. Once the database is large, then this "new data" may be only 1%> or less of the overall database. The new data may consist of several thousand> documents.
So if I do the math, you're expecting to have several hundred thousand
documents? Ok, you've got my attention now.
> I'm wondering if it might be useful to store *all data* in a disk-based index> while *also* storing the newest data in an in-memory index. This would allow me> to offer faster results when searching only the new data (which is what most> people will likely use) while still allowing people to search the entire> dataset if they want to.
In-memory or not, it will certainly be faster to search a smaller
document set so splitting the index in two might not be a bad idea.
Perhaps you could have a daily process which reindexes the recent
document set.
> Of course, this is only a good idea if it provides a significantly faster> response time for searching the in-memory index.
The in memory part won't make the big difference. Having a smaller
index might. I'd recommend doing the simplest thing possible and
refactoring if necessary. It should't be hard to add a second
in-memory index later. Up to you though.
Dave

On 12/14/05, Abdur-Rahman A. <removed_email_address@domain.invalid> wrote:
> What you could considere is using something like cacheAR for the latest> queries or for popular queries..
I'm not really sure but I think you'd probably just use cacheAR to
cache the popular documents. I don't know if I mentioned already but I
haven't had enough time to work with much of the rails stuff yet.
Soon. :-)