jackrabbit-dev mailing list archives

Hi,
I noticed that the time a query takes corresponds to the total size of the query
result. Digging into the code I found the following lines in
org.apache.jackrabbit.core.query.lucene.QueryImpl:
result = index.executeQuery(this, query, orderProperties, ascSpecs);
ids = new ArrayList(result.length());
scores = new ArrayList(result.length());
for (int i = 0; i < result.length(); i++) {
NodeId id = NodeId.valueOf(result.doc(i).get(FieldNames.UUID));
// check access
if (accessMgr.isGranted(id, AccessManager.READ)) {
ids.add(id);
scores.add(new Float(result.score(i)));
}
}
The first line where the lucene query is executed is no problem. The problem
apparently starts in the loop where the UUID of _every_ found document is
fetched. If you get search result with 10000+ documents, which we do, and you
only need the first 20 documents, this becomes a bottleneck.
Two possible solutions come to my mind:
1. Use a lazy QueryResultImpl that keeps a reference to the result and only
fetches the UUIDs for requested nodes. This imposes that the access check is
done in the QueryResultImpl and the result size returned by size() may vary if
you don't have access to some nodes (which it already does if node in the result
gets deleted). The real problem is how to trigger result.close() which closes
the index. I'm even not sure if it causes problems if indexes are not closed as
fast as possible.
2. Use a reverse DocNumberCache. To be really effective this cache has to cache
all docNum to UUID relations, because even if just 500 out of 10000 are uncached
this already gives a performance hit. So a cache with a fixed size wouldn't be
sufficient.
Any thoughts?
Cheers,
Christoph