lucene-java-user mailing list archives

Hello again,
Here http://people.level7.ro/mile.rosu/small_index.zip are a couple of
documents in our index which might provide you a better overview of our
problem(both separated indexes and merged version).
Our problem remains with the date index - a date record has additional
fields used for date range searches which cannot be merged into one
index unfortunately - or at least we do not see a solution.
Thank you,
Mile
-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org]
Sent: Tuesday, June 13, 2006 8:45 PM
To: java-user@lucene.apache.org
Cc: Mircea Pop
Subject: RE: Using more than one index
: A document (in our case an xml that has many metadata) can have more
: than one date, each date with 2 attributes:
: <date type="document" art="geburt">00-00-1886</date>
:
: In the date index I have for every <date> in the input xml a document
: with fields: type (document |other), date, art (birthday | deportation
|
: death...). For example if I merge all the dates that correspond to a
: document then the new type field will contain all the values. So if I
so make a seperate field for each type of date, lucene is very good at
supporting documents with heterogenous sets of fields -- and it's even
better if you use OMIT_NORMS (which makes perfect sense for a date field
where norm values are meaningless anyway)
: A couple of suggestions...
:
: 1) don't use multiple indexes. create one index, with one document
per
: "thing" you want to return (in this case it sounds like books) and
index
: all of the relevent data about each thing in that doc. If multiple
: people
: worked on a book, add all of their names to the same field. addd all
of
: the dates to the book doc -- if you need to distibguish the differnet
: types of dates, make a seaprete field for each type.
:
: If you *must* cross refrence...
:
: 2) make sure you aren't useing the Hits API to iterate over all the
: results when gathering IDs -- use a lower level api (like a
: HitCollector)
:
: 3) use the FieldCache to get the IDs instead of he stored Document
: fields.
:
: 4) don't extract full ID lists from all of then indexes and then
search
: on one of the indexes again with the ID list ... use the ID lists
: generated from the supporting indexes (people and dates) to build a
: Filter
: that you can use when searching the main index.
:
:
:
: : Date: Mon, 12 Jun 2006 12:22:30 +0300
: : From: Mile Rosu <mile.rosu@level7.ro>
: : Reply-To: java-user@lucene.apache.org
: : To: java-user@lucene.apache.org
: : Subject: Using more than one index
: :
: : Hello,
: :
: : We have an application dealing with historical books. The books have
: : metadata consisting of event dates, and person names among others.
: : The FullText, Person and Date indexes were split until we realized
: that
: : for a larger number of documents (400K) the combination of the
: : sequential search hits took a way too long time to complete (15
min).
: : The date index was built using the suggestion found at:
: : http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing
: (big
: : thanks for the hint)
: :
: : Is there a recommended approach to combining results from different
: : indexes (with different fields)?
: :
: : The indexes structure:
: : MainIndex:
: : Fields:
: : @ID@ - keyword (document id)
: : @FULLTEXT@ - tokenized (used for full text6 search)
: : Ptitle - tokenized (used for full text publication title
: : search)
: : Dtitle - tokenized (used for full text document title
: : search)
: : Type - keyword - (used for document type)
: :
: : PersonIndex:
: : @ID@ - keyword (document id == mainIndex.@ID@)
: : Person - tokenized (full text person name search)
: : DateIndex:
: : @ID@ - keyword (document id == mainIdex.@ID@)
: : Date - date as YYYYMMDD - keyword
: : Type - type of date (document date, birth day, etc...)
: : @YYYY@ - year of date
: : @YYYYMM@ - year and month of date
: : @DDD@ - decade
: : @CC@ - century of date
: :
: :
: : Eg:
: : If I want to search for documents that contain: person "John", full
: text
: : "book" and date: before 06/12/2005
: : Step 1: search in personIndex for John - retrieve all @ID@ from the
: hit
: : list
: : Step 2: search in DateIndex for documents that have dates before
: : 06/12/2005 - retrieve id from the hit list
: : Step 3: search in mainIndex for "book" - retrieve all @ID@
: : Step 4: combine all the lists
: : Step 5: search mainIndex for documents with the @ID@ from the
combined
: : id list
: :
: : Each search takes less then 1 second, but retrieving @ID@ from the
: index
: : takes a lot more - the time increases by the number of hits. This is
: : because when retrieving a field value from a document hit, the
Lucene
: : engine loads all the fields from the index (the entire document). So
: if
: : in one search I get 300.000 hits cont, I have to iterate through all
: and
: : retrieve the @ID@ field value - this takes a lot of time.
: :
: : Regards,
: : Mile Rosu
: :
: :
---------------------------------------------------------------------
: : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: : For additional commands, e-mail: java-user-help@lucene.apache.org
: :
:
:
:
: -Hoss
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org