I am just starting up with Lucene, and I'm trying to index a database so I can perform searches on the content. There are 3 tables that I am interested in indexing:

1. Image table - this is a table where each entry represents an image. Each image has an unique ID and some other info (title, description, etc).

2. People table - this is a table where each entry represent a person. Each person has a unique ID and other info like (name, address, company, etc)

3. Credited table - this table has 3 fields (image, person, and credit type). It's purpose is to associate some people to a image as the credits for that image. Each image can have multiple credited people (there's the director, photographer, props artist, etc). Also, a person is credited in multiple images.

I'm trying to index these tables so I can perform some searching using Lucene but as I've read, I need to flatten the structure.

The first solution the came to me would be to create Lucene documents for each combination of Image/Credited Person. I'm afraid this will create a lot of duplicate content in the index (all the details of an image/person would have to be duplicated in each Document for each person that worked on the image).

Is there anybody experienced with Lucene that can help me with this? I know there is no generic solution to denormalization, that is why I provided a more specific example.

Thank you, and I will gladly provide more info on the database is anybody needs

PS: Unfortunately, there is no way for me to change the structure of the database (it belongs to the client). I have to work with what I have.

1 Answer
1

You could create a Document for each person with all the associated images' descriptions concatenated (either appended to the person info or in a separate Field).

Or, you could create a minimal Document for each person, create a Document for each image, puts the creators' names and credit info in a separate field of the image Document and link them by putting the person ID (or person Document id) a third, non-indexed field. (Lucene is geared toward flat document indexing, not relational data, but relations can be defined manually.)

This is really a matter of what you want to search for, images or persons, and whether each contains enough keywords for search to function. Try several options, see if they work well enough and don't exceed the available space.

The credit table will probably not be a good candidate for Document construction, though.

Thanks for the answer. If possible, I would like to search both (images and persons) and based on relations between them. I've read that manually defining relations between Documents forces me to do additional merging on the results.
–
Andrei StanescuMar 11 '11 at 16:58

But, I was just thinking. What if I create 2 indexes: one containing the each image and people that worked on it (concatenated), and another index containing each person and the images that he worked on. Would it be too much to always make 2 searches and then somehow choose the best results from each and merge them?
–
Andrei StanescuMar 11 '11 at 16:58

@Andrei: making two indexes and doing two searches complicates matters, as you'll have to devise a way to do the merging. It's ok to have two kinds of documents in a single index, though; I'd go that route if you want to have both.
–
larsmansMar 11 '11 at 17:07