Task 2: Parsing Office Documents

As DotLucene can index only plain text, we need to parse the Office documents and extract text from them. Reading their binary structure isn't an easy job. However, on Windows 2000+, we can use the IFilter interface which is a part of the Windows Indexing Service. This is installed by default on all Windows 2000+ systems (no Office installation is required).

The IFilter API is also being used by the Windows Desktop Search (MSN Search Toolbar) and Lookout, so you don't have to be afraid that we will use something obscure to parse the documents. It can also parse other file types if you install the appropriate filter.

Working with the IFilter interface requires a lot of COM interop which is a bit tricky. After tweaking the samplesavailable on the web, I finally had a code that worked correctly in most cases:

Office Documents Parsing:

Appearance:

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Hi All,
can any one help me like how i can use different IFilters dyanmically.
what i mean to say is if a new filter is installed in the system then how i can make use of that filter with out changing my code.

is there any way i can write my code to pick the file and parse if the filter is available in the system.

Hi All,
Is this Desktop search application a natural search. Because i tried to search a work by giving a part of the word, I failed to get the result. It is currently working only if we give full word.

Is it possible to make the search as Natural Search i.e., if I give a part of the work eg: the word "testing" is there in my file and if I give a word "test" then also it should bright me the file which contains word "testing"

How this works depends on the Analyzer you used when creating the index (and searching it). Look into PortStemmer (built into Lucene) or SnowBall (separate but available from the Lucene.Net author) , I preferred the latter. KStemmer is a new one I've seen but I've not used it yet
http://www.dotlucene.net/download/

Its available from the same place as Lucene.Net:
http://sourceforge.net/projects/dotlucene/

Snowball is an alternative Analyzer; I couldn't explain it in one comment, it would be best if you read the Lucene documentation or got the book "Lucene in Action" (http://www.lucenebook.com/), which is superb, albeit based on the original Java implementation.

Thank you friend for giving more information on this topic.
I will go thru the article.

do you have any idea how I get licence of dotlucene dll.

please help me if you know any information related to this.

I read some body suggested apache licence 2.0 but when i went to the site i could not see any information how i can get or how much is that dll. as well as i am .net programmer and apache sounds like java. so how do i get for .net

Just follow instructions for installing the Visfilt.exe file you downloaded from Microsoft. After you install, you will automatically be able to extract raw text from visio files. Then add visio file types to the patterns string array:

try
{
//a method that checks the Registry to see if a file type has a filter associated with it
if (IsParseable(fileRef))
{
ifilt = loadIFilter(fileRef);
uint i = 0;
IFilter.STAT_CHUNK ps = new IFilter.STAT_CHUNK();

Hi Inspector/All,
Can you ppl help me how I can use I filters for files like TIFF and PDF's
Can any one send me the code for implementation of the new IFilters.

After seeing the conversation I understood that some body written the code to get text from the IFilters. "ExtractRawText" Can you ppl send me the code to me also as I am also working on the same task.

try
{
//a method that checks the Registry to see if a file type has a filter associated with it
if (IsParseable(fileRef))
{
ifilt = loadIFilter(fileRef);
uint i = 0;
IFilter.STAT_CHUNK ps = new IFilter.STAT_CHUNK();

Hi MickeyB
Did you got the solution for deletion of an existing index.
If so can you help me out as I am also woking on the same sort of application where i need to delete, update and create document in the ndex file which is already generated.

Hi MickeyB
Did you got the solution for deletion of an existing index.
If so can you help me out as I am also woking on the same sort of application where i need to delete, update and create document in the ndex file which is already generated.

I made a test with your demo app building an index on three files. Although all 3 files include the word i was searching for, your app only returned two matches. Why is that? Isn't dotLucene reliably indexing words? Any explanation welcome...