What's in a IFilter?

The IFilter interface was designed by Microsoft for use in its Indexing Service. Its main purpose is to extract text from files so the Indexing Service can index them and later search them. Some versions of Windows comes with IFilter implementations for Office files, and there are free and commercial filters available for other file types (Adobe PDF filter is a popular one). The IFilter interface is used mainly in non-text files like Office documents, PDF documents etc., but is also used for text files like HTML and XML, to extract only the important parts of the file. Although the IFilter interface can be used for general purpose text extraction from documents, it is generally used in search engines. Windows Desktop Search uses filters to index files. For more information on IFilter, see the Links section.

So what else is new?

There are already quite a few articles and pieces of information on how to use the IFilter interface in .NET (see the Links section), so why write another article you ask? Well, there are some problems with the implementations offered in those articles (details below) which caused me to take a different approach to using and loading filters. I'm currently using this implementation in a new product I'm developing (more details will be revealed here), and since it's working great, I decided to share it with you (yes, You!).

Issues with the current implementations

These are the issues I and others have found with the current implementations, I'll discuss each in detail below:

Extracting text from very large files.

COM threading issues.

Adobe PDF filter crashing the application when it's closed.

Extracting text from very large files

All of the sample code I found on using IFilters in C# provided a method that extracts the entire text of a document and returns that as a string. Usually, it's something like this:

publicstaticstring GetTextFromFile(string path)

Now, this might be OK for some uses, but for a general purpose indexer, I find that it isn't the most scalable way to extract text from documents. Some documents may be very large (30 MB PDFs or Word documents are not uncommon), and extracting the entire text at once can have negative effects on the garbage collector since these objects will be stored in the .NET "Large Objects Heap" (see the Links section for more information).

COM threading issues

Since filters are essentially COM objects, they carry with them all the COM threading model issues that we all love to hate. See the Links section for some of the reported problems. To make a long story short, some filters are marked as STA (Adobe PDF filter), some as MTA (Microsoft XML filter), and some as Both (Microsoft Office Filter). This means MTA filters will not load into C# threads that are marked with the [STAThread], and STA filters will not load into [MTAThread] threads. Some people recommend manually changing the registry to mark "problematic" filters as Both, but this isn't something you want to do during the installation of a product, nor can you reliably do it because you don't know which filters are installed on the customer's machine. We basically need a way to load an IFilter and use it no matter what its threading model or our threading model is.

Adobe PDF filter crashing the application when it's closed

There are quite a few reports about problems with the Adobe PDF filter v6. See this and this for some examples. I researched this issue for some time, and I believe I found what the problem is. It seems Adobe forgot (or not..) to export the DllCanUnloadNow function from their PDFFILT.dll. Since a filter is implemented as a COM object, it should export this function to let COM know when it can unload this library. It seems that this causes problems for C# applications because the .dll is never unloaded, and when it does, it's probably a bit late.

In a previous version of my application, I managed to work around this issue by specifically unloading the PDFFILT.dll library. In the current implementation, this workaround is not needed.

How my implementation solves these issues?

Implement a FilterReader

I decided to go the hard way and implement a TextReader derived class called FilterReader. This solves issue #1 because we don't have to extract the entire text at once. Instead, you can simply use the reader to get a buffer at a time. If you still want to get the entire text as a string, use the ReadToEnd() method.

Bypassing COM

In order to get an IFilter instance, you should call the LoadIFilter API and pass it a file name. LoadIFilter eventually calls CoCreateInstance() to actually instantiate the filter, and thus abide to COM rules. To avoid the threading issues, I decided to bypass COM and instantiate the filter COM class myself. This has the following implications and assumptions:

I needed to find the correct COM class that implements the filter for a specific file type.

I needed to dynamically load the COM DLL that implements that COM class and call the DllGetClassObject function that is exported from that .dll.

I didn't want to re-implement all of the COM infrastructure, so in order to solve the issue of unloading COM DLLs only when they're not needed, I decided to keep the DLLs loaded during the entire application lifetime and only unload them when the application dies. Note that this essentially solves issue #3 since we manually unload the PDFFILT.dll.

An IFilter should not be used by multiple threads since it is no longer protected by COM.

I assumed that STA filters will behave correctly when called from an MTA thread when COM is not involved. Until now, I didn't encounter any problem with this approach. If you find a filter that behaves badly when used this way, please let me know.

The details

Finding the correct COM class

Since I've decided not to use LoadIFilter, I needed to find a way to locate the correct DLL and class ID of the object implementing the filter for the file whose text we're interested in. This was a simple task, thanks to the excellent RegMon utility from SysInternals. I simply called LoadIFilter and traced which registry keys where read during that operation. I then used the same logic in my own implementation. The details can be found in the FilterLoader class. When a class\DLL pair is found for a certain file extension, this information is cached to avoid traversing the registry again.

During the research I made on how LoadIFilter works, I came across a utility called IFilter Explorer that shows which filters are installed on your computer. From that tool, I also learned that some indexing engines use methods not implemented in LoadIFilter to find filters. One of these methods uses the content type registered for that extension. My version of LoadIFilter also handles loading filters for files that have no filter registered for them but do have a filter registered for their content type.

Loading the DLL and instantiating the filter implementation

OK, so we have the name of the DLL and the ID of the class implementing our filter, how do we create an instance of that class? Most of the work is handled by the ComHelper class. The steps needed to accomplish that are:

Load the DLL using the LoadLibrary Win32 API.

Call the GetProcAddress Win32 API to get a pointer to the DllGetClassObject function.

Use Marshal.GetDelegateForFunctionPointer() to convert that function pointer to a delegate. Note: this is only available in .NET 2.0. For an equivalent method in .NET 1.1, see the Links section.

Hi,
A very useful article. I tried to convert it into a .ASPX application. It works well on .DOC file but I got a "No Filter Defined" error on .PDF file even I installed the PDF Ifilter v6. Do you know why?

Some of the IFilters we use (which are both home grown and 3rd party) are implimented in .Net.

The different implimentations can use different IFilter interface signatures. The main culprit tends to be GetText where the second parameter is sometimes passed as an IntPtr, sometimes as a char* and sometimes as a char[].

This works fine when calling the IFilter via COM but when using your code it trys to apply your IFilter interface and when it finds they missmatch it fails.

The solution we have come up with is using reflection inside an adapter class to connect the different IFilter interfaces but this isn't really ideal.

Have you come up against this problem yet and/or think up any solutions?

This issue has come up before here[^]:
And back then I couldn't find a solution for this.
I guess that if you're developing your own .net filters you could define the IFilter interface in a seperate assembly and reference that assembly in both your filter implementation and in my library.
If you're using .net filters from 3rd party vendors then I don't think there's much you can do..
Let me know if this helps or if you find anoter solution for this.

Had a bit more of an investigation today (inbetween installing my new dev box) and we've thrown out the idea of using reflection and emit compleatly.

You can also (as you probably know) access the indexing server as a data source with ado.net, but obviously thats not suitable in this situation.

That leaves the wonderful world of native code That gives us 2 routes, first just create a plain native wrapper then pinvoke it or go the funkier route and use c++/cli to intergrate the native and managed sides. Since that is exactly what c++/cli was made for it looks like I'll be using that to first load up the IFilters via native COM then bring it all together into the managed section of the app.

Thanks for the good job .
Is there any way for calling Ifilters Directly without registering them first?
I have some IFilters and want to upload them to a host where my site is, and put them in the same folder of my site assembly (Asp.net 2) and call them from that assembly without the need of registering them on the host.
Is it possible ?

I tested your solution and it is quite OK with registered IFilters. can it be customized in a way that could accept the IFilter DLL path manually?

Yes- That's possible, but the dll path is not enough, you also need to know the GUID of the class that implements the Filter inside that .dll. You can get that class by stepping through the FilterLoader.LoadIFilter method and see what is the value of the variable filterPersistClass after the call to GetFilterDllAndClass().
After you get that information, you can make the following changes to your code in order to load dlls directly without them being registered:
1. FilterLoader class:
Add a LoadAndInitIFilter() method:

I've been searching for low pass, high pass, band pass filters design tutorials for the last 3 weeks and have failed to find any good source (lp.bp,hp filters source codes or class) miserably. My third year mini-porject requires the data analysis in C# through the design of proper filters as mentioned above. I know that there are alot of geniuses in this forum. Can anyone help me please? I'm very desperate...
Thanks in advance.

This is a long shot, but the state variable filter provides lowpass, highpass, bandpass, and band reject outputs simultaneously. Take a look at the musicdsp[^] archives for filter algorithms. However, this is for audio applications. I have no idea if it can be applied to "data analysis."

Apologies if this isn't what you're looking for or if it is off-topic to the article.

On a winxp sp2 machine with Adobe Reader 8 (with and without IFilter 6.0 installed) installed, a 'No filter is defined for somepdffile.pdf' exception is thrown from the FilterReader constructor even though the PDF extension is listed in IFilter Explorer.

I installed Acrobat Reader 8 and I'm seeing the same problem. The fact is that altough the .dll is correctly registered (which my library detects correctly) it can't be loaded because it's missing some .dll. I loaded the AcroRdIF in Dependecy Viewer and it confirmed that the .dll can not be loaded. Sorry - but not much I can about it..

Thank you for looking into it. I found a resolution, well work around. You have to amke sure that the related acrobat dlls are on the path. Adding [drive]\Program Files\Adobe\Reader 8.0\Reader to my path solved the problem.