Full-text search returns incomplete result list

we are connecting to the Alftesco repository (Community Edition 3.2r) using the Web-Service API. Currently we are facing problems, that a full-text search doesn't return a complete result list. We have several documents which for example contains the word "clogging". The search returns some documents containing this word, but other documents (in general PDF) are not in the result list. The missing documents are not scanned in, so it might not be an OCR problem. The search query we are firing is:

PATH:"/app:company_home/cm:Documents/cm:Member//*" AND (@cm\:content.mimetype:application/* AND (TEXT:clogging)

As I had mentioned before, we do have that problems with pdf files also, so it can't result from filtering the content by mime types application/*. Also, some pdf documents are found but some (expected ones) are not …

I am already thinking about some repository index configuration issues. Currently the configuration is default one. Do you have any other ideas what the cause can be?

Some PDF files are simply images wrapped up in the PDF format, In which case text extraction becomes difficult.

Indexing is a two stage process the first is to extract the text, then the next step is to index that plain text. One simple test is to transform the PDF file to plain text in Alfresco and see if you get anything sensible.

thank's for the hint, but those documents we are expecting to find are not images. If you open the documents via Adobe Reader you can select the text. I am still thinking, that this might be a problem by the index, because we do have many documents in the repository (some Gigabyte).

I think it may be due to the fact that some of the documents are indexed with a different locale than you expect. Try a wildcard search like clogg* and see if yo get what you expect. If so then most likely lucene has indexed the docs with different locales and that can give unexpected results.

thanks for your reply and idea. Is this possible if we are hosting only english written documents? The wildcard search "clog*" or "*clog*" do not work either.

The document I am searching for, also contains the words "fuel sulphur content". I tried a full text search for the exact phrase with no result (except the other documents which contains this string). I also tried "fuelsulphurcontent" in case the spaces are removed, but no result. And the document isn't a special one. It only contains text in a simple structure which can be selected and searched by the PDF reader program.

But(!), if I specifiy the path to the nearer space where the document is stored (e.g. PATH:"/app:company_home/cm:Documents/cm:Member/cm:Working_Group/cm:Marine//*" AND (@cm\:content.mimetype:application/* AND (TEXT:"clogging"))), the search is successful! But(!), only if I use one of the latest versions of Alfresco Labs! No result with r3.2.

additionally, I re-checked the usage of the new Alfresco version, and to correct my last post, the document is also found when not specifing the space where it is located! So the conclusion is, that the cause is likely the old Alfresco version 3.2 we were using.

Same problem for me. I try the full-text-search on node browsing in /alfresco and the result set is not empty. If i try to performe same query through web-script the result-set is empty. Adding locales has worked for you?? This is my code: