Open source intranet search over millions of documents with full security

Last year my colleague Tom Mortimer talked about indexing security information within an open source enterprise search application, and we’re happy to announce more details of the project. Our client is an international radio supplier, who had considered both closed source products and search appliances, but chose open source for greater flexibility and the much lower cost of scaling to indexes of millions of documents.

Using the Flax platform, we built a high-performance multi-threaded filesystem crawler to gather documents, translated them to plain text using our own open source Flax Filters and captured Unix file permissions and access control lists (ACLs). User logins are authenticated against an LDAP server and we use this to show only the results a particular user is allowed to see. We also added the ability to tag documents directly within the search results page (for example, to mark ‘current’ versions, or even personal favourites) – the tags can then be used to filter future results. Faceted search is also available.

You can read more about the project in a case study (PDF) and Tom’s presentation slides (PDF) explain more about the method we used to index the security information.

Share this post

5 thoughts on “Open source intranet search over millions of documents with full security”

How did you handle user access to things like spelling suggestions? You can’t suggest the term “Project Thunderclap” if the user can’t see any files on that topic. “Sorry, nothing matches Project Thunderclap”.

We don’t currently do anything to restrict the spelling suggestions. In theory it is possible that a user could be shown a suggestion from a document they don’t have access to. For this client we’re pretty sure this isn’t a problem.

I feel that single word spelling corrections are not nearly as risky as phrase completion, which this system doesn’t do. Yes, you could find whether “GlobalMegaCorp” was in the index, but you wouldn’t know anything about the context (apart from whether you have access to docs containing it). For this installation, this is an appropriate level of security.

Apache Lucene, Apache Solr, Apache Kafka, Apache Hadoop and their respective logos are trademarks of the
Apache Software Foundation. Elasticsearch is a trademark of Elasticsearch BV,
registered in the U.S. and in other countries.