Mashup of Projects

Navigation

Elasticsearch and FSCrawler with Owncloud

If you have ever wanted a search engine for your own files, now you can have it, thanks to a few open source projects. This project will use Elasticsearch, and FSCrawler to provide the search function, and the file indexing. FSCrawler uses Tesseract for text recognition. For the front end, I was initially thinking of making my own php/html, but that would become very complicated in order to deal with the file permissions (to have any application layer security). If you just have internally used files that are shared among everyone on the LAN, then a simple search interface would work great. If you have multiple users with personal files, then ownCloud is the way to go. ownCloud is a ready made platform, that has all of the application security implemented, as well as a search function (and alot of other great features). I modified ownCloud to use the elastic search index for searches. This add’s text recognition from images, and massively speeds up the searches (takes less time than a search on your favorite commercial search engine). I will add that what I have done is very simple, and it could use improvements, and I wouldn’t recommend it in a large environment, because something will probably not work, and people will be upset at you. For professional environments, you can purchase the enterprise version of ownCloud, and get their improved search functionality.
To do this, just follow the below steps, assuming you have Ubuntu:
1. Install ownCloud using docker-compose
Install Docker

Now modify the owncloud search function to use elasticsearch. If you ran docker-compose up in the section for installing owncloud, then it will already be running. To modify it, we need to get bash/shell access to the container. To do this, type:

docker ps

If you only have ownCloud running on docker, there should be 3 containers. We will be changing the owncloud container. To do so, copy the container ID from the container using the “owncloud/server:” image, and type the following with your own container id:

docker exec -it dfbf7fa1b53d /bin/bash

That should put you on the container as root, at /var/www/owncloud. We will modify the code for the search function to have it look to elastic search, rather than the original ownCloud mechanism. It then maps the results into the format that ownCloud is looking for. The way I have done it is quite simple, so it removes some of the ownCloud search result functionality. I only return one type of result, files.
I should also have a blurb about security. The whole point of me using ownCloud for this, is that it controls access to the files. However, elastic search will now have a copy of all the file content in it. Access to this can be restricted with user accounts, and/or by only having elastic search listen on the loopback interface, and the docker interface for ownCloud. If you used the docker-compose script above, the ownCloud address should be 172.18.0.4, and the interface on the host machine should be 172.18.0.1, so you can tell docker to listen on that address. Then each user can have their own index/fscrawler job. OwnCloud can then query a different index depending upon who is logged in. In summary, this code will get search results from elastic search, using the username as the index name, and map those to the format ownCloud uses. Edit /var/www/owncloud/lib/private/Search/Provider/File.php
Make it look like this:

author Andrew Brown <andrewcasabrown.com>
* author Bart Visscher <bartvthisnet.nl>
* author Jakob Sack <mailjakobsack.de>
* author Jörn Friedrich Dreyer <jfdbutonic.de>
* author Morris Jobke <heymorrisjobke.de>
* author Thomas Müller <thomas.muellertmit.eu>
*
* copyright Copyright (c) 2018, ownCloud GmbH
* license AGPL-3.0
*
* This code is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License, version 3,
* as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUTANYWARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESSFOR A PARTICULARPURPOSE. See the
* GNU Affero General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License, version 3,
* along with this program. If not, see
*
*/

There you have it, it’s quite complicated to setup, but once it’s running, it works great! The few improvements I would like to make would be to run fscrawler as a service, and automate the addition of fscrawler instances with users. Also, improving the search results display, to regain the original functionality from ownCloud that displayed different items for media would be nice.