Introduction

SMILA is primarily thought of as a framework where you can plug in your own high-performant/high-scalable components (e.g. for storage). Nevertheless, it is also possible to set up SMILA out-of-the-box on a cluster by using its default implementations. This permits horizontal scaling having the effect that importing and processing jobs/tasks will be shared across the cluster nodes. (Remark: We also have a vertical scaling on each cluster machine, but this is not new, because you also have this with a single-node SMILA.)

The following steps describe how to set up SMILA on multiple cluster nodes.

Install external Solr server

If you want to use Solr for indexing, you need to set up a separate Solr server, because the Solr instances embedded in SMILA cannot be shared with the other SMILA instances.

Distributed server

For larger data volumes you will need to set up Solr in a distributed way, too. However, using a distributed Solr setup is not yet fully supported by the SMILA integration (especially during indexing).

Configuring SMILA on cluster nodes

On each cluster node, you have to do the following SMILA configuration changes.

When running on Linux, you can use either an NFS or an SMB/CIFS directory (mounted via Samba) for the objectstore. First tests seem to indicate that using a SMB/CIFS directory is much faster, especially if lots of small files are written (as is the case during crawling processes by the Delta or Visited Links service). Also, we had stability issues with an NFS mount, where a lot of "state NFS file handle" errors occurred.

Of course, the results may largely depend on your environment and could be completely different in your network.

Solr configuration

You have to point to the Solr server that you have set up above.

Configuration file:

configuration/org.eclipse.smila.solr/solr.properties

solr.embedded=false
...
solr.serverUrl=http://<SOLR-HOST>:8983/solr

Jetty configuration

To monitor the cluster node, you have to make the SMILA HTTP server accessible from external.

Running jobs

There, you should see a list of cluster nodes and the following output for each of them: (The given sample output means that 6 tasks are currently being processed on the given cluster node.)

stat: ...
data: "6"

You can also count the inprogress tasks under http://CLUSTER-NODE:8080/smila/tasks, which is the number of tasks currently processed in the whole cluster. This number can be compared with the maxScaleUp setting for a worker in the clusterconfig.json which is the max. number of tasks allowed to be processed on one node. (see also Taskmanager REST API).