Preconditions

Supported Platforms

JRE

You will have to provide a JRE executable to be able to run SMILA. The JVM version should be at least Java 5. You may either:

add the path of your local JRE executable to the PATH environment variable or

add the argument -vm <path/to/jre/executable> right at the top of the file SMILA.ini. Make sure that -vm is indeed the first argument in the file and that there is a line break after it. It should look similar to the following:

-vm
d:/java/jre6/bin/java
...

Linux

When using the Linux distributable of SMILA, make sure that the files SMILA and jmxclient/run.sh have executable permissions. If not, set the permission by running the following commands in a console:

chmod +x ./SMILA
chmod +x ./jmxclient/run.sh

MacOS

When using MAC switch to SMILA.app/Contents/MacOS/ and set the permission by running the following commands in a console:

chmod a+x ./SMILA

Start SMILA

To start the SMILA engine, simply double-click the SMILA executable. Alternatively, open a command line, navigate to the directory where you extracted the files to, and execute the SMILA executable. Wait until the engine has been fully started. You can tell if SMILA has fully started if the following line is printed in the console window: HTTP server started successfully on port 8080 and you can access SMILA's REST API at http://localhost:8080/smila/.

When using MAC, navigate to SMILA.app/Contents/MacOS/ in terminal, then start with ./SMILA

Further information: The "indexUpdate" workflow uses the PipelineProcessorWorker that executes the synchronous "AddPipeline" BPEL workflow. So, the synchronous "AddPipeline" BPEL workflow is embedded in the asynchronous "indexUpdate" workflow. For more details about the "indexUpdate" workflow and "indexUpdate" job definitions see SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json and jobs.json). For more information about job management in general please check the JobManager documentation.

Start the crawler

Now that the indexing job is running we need to push some data to it. There is a predefined job for indexing the SMILA Eclipsepedia pages which we are going to start right now. We need to start this job in the so-called runOnce mode, which is a special mode where tasks are generated by the system rather than by an input trigger and where the jobs are finished automatically. For more information why this is the case, please see Importing Concept. For more information on jobs and tasks, visit the JobManager manual.

To start the job run, POST the following JSON fragment with your REST client to SMILA:

This starts the job crawlSmilaWiki, which crawls the SMILA Eclipsepedia starting with http://wiki.eclipse.org/SMILA and following only links that have the same prefix. All pages crawled matching this prefix will be pushed to the import job.

If you like, you can monitor both job runs with your REST client at the following URIs:

The crawling of the wikipedia page should take some time. If all pages are processed, the status of the crawlSmilaWiki's job run will change to SUCCEEDED. You can have a look at SMILA's search page to find out if some of the pages have already made their way into the Solr index.

Search the index

Since SMILA uses Solr's autocommit feature (which is configured in solrconfig.xml to a period of 60 seconds or 1000 documents, whichever comes first) it might take some time until you retrieve results.

To search the index which was created by the crawlers, point your browser to http://localhost:8080/SMILA/search. There are currently two stylesheets from which you can select by clicking the respective links in the upper left corner of the header bar: The Default stylesheet shows a reduced search form with text fields like Query, Result Size, and Index, adequate to query the full-text content of the indexed documents. The Advanced stylesheet in turn provides a more detailed search form with text fields for meta-data search like for example Path, MimeType, Filename, and other document attributes.

For text files other than plain text and HTML you cannot search inside the document's text (at least not right now, but you might have a look at Aperture Pipelet which addresses this problem).

Start your jobs

Start the indexUpdateJob (see Start indexing job run), if you have already stopped it. If it is still running, that's fine.

Start your crawlFilesAtData job similar to Start the crawler but now use the job name crawlFilesAtData instead of crawlSmilaWiki. This new job behaves just like the web crawling job, but its run time might be shorter, depending on how much data actually is at your rootFolder.

Search for your new data

After the job run's finished, wait a bit, then check whether the data has been indexed (see Search the index for help).

It is also a good idea to check the log file for errors.

5 more minutes to change the workflow

The 5 more minutes to change the workflow show how you can configure the system so that data from different data sources will go through different workflows and pipelines and will be indexed into different indices.