Machine Learning Models in Fusion

Fusion provides the following tools required for the model training process:

Solr can easily store all your training data.

Spark jobs perform the iterative machine learning training tasks.

Fusion’s blob store makes the final model available for processing new data.

Training Models

Note

The approach for training models explained in this section still works in Fusion 4.0. An alternative approach introduced in Fusion 3.1 lets you create model-training jobs in the Fusion UI. See Machine Learning in Lucidworks Fusion for more information.

An example Scala script to train an SVM-based sentiment classifier for tweets is provided in the spark-solr repository.

The following diagram depicts this process:

Model Prediction

Fusion’s blob store requires all stored objects have a unique ID.
Once the model is stored in the Fusion blobstore, it is available to Fusion’s index and query
Machine Learning pipeline stages, which use the model to make predictions for new data in
pipeline documents and queries.
The following diagram shows how this process works:

Model Checking

To test the goodness of your model in Fusion,
first create either a document index pipeline
or a query processing pipeline which contains a Machine Learning stage that uses your
model to make predictions on your data,
and then send a document or query through that pipeline
pipeline which contains data for which you know what the predicted value should be.
For example, given a trained sentiment classifier and an index stage configured to use it,
the following document should be classified as a highly positive tweet, with a value of (close to) 1.0 in the "sentiment_d" field: