In my last post, Solr Document Processing with Apache Camel - Part 1, I made the case for using Apache Camel as a document processing platform. In this article, our objective is to create a simple Apache Camel standalone application for ingesting products into Solr. While this example may seem a bit contrived, it is intended to provide a foundation for future articles in the series.

Our roadmap for today is as follows:

Set up Solr

Create a Camel Application

Index sample products into Solr via Camel

Downloading, Installing and Running Solr

In this section we will perform a vanilla SolrCloud deployment and start out with the schemalessgettingstarted collection.

To begin, open a terminal and move into a directory where we can install Apache Solr.

Adding Apache Camel Dependencies

As I mentioned in the last post, one of the key drivers for selecting Camel over other integration frameworks and libraries is the relatively light set of dependencies. For our application we only need three external dependencies:

Our Product Data Source

Now that we have the basic structure for our application in place, let's talk about our data source and data model. To make things simple, we'll assume our products are available in one or more JSON files stored on the file system.

We’ll keep the product JSON simple for now and only concern ourselves with a few fields.

Note: I extracted the full Best Buy product data set, grabbed the first 9 and then filtered out all fields except: product ID, name and SKU. (For those interested, I use jq for performing JSON transformations such as this.) In a later article, we’ll work with the entire data set and full data model from Best Buy.

Thinking in Terms of EIPs

Now that we know what our source data looks like, let’s think about how we can use Camel to solve our product ingestion problem.

Read one or more product JSON files from the file system.

Unmarshall the JSON to a GSON object (preferably to a POJO that uses SolrJ @Field annotations).

Submit each product POJO to the Solr indexer.

We can even translate these steps using Enterprise Integration Patterns (EIPs).

The diagram above mirrors our initial ingestion steps.

Consume JSON files from the filesystem using a polling consumer. Essentially, monitor a directory for the addition of files. In our case, we can think of this as a hot folder where files can be processed as they are written to the data directory (i.e., data/solr). Each file that is added to this directory will produce one Camel message.

Split each JSON product object into a separate Camel message.

Submit each product to the Solr endpoint for indexing.

In subsequent articles we will talk about how we can really flex our investment in Camel and perform various content enrichment, transformation and data cleansing processing steps after our splitter and prior to indexing. For now, we are simply indexing the products without modification.

Implementing our Camel Application

Let’s start by creating our POJOs. Essentially, we need our POJOs to do two things: allow us to unmarshall the JSON to a graph of Java objects and annotate our products with SolrJ's @Field annotation so that we can index our products as JavaBeans.

If we look at our JSON we have an array of product objects specified by products. Naturally, we will need a POJO called com.gastongonzalez.blog.camel.Products to represent our array of Product objects.

Not too bad. It looks pretty expressive in just about 30 lines, but let's walk through it any way.

The CamelContext is essentially the Camel runtime and takes care of all the plumbing.

GsonDatatFormat defines the class that is used for unmarshalling the JSON.

The routing engine and routes are a core Camel concept. They are responsible for routing messages. All routes are registered with the Camel Context by calling addRoutes().

RouteBuilder is used to create one or more routes. In our simple use case we only have one route that begins with from() and ends with to().

from() - Here we make use of the out of the box Camel File component. Among other things, components are responsible for creating endpoints. All Camel endpoints are expressed using a URI. We define our file endpoint with two pieces of information: which directory to monitor (data/solr) and specify that we do not want to do anything to files in the directory after Camel has processed them (noop=true). There are many URI configuration options. For example, if we wanted to delete any files after they have been processed by the file endpoint, we could replace noop=true with delete=true.

unmashall() - Unmarshall converts our JSON file to an instance of our Products POJO. At this point, the Camel message body is no longer a File, instead a Java object.

setBody() - Set body allows us to change the contents of the message body. We use Camel's simple expression language to obtain the Product ArrayList and set the ArrayList as the message body.

split() - We use the split EIP to take our current message body (an array of Product POJOs) and split each item in the array into a separate Camel message.

setHeader() - In addition to body, a message also contains headers which can be inspected by processors, endpoints, etc. At a minimum, the Solr endpoint requires that we specify the type of message operation to perform. In our case, we want to index the POJO that is in the message body. Refer to the Solr component documentation for other operations.

to() - Similar to our file endpoint, we define a Solr endpoint using a URI. Here we specify that we want to use SolrCloud (solrCloud). Other supported options include solr, if you need to index a standalone Solr instance, and solrs, if the standalone Solr instance is available via HTTPS. We also define the Solr host and port. In the case of SolrCloud, this value is not considered since we are using ZooKeeper. ZooKeeper is required for SolrCloud and therefore is specified as a URI configuration using zkHost. Lastly, we specify our collection, gettingstarted, to index.

context.start() - Starts the Camel runtime.

Thread.sleep() - Since this is only a sample application we sleep for 10 seconds. This is actually more time than is needed to ensure that our messages are processed. In the next post, we will revisit the application's life cycle. In fact, Camel has a pluggable shutdown strategy, so even if we shutdown Camel too soon (while messages are still in-flight), they will still be processed.

context.stop() - Shutdown Camel gracefully.

Indexing Solr

Now that we have Solr running and have our Camel application completed, let’s compile and run it.