Apache Solr – Indexing using data import handler

In the last post, we have seen how to set up Solr. In this post, we will see how to set up the data import handler to import the data from the database. Here, I am using a sample product catalog database for demonstration. The table structure is given below.

Now, we will prepare a document structure for Solr. The configuration of db-data-config.xml is given below.

The db-data-config.xml file will have a root element as <dataConfig>. Here, the dataSource element is having all the database related details. In this example, I am using MySql as a database. So, I configured MySql driver name, connection URL, user name and password. Copy the MySql driver jar file under solr_home lib directory.

In the Solr point of view, the document is nothing but, a row in a database table. Here, we are going to collect the data from various tables. In this example, I am collecting the data from products and retailers tables. We need to wrap the document under the document element. Every SQL query will become as an entity under the document element. Here, the entity element has a name and query attributes. For the name attribute, we can give any value. In this example, I have given the product as value.

The query attribute is having SQL query to collect the necessary columns. All the columns information is given under field elements. For each column, there is an associated field defined. Field has column and name attributes. For column, attribute give the same value which is there in the SQL query. For the name attribute, we can give any suitable name.

Step 2 :

Now, we will configure the schema.xml file. This file contains all the field types information with the analysis portion.

Solr supports for various field types. A field type is similar to the data type. The sample field types are given below,

<!-- boolean type: "true" or "false" -->
<!-- A text field that only splits on whitespace for exact matching of words -->

After configuring the field types, we need to specify the fields which we are going to use for indexing.

The configuration of fields is given below.

Here, the name of the field should match with the name which we have given in db-data-config.xml. The explanation of attributes is given below.

type → the field type(One of the field type which we have defined earlier)

indexed → the fields which are indexed=true are searchable

stored → the fields which are stored=true are stored as part of the index and those fields will be returned as part of search response

required → the fields which are having required=true will be mandatory fields (Same as a primary key in a database)

Step 3 :

Now, we will configure solrConfig.xml. Here, we are going to configure data import handler to import the data, request handler and the response handler.

The configuration of the data import handler is given below.

db-data-config.xml

In the above configuration, you can notice the file name is matched which we have given in Step 1.

The request handler configuration to take the search requests is given below.

<!--Make sure your system has some authentication before enabling remote streaming! -->
&lt;!-- Set HTTP caching related parameters (for proxy caches and clients).
To get the behaviour of Solr 1.2 (ie: no caching related headers)
use the never304=&quot;true&quot; option and do not specify a value for
--&gt;
<!--httpCaching lastModifiedFrom="openTime"
etagSeed="Solr"-->
<!-- lastModFrom="openTime" is the default, the Last-Modified value
(and validation against If-Modified-Since requests) will all be
relative to when the current Searcher was opened.
You can change it to lastModFrom="dirLastMod" if you want the
value to exactly corrispond to when the physical index was last
modified.
etagSeed="..." is an option you can change to force the ETag
header (and validation against If-None-Match requests) to be
differnet even if the index has not changed (ie: when making
significant changes to your config file)
lastModifiedFrom and etagSeed are both ignored if you use the
never304="true" option.
-->
&lt;!-- If you include a directive, it will be used to
generate a Cache-Control header, as well as an Expires header
if the value contains "max-age="
By default, no Cache-Control header is generated.
You can use the option even if you have set
never304="true"
--&gt;
&lt;!-- max-age=30, public --&gt;
<!-- requestHandler plugins... incoming queries will be dispatched to the
correct handler based on the path or the 'qt' param.
Names starting with a '/' are accessed with the a path equal to the
registered name. Names without a leading '/' are accessed with:
http://host/app/select?qt=name
If no qt is defined, the requestHandler that declares default="true"
will be used.
-->
<!-- default values for query parameters -->
explicit
&lt;!--
10
*
2.1
--&gt;

The query response writer configuration is given below.

If we are not going to configure any response writers in solrConfig.xml, by default XML response writer is enabled. During the search request, we can specify the response writer using “wt” query parameter. Based on this, the search response will be returned in the respective data format.

Now, all the necessary configuration is done. Its time to index the data by using the data import handler. We can invoke the data import handler from the admin console.

or we can send HTTP request to the data import handler by using the below URL