Apache Solr High Speed Data Integration Plugin

Apache Solr High Speed Data Integration Plugin

Guest blog post from The Digital Group. T/DG, stands for The Digital Group, has been working on Talend based data source integration for quite some time. The Digital Group recently launched 3RDi (Third Eye) Enterprise Search Discovery and Analytics Platform that utilizes capabilities of Talend for all data integration layers. 3RDi is a comprehensive suite of products to help address your Enterprise Search needs, offering best-in-class solutions for Content Discovery, Semantic Enrichment, Governance, Analytics, Relevancy Management and Automated Testing. You can read more information about it here: http://www.3rdisearch.com/

Problem

3RDi uses Apache Solr, the open source enterprise search platform, as the backbone for its search features. Apache Solr’s major features include full-text search, hit highlighting, faceted search, real-time indexing, and dynamic clustering. Providing distributed search and index replication, Solr is designed for scalability and Fault tolerance.

As a part of its evaluation and testing of the Solr platform, the T/DG team faced many challenges in attempting to index millions of large complex XML documents at high speed and with minimum errors. Recognizing these limitations, the team realized it could help users by helping cleanse, transform and enrich XML documents with semantic information before they are indexed in the Solr platform. While Apache Solr provides many integration tools like Data Import Handlers and RESTful APIs; these solutions have their own challenges. Solr’s Data Import Handler needs to be loaded in the same JVM, which results in heavy footprint of Solr’s JVM. For large data transfers, this becomes an even bigger challenge and requires the use of a mature integration (or rich ETL) solution like Talend to support higher data transfers, with excellent error handling. Additionally, Talend provides greater flexibility at the data integration layer and provides out of box components to read data from various databases, read data from files in various formats like xml, csv, etc., and transform data, process data in parallel and in batch. However, the challenge is that Talend 6 does not provide a direct way of Apache Solr integration.

Solution:

The solution? New high-speed data integration plugins that make it possible to utilize Talend capabilities for Apache Solr data integration use cases. Today, Talend exchange hosts three different free plugins for Talend-Solr Integration:

These are developed and contributed by T/DG. These three plugins address most of the cases for migrating data in different forms from various data sources to Apache Solr. These plugins work in Talend’s environment, and do not interfere in Solr’s memory. They are built on top of latest Apache Solr (5.X), and they utilize concurrent APIs of Apache Solr, for high speed data transfer in concurrent threads, and in batches.

Now users can define complex workflows with the help of Talend Open Studio, by plugging in any of these Talend components. They can perform complex data transformation, process data concurrently in batches, and eventually push or pull them from Apache Solr.

These components were benchmarked for performance and error handling against old plugins, and were proven to outperform in terms of speed, error handling and flexibility of integration. The data ingestion workflow used for benchmarking, involved complex data transformation and external web service calls for sematic enrichment of data before indexing. These plugins are easy to use by any Talend Developer, and they provide advance settings to control the concurrency level and other Solr parameters. They are maintained by The Digital Group and the company provides technical support for any issues around these plugins. Below are the screenshots of these plugins in action.