ApacheSolrSink

Details

Description

Some use cases need near real time full text indexing of data through Flume into Solr, where a Flume sink can write directly to a Solr search server. This is a scalable way to provide low latency querying and data acquisition. It complements (rather than replaces) use cases based on Map Reduce batch analysis of HDFS data.

Apache Solr has a client API that uses REST to add documents to a Solr server, which in turn is based on Lucene. A Solr Sink can extract documents from flume events and forward them to Solr.

There are some folks that have experience with Apache Solr but do not necessarily understand how to get ElasticSearch up and running.

Having a SolrSink as an alternative could be very helpful in creating a user interface for searching through event and log data collected with Flume using Apache Solr.

In ElasticSearch, the data sent to the Sink can be partitioned using the date (yyyy-MM-dd). With the SolrSink, the partitioning of the captured data by date can be done in a manner similar to ElasticSearch via the CREATE INDEX feature of CoreAdmin

The only downside is that unlike ElasticSearch, where no pre-existing schemas are required, with Apache Solr, the new core can only be created based on a pre-existing instanceDir, solrconfig.xml, and schema.xml files.

Israel Ekpo
added a comment - 26/Mar/13 20:48 I think this is a cool idea.
This could be a great alternative to the ElasticSearchSink.
There are some folks that have experience with Apache Solr but do not necessarily understand how to get ElasticSearch up and running.
Having a SolrSink as an alternative could be very helpful in creating a user interface for searching through event and log data collected with Flume using Apache Solr.
In ElasticSearch, the data sent to the Sink can be partitioned using the date (yyyy-MM-dd). With the SolrSink, the partitioning of the captured data by date can be done in a manner similar to ElasticSearch via the CREATE INDEX feature of CoreAdmin
http://wiki.apache.org/solr/CoreAdmin#CREATE
The only downside is that unlike ElasticSearch, where no pre-existing schemas are required, with Apache Solr, the new core can only be created based on a pre-existing instanceDir, solrconfig.xml, and schema.xml files.

Some use cases need near real time full text indexing of data through Flume into Solr, where a Flume sink can write directly to a Solr search server. This is a scalable way to provide low latency querying and data acquisition. It complements (rather than replaces) use cases based on Map Reduce batch analysis of HDFS data.

Solr has a client API that uses REST to add documents to a Solr server, which in turn is based on Lucene. A Solr Sink can extract documents from flume events and forward them to Solr.

Some use cases need near real time full text indexing of data through Flume into Solr, where a Flume sink can write directly to a Solr search server. This is a scalable way to provide low latency querying and data acquisition. It complements (rather than replaces) use cases based on Map Reduce batch analysis of HDFS data.

Apache Solr has a client API that uses REST to add documents to a Solr server, which in turn is based on Lucene. A Solr Sink can extract documents from flume events and forward them to Solr.

Israel, cool patch! I have some high level feedback and some nitpicky feedback.

High level:

Can we abstract the SolrEventSerializer concept a bit more broadly to be a SolrIndexer? The idea is that people may want to do more than simply map one event to one document, as well as use implementations other than ConcurrentUpdateSolrServer. In order to support more complex indexing use cases in the future, one way to do it could be adding an interface like:

So stuff like docs.add(eventSerializer.prepareInputDocument(event)) would be abstracted into indexer.load(event), and solrServer.add(docs) + solrServer.commit() would be abstracted into indexer.commitSolrTransaction().

Mike Percy
added a comment - 25/Apr/13 10:21 Israel, cool patch! I have some high level feedback and some nitpicky feedback.
High level:
Can we abstract the SolrEventSerializer concept a bit more broadly to be a SolrIndexer? The idea is that people may want to do more than simply map one event to one document, as well as use implementations other than ConcurrentUpdateSolrServer. In order to support more complex indexing use cases in the future, one way to do it could be adding an interface like:
public interface SolrIndexer extends Configurable {
public void configure(Context ctx);
public void init();
public void load(Event event) throws IOException, SolrServerException;
public void beginSolrTransaction() throws IOException, SolrServerException;
public void commitSolrTransaction() throws IOException, SolrServerException;
public void rollbackSolrTransaction() throws IOException, SolrServerException;
public void shutdown();
}
So stuff like docs.add(eventSerializer.prepareInputDocument(event)) would be abstracted into indexer.load(event), and solrServer.add(docs) + solrServer.commit() would be abstracted into indexer.commitSolrTransaction().
Thoughts?
Aside from this suggestion, could you also do the following?
Attach a .patch file that compiles instead of a jar
Ensure indentation is consistent and kept to 2 lines
How about some unit tests?
Regards,
Mike

I did get a chance to review your comments. I think that is a good idea.

I would like to add load(List<Event> events) that accepts multiple events at once.

I was unavailable these last few weeks.

It is probably late now to make 1.4.

I can submit an update patch in a week that we can put in the next release.

Another thing I am also working on is a HTTP client that we can use for both Solr and ElasticSearch sinks so that we are not tightly coupled with dependencies that can break if the server version is not the same as what the client/sink is using.

Israel Ekpo
added a comment - 24/Jun/13 02:12 Thanks Mike.
I did get a chance to review your comments. I think that is a good idea.
I would like to add load(List<Event> events) that accepts multiple events at once.
I was unavailable these last few weeks.
It is probably late now to make 1.4.
I can submit an update patch in a week that we can put in the next release.
Another thing I am also working on is a HTTP client that we can use for both Solr and ElasticSearch sinks so that we are not tightly coupled with dependencies that can break if the server version is not the same as what the client/sink is using.

Gopal Patwa
added a comment - 16/Jul/13 20:08 I was just curious to know how this feature compare with "Flume Morphline Solr Sink" https://issues.apache.org/jira/browse/FLUME-2070
Should I use "Morphline Solr Sink" or "ApacheSolrSink" for genrating Solr index using Flume?

My understanding of FLUME-1687 is that it simply forwards the flume headers as-is to Solr, i.e. it essentially expects an upstream component to send flume events that conform and are formatted exactly as required by Solr. I think it also doesn't support SolrCloud.

In contrast, Morphline Solr Sink is well suited for use cases that stream raw data into HDFS (via the HdfsSink) and simultaneously extract, transform and load the same data into Solr. In particular, the Morphline Solr Sink can process arbitrary heterogeneous raw data from disparate data sources and turn it into a data model that is useful to Search applications. The ETL functionality is customizable using a morphline configuration file that defines a chain of pluggable transformation commands that pipe event records from one command to another. The Morphline Solr Sink also supports SolrCloud and transactional batching and Solr for more scalability, and Solr collection aliases (e.g. for transparent expiry of old index partitions).

Morphline Solr Sink can do everything that FLUME-1687 can do, and more.

wolfgang hoschek
added a comment - 16/Jul/13 20:23 My understanding of FLUME-1687 is that it simply forwards the flume headers as-is to Solr, i.e. it essentially expects an upstream component to send flume events that conform and are formatted exactly as required by Solr. I think it also doesn't support SolrCloud.
In contrast, Morphline Solr Sink is well suited for use cases that stream raw data into HDFS (via the HdfsSink) and simultaneously extract, transform and load the same data into Solr. In particular, the Morphline Solr Sink can process arbitrary heterogeneous raw data from disparate data sources and turn it into a data model that is useful to Search applications. The ETL functionality is customizable using a morphline configuration file that defines a chain of pluggable transformation commands that pipe event records from one command to another. The Morphline Solr Sink also supports SolrCloud and transactional batching and Solr for more scalability, and Solr collection aliases (e.g. for transparent expiry of old index partitions).
Morphline Solr Sink can do everything that FLUME-1687 can do, and more.
Would be nice to merge those two efforts into one.