Solr indexing performances

Hi,we are experiencing some performance issues with Solr batch indexing: we have a cluster composed by 4 workers, each of which is equipped with 32 cores and 256GB of RAM. YARN is configured to use 100 vCores and 785.05GB of memory. The HDFS storage is managed by an EMC Isilon system connected through a 10Gb interface. Our cluster runs CDH 5.8.0, features Solr 4.10.3 and it is Kerberized.With the current setup, speaking of compressed data, we can index about 25GB per day and 500GB per month by using MapReduce jobs. Some of these jobs run daily and they take almost 12 hours to index 15 GB of compressed data. In particular, MorphlineMapper jobs last approximately 5 hours and TreeMergeMapper last about 6 hours. Are these performances normal? Can you suggest us some tweaks that could improve our indexing performances?

The MorphlineMapper phase takes all available resources, the TreeMergeMapper takes only a couple of containers.

We don't need to make queries for the moment, we just need to index historical data. We are wondering if there is a way to speed up indexing time and then optimize collections for searching when indexing is complete.

Re: Solr indexing performances

Hi,yes I forgot to say that I'm a colleague of the guy in the first post.Thanks for your help, I'll try to be as clear as possible.

Usualy, the "java.opts" parameters are set to 80% of the memory.mb one. Is there some specific reason you have set it to "50%" only ?We didn't find good documentation about this, we just set these parameters based on our experience. Are there resources about this?

We upgraded our cluster since the first post. Now we have 8 workers, with a total of 200 vcores and 1,5TB of memory.

How is configured your collection ? How many shards (and how many replica per shard) ?Our collections have 12 shards and 2 replicas per shard. The workload is balanced between machines, so there are 3 cores per machine.

In the first post, it is said that the mapper phase takes 5 hours (for roughly 15GB of compressed data).What is the processing time of a "single mapper task" ?How many mapper tasks are launched in this phase ?Is the CPUs of the "worker nodes" overused during an indexation ? Or is it "idle" ?How is Solr handling the load at the end of the indexing process ? (when Solr is loading the data)Our source produces about 20GB of compressed data every day, split into about 550 compressed files. The number of map tasks is the same as the number of input files. We run one indexing task per day.A single MorphlineMapper's map task takes about 20 minutes to complete. Considering the total number of cores, with a full unloaded cluster, the map phase takes about 1 hour to complete. During this phase, the worker's CPUs are almost 100% loaded.

The reduce phase takes almost 4 hours to complete. We tried two different approaches here:We started indexing without setting the "--reducers" parameter. In this case, this phase takes 24 cores and almost 3 hours. When it ends, the TreeMergeMapper starts, which takes almost 2 hours to complete.As far as I know, during this phases 24 "virtual shards" are created, then they are merged into the final 12 desired shards.

To avoid the TreeMergeMapper job, we tried to set the number of reducers to 12 (the same as the number of shards). In this case, by the way, the MorphlineMapper's reduce phase takes 12 cores and almost 5 hours to complete.So, we can't see any significant improvement using this strategy.

When MorphlineMapper job (and eventually the TreeMergeMapper one) ends, the "Current" sign in the Solr web ui's Statistics tab becomes red, meaning there is something going on. We can't keep track of this in yarn, and cpu and memory usage is not very high during this stage. What is it about? After 4-5 hours the sign returns green and the collection is fully available.

What is the compression algorithm used ?Is it efficient (the trade off between compression rate and performance is acceptable?) ?15GB of compressed data : how many lines does it represent ? (how many fields per line ?)Gzip is used to compress records, every compressed file contains a single txt. This is the way our source sends data.For 20GB of data we have about (550 compressed files)x(37 MB). Every file contains about 320000 records. Every record is made of 23 text fields, some of which are dynamic:

Is there some room to improve the processing time of the morphline script ? Is it efficient enough ?There is no "load_solr" instruction in the morphline ?Yes, we have a "loadSolr" instruction in the morphline:

1 - Try increasing the parameter "mapreduce.reduce.java.opts" to 80% of 8Gb. This might help the reducer phase processing time

2 - 550 files for 20GB of data means an average size of 37MB per file. I guess your block size is higher (64 ou 128MB). Having fewer bigger files might help the mapper phase.

3 - I don't think you need the LOAD_SOLR instruction inside the morphline.

From my understanding, the HBaseIndexerTool is in charge to load solr at the end of the processing. Having this particular instruction inside the morphline means you load twice solr (as far as I understand, at least this is the case when reading from HBase).