Tuning to Complete During Setup

Some tuning is best completed during the setup of you system or may require some re-indexing.

Configuring Lucene Version Requirements

You can configure Solr to use a specific version of Lucene. This can help ensure that the Lucene version that Search uses includes the latest features and bug fixes. At the time that a
version of Solr ships, Solr is typically configured to use the appropriate Lucene version, in which case there is no need to change this setting. If a subsequent Lucene update occurs, you can
configure the Lucene version requirements by directly editing the luceneMatchVersion element in the solrconfig.xml file. Versions are
typically of the form x.y, such as 4.4. For example, to specify version 4.4, you would ensure the following setting exists in
solrconfig.xml:

<luceneMatchVersion>4.4</luceneMatchVersion>

Designing the Schema

When constructing a schema, use data types that most accurately describe the data that the fields will contain. For example:

Use the tdate type for dates. Do this instead of representing dates as strings.

Consider using the text type that applies to your language, instead of using String. For example, you might use text_en. Text types support returning results for subsets of an entry. For example,
querying on "john" would find "John Smith", whereas with the string type, only exact matches are returned.

Configuring the Heap Size

General Tuning

The following tuning categories can be completed at any time. It is less important to implement these changes before beginning to use your system.

General Tips

Enabling multi-threaded faceting can provide better performance for field faceting. When multi-threaded faceting is enabled, field faceting tasks
are completed in a parallel with a thread working on every field faceting task simultaneously. Performance improvements do not occur in all cases, but improvements are likely when all of the
following are true:

The system uses highly concurrent hardware.

Faceting operations apply to large data sets over multiple fields.

There is not an unusually high number of queries occurring simultaneously on the system. Systems that are lightly loaded or that are mainly
engaged with ingestion and indexing may be helped by multi-threaded faceting; for example, a system ingesting articles and being queried by a researcher. Systems heavily loaded by user queries are
less likely to be helped by multi-threaded faceting; for example, an e-commerce site with heavy user-traffic.

Note: Multi-threaded faceting only applies to field faceting and not to query faceting.

Field faceting identifies the number of unique entries for a field. For example, multi-threaded faceting could be used to simultaneously facet
for the number of unique entries for the fields, "color" and "size". In such a case, there would be two threads, and each thread would work on faceting one of the two fields.

Query faceting identifies the number of unique entries that match a query for a field. For example, query faceting could be used to find the
number of unique entries in the "size" field are between 1 and 5. Multi-threaded faceting does not apply to these operations.

To enable multi-threaded faceting, add facet-threads to queries. For example, to use up to 1000 threads, you might use a query as follows:

If facet-threads is omitted or set to 0, faceting is single-threaded. If facet-threads is set to a negative value, such as -1,
multi-threaded faceting will use as many threads as there are fields to facet up to the maximum number of threads possible on the system.

If your environment does not require Near Real Time (NRT), turn off soft auto-commit in solrconfig.xml.

In most cases, do not change the default batchSize
setting of 1000. If you are working with especially large documents, you may consider decreasing the batch size.

To help identify any garbage collector (GC) issues, enable GC logging in production. The overhead is low and the JVM supports GC log rolling as
of 1.6.0_34.

Solr and HDFS - the Block Cache

Warning: Do not enable the Solr HDFS write cache, because it can lead to index corruption.

Cloudera Search enables Solr to store indexes in an HDFS filesystem. To maintain performance, an HDFS block cache has been implemented using Least Recently Used (LRU) semantics. This
enables Solr to cache HDFS index files on read and write, storing the portions of the file in JVM direct memory (off heap) by default, or optionally in the JVM heap.

Batch jobs typically do not use the cache, while Solr servers (when serving queries or indexing documents) should. When running indexing using MapReduce, the MR jobs themselves do not
use the block cache. Block write caching is turned off by default and should be left disabled.

Tuning of this cache is complex and best practices are continually being refined. In general, allocate a cache that is about 10-20% of the amount of memory available on the system. For
example, when running HDFS and Solr on a host with 96 GB of memory, allocate 10-20 GB of memory using solr.hdfs.blockcache.slab.count. As index sizes grow you may need
to tune this parameter to maintain optimal performance.
Note: Block cache metrics are currently unavailable.

Configuration

The following parameters control caching. They can be configured at the Solr process level by setting the respective Java system property or by editing solrconfig.xml directly. For more information on setting Java system properties, see Setting Java System Properties
for Solr.

If the parameters are set at the collection level (using solrconfig.xml), the first collection loaded by the Solr server takes precedence, and block cache
settings in all other collections are ignored. Because you cannot control the order in which collections are loaded, you must make sure to set identical block cache settings in every collection
solrconfig.xml. Block cache parameters set at the collection level in solrconfig.xml also take precedence over parameters at the process
level.

Parameter

Cloudera Manager Setting

Default

Description

solr.hdfs.blockcache.global

Not directly configurable. Cloudera Manager automatically enables the global block cache. To override this setting,
you must use the Solr Service Environment Advanced Configuration Snippet (Safety Valve).

true

If enabled, one HDFS block cache is used for each collection on a host. If blockcache.global is disabled, each SolrCore on a host creates its own private HDFS block cache. Enabling this parameter simplifies managing HDFS block cache memory.

solr.hdfs.blockcache.enabled

HDFS Block Cache

true

Enable the block cache.

solr.hdfs.blockcache.read.enabled

Not directly configurable. If the block cache is enabled, Cloudera Manager automatically enables the read cache. To
override this setting, you must use the Solr Service Environment Advanced Configuration Snippet (Safety Valve).

true

Enable the read cache.

solr.hdfs.blockcache.write.enabled

Not directly configurable. If the block cache is enabled, Cloudera Manager automatically disables the write cache.
Warning: Do not enable the Solr HDFS write cache, because it can lead to index corruption.

false

Enable the write cache.

solr.hdfs.blockcache.direct.memory.allocation

HDFS Block Cache Off-Heap Memory

true

Enable direct memory allocation. If this is false, heap is used.

solr.hdfs.blockcache.blocksperbank

HDFS Block Cache Blocks per Slab

16384

Number of blocks per cache slab. The size of the cache is 8 KB (the block size) times the number of blocks per slab
times the number of slabs.

solr.hdfs.blockcache.slab.count

HDFS Block Cache Number of Slabs

1

Number of slabs per block cache. The size of the cache is 8 KB (the block size) times the number of blocks per slab
times the number of slabs.

Note:

Increasing the direct memory cache size may make it necessary to increase the maximum direct memory size allowed by the JVM. Each Solr slab allocates memory, which is 128 MB by default,
as well as allocating some additional direct memory overhead. Therefore, ensure that the MaxDirectMemorySize is set comfortably above the value expected for slabs
alone. The amount of additional memory required varies according to multiple factors, but for most cases, setting MaxDirectMemorySize to at least 20-30% more than the
total memory configured for slabs is sufficient. Setting MaxDirectMemorySize to the number of slabs multiplied by the slab size does not provide enough memory.

Garbage Collection

Choose different garbage collection options for best performance in different environments. Some garbage collection options typically chosen include:

Concurrent low pause collector: Use this collector in most cases. This collector attempts to minimize "Stop the
World" events. Avoiding these events can reduce connection timeouts, such as with ZooKeeper, and may improve user experience. This collector is enabled using the Java system property -XX:+UseConcMarkSweepGC.

Throughput collector: Consider this collector if raw throughput is more important than user experience. This
collector typically uses more "Stop the World" events so this may negatively affect user experience and connection timeouts such as ZooKeeper heartbeats. This collector is enabled using the Java
system property -XX:+UseParallelGC. If UseParallelGC "Stop the World" events create problems, such as ZooKeeper timeouts, consider using
the UseParNewGC collector as an alternative collector with similar throughput benefits.

You can also affect garbage collection behavior by increasing the Eden space to accommodate new objects. With additional Eden space, garbage collection does not need to run as frequently
on new objects.

Replication

You can adjust the degree to which different data is replicated.

Replication Settings

Note: Do not adjust HDFS replication settings for Solr in most cases.

To adjust the Solr replication factor for index files stored in HDFS:

Cloudera Manager:

Go to Solr service > Configuration > Category
> Advanced.

Click the plus sign next to Solr Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml to add a new property with the following
values:

Name:dfs.replication

Value:2

Click Save Changes.

Restart the Solr service (Solr service > Actions > Restart).

Unmanaged:

Configure the solr.hdfs.confdir system property to refer to the Solr HDFS configuration files. Typically the value is /etc/solrhdfs/. For information on setting Java system properties, see Setting Java System Properties for
Solr.

Set the DFS replication value in the HDFS configuration file at the location you specified in the previous step. For example, to set the replication value to 2, you would change the dfs.replication setting as follows:

<property>
<name>dfs.replication<name>
<value>2<value>
<property>

Restart the Solr service:

sudo service solr-server restart

Replicas

If you have sufficient additional hardware, add more replicas for a linear boost of query throughput. Note that adding replicas may slow write performance on the first replica, but
otherwise this should have minimal negative consequences.

Configure the transaction log replication factor for a collection by modifying the tlogDfsReplication setting in solrconfig.xml. The tlogDfsReplication is a new setting in the updateLog settings area. An excerpt of the solrconfig.xml file where the transaction log replication factor is set is as follows:

<updateHandler class="solr.DirectUpdateHandler2">
<!-- Enables a transaction log, used for real-time get, durability, and
and solr cloud replica recovery. The log can grow as big as
uncommitted changes to the index, so use of a hard autoCommit
is recommended (see below).
"dir" - the target directory for transaction logs, defaults to the
solr data directory. -->
<updateLog>
<str name="dir">${solr.ulog.dir:}</str>
<int name="tlogDfsReplication">${solr.ulog.tlogDfsReplication:3}</int>
<int name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}</int>
</updateLog>

The default replication level is 3. For clusters with fewer than three DataNodes (such as proof-of-concept clusters), reduce this number to the amount of DataNodes in the cluster.
Changing the replication level only applies to new transaction logs.

Initial testing shows no significant performance regression for common use cases.

Shards

In some cases, oversharding can help improve performance including intake speed. If your environment includes massively parallel hardware and you want to use these available resources,
consider oversharding. You might increase the number of replicas per host from 1 to 2 or 3. Making such changes creates complex interactions, so you should continue to monitor your system's
performance to ensure that the benefits of oversharding do not outweigh the costs.

Commits

Changing commit values may improve performance in some situation. These changes result in tradeoffs and may not be beneficial in all cases.

For hard commit values, the default value of 60000 (60 seconds) is typically effective, though changing this value to 120 seconds may improve
performance in some cases. Note that setting this value to higher values, such as 600 seconds may result in undesirable performance tradeoffs.

If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required
notices. A copy of the Apache License Version 2.0 can be found here.