Apache Solr Search with Solr > 4.7

13 December 2014

Solr 4.x brings a plethora of improvements over 3.x and 1.x. All our new projects use 4.x and we try to upgrade any existing client implementations where and when possible. Last week we upgraded another client. The transition was smooth, except for odd entries in the indexing log and, is it turned out, nodes missing from the index.

java.lang.Thread.run(Thread.java:745)
Caused by:
java.lang.IllegalArgumentException: Document contains at least one immense term in field="sm_field_body"(whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[32, 67, 97, 116, 104, 101, 114, 105, 110, 101, 32, 66, 101, 97, 114, 100, 115, 104, 97, 119, 32, 67, 97, 116, 104, 101, 114, 105, 110, 101]...', original message: bytes can be at most 32766 in length; got 108809
[...]
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 108809

Prior to Solr 4.8, terms that exceeded Lucene’s MAX_TERM_LENGTH were silently ignored when indexing documents. Begining with Solr 4.8, a document an error will be generated when attempting to index a document with a term that is too large. If you wish to continue to have large terms ignored, use solr.LengthFilterFactory in all of your Analyzers. See LUCENE-5472 for more details.

Drupal Apache Solr fields are prefixed with a set of characters that denote the dynamic field nature and follow the Solr convention. e.g. ss_means “single-value string field”, sm_ — “multi-value string field”.

In our case sm_field_body and any sm_* fields are declared as solr.StrField fields which are not analyzed, just stored as is. Previously, fields larger than the allowed 32k limit were simply ignored, but not anymore.

However, since sm_* fields are not processed, we need a different solution that does not involve modifying the core Solr configuration. And that comes as a simple hook implementation in a custom module.