This session will introduce and demonstrate several techniques for enhancing the search experience by augmenting documents during indexing. First we'll survey the analysis components available in Solr, and then we'll delve into using Solr's update processing pipeline to modify documents on the way in. The session will build on Erik's "Poor Man's Entity Extraction" blog at http://www.searchhub.org/2013/06/27/poor-mans-entity-extraction-with-solr/

3.
Abstract
This session will introduce and demonstrate several
techniques for enhancing the search experience by
augmenting documents during indexing. First we'll survey
the analysis components available in Solr, and then we'll
delve into using Solr's update processing pipeline to
modify documents on the way in. The session will build
on Erik's "Poor Man's Entity Extraction" blog at http://
www.searchhub.org/2013/06/27/poor-mans-entityextraction-with-solr/

4.
Poor Man’s Entity Extraction
• acronyms: a searchable/filterable/facetable (but not stored)
field containing all three+ letter CAPS acronyms
• key_phrases: a searchable/filterable/facetable (but also not
stored) field containing any key phrases matching a
provided list
• links: a stored field containing http(s) links
• extracted_locations: lat/long points that are mentioned in
the document content, indexed as geographically savvy
points, stored on the document for easy use in maps or
elsewhere

5.
example_data.txt
The DUB airport is at 53.421389,-6.27
See also:
http://en.wikipedia.org/wiki/Dublin_Airport

8.
Extracting with copyField
•
copyField content => acronyms
– Note that destination of a copy field generally should not be stored
(stored="false)
•
"caps" field type
– PatternCaptureGroupFilterFactory with pattern="((?:[A-Z].?){3,})"
•
"The Dublin airport, DUB, is at…"
=> DUB
•
Results could be suitable for faceting, searching, and boosting but the results are
not "stored" values (only indexed terms)