Our Goal: Use less MemoryAs our Solr application runs in client machine, so it's important for us to use less memory as possible as we can.In Solr, we have fields which are used only for search and one way to reduce the index size and memory usage is to remove the term that is larger than the threshold: for example: 50. The reason is that the user is very unlikely to search on these large terms, so why keep them in the index.The Definition for content field

Docker is an open platform for building, shipping, running distributed applications. There are a lot of docker containers with different os and bundled with different application such as hadoop, mongoDB.When we want to learn or give some tools a try, we can just call docker run with the specific image: for example: docker run --name some-mongo -d mongoThis will not mess our host environment, when we are done, we can just call docker kill to kill the running container.We can also use Docker to create a consistent environment which can be ran on any Docker enabled machine.In this article, I would like to introduce how to run hadoop and Solr in docker.Install Hadoop Image and Run itSearch Haddop in Docker registry: https://registry.hub.docker.com, and I chooses the most popular sequenceiq/hadoop-dockerRun the command in my Ubuntu host:docker run -i -t sequenceiq/hadoop-docker /etc/bootstrap.sh -bash

This will download the hadoop-docker image, and start it. After several minutes, it will start the bash of hadoop-docker container.Install Solr in Hadoop ContainerRun the following commands, it will download latest Solr-4.10.1, and unzip it.mkdir -p /home/lifelongprogrammer/src/solr; cd /home/lifelongprogrammer/src/solrcurl -O http://mirrors.advancedhosters.com/apache/lucene/solr/4.10.1/solr-4.10.1.tgztar -xf solr-4.10.1.tgzcd /home/lifelongprogrammer/src/solr/solr-4.10.1/exampleThen run the following command, it will run solr on HDFS with default port 8983.java -Dsolr.directoryFactory=HdfsDirectoryFactory \ -Dsolr.lock.type=hdfs \ -Dsolr.data.dir=hdfs://$(hostname):9000/solr/datadir \ -Dsolr.updatelog=hdfs://$(hostname):9000/solr/updateLog -jar start.jarRun Solr in background on StartupEdit /etc/bootstrap.sh, and add the following commands after HADOOP_PREFIX/sbin/start-yarn.sh cd /home/lifelongprogrammer/src/solr/solr-4.10.1/example && nohup java -Dsolr.directoryFactory=HdfsDirectoryFactory \ -Dsolr.lock.type=hdfs \ -Dsolr.data.dir=hdfs://$(hostname):9000/solr/datadir \ -Dsolr.updatelog=hdfs://$(hostname):9000/solr/updateLog -jar start.jar &Commit changes and Create Docker ImagesFirst run docker ps to get the container id:CONTAINER ID IMAGE 2cd8fadba668 93186936bee2Then let's commit the change and create our own docker images:docker commit 2cd8fadba668 hadoop_docker_withsolr

The ProblemUsually we know the UNC path(like \\server\share\file_path), we need get the localphysical path, so we can login to that machine, go to that path and make some change.

The SolutionWe can use WMI(Windows Management Instrumentation) to get and operate on windows management information. Win32_Share represents a shared resource on a computer system running Windows.In Powershell, we can use get-wmiobject to get WIM class and -filter to specify the share name.So now the solution is obvious: just one command line:get-wmiobject -class "Win32_Share" -namespace "root\cimv2" -computername "computername" -filter "name='uncpath'" | select name,path

The ProblemExtend UIMA Regex Annotator to allow user run custom regex dynamically.Regular Expression Annotator allows us to easily define entity name(such as credit card, email) and regex to extract these entities.But we can never define all useful entities, so it's good to allow customers to add their own entities and regex, and the UIMA Regular Expression Annotator would run them dynamically.We can create and deploy a new annotator, but we decide to just extend UIMA RegExAnnotator.

How it WorksClient SideWe create one type org.apache.uima.input.dynamicregex with feature types and regexes. In our http interface, client specifies the entity name and its regex: host:port/nlp?text=abcxxdef&customTypes=mytype1,mytype2&customRegexes=abc.*,def.*Client will add Feature Structure: org.apache.uima.input.dynamicregex.types=mytype1,mytype2 and org.apache.uima.input.dynamicregex.regexes=abc.*,def.*

Define Feature Structures in RegExAnnotator.xmlorg.apache.uima.input.dynamicregex is used as input paramter, client can specify value for its features: types and regexes. org.apache.uima.output.dynamicrege is the output type.

Run Custom Regex and Return Extracted Entities in RegExAnnotatorNext, in RegExAnnotator.process method, we get value of the input types and regex, run custom regex and add found entities to CAS indexes.

The ProblemWe want to know the start and end offset of named group, but Matcher start(), end() in JDK 7 doesn't accept group name as its parameter.JDK7 adds the support of Named Group:(1) (?<NAME>X) to define a named group NAME".(2) \\k<Name> to backref a named group "NAME" (3) <$<NAME> to reference to captured group in matcher's replacement str We can use matcher.group(String NAME) to return the captured input subsequence by the given "named group", but its start(), end() in matcher doesn't accept group name as its parameter.

The SolutionCheck the JDK code, look at how mathcer.group(String name) is implemented:

The ProblemIn previous post: Using ResultSpecification to Filter Annotator to Boost Opennlp UIMA Performance, I introduced how to use ResultSpecification to make OpenNLP.pear only run needed annotators.But recently, we changed our content analzyer project to use UIMA-AS for better scale out. But UIMA-AS doesn't support specify ResultSpecification at client side, so we have to find other solutions.Luckily UIMA provides a more common mechanism: feature structures to allow us to control annotator's behavioral characteristics.Using Dedicated Feature Structure to Control Server BehavioralThis time, we will take RegExAnnotator.pear as example, as we have defined more than 10+ regex and entities in RegExAnnotator, and the client would specify which entities they are interested. Client specify values of the feature: org.apache.uima.entities:entities, such as "ssn,creditcard,email", RegExAnnotator will check the setting and run only needed regex.

Specify Feature Value at Client SideFirst we have one properties file uima.properties which define the mapping of entity name to the UIMA type: regex_type_ssn=org.apache.uima.ssnregex_type_CreditCard=org.apache.uima.CreditCardNumberregex_type_Email=org.apache.uima.EmailAddress

Check Feature Value at RegExAnnotatorRegExAnnotator will get the value of org.apache.uima.entities:entities, if it's set, then it will check all configured regex Concepts and add it to runConcepts if the concept produces one of the uima types of these entities.

The ProblemToday, I was asked to take a look one query issue:When user searches file, files or file*, Solr return matches correctly, but if user searches files*, Solr doesn't return match.The SolutionGoogle Search, find the solution in this page:Stemming not working with wildcard search

Wildcards and stemming are incompatible at query time - you need to manually stem the term before applying your wildcard.Wildcards are not supported in quoted phrases. They will be treated as punctuation, and ignored by the standard tokenizer or the word delimiter filter.In this case, it is PrefixQuery which work similar as Wildcard Query.The solution is to add KeywordRepeatFilterFactory and RemoveDuplicatesTokenFilterFactory around the Stem Factory:

The ProblemToday,after add some dependencies to maven, I found that maven refuses to compile. In Problems view, it shows error:The container 'Maven Dependencies' references non existing library 'C:\Users\administrator\.m2\repository\jdk\tools\jdk.tools\1.6\jdk.tools-1.6.jar'Check my pom.xml, there is no direct dependency of jdk.tools-1.6.jar, then use maven dependency:tree tool to figure out which lib indirectly depends on it.mvn dependency:tree -Dverbose -Dincludes=jdk.tools

The ProblemToday when I run our Solr application in one machine, during start, it reports warning:Oct 6, 2014 7:25:15 PM org.eclipse.jetty.server.AbstractConnector doStartWARNING: insufficient threads configured for SelectChannelConnector@0.0.0.0:12345

Trying http request in browser, no response, just hang forever.Inspect the solr server in Visual VM. In threads tab, it shows there is 238 live threads, and a lot of selector(128) and acceptors(72). This looks very suspiciours:qtp1287645725-145 Selector127
java.lang.Thread.State: BLOCKED
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll0(Native Method)
at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll(WindowsSelectorImpl.java:273)
at sun.nio.ch.WindowsSelectorImpl$SubSelector.access$400(WindowsSelectorImpl.java:255)
at sun.nio.ch.WindowsSelectorImpl.doSelect(WindowsSelectorImpl.java:136)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
- locked (a sun.nio.ch.Util$2)

Then check the code: When start jetty, the code sets acceptors to number of cpu cores * 2. in this machine, There is 64 cores. This will cause jetty to start 64*2 = 128 selectors and acceptors. connector.setAcceptors(2 * Runtime.getRuntime().availableProcessors());The default acceptors is:(Runtime.getRuntime().availableProcessors()+3)/4 which is 16 in this case. setAcceptors(Math.max(1,(Runtime.getRuntime().availableProcessors()+3)/4));So to fix this issue, I just comment or cusom acceptors code: connector.setAcceptors(2 * Runtime.getRuntime().availableProcessors());

TermRangeFilter matches only documents containing terms within a specified range of terms.
It’s exactly the same as TermRangeQuery, without scoring.
NumericRangeFilter

FieldCacheRangeFilter
FieldCacheTermsFilter

QueryWrapperFilter turns any Query into a Filter, by using only the matching documents
from the Query as the filtered space, discarding the document scores.

PrefixFilter
SpanQueryFilter

CachingWrapperFilter is a decorator over another filter, caching its results to increase
performance when used again.
CachingSpanFilter
FilteredDocIdSet allowing you to filter a filter, one document at a time. In order to use it, you
must first subclass it and define the match method in your subclass.

It's a good start to read Lucene's built-in collectors' code to learn how to build our own collectors: TotalHitCountCollector: Just count the number of hits.public void collect(int doc) { totalHits++; }PositiveScoresOnlyCollector: if (scorer.score() > 0) { c.collect(doc); } // only include the doc if its score >0

TimeLimitingCollector: use an external counter, and compare timeout in collect, throw TimeExceededException if the allowed time has passed: long time = clock.get(); if (timeout < time) {throw new TimeExceededException( timeout-t0, time-t0, docBase + doc );}Also TestTimeLimitingCollector.MyHitCollector is an example of custom collector.FilterCollector: A collector that filters incoming doc ids that are not in the filter. Used by Grouping.