Now it is ready to be configured in the Eclipse GWT plugin. Open your Eclipse preferences, navigate to Google → Web
Toolkit, click the Add… button, browse to your new gwt-0-0-0 directory, and give it a name like GWT-TRUNK.

Now we are set up to run the trunk samples, but for some reason we still can't directly open them in Eclipse. A workaround I'm using is to create a new project with the Eclipse GWT plugin, then delete all of the files and copy in the ones from the sample you're trying to run. For example, to run the Showcase sample, do the following:

Create a new Web Application Project (the first entry in the "g" drop down menu in your toolbar). Call it samples-showcase, give it any package name, pick the GWT TRUNK SDK we set up earlier, deselect Use Google App Engine and Generate project sample code, then click Finish.

This will create the samples-showcase directory in your workspace directory. Change to this directory, delete the folders it created, and copy in the folders from the Showcase sample:

Go back to Eclipse and Refresh your sample-showcase project. At this point you might get an error in Eclipse - a red X mark on the war/WEB-INF folder, and in the Problems panel you'll see the message "The GWT SDK JAR gwt-servlet.jar is missing in the WEB-INF/lib directory". Right-click this message and choose "Quick Fix" to get the following dialog box:

Click Finish and it should be fixed.

You should now be able to run the sample (yay!). Right-click the samples-showcase project in the Package Explorer, choose Run As → Web Application, and the local URL should appear in your Development Mode view tab. Double click the URL to open it in your default browser, or right-click → Copy, and paste it into the URL bar of your preferred browser.

Tuesday, 24 January 2012

Our current project is using ElasticSearch for, er search.
Hierarchical categorization of content found itself pinned to our sprint planning board this week and so I decided to take a look into what options were available.
Our classification is a pretty basic hierarchy dividing and redividing things into groups, where each new group is a sub-species of its parent group.
Essentially, at this stage we have two main requirements:

1. Hierarchical Search

By classifying X as:

/a/b/c/d

We must be able to search for and find X for the following paths:

/a
/a/b
/a/b/c
/a/b/c/d

(I'm using /a/b/c/d for brevity, but imagine /Computer Peripherals/Hard Drives/External/USB for example).

2. Drill Down with facets

To allow for searching from the more general to the more specific we want to able to allow children of a given facet to be returned.
e.g when we ask for the direct children of /a as facets, the system should return /a/b/.

A useful starting point came not from the ElasticSearch documentation but from the Solr wiki, which is a worthwhile read in my opinion. Much of what follows is explained in fuller detail there.

This works pretty well, however by using this tokenizer alone it isn’t particularly easy to constrain the depth of the classification (Requirement 2). Therefore we decided to follow the seemingly hacky but certainly workable idea (See facet.prefix in the Solr article) of storing depth information along with the path.

So our indexed category terms will now be stored as:

0/a
1/a/b
2/a/b/c
3/a/b/c/d

i.e. 1/a/b represents the fact that a/b is 1 level below /a.
With this information we can now use a Regex Pattern to get facets for a particular depth - More on this later in the article.Encoding the depth

We gave some thought to how we might go about encoding depth along with our categories and opted for writing a custom TokenFilter with input provided from the token stream of the PathHierarchyTokenizer.

The remainder of this article is intended to illustrate how to write and test a TokenFilter as well as how to deploy and reference it for use with ElasticSearch. The full source code is available on GitHub.

We are using Maven 3 for managing this customisation so having created a new Maven (jar) project in my IDE, I setup the required dependencies (Full POM is here).

On lines 3-8 we create an Analyzer for use with the test.
TokenCountFilter is the filter that we are going to write to amend the tokens produced by the instance of PathHierarchyTokenizer which is passed as an argument to the constructor on line 6.

The javadoc for the testTokens method describes how the test will work. Note that most of the work is managed by BaseTokenStreamTestCase on line 38. This Lucene test class provides the static method assertAnalyzesTo, which accepts an Analyzer to test, an input string to analyze e.g "/a/b/c" and an array of expected output tokens e.g ["0/a", "1/a/b", "2a/b/c"] which should be emitted after tokenization. Internally BaseTokenStreamTestCase utilizes the JUnit framework to make test assertions based on the input and expected output.

Step 2 Write the TokenFilter

Note: This may not be the most efficient way of achieving our goal but performs well enough in our tests. It is important to note that token filters operate on every token produced for a stream. With this in mind one must consider their use carefully. Our category paths will seldom contain more that 5 elements and consequently this filter will not greatly affect performance when used with the PathHierarchyTokenizer. Any future enhancements by us will be posted in a future article and of course we welcome any advice from your own experience.

For each call to input.incrementToken (line 15) that returns true we end up with a new token in the termAtt buffer (declared on line 3). The remainder of the method manages resizing the termAtt buffer before refilling it with the value of count followed by the original chars from the buffer hence providing us with the required depth encoded before the term in the index.

Running our test class will now give us a green bar.Step 3 Create a factory to provide the TokenFilter to ElasticSearch

This code is fairly self-explanatory and is basically cut and paste code which follows the same pattern as the other factories from ElasticSearch i.e. with a factory method creating a new instance of our TokenCountFilter around the TokenStream.

Step 4 Make the new code available to ElasticSearch

Running:

mvn clean package

Will compile and the code and generate a jar file. This jar should then be added to $ES_HOME/lib. A restart of ElasticSearch will ensure that our classes get loaded.

Step 5 Try it out

First we can setup a new 'test' index which will use our TokenFilter as part of its analysis:

Lines 9-11 define our new "token_count" filter. Note that we actually define the type as TokenCountFilterFactory which is the class responsible for creating our TokenFilter.
Lines 6-8 define a ""depth_path_hierarchy" analyzer. Note that we specify the "path_hierarchy" tokenizer which will provide the tokens for our "token_count" filter.

Next we setup a mapping for a type called 'content' which we shall be storing in our 'test' index.

A Mapping defines how a document should be mapped to the search engine. In this case we have specified that when indexing our "categories" property that the "depth_path_hierarchy" analyzer should be used. Note: For this particular analyzer to be used solely for indexing (i.e. not for search) we used the key "index_analyzer" in our request. When creating a mapping for a property it is also possible to specify which analyzer is to be used when searching by adding the key/value "search_analyzer" : "<analyzer name>", or one can simply use the single key/value "analyzer" : "<analyzer name>" which will use the same analyzer for indexing and searching the property. Further details regarding mapping can be found here.

We are now in a position to add some test content to our 'test' index. (Take note of the categories)

The search shown below verifies that we can find both pieces of content beneath the root "0/foo" category which means that Requirement 1 is satisfied.
Note that we use a term query which is not analyzed, the reason for this is that we want search for an entire term e.g.
/a/b/c/d without that term being tokenized in any way. This makes sense as
otherwise a search for /a/b/c/d would be tokenized into path elements (if we use the "depth_path_hierarchy" analyzer for example) and
therefore could find something categorized more generally, /a/b/c
for example, which we don't want.

Note the regex on line 9. Here we are limiting the facets returned to those that start with 1/foo i.e a depth of 1 beneath /foo.
The response shows that we get the required facet terms, '1/foo/baz' and '1/foo/bar'

So that wraps this rather long post up. No doubt there are numerous approaches to solving this kind of problem but this article has been intended partially as a solution but more generally as an attempt
to illustrate some of the features of ElasticSearch that you may not be familiar with.

Tuesday, 13 December 2011

Macports was hanging during my install of the rabbitmq-server port.
Running port -v showed me that an error was occurring during Erlang compilation (RabbitMQ is writtien in Erlang).
The issue, it seems to due to the optimisation settings used by llvm-gcc-4.2.
The solution is to patch the Erlang Portfile.
Edit:

Wednesday, 30 November 2011

I'm sure I'm just way behind the times but I never knew about the Java Virtual Machine Process Status Tool even though it's been around since Java 1.5.
Running jps on the command line with no options, shows me that my Hadoop processes are running.

Although do note that what looks like the OS process id is actually the local VM identifier which according to jps docs: "is typically, but not necessarily, the operating system's process identifier for the JVM proces"