After looking at the results of my brief foray into sentiment analysis of tweets a couple of weeks ago, and reading about the problem, it became clear that pre-processing may well help clean up the data and improve training. The goal is to reduce the number of possible features. Put simply, there are too many different words, and a lot of them are too noisy!

There are various techniques to do this, such as removing stop words ("and", "the" etc., words that don't add to the sentiment), and stemming to group reduce the variants of the same word (eg plurals and other endings) to the same token.

In Java the Lucene libraries help a great deal here. Here's how to remove stop words using Lucene's StopFilter:

Note that this assumes that the language is English; you'll have to find your own list of stop words for other languages. This example also uses the StandardFilter, which is is also useful for tokenization - it recognises things like email addresses for correct tokenization.

Stemming can also be achieved with the help of Lucene, via the PorterStemmer:

I've been working over several Google App Engine Java apps recently to introduce Google Guice and Objectify to them. Guice is a lightweight dependency injection framework, and Objectify is a superb replacement for JDO/JPA in your Java GAE projects.

If you're familiar with Spring then you'll find this a doddle. There's no XML in sight - Guice concentrates pretty much only on dependency injection, and the Java-based configuration classes one uses instead of XML seem perfectly adequate for this.

It also works nicely with Objectify. This is a data access API for the app engine. Take a look at the examples, they are extremely straightforward:

The API lets you upload a set of training data. It will then create a model which you can interrogate. Training data is stored in Google Cloud Storage, and the API is accessible via REST, secured by OAuth in the usual Google style.

To get a good idea of what's involved I recommend reading the Hello Prediction! tutorial. I pretty much followed their example, except instead of detecting the language I used it to detect sentiment.

I had to refine my aforementioned training data to be in a form suitable for the API. That just means in this case that it has to be CSV file like so:

"positive","I love the whole world and everything in it"
"negative","You guys suck"
"neutral","Cheese is a kind of dairy product"

After following the steps described in the tutorial I was then in a position to query the model. Here's the prediction for an actual example taken from the positive data set:

Note that it doesn't give a unanimous positive vote, although it clearly chooses positive as the most likely category. I suspect this is because there is a lot more neutral data in the training set than either positive or negative, so that there is always a tendency to treat things as neutral. This is a useful quality where borderline cases are involved.

The other thing worth noting is the suspicious looking 2/3 and 1/3 score values themselves. Playing around with different queries always shows this 1/3 to 2/3 split, never any other numbers. I don't know what the cause of this is.

I need to spend some more time with this model, and probably get some more training data. One thing I will say is that it's both easy to use and fast. In Java terms the google-api-java-client covers a lot of ground here. I will post some more on developing with the Prediction API, and how well it performs in future posts.

One of the great things about Google App Engine is, if you stay inside the box, so to speak, many things are a doddle. So much so that I was able to create this new app, Googomi, in a day or two, most of which involved fiddling with and learning about the Google+ API.

The Googomi app is a very simple beast with only one purpose: it will take your public Google+ stream and turn it into an RSS feed.

I've put a modicum of processing into it, so that it should correctly guess the most appropriate title for each RSS item, eg choosing the annotation, or the remote URL's title, where appropriate.

I personally had a use case for this (apart from learning about various Google APIs) whereby I wanted to export Google+ posts to other services automatically. For example, with this I can post from Google+ to Buffer and then beyond automatically.

I've been playing with what the Google+ API has to offer and I've found it quite easy to integrate into my Google App Engine apps using the google-api-java-client.

I initially followed the Quick start for Java tutorial with regard to creating the OAuth tokens and so forth, but the google-api-java-client has some good tutorials regarding making the actual OAuth calls. See for example this section about how to make the calls from a Google App Engine app. The library handles all the plumbing for you.

I only had to make one ammendment to their example. I found that the refresh token wasn't being returned along with the access token after it was granted. However this was simply fixed by adding a call to setApprovalPrompt("force") on the GoogleAuthorizationCodeFlow.Builder, like so:

I've been delving into some twitter sentiment analysis and have been casting about for some useful training data. I've found various sources but few have any neutral data, which I think is important for any training as sort of control.

One useful source is Sanders Analytics, which has a source of tweet ids and a script to download the actual tweets from the ids (Twitter's terms & conditions do not allow the tweets themselves to be distributed).

This script takes a couple of days to download all the tweets because it has to honour Twitter's API limits.

I found one issue in the script which is easily fixed. It could cope with the presence of "error" in the response, but not "errors", eg:

{"errors":[{"message":"Sorry, that page does not exist","code":34}]}

The simple fix is to add this to the parse_tweet_json function, after the error check:

if 'errors' in tweet_json:
raise RuntimeError('errors in downloaded tweet')

When the script finishes it will produce a file called full-corpus.csv. Now the final data has this format:

That is, the subject, the sentiment, the tweet id, the date and the tweet content.

The subject is what the tweet is about. This is important, as the sentiment refers to the subject. In other words the sentiment is about the subject (in this case "apple"), and not anything else in the tweet content.

Regardless, for my purposes I do actually need the tweet content without the subject. This can be simply achieved using grep and awk. Eg to extract the neutral tweets:

A couple of years ago or so Heroku and Facebookteamed up to make creating Facebook apps a doddle. Indeed one can do so with a few clicks from the app creation centre in Facebook if you already have a Heroku account.

Here are pretty comprehensive instructions from Heroku on how to do this, and I can attest that it all works well.

I've added to this setup with a staging instance for team testing purposes using the facility Heroku has for managing different environments by pushing to different remotes. See this handy guide for full details.

To create a staging branch called staging:

heroku create --remote staging

And to add Facebook app credentials for the staging version of your app just do:

One of the fiddly steps in setting up a web app, and maintaining it is managing all the various javascript libraries your pages use. But it's quite easy to manage resources like jQuery in Maven thanks to WebJars. Here's how to use it in Dropwizard.

If you take a look at WebJars you'll see all sorts of supported libraries. I'll use jQuery in this example.

Twitter4J is an excellent java library for all sorts of twitter work. I've been using it recently to connect to the "garden hose", ie Twitter's streaming API. Here's how to follow a particular user with it.

I've been playing with the Google App Engine Backend service recently. Google's documentation on this is clear but it could really do with a tutorial. This post isn't anything like that desired tutorial but a recap of what I discovered and ended up with trying to address my particular need, which may be helpful!

I've been using the Backend in a game engine I'm writing. I'm using Google's Channel API to send messages to the client, but the game needs a background thread to handle the game engine itself. More preciesly, each game process is likely to run longer than the 60 second limit normally imposed on GAE requests. This is where the backend comes in.

This will create a GAE queue task which will execute the path /zengine in the backend environment called zengine. The code backendService.getBackendAddress("zengine") creates an address that works both in the production and local development environments. This is crucial, because the dynamic instances have an otherwise random and unguessable address - in dev environments they are on a random port!

By passing in the Host header with the address of the backend my /zengine servlet will run in the zengine backend environment instead of the normal, 60-second-limited one.

The actual task item will in practice contain parameters that specify which game process to start, ie for what user and what game. My /zengine path is mapped to the game process itself. It merely needs to start the process. Using the supplied queue task parameters I can get a handle on the channel for communication back to the client, and start the game.

Searching about I found a similar problem in previous versions of the plugin, a couple of years ago. Unfortunately the solutions given there don't work any more, because neither the appengine-local-runtime nor appengine-tools-api artifacts are available any more, at least from the central maven repository.

However a bit of trial error gave me this solution, to include appengine-tools-sdk instead as follows:

I found this useful blog posting by Gunawan Deng very useful in getting me started on using Maven to create Google App Engine projects.

Here's another tip. I ran into an error when trying to run the app locally with gae:run:

[ERROR] Failed to execute goal net.kindleit:maven-gae-plugin:0.9.5:run (default-cli) on project extropymvnspike: ${gae.home} is not a directory: ${gae.home} is not a directory:
[ERROR] Please set the sdkDir configuration in your pom.xml to a valid directory. Make sure you have correctly extracted the app engine sdk.

The error was easy to fix - one can add a property for gae.home in your pom.xml called gae.home which should point to the installation directory of your app engine SDK. However, I found it more useful to add it to my settings.xml as I imagined I would want it for all my GAE projects:

Dropwizard is an incredibly useful framework for creating REST apis very quickly. One thing that it doesn't come with out of the box (yet) is support for session-based security, that is, holding principal information with a plain old HttpSession.

There are excellent reasons not to do this, particularly for REST apis. Session management can be a fiddly business that isn't particularly scalable. However, sometimes you need it anyway.

Here's how to add a simple annotation-based scheme, drawing heavily on the very useful posting by Antoine Vianey.

Code

Let's add an annotation to represent a logged in user, i.e. a user that's been set into the session as an attribute:

Conclusion

This is an easy and flexible way to add session management to your resources when you need it. As Antoine adds, one can create other annotations for different roles, eg a @SessionAdminUser, if you want to restrict different methods to different roles.

I've been trawling about looking for old Inform (Z-Machine) games to play, and I came across this rather pleasant example of Inform 7 code. This produces an actual running game, of a sort:

Will's Study is a room. The desk is here.
A hastily handwritten note is on it.
Description is "It's from your friend Shakespeare:
'I've gone to lunch. You'll have to write the sonnet.'"
Composing is an action applying to nothing.
The quill is a thing that is in the study.
Understand "write sonnet" as composing.
Description of the quill is "Old and cruddy".
Instead of composing when the player
has no quill, say "You have not got the quill."
Instead of composing, say "And... done. 'Heya',
says Will, returning. You say, 'Hello, Will!'
Says Shakespeare, 'Thank you for the time you've taken!
You really are a pal, Sir Francis Bacon.'"

I might have to give this a go myself. Maybe some interactive system documentation...