Is there any published research of decent quality linking news or unstructured information to asset returns? I know that Thomson Reuters offers its Machine Readable news (MRN), so somebody must use it. But I can't find much in the public domain.

The selling point is that they provide a sentiment reading per news item so the user doesn't have to do any NLP.

If you have a Reuters sales rep or contact them then they can get you several research/white papers that are interesting. Here are the ones I have been able to find online (my sales rep has provided me with better ones but I didn't save them):

Deutsche Bank's Quantitative Strategy (US) team put together the following piece on this topic (note: their research is available for clients, but I found that somebody uploaded the piece to a sketchy web site). In case the link dies, some of the academic papers they site are:

This month we tackle another new dataset: news sentiment. Regular readers of our research
will know that this is a topic we find particularly interesting, and one that we have already
done a lot of work in. In this particular report, we take what we think is an innovative
approach to studying the predictive power of news sentiment; instead of using standard
linear models, we focus on three non-linear, “learning” type models: classification and
regression trees, forests of classification and regression trees, and multivariate adaptive
regression splines. All three of these models are unique in that they allow us to take a datacentric
approach to our analysis. Instead of predefining a hypothetical relationship and then
testing it, we allow the data to determine the form of the model. This allows us to better
understand which variables within our dataset are most important in determining post-event
abnormal returns. It also allows us to model complex non-linear relationships that may not be
apparent at first glance.

Overall we find that news sentiment, in conjunction with non-linear models, can generate
alpha. Even better, we find this alpha is relatively uncorrelated with the more traditional quant
factors. Of course, there is also a downside. The predictive ability of news sentiment is shortlived;
the best results are obtained when forecasting only the next five days. Therefore, for
some quantitative investors, the signal on its own may have too much turnover to be viable.
Nonetheless, we do show that there are ways for even lower-frequency investors to use
news sentiment data to enhance their stock-selection process.

Results
First of all, the author shows that there is, as expected, a statistical and economical
difference in the returns on news days compared to non-news days. Also, while the direction
of the difference is in accordance with the sentiment, the magnitude of the difference
doesn’t relate to the news being positive or negative. These differences in returns between
news and no-news days are actually heterogeneous among stocks: small and illiquid stocks
tend to react more strongly, as do low book-to-market and high volatility stocks. From an
industry point of view, the reactions also differ substantially, while still being significant, in
each group. Interestingly, Dzielinski finally finds that there is a risk premium attached to news
sensitivity, and that this phenomenon remains after controlling for well-known risk factors.
The monthly return on the hedge portfolio is significantly different from zero and stands at
0.95% on average. The strategy still exhibits some significant loadings on some risk factors,
as could have been expected from the panel regressions in sub-samples.

A cautionary tale on all these approaches it told by Tim Loughran and Bill MacDonald in the Journal of Finance, 2011 (When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks, here).

In their analysis they show that the commonly used Harvard Psychosociological Dictionary is inadequate for sentiment classification in a financial context. Their findings are specific to the analysis of 10-k, but probably also indicative of the general difficulties with NLP in finance. Some findings:

Most misclassifications simply introduce noise in the estimates;

Some misclassifications introduce false positives (eg. "cancer" is normally negative, but in a financial context it is neutral, most likely it refers to an industry sector.

A simple long-short strategy based on positive/negative words count yields small (positive) alphas which are not statistically significant.

There are of course several caveats:

This approach is "mainstream" academic finance, with all its pros and cons (pros: clean approach, reproducible, simplicity suggests a low chance of data-snooping; cons: not strictly speaking quantitative, and - in this case - it doesn't use cutting the edge technology);

The results are based on long horizon portfolio returns (buy/short and hold strategy on a 12-month horizon);

The textual analysis is limited to low frequency information (10-Ks) as opposed to medium/high frequency information provided by news feeds.