Archive for the ‘Radical Syndication’ Category

It’s all very well publishing a research paper that describes the method for, and results of, analysing a dataset in a particular way, or a news story that contains a visualisation of an open dataset, but how can you do so transparently and reproducibly? Wouldn’t it be handy if you could “View Source” on the report to see how the analysis was actually done, or how the visualisation was actually created from an original dataset? And furthermore, how about if the actual chart or analysis results were created directly as a result of executing the script that “documents” the process used?

As regular readers will know, I’ve been dabbling with R – and the RStudio environment – for some time, so here’s a quick review of how I think it might fit into a reproducible research, data journalism or even enterprise reporting process.

The first thing I want to introduce is one of my favourite apps at the moment, RStudio (source on github). This cross platform application provides a reasonably friendly environment for working with R. More interestingly, it integrates with several other applications:

RStudio offers support for the git version control system. This means you can save R projects and their associated files to a local, git controlled directory, as well as managing the synchronisation of the local directory with a shared project on Github. Library support also makes it a breeze to load in R libraries directly from github.

R/RStudio can pull in data from a wide variety of sources, mediated by a variety of community developed R libraries. So for example, CSV and XML files can be sourced from a local directory, or a URL; the RSQLite library provides an interface to SQLite; RJSONIO makes it easy to work with JSON files; wrappers also exist for many online APIs (twitteR for Twitter, for example, RGoogleAnalytics for Google Analytics, and so on).

RStudio provides built in support for two “literate programming” style workflows. Sweave allows you to embed R scripts in LaTeX documents and then compile the documents so that they include the outputs from/results of executing the embedded scripts to a final PDF format. (So if the script produces a table of statistical results based on an analysis of an imported data set, the results table will appear in the final document. If the script is used to general a visual chart, the chart image will appear in the final document.) The raw script “source code” that is executed by Sweave can also be embedded explicitly in the final PDF, so you can see the exact script that was used to create the reported output (stats tables of results, or chart images, etc). If writing LaTeX is not really your thing, RMarkdown allows you to write Markdown scripts and again embed executable R code, along with any outputs directly derived from executing that code. Using the knitr library, the RMarkdown+embedded R code can be processed to produce an HTML output bundle (HTML page + supporting files (image files, javascript files, etc)). Note that if the R code uses something like the googleVis R library to generate interactive Google Visualisation Charts, knitr will package up the required code into the HTML bundle for you. And if you’d rather generate an HTML5 slidedeck from your Rmarkdown, there’s always Slidify (eg check out Christopher Gandrud’s course “Introduction to Social Science Data Analysis” – Slidify: Things are coming together fast, example slides and “source code”).

A recent addition, RStudio now integrates with RPubs.com, which means 1-click publishing of RMarkdown/knitr’d HTML to a hosted website is possible. Presumably, it wouldn’t be too hard to extend RStudio so that publication to other online environments could be supported. (Hmm, thinks… could RStudio support publication using Github pages maybe, or something more general, such as SWORD/Atom Publishing?!) Other publication routes have also been demonstrated – for example, here’s a recipe for publishing to WordPress from R).

Oh, and did I mention that as well as running cross-platform on the desktop, RStudio can also be run as a service and accessed via a web browser. So for example, I can log into a version of RStudio running on one of OU/KMi’s server and access it through my browser…

Here’s a quick doodle of how I see some of the pieces hanging together. I had intended to work on this a little more, but I’ve just noticed the day’s nearly over, and I’m starting to flag… But as I might not get a chance to work on this any more for a few days, here it is anyway…

PS I guess I should really have written and rendered the above diagram using R, and done a bit of dogfooding by writing this post in Rmarkdown to demonstrate to the process, but I didn’t… The graph was actually rendered from a .dot source file using Graphviz. Here’s the source, so if you want to change the model, you can… (I’ve also popped the script up as a gist):

It’s generally taken as read that folk hate doing documentation*. This is as true of documenting data and APIs as it is of code. I’m not sure if anyone has yet done a review of “what folk want from published datasets” (JISC? It’s probably worth a quick tender call…?), but there have certainly been a few reports around what developers are perceived to expect of an API and its associated documentation and community support (e.g. UKOLN’s JISC Good APIs Management Report and API Good Practice reports, and their briefing docs on APIs).

* this is one reason why I think bloggers such as myself, Martin Hawksey and Liam Green Hughes offer a useful service: we do quick demos and geting started walkthroughs of newly launched services, demonstrating their application in a “real” context…

At a recent technical advisory group meeting in support of the Resource Discovery Taskforce UK Discovery initiative (which is aiming to improve the discoverability of information resources through the publication of appropriate metadata, and hopefully a bit of thought towards practical SEO…) I suggested that a Q and A site might be in order to support developer activities: content is likely to be relevant, pre-SEOd (blending naive language questions with technical answers), and maintained and refreshed by the community:-)

In much the same way that JISCPress arose organically from the ad hoc initiative between myself and Joss Winn that was WriteToReply, I suggested that the question and answer site with a focus on data that I set up with Rufus Pollock might provide a running start to UK Discovery Q&A site: GetTheData.

API connections to OSQA, the codebase that underpins GetTheData, are still lacking, but there are mechanisms for syndicating content from RSS feeds (for example, it’s easy enough to get a feed out of tagged questions out, or questions and answers relating to a particular search query); which is to say – we could pull in ukdiscovery tagged questions and answers in to the UK Discovery website developers’ area.

Another issue relates to whether or not developers would actually engage in the asking and answering of questions around UK Discovery technical issues. Something I’ve been mulling over is the extent to which GetTheData could actually be used to provide QandA styled support documentation for published data or data APIs, concentrating a wide range of data related Q&A content on GetTheData (and hence helping building community/activity through regularly refreshed content and a critical mass of active users) and then syndicating specific content to a publisher’s site.

So for example: if a data/api publisher wants to use GetTheData as a way of supporting their documentation/FAQ effort, we could set them up as an admin and allow them rights over the posting and moderation of questions and answers on the site. (Under the current permissions model, I think we’d have to take it on trust that they wouldn’t mess with other bits of the site in a reckless or malevolent way…;-)

API/data publishers could post FAQ style questions on GetTheData and provide canned, accepted (“official”) answers. Of course, the community could also submit additional answers to the FAQs, and if they improve on the official answer be promoted to accepted answers. Through syndication feeds, maybe using a controlled tag filtered through a question submitter filter (i.e. filtering questions by virtue of who posted them), it would be possible to get a “maintained” lists of questions out of GetTheData that could then be pulled in via an RSS feed into a third party site – such as the FAQ area of a data/api publisher’s website.

Additional activity (i.e. community sourced questions and answers) around the data/API on GetTheData could also be selectively pulled in to the official support site. (We may also be able to pull out the lists of people who are active around a particular tag???) In the medium term, it might also be possible to find a way of supporting remote question submission that could be embedded on the API/data site…

If any data/API publishers would like to explore how they might be able to use GetTheData to power FAQ areas of their developer/documentation sites, please get in touch:-)

And if anyone has comments about the extent to which GetTheData, or OSQA, either is or isn’t appropriate for discovery.ac.uk, please feel free to air them below…:-)

One of the many RSS related feature requests I put in when we were working on the JISCPress project was the ability to get a page level RSS feed out where each paragraph was represented as a separate item the page feed.

WordPress already delivers a single item RSS feed for each page containing just the substantive content of the page (i.e. the content without the header, footer and sidebar fluff), which means you can do things like this, but what I wanted is for the paragraphs on each page to be atomised as separate feed elements.

Eddie implemented support for this, but I didn’t do anything with it at the time, so here’s an example of just why I thought it might be handy – paragraph level search.

At the moment, searching a document on WriteToReply returns page level results – that is, you get a list of search results detailing the pages on which the search term(s) appear. As you might expect with WordPress, we can get access to these results as a feed by shoving feed in the URI, like this:http://ouseful.wordpress.com/feed?s=test

First of all, grab the search feed for a particular query on a particular document into a Yahoo Pipe:

Rewrite the URI of each page liked to in the results feed as the full fat, itemised paragraph feed for the page, and emit those items (that is, replace each original search results item with the set of paragraph items from that page).

The next step is to filter those paragrpah feed items for just the paragraphs that contain the original search terms:

We need to rewrite the link because (at the time of writing) the page paragraphs feed doesn’t link to each paragraph, it links to the parent page (a bug report has been made;-)

Note that at the time of writing, there’s also a problem with the paragraph number reported in the link (again a report has been made), a workaround patch for which is included in this pipe.

What this means is that we now have a workaround for indexing into individual paragraphs using a search term. If we tag content at the paragraph level, (e.g. by running a page-level paragraph feed, or double dip search results feed through OpenCalais), we can generate related search links into the document, or other documents on the platform, at a paragraph level, increasing the relevance, or resolution (in terms of increased focus), of the returned results.

Just by the by, the approach shown above is based on a search, expand and filter pattern, (cf. a search within results pattern) in which a search query is used to obtain an initial set of results which are then expanded to give higher resolution detail over the content, and then filtered using the original search query to deliver the final results. If a patent for this doesn’t already exist for this, then if I worked for Google, Yahoo, etc etc you could imagine it being patented. B*****ds.

I don’t often do posts where I just link to or re-present content that appears elsewhere on the web, but I’m going to make an exception in this case, with a an extended preview to a link on Martin Hawksey’s MASHe blog…

Anyway, whilst I was watching Virtual Revolution over the weekend (and pondering the question of Broadcast Support – Thinking About Virtual Revolution) I started thinking again about replaying twitter streams alongside BBC iPlayer content, and wondering whether this could form part of a content enrichment strategy for OU/BBC co-productions.

which leads to a how to post on Twitter powered subtitles for BBC iPlayer in which Martin “come[s] up with a way to allow a user to replay a downloaded iPlayer episode subtitling it with the tweets made during the original broadcast.”

This builds on my Twitter powered subtitling pattern to create a captions file for downloaded iPlayer content using the W3C Timed Text Authoring Format. A video on the Martin’s post shows the twitter subtitles overlaying the iPlayer content in action.

It is said that “fortune favours the prepared mind”, or at least the mildy obsessing one, so when I saw @danbri’s post on Local Video for Local People and realised that it was trivial to get hold of geocoded Youtube videos within a certain distance of a specified location using the following Youtube API call:

To view the channel in Boxee, enter a location, and optionally a topic, and then either:

– run the pipe, and subscribe to the RSS feed directly in Boxee;
– bookmark the URI of the pipe. Enter the URI in you browser location bar according to the following pattern:http://pipes.yahoo.com/ouseful/delitv_local?l=required,location&q=optional search terms
hit return, check the location and optional search terms are correct and the pipe is giving a plausible output, and then bookmark that page to a DeliTV tag on delicious. (WHen you bookmark the pag, any spacesin your search terms should be replaced by %20. So the above would be bookmarked containing the characters optional%20search%20terms). If you then subscribe to that DeliTV channel via a DeliTV pipe that you have hooked up to your Boxee account, you will be able to watch the channel through Boxee.

(Hmm, I wonder, should these be sorted by relevance or recency? I think the default is relevance?)

If you want to define a variety of different topic channels around a particular location, or a set of channels on the same topic from different locations, bookmark each channel to delicious and subscribe to them all through the same DeliTV pipe :-)

As recent readers may know, I’ve been blogging lately over on the Arcadia Project Blog, a site I have authoring permissions on but not admin rights. At the moment, comments on the site seem to be disabled except to project team members (I’m not sure how they are whitelisted?), which is a bit of a pain because I want wider comments on the site.

So what to do? The blog is hosted on Blogspot, which means I can add embed codes and javascript to a post and hence embed a Disqus comment thread on each post I write.

Alternatively, (additionally), commenters who are running the enhanced Google Toolbar can comment on the page using Google Sidewiki.

Sidewiki is all well and good (or, errr, not – maybe it’s really evil…?) but it means that unless you’re logged in to Google and running the Google toolbar, or you’ve got a Greasemonkey script or bookmarklet to check for Sidewiki comments related to a page you’re probably not going to see the Sidewiki comments.

where webpageUri is the URI of the page you want to see comments for, suitably encoded. In Javascript, I think encodeURIComponent(window.location) should do the trick…

How to get a JSON version of the feed, wrapped in a callback function that can be used to display the feed, is documented on the GData API site: Using JSON with Google Data APIs – just add ?alt=json-in-script&callback=myFunction.

Reusing the sample code on the GData site, it was easy enough to create a function to display a Sidewiki comment feed for a particular page:

I’m just such a glutton for punishment… the slightest external interest in things that might be OUseful, and like a whotsit chasing a doo dah, I can’t but bite… So for example: in Videographics from the Economist last week(?!), @deburca wrote:

The Economist now has an interesting section on videographics, each of which can be downloaded or embedded into blogs, teaching resources etc.
…
An RSS feed is also available which may be a useful channel addition for Tony Hirst’s Delitv project

and a brief sigh that they don’t make an OPML feed available, I produced a quick pipe that scrapes the page to generate a feed containing links to each of the different video ‘programme’ feeds, rewriting the http:// part of the the URL to the rss:// protocol that Boxee expects: