Archives

Status of this Blog

This blog was used to support the JISC PoWR projec which ran from April 2008 to November 2010. The project has delivered its outputs and is now complete. The blog has now been frozen and we do not intend to publish any new posts.

Meta

Subscribe

Much discussion of blog preservation focuses on how to preserve the blogness of blogs: how can we make a web archive store, manage and deliver preserved blogs in a way that is faithful to the original?

Since it is blogging applications that provide this stucture and behaviour (usually from simple database tables of Posts, Comments, Users, etc), perhaps we should consider making blogging software behave more like an archive. How difficult would that be? Do we need to hire a developer?

One interesting thing about WordPress is the number of uses its simple blog model has been put to. Under-the-hood it is based on a remarkably simple data base schema of about 10 tables and a suite of PHP scripts, functions and libraries that provide the interface to that data. Its huge user-base has contributed a wide variety of themes and additional functions. It can be turned into a Twitter-like microblog (P2 and Prologue) or a fully-fledged social network (WordPress MU, Buddypress).

Another possibility exploited by a 3rd-party plugin is that of using WordPress as an aggregating blog, collecting posts automatically via RSS from other blogs: this seems like a promising basis for starting to develop an archive of blogs, in a blog.

The plugin in question is called FeedWordPress. It uses the Links feature of WordPress as the basis of a list of feeds which it checks regularly, importing new content when it finds it, as Posts within WordPress.

I installed FeedWordPress a while ago on ULCC’s DA Blog, and set it up to import all of the ULCC-contributed posts to JISC-PoWR, i.e. those by Ed Pinsent, Kevin Ashley and myself. I did this because I felt that these contributions warrant being part of ULCC’s insitutional record of its activities, and that DA Blog was the best to place to address this, as things stand.

JISC-PoWR also runs on WordPress, therefore I knew that, thanks to WordPress’s REST-like interface and Cool URIs, it is easy not only to select an individual author’s posts (/author/kevinashley) but also the RSS feed thereof (/author/kevinashley/feed). This, for each of the three author accounts, was all I needed to start setting up FeedWordPress in DA Blog to take an automatic copy each time any of us contributed to JISC-PoWR.The “author” on the original post has been mapped to an author in DA Blog, so posts are automatically (and correctly) attributed. The import also preserves, in custom fields, a considerable amount of contextual information about the posts in their original location.

In many cases, I’ve kept the imported post private in DA Blog. “Introductory” posts for the JISC-PoWR project blog, for example: as editor of DA Blog, I didn’t feel we needed to trouble our readers there with them; nevertheless they are stored in the blog database, as part of “the record” of our activities.

This is, admittedly, a very small-scale test of this approach, but the kind of system I’ve described is unquestionably a rudimentary blog archive, that can be set up relatively easily using WordPress and FeedWordPress – no coding necessary. Content is then searchable, sortable, exportable (SQL, RSS, etc). (Note, by the way, what happens when you use the Search box on the JISC-PoWR blog copy in UKWAC: this won’t happen with this approach!)

For organisations with many staff blogging on diverse public platforms this would be one approach to ensuring that these activities are recorded and preserved. UKOLN, for example, manages its own blog farm, while Brian and Marieke have blogs at WordPress.com (as well as contributing to this one), and Paul Walk appears to manage his own blog and web space. This kind of arrangement is not uncommon, nor the problem of how an institution get a grasp on material in all these different locations (it’s been at the heart of many JISC-PoWR workshop discussions). A single, central, self-hosted, aggregating blog, automatically harvesting the news feeds of all these blogs, might be a low-cost, quick-start approach to securing data in The Cloud, and safeguarding the corporate memory.

There are more issues to address. What of comments or embedded images? Can it handle Twitter tweets as well as blog posts? Does it scale? What of look-and-feel, individual themes, etc? Now we start needing some more robust tests and decisions, maybe even a developer or two to build a dedicated suite of “ArchivePress” plugins. But thanks to the power and Open-ness of WordPress, and the endless creativity of its many users, we have a promising and viable short-term solution, and a compelling place to start further exploration.

This entry was posted on Monday, March 23rd, 2009 at 3:01 pm and is filed under Preservation, Software, Web 2.0.
You can follow any responses to this entry through the RSS 2.0 feed.
Responses are currently closed, but you can trackback from your own site.

4 Responses to “Set a blog to catch a blog…”

interesting post! Feedwordpress sounds like a very useful tool, particularly for institutions that want a central ‘repository’ of blog postings. I’ve seen it employed on other blogs and wondered how it was done.

I’d probably refer to this kind of an approach as an archive of blog posts than a blog archive – a subtle difference, but it result in quite a different end product, I think? A blog is more than just a collection of posts: it’s a platform for interaction and exchange of ideas with the community. Comments are usually enabled and the comments can become an important part of the blog, even enriching posts and expanding on ideas in the original post. Thought goes into the design and appearance of the blog. Links and widgets provide wider context. All these things make a blog more than just a series of articles, so for the sake of integrity I’d probably argue for a wider capture than the one described above. Though that said, you can probably set it to capture comments as RSS feeds… so what this really comes back to is clarity of requirements for preservation, or establishing exactly what it is you want to preserve and what it is necessary to preserve to ensure the integrity of your target is maintained and it can be re-used as per your re-use requirements.

Collecting just the posts is useful, but it restricts re-use of the archive – eg, it would be difficult to use it to illustrate how you as an institution engage with the community if you don’t keep your comments. However, I do agree that this is a useful and ‘low-cost, quick-start approach to securing data in The Cloud, and safeguarding the corporate memory.’ After all, institutions that want to do this will have to start somewhere and it’s better they capture the posts than capture nothing at all. It will be interesting to see how this particular plug-in develops further.

Maureen.

PS – What’s a blog farm? Is it like a worm farm – but with less squirming?!

Yes, as I suggested, this is just a beginning, but IMO a more satisfactory one than with BlogBackUpOnLine. The idea was to see how far you can get without hacking the code: extending the given functionality of the plugin to pull in comments too, and marry them up to the post, is probably not rocket surgery. Maybe it does that already, I’ll check!

As for look-and-feel… I still remember a certain gentleman, at Erpanet in Urbino, asking: “What if I suggested preserving look-and-feel was a load of dingo’s kidneys?” Or words to that effect.

Perhaps some kind of tiered system for classifying web archives would be useful, a la WCAG: e.g. Level 1 preserves information content, Level 2 preserves info+functionality, Level 3 preserves the info+fn+look-and-feel. Then organisations can define their objectives from the outset and not get sidetracked or scope-creep…

[...] Yet, as we discovered in JISC-PoWR, few institutions have truly incorporated web archiving into their overall records and asset-management systems, let alone recognised the specific value of blog content (or even of using blogging to replace traditional approaches to reporting and minuting). Perhaps it just seems too complicated. For those that want to, the only tools that seem to be readily available are specialised tools – like Web Curator Tool and PANDAS – that utilise crawlers like Heritrix and HTTrack to copy websites by harvesting the HTML framework, and following hyperlinks to gather further embedded or linked content. The result might typically be a bunch of ARC/WARC files (a file format specifically designed to encapsulate the results of web crawls), containing snapshots of the browser-oriented rendering of web resources. For many web resources, especially static pages, this is sufficient. When it comes to blogs, though, the archived results seem a bit too static – as I noted in an earlier JISC-PoWR post. [...]