January 03, 2005

Using Dublin Core in RSS feeds

Revised and updated 5 January 2005.

Warning: This is a lengthy and highly technical post. If you are not interested in RSS and/or metadata standards, you can safely ignore it.

Ever since the advent of RSS 1.0 (a.k.a. RDF), Dublin Core elements have been used in RSS feeds as metadata descriptors. They even turned up in Movable Type's very own implementation of RSS 2.0, before dying a quick death with the arrival of Atom.

Due to Atom, much of what I am going to say here may seem obsolete; still it seems important to talk about the use of Dublin Core (DC) with weblogs in general and RSS feeds in particular, as it could be useful for the scaleabilty and interchangeability of weblog content, and point out the following facts:

DC would be useful for weblog description;

DC is possibly problematic to implement with weblogs;

DC needs to be implemented correctly, or not at all;

RSS Feed readers should be able to parse DC correctly.

Much of what I'm writing here comes from the implementation of DC where I work and from my experiences in implementing DC in the RSS 1.0 feed of my other weblog The Evil Empire. Here are my observations and humble opinions:

Dave Winer's RSS 2.0 standard is very strict about what its tags denote; therefore, and for the sake of clarity, I will compare the DC terms with the respective tags of RSS 2.0. I would also like to point out that Movable Type's implementation of DC is faulty and not recommended as a model, as use of incorrect DC elements undermines the standard. If you are using the Movable Type RSS 2.0 feed, get rid of it now and use the standard instead.

Namespace

First of all, it should be obvious that you must of course include the correct namespace. For RSS 1.0, it looks like this:

<RDF:rdf xmlns:dc="http://purl.org/dc/elements/1.1/">

Or, in RSS 2.0:

<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">

Title

<dc:title> is really synonymous with <title> in Winer's RSS 2.0. Within <channel> it denotes the title of the weblog; within <item> it denotes the title of the posting.

<dc:title>Using Dublin Core in RSS feeds</dc:title>

Creator

<dc:creator> is much more problematic. According to the DC specs, it denotes "An entity primarily responsible for making the content of the resource". The key term here is content. In the context of a weblog, this can mean two things:

If you are writing original weblog entries, then you are <dc:creator>.

However, if you are merely linking to an article elsewhere and are not adding significant material to the link (e.g. on a linkblog), then <dc:creator> is always the author of the original article.

With some linkblogs, it may be hard to decide which of the two to pick — you have to decide whether your entry is mostly original writing or mostly referring. I also know no weblog software that handles this correctly — usually simply the name of the weblog author, i.e. you, is inserted (it's difficult to implement; basically you'd need some kind of field to enter the Creator's name if it's not you, or your news aggregator would have to hand the original author on to your blogging software). In the case of The Evil Empire, which is a very strict linkblog and only uses text from the original sources, using my name as <dc:creator> would certainly not be correct. With a hack, I managed to tweak the template so that <dc:creator> always refers to the original author.

Ideally, the names used for <dc:creator> should be normative and taken from an authoritative thesaurus, such as LoC-NA or PND. Unlike <author> in Winer's RSS, the name rather than the e-mail address of the author is required.

Don't think that you are automatically <dc:creator> for all your weblog entries simply because this is your weblog. According to the DC specs, you are not. See also the notes on Publisher and Contributor below.

Subject

DC defines <dc:subject> as "A topic of the content of the resource ... [ideally] a value from a controlled vocabulary or formal classification scheme". This means that the standard practice in RSS 1.0 and MT's RSS 2.0 to use it for weblog categories is wrong — unless you organise your categories using LCSH or SWD (or even DDC, if you feel so inclined), none of which I've ever seen on any weblog so far.

This means that <dc:subject> makes sense as part of the <channel> description for topical weblogs with a strong thematic focus, such as

But do bear in mind that <dc:subject> is not at all synonymous with <category> in Winer's RSS 2.0. If you do not name your categories according to LCSH or SWD, use <category> instead.

Description

<dc:description> can include any free-text account of the content of the resource, ideally a summary, but there are no real restrictions here — you can also include the full text. This makes it synonymous with <description> in Winer's RSS 2.0 and with both(!) <summary> and <content> in Atom.

Publisher

<dc:publisher> is the "entity responsible for making the resource available". Notice the difference from <dc:creator>: if person A is creating a website and person B writes an article that is published on that website, then person A is <dc:publisher>, and person B is <dc:creator>. In terms of implementation, this is usually easy: since in most cases you are the one publishing your weblog, this is you.

There is no similar term in Winer's RSS 2.0, perhaps <managingEditor> is most closely related.

Contributor

<dc:contributor> is used for somebody "making contributions to the content of the resource". This is probably rarely used with weblog entries, where each article tends to have its own clear-cut author, but it can make sense for the channel description of multi-author weblogs:

Notice the difference from <dc:creator> and <dc:publisher>: the following example is for an entry in Phil Gyford's Samuel Pepys diary weblog, which contains Pepys' text in the translation of Mynors Bright; this entry also contains several annotations by various readers.

The distinction between <dc:creator> and <dc:contributor> is to decide what is the main creative work. If person A writes an article and person B takes a few photographs to illustrate the article, then A is <dc:creator> and B is <dc:contributor>. If person B takes a photograph and person A writes a brief explanatory note for it, then B is <dc:creator> and A is <dc:contributor>.

As you can see, this element has a very broad scope — depending on the topic it can even include people who post comments. There is no similar term in Winer's RSS 2.0.

Date

<dc:date> is "associated with the creation or availability of the resource"; this can be the posting date on your weblog, which is the best and easiest way to implement, but, if you are linking to another article elsewhere, can also be that article's date. Best practice, and implemented correctly in the MT templates, is to use W3CDTF (ISO 8601).

<dc:date>2004-12-28T11:42:40+01:00</dc:date>

This element is somewhat broader than <pubDate> Winer's RSS 2.0, but it can be used in the same manner. Notice, however, that <pubDate> uses a specific date format that is not W3CDTF, whereas <dc:date> should preferably use W3CDTF, but can also use other formats.

Type

Not implemented in any template that I know of, although this could potentially be very useful. <dc:type> describes "nature or genre of the content of the resource". DC suggests a specific Type vocabulary (DCT1). For original articles, <dc:type> will probably always be Text, but for audioblogs, it may well be Sound, for photoblogs StillImage, and for videoblogs MovingImage. For links to other online rescources, almost any other Type is possible, depending on what you are linking to.

Format

<dc:format> is used for the "physical or digital manifestation of the resource"; this is simplified in the context of weblogs insofar as we are talking almost exclusively about online resources, and we can thus simply use the Internet Media Types (MIME). Again, this is not implemented in any RSS template that I know of.

Identifier

<dc:identifier> is the "unambiguous reference to the resource within a given context". Since the "given context" is your weblog, this makes it synonymous to <guid> in Winer's RSS 2.0. This means that if your permalinks are permanent and unambiguous (i.e. each item can be found via a unique URL), it is safe to use your permalink here.

Source

<dc:source> comes in whenever a weblog entry is not entirely original. According to the DC definition, it is a "Reference to a resource from which the present resource is derived".

Basically, it is needed whenever your weblog entries are "derived from" some other resource (rather than being original entries, or new entries that are merely "based on" other resources). Sometimes this distinction may be hard to make; best practice is to include <dc:source> in case of doubt. With linkblogs, this is always the URL of the original article. Notice the difference between <dc:source> and <dc:identifier>:

Winer's RSS 2.0 has <source>, which is similar, but contains the name rather than the URL of the source.

<dc:identifier> is the local identifier of the current weblog entry; <dc:source> shows where the material in the current entry came from.

This is similar to, but somewhat stricter than <link> in Winer's RSS 2.0. Whereas <link> contains any article that you link to, <dc:source> should be used both for an article that you link to and for an article that your current article is derived from. This means that you will have to use <dc:source> more often than <link>.

Language

<dc:language> denotes the language of the content according to RFC3066, which itself is based on ISO 639. This can be done on the <channel> level if the entire weblog is in the same language (which is usually implemented correctly in most default feeds), or on the <item> level in the case of a multilingual weblog (which is, sadly, not really implemented anywhere).

In previous versions of RSS, this was implemented via the <trackback:about> model. As you can see, simple DC would have sufficed.

Of course you can also use <dc:relation> to manually add URLs to web pages that you consider of related interest. Notice that there is a difference between <dc:relation> (related content) and <dc:source> (related content that was the basis for your entry) — see the entry on Source above.

Coverage

<dc:coverage> is used for the spatial or temporal "extent or scope of the content of the resource", ideally using terms from a controlled vocabulary such as the TGN for places and W3CDTF for dates.

It is perhaps most useful for weblogs with a specific geographic and/or historical focus.

Notice that there is a difference between <dc:date>, which is about when the entry was made available, and <dc:coverage>, which is about the time covered by the entry. So if I publish an article about what I did on New Year's Eve a couple of days later, it looks like this:

<dc:title>What I did on New Year's Eve</dc:title>
<dc:date>2005-01-03T12:29:04+01:00</dc:date>
<dc:coverage>2004-12-31</dc:coverage>

One other use that comes to mind is for monthly or weekly weblog archives, although that would probaly mostly apply to web pages, and not RSS feeds.

(For further details on including DC elements in meta tags of web pages see below.)

Rights

<dc:rights> is used for any "Information about rights held in and over the resource". This works on the <channel> level as well as the <item> level, but may be harder to implement on the latter if no distinction is made between <dc:creator> and <dc:publisher>, as you — the publisher — do not automatically own the rights to an article if you are not also the creator.

DC in meta tags

Apart from RSS feeds, DC elements can also be included in meta tags of web pages. This is probably only useful if you are generating a separate web page for each individual weblog entry, and can be a real pain to do correctly as most weblog software will not allow you to easily create all of these meta tags without further, often complicated, hacks. So merely to show you what it could be like, here's what a full set of DC meta tags for this page, if it existed, would look like:

Explanation: DC.Title: the title of the weblog entry. - DC.Creator: the author of the text, spelt according to the authoritative heading in PND. - DC.Subject: two subject headings according to LCSH. - DC.Description: a summary of the text. - DC.Publisher: the person who runs the weblog. - DC.Date: date of publication on the weblog, formatted according to W3CDTF. - DC.Type: resource type according to DCT1. - DC.Format: Internet Media Type (IMT) of the online resource. - DC.Identifier: the local permalink URI of the weblog entry. - DC.Source: included because this article could be seenas a reinterpretation of DC for weblogs, hence the URI of that page. - DC.Language: language of the text according to RFC3066. - DC.Relation: the URI of a website that sent a Trackback ping to this entry. - DC.Rights: Copyright notice. - DC.Contributor and DC.Coverage do not apply and were thus left out. The final <link rel> points to the DC element set for reference purposes.

To demonstrate the use of all DC elements in the description of a weblog entry, I made a sample description for an entry in Phil Gyford's Samuel Pepys diary weblog:

<meta name="DC.Title" content="Pepys' Diary: Monday 30 December 1661" />
<meta name="DC.Creator" scheme="LoC-NA" content="Pepys, Samuel (1633-1703)" />
<meta name="DC.Subject" scheme="LCSH" content="Pepys, Samuel, 1633-1703 -- Diaries" />
<meta name="DC.Subject" scheme="LCSH" content="Cabinet officers -- Great Britain -- Diaries" />
<meta name="DC.Subject" scheme="LCSH" content="Great Britain -- Social life and customs -- 17th century -- Sources" />
<meta name="DC.Subject" scheme="LCSH" content="Great Britain -- History -- Charles II, 1660-1685 -- Sources" />
<meta name="DC.Subject" scheme="DDC21" content="941.066092" />
<meta name="DC.Description" content="At the office about this estimate and so with my wife and Sir W. Pen to see our pictures, which do not much displease us, and so back again, and I staid at the Mitre, whither I had invited all my old acquaintance of the Exchequer to a good chine of beef..." />
<meta name="DC.Publisher" content="Phil Gyford" />
<meta name="DC.Contributor" scheme="LoC-NA" content="Bright, Mynors (1818-1883)" />
<meta name="DC.Date" scheme="W3CDTF" content="2004-12-30" />
<meta name="DC.Type" scheme="DCT1" content="Text" />
<meta name="DC.Format" scheme="IMT" content="text/html" />
<meta name="DC.Identifier" scheme="URI" content="http://www.pepysdiary.com/archive/1661/12/30/index.php" />
<meta name="DC.Source" scheme="URI" content="http://www.gutenberg.org/etext/4130" />
<meta name="DC.Language" scheme="RFC3066" content="en" />
<meta name="DC.Relation" scheme="URI" content="http://blogs.msdn.com/mcreasy/archive/2004/12/31/344960.aspx" />
<meta name="DC.Coverage" scheme="TGN" content="London" />
<meta name="DC.Coverage" scheme="W3CDTF" content="1661-12-30" />
<meta name="DC.Rights" content="The main diary entries, the footnotes in the right-hand sidebar, the text in the Diary Introduction section, and the main text on the People and Places pages are taken from the Project Gutenberg version of PepysÄº diary and as such are free of copyright restrictions. All annotations added by users in the Diary section (attached to the diary entries and People and Places pages) and the rest of the site are available under a Creative Commons Attribution-NonCommercial-ShareAlike license. Any material posted in the annotations by users that is quoted from elsewhere retains its original copyright status." />
<link rel="schema.dc" href="http://purl.org/DC/elements/1.1/" title="Dublin Core" />

Explanation: DC.Title: the title of the weblog entry. - DC.Creator: the author of the original text, spelt according to the authoritative heading in LoC-NA. - DC.Subject: several subject headings according to LCSH, one according to DDC. - DC.Description: a brief excerpt from the text. - DC.Publisher: the person who runs the weblog. - DC.Contributor: the person who translated the diary from Pepys' secret script into English, spelt according to the heading in LoC-NA. - DC.Date: date of publication on the weblog, formatted according to W3CDTF. - DC.Type: resource type according to DCT1. - DC.Format: Internet Media Type (IMT) of the online resource. - DC.Identifier: the local permalink URI of the weblog entry. - DC.Source: the URI where the original text is located. - DC.Language: language of the text according to RFC3066. - DC.Relation: the URI of a website that sent a Trackback ping to this entry. - DC.Coverage: the covered place spelt according to TGN, the covered time formatted according to W3CDTF. - DC.Rights: Copyright notice from the weblog.

The problem?

The main problem why most of the 15 DC elements have not been properly implemented in weblogs, neither in RSS feeds nor in meta tags, is that there is no weblog software which offers enough fields to enter all the necessary metadata (or is intelligent enough to create at least some of them automatically), and even if there was one that did, I cannot think of a user interface that would not confuse the average user — people who don't know that such a thing as Dublin Core even exists.

Conclusion

Why talk about Dublin Core now that everybody is using Atom anyway?

Because many people still use RSS feeds with incorrect implementations of Dublin Core. Because DC would have provided a standardised, useful vocabulary for RSS feeds if anyone had cared to listen and pay attention rather than cook their own flavours of RSS, which are now all becoming obsolete. Because Dublin Core is a widely and extensively used standard for metadata and applying it to weblogs might have been useful. Because far too few people know about it at all.

Should you implement DC in your feed(s)?

No. I'm just pointing out that it's possible and what it would look like.

Should you include DC elements in the meta tags of your weblog pages?

No. Of course, in an ideal world, every web page would use DC meta tags. But then this is no ideal world, so you don't have to use them.

Horst, how come you've got such an in-depth knowledge of these matters? Aren't you an English lit major turned librarian? Or is knowledge about Dublin Core mandatory for a librarian? Or am I simply way off track and you're actually a webdeveloper and I should have read your whole weblog archives before posting that comment? I hope not.

Dublin Core is more common with archives that with libraries, but yes, it is part of our training, and it is also something I'm professionally and semi-professionally interested in.

Joanne Harrington said on June 1, 2005 10:28 PM:

My project team and I are going to use RSS to enable the get and pull of government and NGO links among Web sites that are part of what is called the Collaborative Seniors' Portal Network.
This is our way of making No Wrong Door happen so that consistent search results related mostly to government programs and services are presented to visitors to any member of the CSPN. Sounds like a simple thing to accomplish but it isn't when you are working with partners across different governments and NGO's. Our consultant David Megginson - who you may have heard about recommended RSS and now - finally the point of my comment is that for years the federal government standard has been to employ DC elements to meta data and we are going to be working with our partners at the provincial and community level to adopt a standard use of DC elements.
I am not an IM expert- I'm one of those odd program owners that believes that just like any good mechanic will say as a car owner you have to know what is going on under the hood - at least to some degree and that is how I view IM - a little bit of knowledge about IM is a GOOD thing.
Found your blog very easy to read and understandable. Thank you.