Sematifying Delicious

Introduction

The tags2con dataset has been manually created by a group of human annotators that linked del.icio.us tags to their real meaning.

A subset of a delicious dump has been used to create the tags2con dataset, a set of 1681 user-bookmarks pairs have been selected and all the tags used by these pairs have been manually cleaned and disambiguated to WordNet synsets.

The dataset we have built includes annotations from users which have less than 1000 tags and have used at least ten different tags in five different website domains. This upper bound was decided considering that Delicious is also subject to spamming, and users with more than one thousand tags could potentially be spammers or machine generated tags as the original authors of the crawled data assumed.

Furthermore, only user-bookmark pairs that have at least three tags (to provide diversity in the golden standard), no more than ten tags (to allow timely manual evaluation) are selected. Only URLs that have been used by at least ten users are considered in order to provide enough overlap between users. After retrieving all the user-bookmark pairs that comply with the previously mentioned constraints, we randomly selected 1681 pairs with the following method: 500 pairs were selected purely at random, 1181 pairs were selected randomly at equal distribution in the pairs that overlaped with one of the following DMOZ topics: Top/Home/Cooking, Top/Recreation/Travel or Top/Reference/Education. Table 1 summarizes the characteristics of the resulting subset of the dataset.

Table 1. Statistics about the Dataset

Item

Count

<r,
u>
pairs

1681

total
number
of
tags

7323

average
number
oftags
per
pair

4.35

unique
tags

1569

unique
URLs

739

unique
users

1397

website
domains

603

Each tag has been split into lemmatized tokens and each of them has then been linked to its meaning in the WordNet 3.0 ontology

In Figure 1, we propose an extension to the Newman's ontology where a tags:Tag can be split in a number of tags2con:Token that then link to the actual semantic in a knowledge organisation system (in this case a SKOS:Concept) that can be used in reasoning.
In this proposal, for compatibility with the existing RDF models that widely use the Newman's tags:Tag class, we also use this one.
However, it is our belief that this creates a confusion between the linguistic layer of the folksonomy and its conceptual layer that can lower the accuracy of reasoning services based on this data. Thus, we would recommend to drop such compatibility in the future.

In addition to the tags2con extension (rdf, n3, Ontology Browser), the main ontologies that we use to distribute the dataset are:

Linked Open Data (LOD) Cloud

The dataset that we are distributing here is following the Linked Data principles and also tries to fulfill the requirements of the LOD Cloud:

There must be resolvable http:// (or https://) URIs — All resources are available through http://.

They must resolve, with or without content negotiation, to RDF data in one of the popular RDF formats (RDFa, RDF/XML, Turtle, N-Triples) — the dataset is available as RDF/XML and N-Triples.

The dataset must contain at least 1000 triples — The dataset currently contains 18675 triples and a planned update should add almost double this.

The dataset must be connected via RDF links to a dataset that is already in the diagram. This means, either your dataset must use URIs from the other dataset, or vice versa. We arbitrarily require at least 50 links — the dataset links to the WordNet 3.0 resources, with around 262 links.

Access of the entire dataset must be possible via RDF crawling, via an RDF dump, or via a SPARQL endpoint — The dataset is availlable through RDF crawling and an RDF dump can be provided on request.