Where does this co-reference data come from?

Well, this is rather a long story, as we did not set out to provide this service.
But if I tell you the story, you might be able to assess the utility of it in your context.

As part of the RKBExplorer work,
we needed to be able to manage co-reference between triplestores (see related publications).
We had an existing infrastructure for doing this, the Co-Reference Service (CRS),
and we populated these CRSes with the co-reference data we were generating on RKBExplorer.com.
As the RKBExplorer application became more sophisticated, we needed to know co-reference information
with other sites such as dbpedia and http://data.semanticweb.org/.
This enabled us to use information such as descriptions from wikipedia/dbpedia, and the information on conferences and foaf relationships.

However, long ago we discovered that getting things even slightly wrong can cause serious problems
once the "network effect" that we are seeking comes into play. A seemingly trivial problem of a source
telling us that two different people with the same name are the same person can result in our network
relationships between entities that are related to them being badly misrepresented.
Such problems would not arise if the raw data is simply being presented.

So I set out to gather co-referent information from sources I thought were sufficiently accurate for my purposes.

I started with the data we already had, and indeed are still generating.
I then went to the Linked Data cloud, and harvested from the RDF dumps and SPARQL endpoints that I deemed to be satisfactory.
In addition I approached some people who were not publishing in a form I could easily harvest already,
such as David Baxter of Opencyc, and asked them to provide the data to me directly.

I have avoided spidering the web for arbitrary data, and indeed would suggest that other Semantic Web search engines are a much
better source for this than I can possibly provide.

The question of which predicates I might have used now arises. There is what I consider a deep irony here.
For many years, we have been arguing (not always with great success) that the issue of co-reference is much more complicated
than can be captured by a simple predicate such as owl:sameAs.
On undertaking this task, I found that there are many predicates coming into existence that address this question.
In assembling this site, I have used at least the following:

I accepted the idea of co-reference for each of these on a per source basis.
The <sameAs> service currently has a single concept of co-reference,
and publishes the data it has in a single way, for example using the owl:sameAs predicate.

I have to say it does beg the question of why there should be so many vocabularies that mint new URIs for these concepts.

So what sources? Here is a non-exhaustive list of places I may have got the data came from:

Finally, please be aware that the data is changing all the time.
As people browse using RKBExplorer, the system examines the results and establishes co-reference as appropriate;
thus the results provided by the RKBExplorer are intended to improve as time goes by, and also the <sameAs.org> reflection of that will change.

I hope that helps - I confess that in the early days I was simply getting data I needed, rather than preparing to document it.

Helping us

There is currently no public service to enable arbitrary contribution to the contents of <sameAs>.
If you have significant data you would be prepared to give us, then please conact us at the email below.
On the other hand, if you have time to help us provide such services, then please feel free to offer your help.

License and Re-use

We believe that Linked Data needs to develop clear, focussed, services
that only do one or two things, so that they can be composed and utilised by
the more complex services, as well as facilitating re-use. We hope that
<sameAs> fits into that category, and that Linked Data application
builders will find it an appropriate and useful service for the important
task of discovering co-referent URIs.

In addition, by providing formats oriented towards non-Linked Data
application, we hope that the use of Linked Data can be spread even wider.
If you really want, there are a number of sameAs logos available.

There are currently roughly 200M URIs, with an average of about 3 URIs per bundle.

To the extent possible under law,
the person who associated CC0
with this work has waived all copyright and related or neighboring
rights to this work.