All the Perl that's Practical to Extract and Report

Navigation

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Without JavaScript enabled, you might want to
use the classic discussion system instead. If you login, you can remember this preference.

Please Log In to Continue

It looks like an awfully verbose way of saying some very simple things. And I expect that for it to be useful for users they'll need to do XML voodoo. Which is HARD. I just don't see the point of using an obfuscatory format like RDF/RSS/XML/whatever it's called this week, rather than (eg) the output from Data::Dumper or YAML. Maybe I'm missing something.

RDF may be more suitable and appropriate for aggregation of the various metadata files relating to a single distribution. Much of it will be primarily for PAUSE and the indexers like search.cpan and the various tools people already use like cpan.pm so users generally won't ever need to look at the raw metadata unless they really want to.

I should probably have explained this a little more. I got really confused and all negative about RDF until recently. The main problem is that it's all in XML and that scares everyone, but RDF is really all about triples: subject, predicate, object. It just so happens that the most common serialisation format at the moment is in XML.

So an interesting triple would be "LBROCARD" "is the author of" "Acme-Buffy-1.2". Or, in the RDF fragment about Acme-Buffy-1.2: "<cpan:id>LBROCARD</cpan:id>". Noti

You still have to make guesses about what a is, surely? At some point, a human has to decide that LBROCARD is the person who wrote Acme::Buffy, and that it's not some other random identifying feature like an ASCII-fied checksum.

Still needs a human to read, parse and understand the fact that <foo> represents a FOO in the real world, and to write the code to handle FOOs correctly. That is, it requires just as much work as understanding what 'author' means in a structure such as:

Oh, OK, maybe I didn't follow your meaning. I wasn't meaning to imply that using RDF (and in the vocabulary itself, OWL [w3.org]) would actually define what the data is. But yes, isn't that always going to be the case, until we have smart computers? At the moment, the closest thing to "encapsulated meaning" we have is Cyc [opencyc.org], and that's a long way off from being the real thing. RDF vocabularies, as you say, are good for defining relationships between things.

I always try to either use something that is explicitly designed to be human-readable, like Data::Dumper (with purity and indent style 2) or more recently YAML; or something which cares not about being human-readable, such as Storable or some other binary format. RDF/RSS/XML, because it's ASCII, looks like it's meant to be human-readable, so I try to read it and get irritated.

Wow, I really like this idea. Is the idea to serialize CPAN metadata in a similar way to how the Open Directory Project [dmoz.org] makes their data [dmoz.org] available? Speaking as an ex-librarian, your use of RDF and DublinCore is commendable. People in the library and information science communities have been getting all excited about RDF and DublinCore for years, and it's is very cool to see someone putting it to practical use. I bet the the semantic web folks [w3.org] would also be very interested to hear about your experiments.

On a somewhat related note: while it's a kind of eclectic the Open Archives Initiative [openarchives.org] has developed a protocol [openarchives.org] for sharing large sets of metadata. The OAI-PMH provides a very simple framework for building data providers and data harvesters using a set of 6 verbs over XML/HTTP: Identify(), ListIdentifiers(), GetRecord(), ListmetadataFormats(), ListRecords(), ListSets(). While it might not be of direct use, it could be of interest if you are looking for ideas on how to allow people to update their local copies of CPAN metadata without grabbing the whole lot each time. The OAI-PMH has its roots in the arxiv [arxiv.org] pre-print server at Los Alamos, and is currently being used by quite a mix [arxiv.org] of data providers. Oh, and I wrote Net::OAI::Harvester [cpan.org] for interacting with repositories:-)

This snippet doesn't look entirely kosher. The urn::filesize and urn::mimetype elements need to be placed into a proper namespace.

The RDF format is rather, um, ugly to behold. It's good for interchange between apps, but greatly obfuscates the meaning for wetware parsers. I think the following is a faithful interpretation of the above example in Notation 3 [w3.org]:

It was just a fragment, so it had no namespaces. Thanks for the feedback, it does now. Also I added Author ID and MD5 Checksum. More metadata from CPANTS and META.yml to come soon. I used RDF/XML as it was the simplest thing possible at the time and RDF::Simple was, well, simple. Anyway, you can check it out at: http://www.cpan.org/authors/id/L/LB/LBROCARD/cpan.rdf.gz
(autrijus is hacking PAUSE so I can replace the file instead of releasing new versions all the time).

First off, XML isn't the only possible serialization of RDF.
Second, and more importantly, I think it's reasonable for CPAN metadata to be stored/provided as YAML... so long as it can be unambigiously mapped to RDF for those applications that need/want it.

I would argue that the world that uses XML/RDF is larger than the world that uses YAML. I have no statistics to back this up, it is just a gut feel. Safety in numbers is not really a good argument, but I guess the main thing that the data is *available* (thanks Acme) than what format it is in.

Actually, I'd argue with equal conviction that CPAN Metadata should be canonically stored in N3.

My cat would argue even more strongly that we should design a database schema and shove all the data into {SQLite|MySQL|PostgreSQL}. Even Cats can understand third normal form.;-)

The one thing we really need is to agree on the triples and the meaning of the assertions that describe CPAN metadata. Everything else is just syntax. Mapping from one syntax or another (or deeming one syntax "preferred") is an e