Post by Ettore RIZZAI'm looking for Wikidata bots that perform accuracy audits. For example,comparing the birth dates of persons with the same date indicated indatabases linked to the item by an external-id.

This is mostly a screenscraping job, because most external databases areonly accessibly in unstructured or poorly structured HTML form.

"Poorly structured" HTML is not all that bad in 2018 thanks to HTML 5(which builds the "rendering decisions made about broken HTML fromNetscape 3" into the standard so that in common languages you can getthe same DOM tree as the browser)

If you try to use an official or unofficial API to fetch data from someservice in 2018 you will have to add some dependencies and you justmight open a can of whoop-ass that will make you reinstall Anconda ormaybe you will learn something you'll never be able to unlearn about howXML processing changed between two minor versions of the JDK

On the other hand I have often dusted off the old HTML-based parser Imade for Flickr and found I could get it to work for other mediacollections, blogs, etc. by just changing the "semantic model" embodiedin the application which could be as simple as some function or objectthat knows something about the structure of the URLs some documents.

I cannot understand why so many standards have been pushed to integrateRDF and HTML that have gone nowhere but nobody has promoted the cleansolution of "add a css media type for RDF" that marks the semantics ofHTML up the way JSON-LD works.

Often though if you look it that way much of the time these daysmatching patterns against CSS gets you most of the way there.

I've had cases where I haven't had to change the rule sets much at allbut none of them have been more than 50 lines of code, all much less.

Post by Ettore RIZZAI'm looking for Wikidata bots that perform accuracy audits. Forexample, comparing the birth dates of persons with the same dateindicated in databases linked to the item by an external-id.

This is mostly a screenscraping job, because most external databasesare only accessibly in unstructured or poorly structured HTML form.Federico_______________________________________________Wikidata mailing listhttps://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata is obviously linked to a bunch of unusable external ids, but alsoto some very structured data. I'm interested for the moment in the state ofthe art - even based on poor scraping, why not?.

I see for example this request for permission<https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/Symac_bot_4>for a bot able to retrieve information on the BNF (French national library)database. It has been refused because of copyright's issues, but simplychecking the information without extracting anything is allowed, isn't?

Post by Paul Houle"Poorly structured" HTML is not all that bad in 2018 thanks to HTML 5(which builds the "rendering decisions made about broken HTML fromNetscape 3" into the standard so that in common languages you can getthe same DOM tree as the browser)If you try to use an official or unofficial API to fetch data from someservice in 2018 you will have to add some dependencies and you justmight open a can of whoop-ass that will make you reinstall Anconda ormaybe you will learn something you'll never be able to unlearn about howXML processing changed between two minor versions of the JDKOn the other hand I have often dusted off the old HTML-based parser Imade for Flickr and found I could get it to work for other mediacollections, blogs, etc. by just changing the "semantic model" embodiedin the application which could be as simple as some function or objectthat knows something about the structure of the URLs some documents.I cannot understand why so many standards have been pushed to integrateRDF and HTML that have gone nowhere but nobody has promoted the cleansolution of "add a css media type for RDF" that marks the semantics ofHTML up the way JSON-LD works.Often though if you look it that way much of the time these daysmatching patterns against CSS gets you most of the way there.I've had cases where I haven't had to change the rule sets much at allbut none of them have been more than 50 lines of code, all much less.------ Original Message ------To: "Discussion list for the Wikidata project"Sent: 9/26/2018 1:00:53 PMSubject: Re: [Wikidata] Looking for "data quality check" bots

Post by Ettore RIZZAI'm looking for Wikidata bots that perform accuracy audits. Forexample, comparing the birth dates of persons with the same dateindicated in databases linked to the item by an external-id.

This is mostly a screenscraping job, because most external databasesare only accessibly in unstructured or poorly structured HTML form.Federico_______________________________________________Wikidata mailing listhttps://lists.wikimedia.org/mailman/listinfo/wikidata

Post by Ettore RIZZADear all,I'm looking for Wikidata bots that perform accuracy audits. Forexample, comparing the birth dates of persons with the same dateindicated in databases linked to the item by an external-id.

Let's have a look at the evolution of automated editing. The first stepis to add missing data from anywhere. Bots importing date of birth arean example of this. The next step is to add data from somewhere with asource or add sources to existing unsourced or badly sourced statements.As far as I can see that's where we are right now, see for example editslikehttps://www.wikidata.org/w/index.php?title=Q41264&type=revision&diff=619653838&oldid=616277912is . Of course the next step would be to be able to compare existingsourced statements with external data to find differences. But how wouldthe work flow be? Take for example Johannes Vermeer (https://www.wikidata.org/wiki/Q41264 ). Extremely well documented andresearched, buthttp://www.getty.edu/vow/ULANFullDisplay?find=&role=&nation=&subjectid=500032927and https://rkd.nl/nl/explore/artists/80476 combined provide 3 differentdates of birth and 3 different dates of death. When it comes to thesekind of date mismatches, it's generally first come, first served (firstdate added doesn't get replaced). This mismatch could show up in somereport. I can check it as a human and maybe do some adjustments, but howwould I sign it of to prevent other people from doing the same thingover and over again?

With federated SPARQL queries it becomes much easier to generate reportsof mismatches. See for examplehttps://www.wikidata.org/wiki/Property_talk:P1006/Mismatches .

Thank you very much for your answer and your pointers. The page (which Idid not know existed) containing a federated SPARQL query is definitelyclose to what I mean. It just misses one more step: deciding who is right.If we look at the first result of the table<https://www.wikidata.org/wiki/Property_talk:P1006/Mismatches> ofmismatches (Dmitry Bortniansky <https://www.wikidata.org/wiki/Q316505>) andwe draw a little graph, the result is:

[image: Diagram.png]

We can see that the error comes (probably) from Viaf, which contains aduplicate, and from NTA, which obviously created an authority based on thisbad Viaf ID.

My research is very close to this kind of case, and I am very interested toknow what is already implemented in Wikidata.

Post by Ettore RIZZADear all,I'm looking for Wikidata bots that perform accuracy audits. Forexample, comparing the birth dates of persons with the same dateindicated in databases linked to the item by an external-id.

Let's have a look at the evolution of automated editing. The first stepis to add missing data from anywhere. Bots importing date of birth arean example of this. The next step is to add data from somewhere with asource or add sources to existing unsourced or badly sourced statements.As far as I can see that's where we are right now, see for example editslikehttps://www.wikidata.org/w/index.php?title=Q41264&type=revision&diff=619653838&oldid=616277912is . Of course the next step would be to be able to compare existingsourced statements with external data to find differences. But how wouldthe work flow be? Take for example Johannes Vermeer (https://www.wikidata.org/wiki/Q41264 ). Extremely well documented andresearched, buthttp://www.getty.edu/vow/ULANFullDisplay?find=&role=&nation=&subjectid=500032927and https://rkd.nl/nl/explore/artists/80476 combined provide 3 differentdates of birth and 3 different dates of death. When it comes to thesekind of date mismatches, it's generally first come, first served (firstdate added doesn't get replaced). This mismatch could show up in somereport. I can check it as a human and maybe do some adjustments, but howwould I sign it of to prevent other people from doing the same thingover and over again?With federated SPARQL queries it becomes much easier to generate reportsof mismatches. See for examplehttps://www.wikidata.org/wiki/Property_talk:P1006/Mismatches .Maarten_______________________________________________Wikidata mailing listhttps://lists.wikimedia.org/mailman/listinfo/wikidata