Formal Metadata

CC Attribution 4.0 International:You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Content Metadata

Speakers Keith Russell, ANDS, will provide an overview of the key components of Interoperability 2) OzNome - Interoperable data as an example of Fair data principles Simon Cox, David Lemon, Jonathan Yu (CSIRO) will present on how they have made the research data in the OzNome project Interoperable, not only for humans, but also for machines #3 INTEROPERABLE covers: -- an overview of the 3 INTEROPERABLE principles which use vocabularies for knowledge representation, standardisation and references other metadata. -- resources to support institutional awareness and uptake of Interoperable principles. The FAIR data principles were drafted by the FORCE11 group in 2015. The principles have since received worldwide recognition as a useful framework for thinking about sharing data in a way that will enable maximum use and reuse. This webinar series is a great opportunity to explore each of the 4 FAIR principles in depth - practical case studies from a range of disciplines and organisations from around Australia, and resources to support the uptake of FAIR principles.

just to introduce myself my name is Keith Russell I work for the Australian after data service I'm your host for today my colleague Suzanne Sabine is behind the scenes co-hosting the webinar with me just a usual little bit background the Australian national data service works with research organizations around Australia to establish their I'll have them trusted partnerships reliable services and enhanced capability in the research sector we work together with two other increased funded projects research RDS research data services and nectar to create an aligned set of joint investments to deliver transformation in the research sector so this webinar is part of a series of activities we are undertaking to which aim to support the Australian research community in increasing our ability to manage our research data as a national asset so as I mentioned earlier this is a third in a series of webinars around fair so already had the webinars on findable and accessible and today interoperable next week reusable so today I will give a

01:05

brief introduction about what is interoperable as described under the fair data principles in force 11 and then I'm very grateful that Simon and Jonathan have available to talk about what they did in practice in the in the Osmo project to make their data interoperable I think it's a great example to show how this quite complex topic can actually be carried forward in practice so this is what force 11 says about interoperable and first of all a few things to keep in mind so just reiterating a few things I mentioned in the very first webinar so when they talk about data and with well as you look at these headings you'll see that they talk about data and metadata so interoperable applies both to the metadata describing the data collection and the actual data itself another point to keep in mind is throughout the Fair principles they think a lot around not only with data being usable for humans but also for machines and that provides huge benefits in bringing together disparate data sets in bringing together bits of knowledge that are distributed over different datasets and interoperable is a key element there to make sure that data can be brought together and actually can be you can we can get those benefits out of bringing data together which will enable new knowledge discovery new relationships to be discovered new patterns to be recognized all those all those pieces of work so as we look at these three headings that they've listed under interoperable the first one there is that data and metadata use a formal accessible shared and broadly applicable language for knowledge representation to keep in mind there is that not only free you as the research or the researcher that has created the data but also for another researcher that wants to understand the data use the data it's useful that they understand the language you've used and that that is a standard yet standardized language something that other other other users can also pick up and use so ideally that is the case for the metadata so that is definitely the case for the metadata and ideally that would also be used in the actual data itself very basic example if the researcher has observed that this they saw a magpie they can write in I saw a magpie but it's much more useful for a researcher somewhere else on the other side of the world that you write in that it's an Australian magpie and that is a Kratt occurs Temujin that means that a researcher on the other side of the world hasn't as it uses using a standard language will actually be able to better understand what you meant and what what the description is about now it's not just in the actual wording used in the vocabulary use but it's also in you know in the it's useful to have a framework around that which will allow the data to be also be machine readable and picked up by machines and used and interpreted now one obvious example which get meant gets mentioned quite a lot is using RDF an ontology x' that is quite common in the life sciences and life's a number of life science researchers and that were quite active in the force 11 group but one thing they they emphasize is that it's not doesn't just have to be through RDF and ontology x' there might be other solutions for this and they don't want to don't want to make it exclusively through those technologies so that's something to keep in mind regarding the making of data interoperable that's what I've invited Simon and Jonathan to come and talk about and they'll be able to talk about it in much more detail the second point here is around recoveries in using vocabularies and they emphasize that if you use of vocabulary well first of all try and use one that already exists or is agree and is agreed on by the community if you have a terminar if you have terms in in there that I'm not in that vocabulary but otherwise it fits try and get them added to that vocabulary and finally if if that is not possible then only then and only then start creating your own broken I broke a Buehler II so please don't go out and create VOC a breeze for everything rather look if there is already a community agreed to a capillary also make sure that that vocabulary itself is fair so findable accessible interoperable reusable so if in your data set you should have a reference to that vocabulary you are referring to and make sure that that vocabulary can be found just as long as your data set can also be found final point they make is that the data and the metadata should include qualified references to other data and metadata so what they mean there is that shouldn't just be a reference to another data set for example but also an indication what that relationship is so it's not just it's related somehow to this other better data set but perhaps it is a subset of another data set or it builds on another data set using standardized terminology a little more on qualified references from the perspective of the metadata especially it's valuable to not only refer to other players or other elements around the around your data set but to do that using identifiers so for example if you are describing your data set and saying well it was created there was somebody was involved in creating that data set provide a qualified identifier that that person was for example with the author of that data set and if possible also you an identifier to identify that person that allows other relationships to be made and it allows further connections to be made and that information to be picked up and used especially for my machine when being analyzed by machines so just the list here of possible identifiers these are just examples there are more identifies out there but for example if you're referring to a noir author include their orchid if you're referring to a publication use the DOI that related to that publication if you are referring to software nowadays you can recite you can assign a DIY to a software package and refer to that deal though I etcetera well I think I've rambled on enough for now so I'd like to hand over to Simon and Jonathan and the grateful that they're they've made their time available so just a brief introduction Simon is a research scientists at CSIRO land and waters environmental infect information systems research program he specializes in distributed architectures and information standards for environmental data focusing on geosciences and water Jonathan Jonathan U is a research computer scientist specializing in information architectures data integration link data semantic web data analytics and visualization he's part of the environmental informatics group in CSIRO land and water so together they've been

08:08

very active in in applying their thinking around making data interoperable India's most known project now one thing I want to point out is that in the auslan project they did a whole series of work around the fair data principles in all different aspects today I've asked them especially to focus on interoperable but please keep in mind that they've also done a whole bunch of other work so without any further ado I'd like to hand over to to Simon and Jonathan and I'm very intrigued to hear as how they've picked up interoperability and use that in the in the oz9 project okay thanks Keith so thanks for in introductions as well so today we presenting on some of the work we did in the asana initiative particularly looking at land of water and the data that we have in in CSIRO and how to make that interoperable accordingly to some of the principles that fair espouses but as we will talk about some of the implementations that

09:17

we have explored around the fair principles into actionable questions to it address yeah how fair your data is so if you haven't come across AHS Nome this is a Cyril that initiative aiming to connect information ecosystems around Australia the odds Nome name was coined echoing Genome Project so as being Australia and the known being a genome kind of inspired project but really what we're looking at here is tools services products methods approaches and practices and infrastructure to support having more connected information infrastructures and in the in the previous year we as Keith mentioned we focused on environmental information infrastructures so it's couple of leagues they can follow and today we'll be talking about an example in the water space okay so as part of establishing

10:21

the AHS Nome architecture our genome infrastructure we felt that we needed to assist our potential data providers to understand what good data was what in the context of this seminar series what fair data is we all call the doors Nome data and basically we developed a rating set of rating criteria and a tool to allow assessment by data providers of the data of the data that they're they're providing and this is just on the right-hand side of the screen here you can see a screen capture of the the sort of the kickoff page of the other tool you'll also notice that we've got a slightly adapted version of the fare criteria findable accessible

11:18

interoperable reusable but we also add in the last line there trusted which appears to go a little bit beyond what has been conceived in fear until now but we suggest would be a useful addition we've kind of bundling the interoperable and reusable together we see those as being very closely related and and you know obviously with it this is it's it's teasing out some of the issues around what it is that makes data interoperable Keith's given a sort of a high-level overview and indicated what some of the concerns might be we've taken done our own take on this a bit actually fairly strongly leaning on our experience over a number of years more than a decade now actually of working in the with in the data standards communities in particular the geospatial data standards communities and some of the learnings that we've got from there which were applying directly in here obviously environmental data which is what we largely work what we've what our heritage is where we've largely been working a lot of that is geospatial so it makes sense to be to be building on that now just a bit of a reminder the

12:35

force 11 Fair principles this is a summary slide from Michelle diamante who's one of the original authors of the papers and developers of the fair principles they've got these the guiding principles with the the four keywords and and teased out into three or four sort of sub principles in each case with the FAI and our letters we're looking at the interoperable set here which and keith has already shown it's interesting that michelle has recently done a study evaluating a number of repositories particularly in europe but some of them are broader than that but here's the list of repositories that were evaluated and scored those on the fair principles the data is available in this form actually

13:28

this table shoots off to the right of the screen there's lots more going on there but looking at the summary of the results it's fairly notable that the

13:38

tallest red bar here is in the interoperable category so what this is saying is of the fair data principles this is the one which is hardest to meet the heart the one that's hardest to conform to and so really that's the

13:55

focus of of the approach that we've taken which is to to kind of lead people through how they can make their data more fair more Haas nom ik more interoperable and the particular way in which we've broken out the question of interoperability is on if you look at the the numbered terms here is it loadable is it usable is it comprehensible is it linked as well as is it licensed and I'm just going to go

14:28

through some of the details of those and you'll see this you know a sense is fairly repetitive of some of the concerns that Keith explained at the beginning but we're we're putting some some more concrete examples on to these these criteria just to to indicate to our data providers that when we say a standard data format we mean something like CSV or JSON or XML or next CDF these are all important file formats towards the left and then they're you know they're kind of general but netcdf is one that's used it a lot in the remote sensing and environmental science communities so we've got a bit of a ladder here of different levels of conformance which you can reach about whether a data set would be loadable is it in a unique file format well that means that you've gotta have some unique software to load it or is it in a standard data format you know and normally that would be denoted by one of the standard mime types best of all would for data to provide be provided in multiple standard formats giving a choice to the user so that which at whatever their favorite platform for loading data they can they can use next question even when you've loaded it can you use it if it's if the structures within the data set even if it's loaded if the structures are unclear then it's not going to be very usable and that is comes down to the matter of is there a schema that's provided which explicitly makes explicit the date of the structures within the data sets a lot of sort of traditional data yeah there's a structure in there but the schema is not available independently off the date or if you like the schema is implicit it's not formalized the schema maybe is different every time a lot of spreadsheets are done that way spreadsheets got a lot of boxes but if every time you use it you add different columns and and use the the pages in a spreadsheet in a different way then it's it takes a little while for the user to get their heads around what's going on before they can use it so there's various explicit schema languages like DDL which is largely used for relational systems XML schema something coming out of the Open Knowledge world these days called data packaging which allows you essentially to describe a schema for a CSV file then you've got in the the RDF the Semantic Web space RDF s so now jason even has a schema language these days although it's not broadly i'm used and so it's nice to provide data with a schema but best of all would be to say the data I'm using I'm using this community schema this community and for example the open geospatial consortium provides a number of community schemas for observations for time series for hydrology for Geoscience and if you're publishing or attempting to share data in any of these disciplines then best to go off and find a community schema then even when you've got it loaded and you understand what the structures are you've still got the question about what the words and numbers are inside the boxes do the the column headings are they explicit enough to understand are they just shorthand for something which which the project leader when he was developing the data knew that he would under or she would understand it the next week but even he or she if they came back to it the next year may not understand it best of course is if the field labels are linked and do have tech explanations probably in plain text then are better still is to use standard labels for example the unit universal code for units of measure units codes or the climate and forecast conventions coming out of the fluid earth community so the ladder that we've got here says are you using standard labels is it just some of the field names are linked to standard externally managed recovers are all the field names linked to standard externally management coverage and you get this ladder better and better and better and then the question about how well linked is your day so well if it's just a file sitting on a server somewhere there's no links in or out yeah you're lucky to find it if most of the datasets that we're that in this community we'd be expecting is that they're they're indexed in a catalog or they're available from a landing page and that's a situational you've got inbound links to the data set and best of all is when there are outbound links embedded or implicit in the data structures in the dinner data set which says exactly how it's related and this links in with some of the previous concerns that we had there about field names and these kinds of things so I'm going to hand back to Jonathan to tease

19:25

through a case study that we've got here really based on the aura l Australian water resources assessment data sets so Jonathan yeah so as mentioned earlier an Auslan project we looked at a practical example and case study in the aura L data set this is a continental scale data set that has historical time series from 1911 the bureau publish an operational version online you can find that on the website but often scientists have to basically do this data set by knowing where it is and knowing how to use it implicitly and knowing how to reference the requisite geospatial features and understand the field name values so I've got an example in the search so the next next slide shows the assessment of it

20:22

using our tool and just focusing on the interoperable side of things we have rated it as a web service so it's you know we can get it by the web however the reference definitions are text only and they are very localized in the data set itself now give an example in the next in the next slide so this is

20:44

coming out from the net Cydia metadata that this data set you can access this file you know online through threads or by their netcdf tools but this is a summary of the metadata that's in that that comes along with the data so we've got long name here potential evapotranspiration we've got the name which is a label for the field ez row underscore AVG units mm and a standard name which is a convention in netcdf to refer to the actual property which is e zero underscore AVG which in this case is a part of the CF conventions that's often used with this format so if you are an expert in this area and you've used this data set many times you'll know what this is if you are newcomer you have to do a lot of work to well a little bit of work to understand what actually this this this data field means and in the Osmel project what we did was enrich this with external variables so if you go to the next slide so I mean so this is the same field is and we've added you can you know this added lines at the bottom here they tease out what this particular data field means in the context of externally defined vocabularies so we've now enriched this with a scale quantity kind identifier potential evapotranspiration and it's the HTTP URI where you can resolve it and get definition so similarly for substance attacks in unit ID and feature of interest and just talk about what they are so this is what a part of the project was to explore could we define vocabularies for these from which we could reference out outbound links from the data to the definition and this is just a summary of what we did in the context the or al data set and this is an example of potential

22:44

evapotranspiration you know we've got a conceptual model here where we've got broader notions of potential evapotranspiration evapotranspiration and we've got linked relationships out to things like fill the feature of interest object of interest and unit of measure so this view provides a vocabulary entry for potential evapotranspiration and not only that identify for it not only the description for it but a rich richer model then you would get from if you just had something in inline so you've got outbound relationships from this concept to its related concepts essentially so this is a demonstration of defining the concepts externally I'm having them quite rich Lea explained through this medium but having the ability to link that from the data set itself to this definition to make it more interoperable so that if we have another data set that talked about potential evapotranspiration it could potentially be linked and interoperable a revised Ozlem maturity estimation

23:54

using the Osmo 5-star tool and just focusing on the interoperable field we see that it's for for using the same tool and assessing it based on the criteria we've gone up from to start to more than 404 stars in interoperable space and the reason for that is that we now have reference definitions as link data and externally hosted observed property for Kaveri definitions rather than just you know inline labels of what what it is provides more interoperability and if the recovery was standardized then we will have a higher estimation in that field but it's just a demonstration of how we went about making something more interoperable through the Auslan project and yeah I'll just pick up at the end

24:46

here and just comment that when we were starting this data ratings exercise we we we actually didn't look at fare at the beginning we developed our own set of criteria these keywords here and then subsequently correlated them with the the fare principles one of the interesting things was there was three lines in this table here the ones in red which didn't correlate with concerns that had been identified within within fare and the first one might be seen as trivial but we thought it was a question that was worth asking particularly when working with research scientists and talking about making their data available which was the question about the the first question is your data intended to be used by anybody else there's lots of data generated which is which is never shared now it's not necessarily a good thing and to a certain extent having the question there you know highlights the fact that there is a question to be asked and that some scientists need researchers need to be encouraged to think about making their data available make about publishing it so I think in terms of the fare principles this one was the kind of the implicit starting point if it's published it's in its yes it's implicitly fair a couple of other rows one concern which comes up particularly we've worked a lot with agencies that have sort of systematic data collection processes with systematic curation and maintenance revisiting a dataset is refreshed every day or every month or every year or that that concern didn't seem to be particularly addressed in the fair principles as they stand and so we'd say you know the concern about whether the data is expected to be updated and maintained or maybe a bit more than fair and the bottom row there as well was the concern about you know this this is a if you like an elaboration of the assessment of data that you might do which is to assess to get some information about how well trusted it is now a lot of that's about who else is using it how much it's well that that's often the criteria used who else is using it how many times has it been used what other products as has it been have been generated from this from this data set and so can I trust it so just emphasizing that row there is the interoperable corresponds with the interoperability which is what we've really been focusing on today from the use of standards I guess standards is a funny word you have to be a bit careful with it capital S standards sometimes people think that's just to do with ISO or Australian standards or whatever really the point about standards is that they are community agreements there are community agreements which are available for additional members of the community to join in but but think of but it's important to think of them as as agreements agreements to do things in a common way so finally just to slide with some some

28:09

links to some of the material that we've been showing today and we'll say thank

28:13

you for listening Thank You Simon Thank You Jonathan that was that was really interesting and a really really useful way to see what it actually means in practice because I think interoperable it can be quite a complex difficult subject sometimes also a one that requires much more knowledge of the actual field of research that's going on so that's what you're talking about so I think this is a great example of where you've been we've been working in

28:40

a specific field to try and make that data more interoperable it's thanks very

28:45

much for your time and there's a really interesting discussion and really started to tease out a number the issues in a number of the things that probably will need developing further I just put up a slide which links off to a number of resources and some of these Simon already mentioned so and it has a service research for companies Australia which anybody around the country or actually internationally also can can use if if you don't have your own tool to set up a vocabulary that is a possible way of doing it there also already existing vocab routes in there so have a look at that if that's of interest we also have an interest group that works in this space if you are looking at the metadata and having qualified relationships within the metadata and using identifiers there's a few links there to places where you can find information about possible identifiers we're also trying to pull that metadata describing datasets together and sharing that internationally through a number of hubs that's taking place through this colics project research start Australia is a sort of an Australian hub contributing into then that international hub international effort so let's have a look there if you're interested we did 23 research starter things last year and two of the things are relevant for our discussion today and have them if you are look you're interested in digging into it a little further and discovering a little bit more about it and I'm discovering what other cameras mean in practice have a go at thing twelve or if you are more interested in the identifier and link data have a look at thing 14 finally I'd like to first of all thanks Simon and Jonathan again for their time and for the the excellent presentation and the the the insights that they brought to the table finally we would like to acknowledge increase the national collaborative research infrastructure strategy program that

30:32

provides the funding for ends so thanks again and look forward to seeing you all next week