One of the CLIP group of projects funded by JISC Focus of talk is on the domain as some of the citation issues will be common with other projects. Concentrating on Sage and its data and some of the implications. Not covering everything done in the project.

Holy Trinity as written up in the proposal.

There is something missing from the triangle Cuts across the 3 areas of data, process, publicaton Central to issues of attribution and credit That’s the contributors! Will come back to them later.

This is an overview of a more complex process (simplified view). Does not show original sources which enter the process. Each part can be looked at in more detail – example coming up. Different people involved as each stage is specialized.

This is just one of the stages from the previous diagram. We have a version of this for each of the stages, not enough time to go into each of the stages. Each stage has input and ouput, tools employed e.g. r scripts

Previous slide was a simplified view, actual process is broken down into a workflow, with configuration details, actual scripts, input and output.

Main contribution has been to capture this process, document it better via Taverna, makeit more understandable and re-usable. Understanding the domain before we can ask the questions about citation. Lessons Learned. Publications have many gaps, perl scripts not very user friendly, working from a document shared by SageBionetworks. Based on a visit to Sage Bionetworks, funded by Sage, built relationships and dialogue leading to data sharing

Moving on to a slightly diffferent perspective from the domain to the general citation question that we start to address. Started to think about how citation happens. Scenarios on the blog. For each of these stages we will think about the questions that our example throws. I cite others – input data is derived from somewhere I make my work citable – main work of the project (Taverna) Credit – motivation – least addressed so far in the project.

Identifying the contributor is an issue e.g. the geographicla area of work, may need to identify the organisation that funded. Does the modification change the original, how do I preserve the link.

This has been the main area of work for Sagecite – next slide show the role of taverna.

This was a diagram we had early on in the project - What about other types of publication?

I presented this at SOLO earlier this month, .. [eda segja annarstadard, i recent &amp; ongoing ORCID activities] I reach for Geoff Bilder’s slides again and nick a few things what I want to do here is replace Geoff’s silly little dude with glasses with my much much cooler ‘academic dude’, as we call him in the office. ### SKIPTA ### I want to show you in the next few slides a hypothetical scenario involving this dude, representing me, submitting a dataset to this digital repository which is a companion to Geoff’s Psychoceramics Review journal this will demonstrate some of the practicalities of how we might actually use ORCID in data publication.

Coming back to those people….. ORCID addressed by a presentation later on, focussed effort on discussions, bulding scenarios.

Have started disussions, no service yet – how tools like myExperiment and Taverna which are on the desktop and manage identity (not global) work with a service like ORCID to exchange information including for validation.

Finally, an advert……..

Collaborators on some of the projects, provided some of the slides, Sage funded the visit, shared data and documentation.

5.
Sage data and processes <ul><li>The idealised Sage modelling process can be divided into 7 stages </li></ul><ul><li>A combination of phenotypic, genetic, and expression data are processed to determine a list of genes associated with diseases </li></ul><ul><li>Different people are responsible for different stages of the modelling process. One person oversees the whole process though. </li></ul>

6.
Stage 2: Statistical QC <ul><li>Actual values in data sets are validated for quality to check for experimental artifacts </li></ul><ul><li>The checks made are dependent on the type of data set and involves the use of R scripts and tools like Plink </li></ul><ul><li>The output is a normalised data set </li></ul>Validated & curated data sets Curated data sets Statistical QC

10.
I cite others <ul><li>Challenges </li></ul><ul><ul><li>Tracking what data I have used </li></ul></ul><ul><ul><li>Some information may be confidential </li></ul></ul><ul><ul><li>Some data may be restricted access </li></ul></ul><ul><ul><li>What if I have modified the data? </li></ul></ul>

14.
I make my work citable <ul><li>Challenges </li></ul><ul><ul><li>Making my work re-usable </li></ul></ul><ul><ul><li>Granularity of credit </li></ul></ul><ul><ul><li>When to assign a new identifier </li></ul></ul><ul><ul><ul><li>What type of identifier </li></ul></ul></ul><ul><ul><li>What represents intellectual input – which contributions deserve to be cited? </li></ul></ul>

18.
Different forms of publication <ul><li>As support for an article </li></ul><ul><li>Publish to a repository/archive </li></ul><ul><li>Blogs or other social networking sites </li></ul><ul><li>Micro-attribution (nano-publication) </li></ul>