Software Engineer at InterMine, in the Dept of Genetics @Cambridge_Uni. Fond of UIs, open source, veggies, running & sci-fi. http://yo-yehudi.com

Jul 25

BOSC 2017 Day 2, Part 2 #BOSC2017 #ISMBECCB

This is the part of the story in which I summarise the final BOSC lightning talks, open data panel, and FAIR bingo. You may wish to read about Day 1 of BOSC, or Part 1 of Day 2, before you read this section!

ToolDog

I left off part way through the late-breaking lightning talks in my previous post. Picking back up, we were treated to a short talk about ToolDog by Kenzo-Hugo Hillion. ToolDog provides common descriptive wrappers for tools in Elixir’s bioinformatics tool registry. You can see a poster about it here on F1000.

BioThings SDK

Chunlei Wu gave the second BioThings talk of the conference, focusing on the SDK, which gives you the tools to create your own BioThings API.

Standing on the shoulders of giants to fight superbugs

Kai Blin spoke about the scary prospect of antibiotics potentially stopping working altogether at some point in the not-too-far future — as we all know, antibiotics are regularly overused and misused. He followed up by giving us a ray of hope and inspiration: there are efforts to discover new ways to fight off infection. Often, these are computational efforts.

And what software stack does science use, overwhelmingly? It’s open source. We may contribute just a line of code here, a pull request there, or maybe even maintain a larger package — but without the work of open source software, the future might be a bleaker place than it is. Three cheers for open source!

Lunch and BoFs

I was terribly torn about which BoF — “Birds of a Feather” — session to attend at lunchtime. On the one hand, there was the JOSS BoF, and the JOSS talk earlier in the day had been really interesting. On the other hand, the eLife BoF sounded so very interesting, too!

I flipped a mental coin and ended up at eLife, where Naomi Penfold ran a fantastically targeted session, discussing ways to handle scientific code appropriately, given that traditional models of publishing papers can result in lost and obfuscated data, code that can’t be re-run, and poorly examined code that may not have undergone any peer review at all.

I was fascinated by the discussions about how this could be solved, and how much of the onus is on the journal. Early on I argued that journals being strict about software quality and reproducibility was just sensible, a necessary step to create valid science, but it quickly became clear that my viewpoint was somewhat naive — if a journal is too laborious to submit your work to, many people will simply go to a less rigorous journal, because a lower quality bar is an easier bar to pass. The summary of the discussion was taken in real time by Naomi on etherpad.

Publish in the way you practise — e.g. don’t force well-formatted rMarkdown documentation to be submitted in Word format!

Afternoon Sessions

RADAR-CNS

Nivethika Mahasivam gave a strong technical overview of RADAR-CNS, an open source project which creates tooling for tracking personal medical devices such as wearables and phones. It’s designed to enable real-time and retrospective analysis of the data, helping people with depression, multiple sclerosis and epilepsy. It must be incredibly rewarding to work on a project that helps people out like this!

WikiData

Andrew Su introduced WikiData to the audience. You’re probably familiar with Wikipedia, which shares data in a human-readable format. WikiData represents the same types of data, but in a format designed to be read by machines, and consumed via SPARQL (pronounced “sparkle”). He also pointed out that CC-BY — a Creative Commons licence which permits re-use of the content, but requires attribution of the original source — is often a sub-optimal licence and can make re-using data very difficult. It’s hard enough to comply with CC-BY that data under this licence can’t be used on Wikidata. Instead, consider using CC-0, a dedicated public domain licence:

BioCADDIE

Bioschemas

“FAIR Bingo” became a running joke at BOSC; as one of the many original creators of the FAIR data principles of Findability, Accessibility, Interoperability, and Reproducibility (paper), it’s only natural that Carole Goble would introduce us to Bioschemas, an effort to make the biological web FAIRer. The Bioschemas working group includes quite a few organisations, including my own workplace, InterMine.

Brassica Information Portal

Image Credit: U.S. Department of Agriculture/ Flickr (public domain)

The Brassica Information Portal was introduced to us by Annemarie Eckes. Brassicas include cabbage, mustard, broccoli, and rapeseed/canola, among other delicious foodstuffs. The portal encompasses both a GUI and an API, allowing researchers to access and update data programatically or graphically, as preferred.

Open Data — Standards, Opportunities and Challenges

This panel discussion drew together several of the speakers from BOSC: Madeleine Ball, who gave the Open Human keynote on day 1, Carole Goble, who introduced Bioschemas earlier in the day, Nick Loman, who delivered the keynote after this panel, and Andrew Su, who discussed WikiData and CC0 licensing. Mónica Muñoz-Torres chaired the discussion.

Some of the key points:

Attitudes toward data sharing: a generational problem?

Carole Goble shared the interesting but somewhat disappointing fact that whilst younger people (e.g. PhD students) are generally pro- open data, PIs in projects are more likely to be resistant to it, considering the challenges of time and money spent to be too great to justify the effort. In an ideal world, data would be open by default (excluding scenarios where open data is inappropriate, such as personal privacy, health, etc.), creating an environment where anyone who isn’t sharing their data appears to be the odd one out.

Data “Flirting”

A phrase shared by Carole Goble: incomplete data sharing can be used as a marketing ploy, revealing just enough to interest others without actually providing anything of real use!

Open data: how do you prevent its misuse?

An insightful audience member asked the panel how to prevent open data from being misused. Madeleine Ball responded to point out that truly open data may indeed be misused, and there isn’t really any way to stop it, without making the data less open. A double-edged sword, perhaps?

Nick Loman further pointed out that Open Source also cuts both ways, with North Korea benefiting from Open Source software as much as anyone else. Indeed, when I googled this briefly, I noted that Wikipedia has an article on Red Star OS, a modified North Korean Linux distribution.

Not all data should be open

This seems surprising when coming from a panel discussing open data, but even pushing the issues of ethics in personal genomics, health, location, privacy, etc., there are other reasons to keep data hidden.

Conservation and ecology are good examples of this. Publishing the location of rare, interesting, or vulnerable animals could result in their harm or death, whether the people using that data are malicious or merely curious to see the animal. I actually had an experience with this myself recently — I live in the Cambridgeshire Fenlands, (exact locations intentionally kept vague) and whilst birdwatching I was privileged to spot a pair of cranes, a bird so rare in the UK that their single breeding site is kept secret for their own safety.

Re-digitisation

Hilmar Lapp mentioned from the audience that sometimes merely having data “open” isn’t enough — it needs to be in a sensible format. This is probably a place where the academic publishing model has room for improvement — for example, large amounts of tabular data shouldn’t be converted into PDF format, where one might have to literally re-type the entire dataset by hand in order to re-use the data.

He further asserted that the act of re-digitising data is a transformative enough act that it’s not only legal to licence the data openly, but is in fact a duty, lest the same data be re-digitised repeatedly by different parties. (A side note — can anyone verify this claim? I’d love to hear more.)

In my mind, this problem is kin to the Digital Dark Age, the very real problem that as technology moves on, we’ll no longer have the hardware or software to read old data formats.

Thanks for reading with me this far. I’ll finish up my final post, coming soon, which will summarise the keynote for Day 2 of BOSC, an inspiring talk delivered by Nick Loman, about making a difference in epidemic-infested countries by sequencing the genomes of viral outbreaks.

Disclaimer: Any views expressed are my own, not necessarily those of PLOS.