InterMine Mobile

Category: Conferences

Research software engineering as a career path

What is an RSE (Research Software Engineer), you may ask? It’s a role that has existed for decades, but has only been using this name for a few years. As RSEs, we tend to be software engineers who work in academia, or perhaps academics who write production-ready code – or maybe both.

A common theme seems to be universities establishing RSE groups who work in consultancy-style ways – academics who have code, or have a need for code, approach the groups and are helped through their tasks, whether it be refactoring some old/messy/slow code, providing suggestions, or writing code to make their research easier. The RSE group may also provide training in programming languages, version control, best practices and other relevant computational basics that ease the needs of researchers.

Whilst I think most or all of us at InterMine would consider ourselves to be RSEs, we don’t really fit this model – we all write code, we all contribute to papers, but all of our sub-projects and work focus around a single primary project – some of us are working to make InterMine more FAIR, others to make it easy to launch InterMine on the cloud, but it’s still all InterMine. I’m sure we’re not the only group like this, and it makes me wonder if there should be names for the different flavours of RSE groups out there. Central RSE groups vs. dedicated RSE groups? Consultancy / support / advocacy RSEs vs. RSE specialist groups? I’m not sure if any of these are quite right, and I’d be curious to hear what others think.

RSE 2018: a grassroots conference for research software engineers

Moving on from musing about job titles, though – a bit about the recent conference. RSE2018 is the third annual UK conference for Research Software Engineers, but it’s the first time I’ve attended, personally. It made a change to have a conference where everyone around was working in research and software development, but not all of it was open source or bioinformatics related. I relished the chance to meet and discuss career paths with others, and enjoyed perhaps too much when the late-night conference dinner descended into attempts to assign poetry genres to different programming languages. Java is obviously epic poetry, but others get trickier. Terse Clojure might be a haiku, and perhaps Python, with its structured whitespace, is a form of concrete poetry?

The conference keynotes varied – there was an introduction to a digital humanities project, Oracc, which hosts annotated and transcribed cuneiform, we were introduced to the Microsoft Hololens and some of the challenges and history of its creation, a talk about Google Deepmind, and I particularly enjoyed the keynote talking about the sustainability of research software. Given how chaotic dependencies make everything, it’s no wonder that maintaining software takes a significant amount of time and money!

You could think of software dependencies like ripples on rainy water: all spreading out and interacting, becoming beautiful chaos as ripples interact with one another, say @jameshowison.

There were some hands-on tutorials and workshops, but I mostly attended RSE-community related sessions. A couple that stood out to me, in no particular order:

Diversity in recruiting RSEs. We had speakers from Microsoft talking about their efforts to make their research staffing pool more diverse, which included gruelling-sounding half-days sessions where candidates were interviewed by four different interviewers in an attempt to remove bias. Somewhat entertainingly, the room this was conducted in – the senate chamber – had red throne-like seats and eight large portraits on the walls, every single one depicting an older white male. The irony was not lost upon the session attendees!

The RSE community AGM. Rather than being an informal gathering of individuals, the UK RSE group will soon be re-launching as an official society that members can join for a nominal fee. The AGM gave us a chance to hear about some of their plans (you can sign up to hear about the launch date), as well as the opportunity to share your wish list of likes, dislikes, and comments on the activities the group performs. I’m looking forward to interacting with the society and seeing where they head!

It’s a conference I’d definitely like to attend again. If you missed out, you can catch up with many of the relevant points on twitter, under the hashtag #RSE18.

BOSC (the Bioinformatics Open Source Conference) is normally part of ISMB (Intelligent Systems for Molecular Biology), but for the first time this year, it teamed up with The Galaxy Community Conference (GCC) instead. For us, this presented an exciting opportunity – like a regular BOSC but with the added bonus of training days and the chance to interact with Galaxy contributors during the CollaborationFest hackathon (and the rest of the conference too).

While we did recommend to people that they try to install the InterMine Python client, we also managed to work around the issue for anyone who didn’t have things installed, thanks to binder. You can still see the tutorial exercise notebooks and work through them, and we have the same set of notebooks with answers if you get stuck or need a hint. This was the first time we worked through the exercises interactively onscreen this way, but it seemed to work well! I’m hopeful we can continue providing the API portion of our tutorial this way in the future.

We had planned to do an R section, but actually ran out of time to do this – the tutorial was about two and a half hours in total. If an R tutorial is something of interest in the future though, please do let us know! You can do this via comments on this article, twitter, pop by chat.intermine.org, or email us at info – at – intermine – dot – org.

InterMine 2.0: More than fifteen years of open biological data integration

[Slides link] We were very pleased to have a talk accepted as well as the training, giving us a chance to introduce InterMine to others and talk about its history. While I was talking I mentioned that we were ranking at just under 300 stars on our main GitHub repo, and the audience kindly help bump it up and over 300!

One of the topics I focused on during the talk included a massive thanks to all of the work our broader community does to help keep InterMine become and remain a great resource. Afterwards, Lorena Pantano raised the question: how do you get others to adopt your work and contribute to it?

Personally, I’ve been working at InterMine for three years now, so I certainly can’t attest to the entirely of the history – much of this is doubtless down to the team’s great work and Gos’s great vision (and grant writing!) – but I also think one of the most important parts is probably down to making it easy for others to use your work: good developer docs, tickets that explain issues clearly, help documentation for end-users, etc. I’d love to hear more thoughts about this in the comments!

Birds of a Feather sessions

Daniela and Yo both ran separate Birds of a Feather unconference-style sessions over lunch. Yo’s BoF focused on getting (and keeping) more open source contributors – Nicole Vasilevsky was kind enough to keep notes for this session. Thanks, Nicole!

Meanwhile Daniela shared the InterMine approach to implement stable and persistent URIs and the possible related issues, inspired by other data integrators and the lessons learnt in the Identifiers for the 21st century paper; some attendees have also contributed providing their own solutions.

Hackathon

Group meeting session at CoFest. Try to spot Daniela! 😉

During the CollaborationFest hackathon, Daniela and Yo were able to complete (yeahhhh!!) the integration between Galaxy and InterMine thanks to invaluable help of Daniel Blankenberg.
On the next Galaxy release, the new InterMine plugin will be available and will allow to import data (from InterMine) into Galaxy and export lists of identifiers (e.g. proteins, genes) from Galaxy (into InterMine) by selecting the mine instance from the InterMine registry. Watch this space – we’ll hopefully arrange to get some details on the Galaxy training network to explain how to run the data imports in each direction.

What is OpenCon?

OpenCon is a yearly event designed to bring together people who are dedicated to open in all its incarnations. It’s in such high demand, the only way to get in is by application, and most attendees are provided with scholarships to help with travel/accommodation costs.

We weren’t able to attend the international event, but thankfully there was a great satellite event running in Cambridge – OpenConCam.

PeerJ is an open access journal which focuses on methodological rigour when publishing, rather than preferring groundbreaking new science – something particularly important for early career researchers. One of my favourite points from her talk was when she demonstrated the checklist that PeerJ uses to help authors disseminate their content effectively:

Many of us know from personal experience that accessing scientific publications even in wealthier western countries can be controversially difficult, so it’s hard to imagine how much more difficult this must be in developing countries. Thankfully, there are initiatives such as Africa Information Highway, Eifl, and Hinari which aim to make data and publications more accessible. She also discussed the cultural concept of ubuntu – sharing and caring for each other as a concept that works hand-in-hand with the open* movement.

Bullied Into Bad Science is a campaign to help early career researchers who may be under pressure to omit or tweak their scientific results in order to gain a desired outcome or exciting publication. Laurent was clearly passionate about this subject: Sometimes the system pressures mean that successful academics are not necessarily good scientists – and things really shouldn’t be this way.

Queen B

This session was frantic! The basic premise was that the room divided into groups of 4, nominated a “queen bee” who presented a problem (in one minute), and then the group broke up and discussed possible solutions with others in the room for three minutes, reporting back over the span of two minutes. Lather, rinse, repeat until all members in a group have been queen bees. Topics I recall discussing included getting humanities more involved in open science, open source code in science, how to inspire people to publish in journals with strict open policies when they could go for a less principled journal more easily, and how to sell open* to the disinterested.

Danny shared something dear to our hearts: Getting others involved in open. While she was specifically referring to open access, most points could easily be applied to open science, data, and source too. Her focus was on figuring out how to get the most “bang for buck” – that is, find and influence people who will pay off the most for the least effort.

Undergrads, for example, aren’t great targets as they mostly don’t continue in academia, but PIs, and government bodies may be more useful, because they have much more influence if they’re sold on open access. Similarly, sometimes it makes more sense to influence decision makers and get them to evangelise for you, if you don’t have enough authority to impress people. Make sensible decisions, and don’t run up against brick walls repeatedly if it isn’t paying off!

Focus Groups

After lunch, we had an unconference-style set of sessions, where everyone nominated topics they were interested in, and added stars beside ideas they themselves were interested in attending. The resulting sessions were:

Self-care in Open: Many of us volunteer time outside a normal 9-5 job to help promote open, and the environment can be discouraging or rough sometimes – not everyone is as keep on open as we are! Suggestions presented by Kirstie Whitaker included working with micro-ambitions (turning your work into small, achievable chunks rather than trying to conquer everything), and thinking of success as a spectrum. A small win is still a win!

Open + inclusive: Laurent Gatto pointed out in a blog post earlier this year that the Open movements aren’t always as…. open as they should be. Sometimes Open Science can fall down in the same places less open science falls down – not making sure to have a decent balance of ethnicities, genders, sexual orientation, etc. Can we do better?

If you perceive yourself as open & welcoming you need to do more than just saying it 😉 – FOCUS GROUP B on inclusivity in Open Access #OpenConCam

Open source code in science: If you’re an InterMiner, you’re probably already pretty keen on open source scientific software and can see the benefit of it – but not everyone does. Many, many papers that use code to produce their scientific results don’t expose that code. But if the code isn’t in the paper, or linked to it openly in some way… how was it peer reviewed? If the code is wrong, so is the science it produces. I proposed this discussion topic, and really enjoyed perspectives from my team mates. Some of the ideas generated included:

Share dummy data to run your code on, if the data are proprietary or there are privacy issues.

Try to encourage journals to have software availability statements

Encouraging researchers to share their code, even if it’s only a few lines. After all, if you’ve written 6 lines of code to configure an R plot, whilst it might seem insignificant – that’s actually really easy to peer review and correct mistakes! By comparison, bigger software packages can be hundreds, thousands, or even millions of lines of code. The thought of trying to review that (beyond reviewing quality metrics like testing, documentation, and commenting) makes me a bit scared.

Open in the humanities: This is a fascinating subject, and I don’t think many (any?) of the audience members were in the humanities. We raised a lot of questions about the shape of humanities data.

Opening the lab door (Christie Bahlai)

After the focus groups, Christie Bahlai skyped in to talk about running an open lab. She shared some of the different types of pushback against open science:

Those who consider themselves too busy to share

People who have been pushed from ‘busy’ status to actively hostile against open science, perhaps when they were asked to participate further and didn’t wish to

The worried – people who have legitimate concerns about open science (I’m sure I’m not the only person who doesn’t really believe in “anonymised personal data”).

The unheard – those who are disadvantaged and marginalised already worry that practising open will marginalise them further. How can we protect these people?

She also talked about getting people involved in open as early as possible, including introductions to open as part of the undergrad curriculum:

.@cbahlai: We need to get open science into the curriculum. Tell the university administrators about all the hard skills you will be providing – and then when you teach, also include all the soft skills the class needs, but that are harder to sell #openconcam

This talk was an out-of-the-blue surprise. Rather than focusing on academia like most of the previous talks, Eliot shared how open videos, photos, and “facts” on the web can be verified for journalism. If you’ve heard of doxxing, you’ll know a bit about the techniques Eliot described, using social media, satellite imagery, and other online tools to track people who don’t want to be tracked – but this time, for Good. He described how some of the white supremacist rally leaders were identified, as well as verifying missile attacks in Syria – including who perpetrated them and who was lying about it.

This talk stilled twitter’s usually vibrant #OpenConCam discussions to a halt, probably due to the riot of emotions it induced in most of the participants. We’d been shown highly disturbing images, felt fear wondering how these techniques could be misused, and we awed by the massive importance of what we’re seeing, no matter how awful it was. I’m sure I wasn’t the only person torn between wishing I’d never seen it and knowing that I had to watch it, because burying our heads in the sand isn’t an option either.

Wrap-up

OpenCon 2018 hasn’t been announced yet, but this year, all around the world, there are still satellite events like the one I attended. If you haven’t attended a conference about working openly before, this is a great way to get a taste – or if you’re a die-hard enthusiast, you’ll get the chance to meet like-minded individuals and be inspired!

I really enjoyed attending the Neo4j Life & Health Sciences Workshop, organized in Berlin, this week, by Michael and Petra: a day rich with great presentations about the application and utility of graph technology in several research areas. Here are only few examples:

The Ontology Lookup Service, a repository for biomedical ontologies, implemented with the support of graph databases and Apache Solr for indexing, different technologies for different purposes.

In the Lamond lab (University of Dundee), they model proteomics data with graph databases in order to understand protein behaviour under different conditions and dimensions of analysis.

Tabloid Proteome is a database of associated protein pairs, derived from mass-spectrometry based proteomics experiments, implemented using a graphdb, which can help also to discover proteins that are connected indirectly, or may have information that you are not looking for!

Reactome is a pathway database which has recently migrated from MySQL to Neo4j, with relevant performance improvement. You can access data via the GraphCore open source Java library, developed with Spring Data Neo4j, or via Neo4j browser.

I’ve lost count of how many times I heard sentences like: “Biology systems are complex and growing and graphs are the native data model” or “Graph database technology is an effective tool for modelling highly connected data as we have in biology systems”. We already knew it, but it’s been very encouraging and promising hearing it again from so many researchers and practitioners with higher experience than us in graph technologies.

In the afternoon, I attended the workshops “Data modelling with Neo4j”; starting from the data sources we usually work with, we have tried to model the entities and the relationships in order to answer some relevant questions. Modelling can be very challenging and, in some cases, it might depend on the questions you have to answer!

Before the end, I had the chance to give a short presentation about our experience with Neo4j.

A couple of weeks ago we took part in the May ELIXIR Bioschemas meeting, along with representatives from Google, the European Bioinformatics Institute (EBI) and other participating organizations from the UK and beyond.

To give some background, Bioschemas is based on schema.org, an initiative to produce schemas that can be directly embedded in websites to give more structure to data. Search engines can understand this more easily than simple text, and it’s the stuff that powers a proportion of Google snippets (those box-outs you see on Google search results when you search for something popular). For example, let’s suppose I wanted to tell search engines more about my Jazz event. This is what I would embed in the webpage for the event.

Bioschemas wants to do the same but for biological information (like genes, proteins, samples, etc.). So in InterMine, for the CHEY_BACSU protein report page in SynBioMine we might have something like this:

A search engine (or a specialized life sciences search tool) can then crawl and aggregate the structures embedded in a wide range of life sciences websites (particular those with lots of small sites such as biological samples in biobanks). The goal is to make it considerably easier for scientists to find information relevant to their research without having to visit lots of sites individually.

The job of Bioschemas is to go through the existing schema.org schemas and decide what existing stuff we can use (such as Dataset) and what we need to propose as new schemas (such as BiologicalEntity). schema.org schemas are big bags of attributes with no cardinality constraints as they need to satisfy a lot of different use cases, so another job of Bioschemas is to recommend which attributes to use and at what cardinality, both for data in general (DataSet, for example) and for specific life sciences entities, such as proteins and biological samples.

We were in London to attend GraphConnect, the annual conference organised by Neo4j.
It was fantastic to meet so many people around the world enthusiastic about graph databases, and a lot of people that, like us, are prototyping/exploring Neo4j as possible alternative to relational databases.

They have announced the release of Neo4j 3.2 which promises to bring a huge improvement in term of performance; the compiled Cypher runtime has improved the speed by ~300% for a subset of basic queries and the introduction of native label indexes has also improved the write speed.

They have also added the composite indexes (that InterMine uses a lot) and the use of indexes with the OR operator. We highlighted the problem months ago on stackoverflow and we were so surprised to see it fixed. We have to update our “What we didn’t like about Neo4j” list by removing 2 items. We’re really happy about that!

It was a pleasure to attend Jesus Barrasa’s talk on debunking some RDF versus property graph alternative facts. He demoed how a RDF resource does not necessarily have to live in a triple store but can also be stored in Neo4j. Here are part1 and part2 of “Neo4j is your RDF store”, a nice post where he describes his work in more detail.

Another nice tool they have implemented is the ETL tool to import data from a relational database into Neo4j by applying some simple rules.

The solution-based talks demonstrated how Neo4j is being used to solve complex, real world problems ranging from travel recommendation engines to measuring the impact of slot machine locations on casino floors. While the topics were diverse, a common theme across their respective architectures was the use of GraphAware’s plugins, some of which are free. One plugin that looks particularly interesting is the Neo4j2Elastic tool which transparently pushes data from Neo4j to ElasticSearch.

During the conference, we discovered that there is a Neo4j Startup Program that allows to have Neo4j enterprise edition for free. Not sure if we count as a start up though!

Overall, we’re super happy with the improvements Neo4j has made, and super impressed with Neo4j’s growing community. Looking forward to meeting with Neo4j team in London, at their meetup, and sharing our small experience with the community!