The authors have gathered some really fascinating data measuring day-by-day altmetrics of papers at the journal Nature Communications, which at the time was hybrid: some articles behind a paywall, some articles were paid-for open access at a cost of $5200 to the authors/funders. (The cost of open access here is an absolute rip-off. I do not endorse or recommend outrageously priced paid-for open access outlets like Nature Communications. PLOS ONE costs just $1350 remember! PeerJ is just $99 per author!)

The paper is by no means perfect – I’m not saying it is – but the ideas behind it are good. Many on twitter have commented that it’s ironic that this paper on open access advantage is itself only made available behind a paywall at the publisher.

The good news is, Dr Xianwen Wang has responded to this and has made an ‘eprint’ copy (stripped of all publisher branding) freely available at arXiv as of 2015-03-19 (post-publication). The written English throughout the manuscript is not brilliant but I feel this reflects poorly on the journal rather than the authors – it’s remarkable that Scientometrics can charge a subscription fee to subscribers if they offer no copy-editing on accepted manuscripts! Finally, technical detail on precisely how the data was obtained is rather lacking. So that’s the critique out of the way…

But I wanted to dig deeper into the data. So I emailed the corresponding author; Xianwen for a copy of the data behind figure 2 and he happily and quickly sent it to me. I was fairly shocked (in a good way) that he sent the data. Most of the times I’ve sent email requests for data in the past have been ultimately unsuccessful. This is well documented in the field of phylogenetics *sad face*. The ‘email the author’ system simply cannot be relied upon, and is one of many reasons why I feel all non-sensitive data supporting research should be made publicly available, alongside the article, on the day of publication.

Xianwen had filtered these suspicious jumps out of his figures but neglected to mention that in the methods section, so upon informing him of this discrepancy he’s told me he’s going to contact the editor to sort it out. A great little example of how data sharing results in improved science? The unfiltered data looks a little bit like the plot below:

Anyway, back to the spikes/jumps in activity – they certainly aren’t an error introduced by the authors of the paper – they can also be seen via Altmetric (a service provider of altmetrics). The question is: what is causing these one-day spikes in activity?

I have alerted the team at Altmetric, and they have/will alert Nature Publishing Group to investigate further

Most of the spikes are likely to be accidental in cause but it would be good to know more. A downloading script gone awry? But there is still a possibility that within this dataset there is putative evidence for deliberate gaming of altmetrics, specifically: article views. I look forward to hearing more from Altmetric and Nature Publishing Group about this… the ball is very much in their court right now.

Moreover, now that these peculiar spikes have been detected; what, if anything, should be done about it?

Just a quick post to congratulate the Bill & Melinda Gates Foundation for their fabulous new research policy covering both open access & open data.

One of the key things they’ve implemented for 2017 is ZERO TOLERANCE for post-publication embargoes of research. Work MUST be made openly available IMMEDIATELY upon publication to be compliant. No ifs, no buts.

Let’s just remind ourselves why other major research funders like RCUK & Wellcome Trust allow publishers to impose an embargo on academic work before it can be made public:

Do any academics want a post-publication embargo on their work, that stops it being shared, read & re-used by the widest readership possible?

Does it benefit readers, patients, policy-makers or practitioners to have a post-publication embargo delaying their access to the very latest research?

Does it benefit research funders themselves to have a post-publication embargo on work they fund?

The only stakeholder that benefits from research funder policies that allow post-publication embargoes preventing free access to research are the legacy publishers. The fact that RCUK, Wellcome Trust and many others pander to these parasitic publishers and their laughably unfit-for-purpose business model is just WRONG and it makes me angry. JUST SAY “NO” TO POST-PUBLICATION EMBARGOES!

It’s high-time that major research funders wrote policies that ask for what WE ALL ACTUALLY WANT, instead of a bullshit compromise that minimises fiscal harm to the multi-billion dollar legacy publishers.

I admire the Gates Foundation. They understand what we all need and they’ve implemented that in a clear and appropriate policy; optimal for readers, researchers, patients, practitioners and policy-makers. We want immediate open access, and we want it NOW! The ball is now in your court Wellcome Trust, make your move!

Day 0 of OpenCon started with me missing the pre-conference drinks reception because my flight from Chicago was delayed by 2 hours. I got into Washington, D.C. (DCA) at about midnight & then had to wait half an hour for a blue line train to take me the short distance from the airport to the conference hotel — I’m a diehard for public transport! Finally arriving at the hotel past 1 o’clock in the morning. Not a great start. Sincere apologies to my excellent room mate Alfonso Sintjago, to whom I hastily introduced myself the next morning #awkward

Day 1 started with a real bang. Michael Carroll gave a short speech. Then Pat Brown gave a long but HUGELY enjoyable talk about his role in the founding of PLOS & some excellent take home messages from the talk:

* Write petitions & letters for change with colleagues. Even if you fail to directly achieve all the goals or immediate aims of the petition, the act of doing it, the publicity & thought-provoking it raises can have real and important positive effects.

* Sometimes you’ve got to do odd things that might be against your ethos, to support your interests in the long term e.g. the traditional review selectivity of PLOS Biology & initially, printing paper copies of PLOS Biology.

* Sometimes you have to fake it to make it (N.B. said in the context of collective action, not scientific research)

Too much valuable data is either not online, or locked-up in PDFs – we can do better! The research community, where possible should be aiming to provide 3-star open research data. Providing data in non-proprietary forms such as .csv is not hard, and we’ve had the technology and infrastructure to do this for a while now, we just need to do it…!

I’m so glad Victoria Stodden gave the next talk after the panel, I think I was the one on the organising committee who first suggested her for a keynote slot (sorry to brag!). Victoria did not disappoint – her talk was a remarkable display of undeniable deep-thinking & scholarship. Her reminder to us all of Merton’s Scientific Norms (1942) was an excellent grounding in the basis of open research:

Communalism: scientific results are the common property of the community

Universalism: all scientists can contribute to science regardless of race, nationality, culture, or gender

Disinterestedness: act for the benefit of a common scientiﬁc enterprise,
rather than for personal gain.

Originality: scientiﬁc claims contribute something new

Skepticism: scientiﬁc claims must be exposed to critical scrutiny before being
accepted

This was clearly appreciated by the audience and others e.g. Lorraine have already blogged about it. I also took home from the talk that it’s important to distinguish between the 3 different types of reproducibility: Empirical Reproducibility, Computation Reproducibility, and Statistical Reproducibility, and that the Bayh-Dole Act is the an awfully bad motivator for NOT opening-up research in the US (of which I pointedly reflected-on in a meeting at the NIH on day 3).

REAL TALK: at the end Stodden made a great point, which I hope was listened to: young academics should not be expected to martyr themselves for the cause of open scholarship, and that it should be the more senior academics leading the way — here, here!

Don’t martyr yourself for the cause. “Martyrdom of Saint Sebastian”. By Giovanni Bassi, 1525. Public Domain

After lunch there were parallel sessions. Uvania Naidoo led a workshop on Open Access in the Context of Developing Countries. I regret I can’t report on that session because Peter Murray-Rust and myself were holding a ContentMine workshop in the alternate room at the same time. The ContentMine session was really good fun, and very interactive — you can see the discussion from the session on the etherpad here. Jure Triglav had some great ideas around mining the literature for software citations, Nic Weber chimed-in that HPC citation /mentions would be great to do too. April Clyburne-Sherin was interested in clinical trials data mining etc… I could go on. The trick now is for us to explore these ideas and see if we can make them happen after the conference. The epidemiology/ebola, content mining looks like it’s definitely going to happen, many people were interesting in forming a collaboration around that.

Innovative Publishing Models

I’m not going to report every session in full detail. This is one where I’m probably skimping. Meredith Niles (Harvard postdoc) moderated talks and discussion by a panel consisting of Arianna Becerril (Redalyc), Pete Binfield (PeerJ), Mark Patterson (eLife) and Martin Paul Eve (Open Library of Humanities).

Meredith Niles and myself, in my new favourite t-shirt at the evening reception, Day 1. Twitter / M. Niles. All rights reserved, copyright not mine.

Huge congratulations to the organising committee for bringing this particular panel together. These are without doubt in my mind, representatives of four of the most important, innovative organizations in academic publishing right now. They all gave excellent talks but particular kudos should go to Martin Paul Eve for delivering a swish Prezi and more importantly, a passionate, invigorating talk on the possible future of OA in the humanities.

The impact of open

The line-up alone for the next session was stellar. The conference had it’s first glimpse of Erin McKiernan on stage, moderating a panel consisting of Jack Andraka, Peter Murray-Rust, and Daniel DeMarte. Forgive me for a lack of detail here, it was near the end of a long day. Jack gave his usual polished speech, with humour and grace. As well as ably fielding a couple of tough but fair questions about his patent pending. As ever, a lot of people wanted to take pictures with him and he was gracious to allow everyone who wanted a photo with him

Four people proudly pushing boundaries. Photo: mine! Licence: All rights reserved.CC-BY of course!

The day ended with a closing keynote from John Wilbanks which was really the perfect cherry-on-top of the icing of a brilliant first day. It’s only been a few days but his talk slides, ‘Open as a Platform‘ have racked-up nearly 1000 views and I’m not surprised. I’d better not blather on too much, but put it this way: Wilbanks is a hero to me. I love some of things he’s said before and I’ve really taken them to heart in my work e.g. “The best time to plant a tree is 20 years ago. The 2nd best time is NOW” from ‘Data sharing as a means to a revolution‘. It was simply great to be able to chat to both Michael Carroll and John Wilbanks at the evening reception.

Miscellaneous Day 2 Highlights (If I don’t abbreviate this blog post soon, it’ll be book length)

Audrey Watters keynote talk ‘From “Open” to Justice‘ had a clear closing message: open is necessary but it’s not enough, we need meaningful political engagement, care and justice. The word ‘open’ alone does not solve all our problems (I may have paraphrased!).

Erin McKiernan‘s keynote was an inspiration to us all. ‘Being Open as an Early Career Researcher‘ was a masterclass in DOING IT THE RIGHT WAY, with an abundance of supporting evidence. I haven’t had the privilege of seeing her speak before, and had heard lots about how good a speaker she is – I wasn’t disappointed. I completely stand with Erin when she says:

If I am going to ‘make it’ in science, it has to be on terms I can live with.

I was particularly taken by Ahmed Ogunlaja‘s clever response, to the question of how he approaches OA advocacy in Nigeria:

Open Access wins all of the arguments all of the time

That in itself got a round of applause. It’s no exaggeration to say there were a lot of earnest rounds applause that day; no polite applause.

Another such spontaneous round of applause came when Penny Andrews took the microphone to raise a really important point/question about diversity and social mobility in research in a calm, professional, clear tone. The audience, myself included were simply floored by how erudite it was. Stunning. This is but a small sample of what Penny brought to OpenCon:

Late into the night at the ‘unconference’ session perhaps circa 11pm, Jure Triglav found out that his ScienceGist summaries are being used (in a good way!) by a researcher as sample data to test against a machine-based paper summary approach — I hope Jure blogs more about that, it seemed pretty cool to me. I’m also hoping ScienceGist might be used on PeerJ. Watch this space…

Mitar, gave an excellent talk, PeerLibrary has come-on a lot since I last looked at it, and he seemed to be literally overflowing with brilliant ideas, awaiting implementation. He told me had been considering applying for a Sloan Foundation grant to support his excellent work, but hadn’t yet applied, so without his knowledge/consent I decided to send a cheeky tweet to encourage him! If Sloan won’t fund his project(s), I’m sure Shuttleworth will!

Carolina Botero’s talk was an important closer for day 2. So so important. Sharing Research Is Not A Crime!

I’ve a written a long post and most of it is glowingly, sickeningly positive. What didn’t go well?

Well… this is all my fault but I do feel the ‘How to be an open researcher’ session run by Erin & myself could have been smoother. We had technical difficulties setting-up the computer. BOTH our laptops only have HDMI connectors, no VGA, so we had to borrow Georgina‘s Mac & neither Erin nor I are particularly great Mac users (4-finger swiping between the browser and the presentation slides was challenging!), on linux this is very easy to do, just Alt-Tab & cycle through to the window you want. I must also apologise to Erin for launching into a mini-rant about figshare without forewarning her – I have concerns about putting too much open data on a commercial platform, that there simply isn’t enough space in this blog post to get into. Another time! But in principle I think double-teaming a lively workshop like this works really well — especially if we have slightly different viewpoints on some tool or strategy.

Day 3: On The Hill

Well, I learn’t a little about Minnesota whilst sitting in Amy J. Klobuchar‘s office. In our short time with a legislative assistant of hers, we pitched hard for Open Access & Open Educational Resources.

I highlighted that US taxpayer-funded academics give their work for free to commercial publishers, other academics peer-review this content, for free, the publisher barely does anything aside from typesetting & putting the content online, and hence most of the big publishers are consistently making 30-40% profit margins on taxpayer-funded research. [Standard knowledge basically] I was also quick to allay any concern that it would harm US businesses – I pointed out that most of the large publishers were European – Elsevier (Dutch), Springer (German), Nature Publishing Group (UK). It was a little disappointing to have only 30 minutes but that apparently was a good innings as these things go.

Whilst I honestly have no idea what will come of the Minnesota Senator meeting, the meeting at NIH was seriously productive.

NIH was simply fabulous for all involved, including NIH if you ask me! Many of the younger early career researchers got to see detailed & complicated concerns of their (relatively) more senior attendees e.g. Prateek Malwahar, Daniel Mietchen, Lauren Maggio, Karin Shorthouse and myself. I was worried that perhaps we might have ‘dominated’ the discussion a bit too much, but after discussing it with Shannon Evans afterwards – many actually really enjoyed seeing research-savvy people really dig into difficult policy issues. Natalia Norori‘s question near the end was also brilliantly appropriate, and the response rather chilling (although I should be clear, I’m not trying to shoot the messenger here!) — the USA has some deep political problems if disclosing the number of people using PubMed from outside the US is a ‘bad’ thing (those who were there will know exactly what I’m talking about!). I’m also hugely excited by the prospect of the OA_Button *potentially* getting a linkout button on Pubmed – Kent Anderson’ll love that, eh?.

Daniel Mietchen & I gave some valuable feedback on the packaging of the PubMed OA subset – the contention was that it wasn’t seeing much visible use, and yet Daniel & I both feel this is wrong — there are many users out there — it’s just hard to publish mining research because it’s often new/interdisciplinary and how does one ‘cite’ PubMed corpus usage anyhow? — it’s clearly going to be difficult to track users.

I was hugely flattered when Neil Thakur said he’s read my blog before! wow! Hope you like this post Neil.

Swapping shirts & the super-friendly culture at OpenCon

I gave out my 2 spare ‘Boycott Elsevier’ t-shirts at OpenCon this year, and I think I’ll make shirt-swapping a regular thing if I can! First, it was my immense pleasure to swap shirts with Daniel Mutonga at the organizing committee dinner. To his credit, Daniel was the one who suggested it: ‘like football players after a game’ , so I put on his MSAKE tee & he put on my ‘Boycott Elsevier’ tee. Fantastic. I think I should swap t-shirts with someone at every conference. Shannon (?) told me an interesting variation on this one, which also sounds like a good idea to implement: swapping pin badges.

I gave the other spare ‘Boycott Elsevier’ t-shirt to Erin McKiernan. We joked it would be hilarious to wear at SfN. Although, slightly concerned for how it would be received I did make clear that I didn’t mind if she chose not to wear it at SfN. She’s since tweeted me a picture wearing it in front of the Elsevier stand – exactly what I’d do! Every penny spent on those t-shirts has been totally worth it – such a good medium for non-violent, high profile activism!

The ‘backchannel’ discussion on twitter between OpenCon attendees & remote followers of the conference was also brilliant. Lots of lively, informative, intelligent threads of discussion sparked by lots of the talks, simply excellent.

It was also great to see Celya Gruson-Daniel again – she’s a real unsung hero of open science – if you aren’t aware of her project HackYourPhD go check it out NOW. Community building is immensely important and she’s clearly very good at it. It’s immensely & deservedly popular in the Franco-phone world. (I wonder if there are similar wildly successful Spanish-language open science communities? Please point them in my direction if you know of one!)

I must also thank Kurtis Baude for interviewing me about open research data in one of the breaks – his enthusiasm for spreading open science is infectious – we had a great chat together.

People making change for the better

Being at OpenCon, more than at any other meeting, I was truly amongst friends. I was going to list everyone here in thanks but a list of 175 names isn’t much fun to read & I wouldn’t want to miss anyone out! Sorry to anyone I didn’t mention by name!

I have to admit, I went to OpenCon feeling a little bit low. My cranial / postcranial data comparison manuscript from my PhD had been recently rejected (again). Not on the basis that it was bad science, just that it wasn’t quite interesting enough for readers of the particular journal we (re)submitted it to. I gather this happens a lot with traditional impact-factor chasing publication strategies, and it can ruinalter career paths before they even get started. To have spent 4 years doing a PhD & 3 years of that on/off trying(ish) to publish this particular chapter and STILL have nothing, not even a preprint to publicly show for it (don’t even ask why I can’t put up a preprint. I think preprints are a great ideamyself…). I was a tad depressed – let’s not pretend this doesn’t happen to us all, folks. Real Talk

Luckily, OpenCon has completely changed my mood for the better and reminded me of all the important things I did do during my PhD:

* I published *shrugs* in academic journals. I’m not even going to link to what I did manage to publish. I have a h-index, yada yada… I think all of the below were more important contributions, with more real-world impact to be honest:

* I gave a pretty darn good talk about content mining at the European Commission ‘Licences for Europe, Working Group 4: Text & Data Mining’ event. Which helped stave-off the unwanted imposition of ‘licensed’ content mining in Europe.

I write the above list, self-indulgently to convince myself I’m not stupid. I can do clever stuff. I’m pretty sharp when it comes to research policy, and I have ideas and enthusiasm to help make research more open (== better). I think I’ve proved that now, time and time again.

Next week I’m meeting up with my supervisor and we’re going to work on revising & resubmitting that manuscript again. And thanks to OpenCon 2014 I’m actually in the mood to do that. Thanks Generation Open. You’re awesome.

Jeremy Miller from Naturalis is also very interested in OA Zootaxa content from the point of view of spiders. He gave a talk on Data Visualization on behalf of his team from the Leiden hackday. Luckily, with no prior ‘special’ mark-up, by searching ‘Araneae‘ I could show Jeremy the promise of what I’m doing on Flickr. Many phylogenies containing spider taxa came up in the search, many of which he immediately recognized as from his own open access publications! With a little bit of work to further mark-up the attributes he’s interested in, I might be able to provide something of real use – the ability to search figure images/captions across hundreds of open access journals, from many different publishers with just ONE search!

The Bouchout Declaration will be launched today at this meeting. I’m happy to say I facilitated the signing of this declaration by Open Knowledge. Many other organisations have signed this declaration and I hope it makes a splash – we need science to be open to do good science!

Finally, I’ve also potentially got a new research collaboration going (more of which later!).
It’s been well worth the trip!

“From Nigeria to Norway, the next generation is beginning to take ownership of the system of scholarly communication which they will inherit,” said Nick Shockey, founding Director of the Right to Research Coalition. “OpenCon 2014 will support and accelerate this rapidly growing movement of students and early career researchers advocating for openness in research literature, education, and data.

The first event of its kind, OpenCon 2014 builds on the success of the Berlin 11 Satellite Conference for Students and Early Stage Researchers, which brought together more than 70 participants from 35 countries to engage on Open Access to scientific and scholarly research. The interest, energy, and passion from the student and researcher participants and the Open Access movement leaders who attended made a clear case for expanding the event in size and duration, and to broaden the scope to related areas of the Openness movement.”

Last year, I was also part of the organizing committee for the event that this has grown from – the Berlin 11 Satellite conference:

The Berlin 11 Satellite Conference was really exciting but only a 1-day event before the ‘main’ Berlin 11 event – an assemblage of students and ECR’s from literally all over the world (attending with generous full funding support), including representatives from (in no particular order) China, India, Saudi Arabia, Georgia, Tanzania, Tasmania(!), Kenya, Nigeria, Ghana, Uganda, Columbia, FYR Macedonia, Mexico, Brazil, Sweden, Holland, Denmark, Poland, Portugal, Canada, the US, the UK… So don’t worry about where you are in the world – as long as you’re a student or ECR you’ll be eligible to apply for OpenCon 2014 (places are limited though!).

As a reminder, at the event last year we had Jack Andraka and Mike Taylor amongst the guest speakers. It was such a comprehensive success that it’s been expanded into a full 3-day event this year, expanding scope too, to include Open Data and OER, not just OA (they’re all obviously inter-related problems; better to tackle the integrated set of problems rather than aspects in isolation!).

Applications for OpenCon 2014 will open in August. For more information about the conference and to sign up for updates, visit www.opencon.net

I promise you this – it’s going to be BIG and I’m stoked to be part of an international organizing committee helping to make this happen.

OpenCon 2014 is also looking for additional sponsorship, particularly for Travel Scholarships to ensure global representation at this meeting, so if you have a marketing budget to spend, or are feeling generous please do have a look at the sponsorship opportunities.

This particular data type is a cornerstone of modern evolutionary biology. You’ll find phylogenetic analyses across a whole host of journal subjects – medical, ecological, natural history, palaeontology… There are also many different ways in which this data can be re-used e.g. supertrees & comparative cladistics. Not to mention, simple validation studies &/or analyses which extend-upon or map new data on to a phylogeny. It’s really useful data and we should be archiving it for future re-use and re-analysis. To my great delight, this is what I’m being paid to attempt to do for my first postdoc; on a grant I co-wrote – finding & liberating phylogenetic data for everyone!

Why PLOS ONE?

It’s a BOAI-compliant open access journal that publishes most articles under CC BY, with a few under CC0.

This means I can openly re-publish figures online (provided sufficient attribution is given) — no need to worry about DMCA takedown notices or ‘getting sued’! This makes the process of research much easier. Private, non-public, access-restricted repositories for collaboration are a hassle I’d rather do without.

It’s a high-volume ‘megajournal’ publishing ~200 articles per day, many of which include phylogenetic analyses.

Thus its worthwhile establishing a regular daily or weekly method for parsing-out phylogenetic tree figures from this journal

Killer feature: as far as I know, PLOS are the only publisher to embed rich metadata inside their figure image files.

This makes satisfying the CC BY licence trivially easy — sufficient attribution metadata is already embedded in the file. Just ensure that wherever you’re uploading the file to doesn’t wipe this embedded data, hence why I chose Flickr as my initial upload platform.

What does this enable or make easier?

On it’s own, this collection doesn’t do much, this is still an early stage – but it gives us an important insight into the prevalence of certain types of visual display-style that researchers are using:

In my initial roadmap, the plan is to do PLOS ONE, the other PLOS journals, then BMC journals, then possibly Zootaxa & Phytotaxa (Magnolia Press). There will be a Github-based website for the project soon, lots still to do…!

Want to know more / collaborate / critique ?

Conferences:

I’ve got an accepted lightning talk at iEvoBio in Raleigh, NC later this year about the PLUTo project.

Last Saturday I went to Hack4Ac – a hackday in London bringing together many sections of the academic community in pursuit of two goals:

To demonstrate the value of the CC-BY licence within academia. We are interested in supporting innovations around and on top of the literature.

To reach out to academics who are keen to learn or improve their programming skills to better their research. We’re especially interested in academics who have never coded before

The list of attendees was stellar, cross-disciplinary (inc. Humanities) and international. The venue (Skills Matter) & organisation were also suitably first-class – lots of power leads, spare computer mice, projectors, whiteboards, good wi-fi, separate workspaces for the different self-assembled hack teams, tea, coffee & snacks all throughout the day to keep us going, prizes & promo swag for all participants…

The principal organizers; Jason Hoyt (PeerJ, formerly at Mendeley) & Ian Mulvany (Head of Tech at eLife) thus deserve a BIG thank you for making all this happen. I hear this may also be turned into a fairly regular set of meetups too, which will be great for keeping up the momentum of innovation going on right now in academic publishing.

The hack projects themselves…

The overall winner of the day was ScienceGist as voted for by the attendees. All the projects were great in their own way considering we only had from ~10am to 5pm to get them in a presentable state.

This project was initiated by Jure Triglav, building upon his previous experience with Tiris. This new project aims to provide an open platform for post-publication summaries (‘gists’) of research papers, providing shorter, more easily understandable summaries of the content of each paper.

I also led a project under the catchy-title of Figures → Data where-by we tried to provide added-value by taking CC-BY bar charts and histograms from the literature and attempting to re-extract the numerical data from those plots with automated efforts using computer vision techniques. On my team for the day I had Peter Murray-Rust, Vincent Adam (of HackYourPhD) and Thomas Branch (Imperial College). This was handy because I know next to nothing about computer vision – I’m Your Typical Biologist ™ in that I know how to script in R, perl, bash and various other things, just enough to get by but not nearly enough to attempt something ambitious like this on my own!

Forgive me the self-indulgence if I talk about this Figures → Data project more than I do the others but I thought it would be illuminative to discuss the whole process in detail…

In order to share links between our computers in real-time, and to share initial ideas and approaches, Vincent set-up an etherpad here to record our notes. You can see the development of our collaborative note-taking using the timeslider function below (I did a screen record of it for prosperity using recordmydesktop):

In this etherpad we document that there are a variety of ways in which to discover bar charts & histograms:

figuresearch is one such web-app that searches the PMC OA subset for figure captions & figure images. With this you can find over 7,000 figure captions containing the word ‘histogram’ (you would assume that the corresponding figure would contain at least one histogram for 99% of those figures, although there are exceptions).

figshare has nearly 10,000 hits for histogram figures, whilst BMC & PLOS can also be commended for providing the ability to search their literature stack by just figure captions, making the task of figure discovery far more efficient and targeted.

Jason Hoyt was in the room with us for quite a bit of the hack and clearly noted the search features we were looking for – just yesterday he tweeted: “PeerJ now supports figure search & all images free to use CC-BY (inspired by @rmounce at #hack4ac)” [link] – I’m really glad to see our hack goals helped Jason to improve content search for PeerJ to better enable the needs (albeit somewhat niche in this case) of real researchers. It’s this kind of unique confluence of typesetters, publishers, researchers, policymakers and hackers at doing-events like this that can generate real change in academic publishing.

The downside of our project was that we discovered someone’s done much of this before. ReVision: Automated Classification, Analysis and Redesign of Chart Images[PDF] was an award-winning paper at an ACM conference in 2011. Much of this project would have helped our idea, particularly the classification of figures tech. Yet sadly, as with so much of ‘closed’ science we couldn’t find any open source code associated with this project. There were comments that this type of non-code sharing behaviour, blocking re-use and progress, are fairly typical in computer science & ACM conferences (I wouldn’t know but it was muttered…). If anyone does know of the existence of related open source code available for this project do let me know!

So… we had to start from a fairly low-level ourselves: Vincent & Thomas tried MATLAB & C based approaches with OpenCV and their code is all up on our project github. Peter tried using AMI2 toolset, particularly the Canny algorithm, whilst I built up an annotated corpus of 40 CC-BY bar charts & histograms for testing purposes. Results of all three approaches can be seen below in their attempts to simplify this hilarious figure about dolphin cognition from a PLOS paper:

We might not have won 1st prize but I think our efforts are pretty cool, and we got some laughs from our slides presenting our days’ work at the end (e.g. see below). Importantly, *everything* we did that day is openly-available on github to re-use, re-work and improve upon (I’ll ping Thomas & Vincent soon to make sure their code contributions are openly licensed). Proper full-stack open science basically!

Other hack projects…

As I thought would happen I’ve waffled on about our project. If you’d like to know more about the other projects hopefully someone else will blog about them at greater length (sorry!) I’ve got my thesis to write y’know!

You can find more about them all either on the Twitter channel #hack4ac or alternatively on the hack4ac github page. I’ll write a little bit more below, but it’ll be concise, I warn you!

Dan Stowell & co used NLP techniques on full-text accessible CC-BY research papers, to classify all of them in an automated way determining whether they were qualitative or quantitative papers (or a mixture of the two). The last tweeted report of it sounded rather promising: “Upstairs at #hack4ac we’re hacking a system to classify research papers as qual or quant. first results: 96 right of 97. #woo#NLPwhysure” More generally, I believe their idea was to enable a “search by methods” capability, which I think would be highly sought-after if they could do it. Best of luck!
Apologies if I missed any projects. Feel free to post epic-long comments about them below

This post was originally posted over at the LSE Impact blog where I was kindly invited to write on this theme by the Managing Editor. It’s a widely read platform and I hope it inspires some academics to upload more of their work for everyone to read and use

Recently I tried to explain on twitter in a few tweets how everyone can take easy steps towards open scholarship with their own work. It’s really not that hard and potentially very beneficial for your own career progress – open practices enable people to read & re-use your work, rather than let it gather dust unread and undiscovered in a limited access venue as is traditional. For clarity I’ve rewritten the ethos of those tweets below:

Step 1: before submitting to a journal or peer-reviewservice upload your manuscript to a public preprint server

Step 2: after your research is accepted for publication, deposit all the outputs – full-text, data & code in subject or institutional repositories

The above is the concise form of it, but as with everything in life there is devil in the detail, and much to explain, so I will elaborate upon these steps in this post.

Within biology it’s relatively unheard of to upload a preprint before submission but that’s likely to change this year because of an excellent well-put article advocating their use in biology and the verymanydifferentoutletsavailablefor them. My own experience of this has been illuminating – I recently co-authored a paper openly on github and the preprint was made available with a citable DOI via figshare. We’ve received a nice comment, more than 250 views and a citation from another preprint. All before our paper has been ‘published’ in the traditional sense. I hope this illustrates well how open practices really do accelerate progress.

Outside of the natural sciences the situation is also similar; Martin Fenner notes that in the social sciences (SSRN) and economics (RePEc) preprints are also common either in this guise, or as ‘working papers’ – the name may be different but the pre-submission accessibility is the same. Yet I suspect, like in biology, this practice isn’t yet mainstream in the Arts & Humanities – perhaps just a matter of time before this cultural shift occurs (more on this later on in the post…)?

There is one important caveat to mention with respect to posting preprints – a small minority of conservative, traditional journals will not accept articles that have been posted online prior to submission. You might well want to check Sherpa/RoMEobefore you upload your preprint to ensure that your preferred destination journal accepts preprint submissions. There is an increasing grass-roots led trend apparent to convince these journals that preprint submissions should be allowed, of which some have already succeeded.

If even much-loathed publishers like Elsevier allow preprints, unconditionally, I think it goes to show how rather uncontroversial preprints are. Prior to submission it’s your work and you can put it anywhere you wish.

Step 2: Postprints

Unlike with preprints, the postprint situation is a little trickier. Publishers like to think that they have the exclusive right to publish your peer-reviewed work. The exact terms of these agreements will vary from journal to journal depending on the exact terms of the copyright or licencing agreement you might have signed. Some publishers try to enforce ‘embargoes’ upon postprints, to maintain the artificial scarcity of your work and their monopoly of control over access to it. But rest assured, at some point, often just 12 months after publication, you’ll be ‘allowed’ to upload copies of your work to the public internet (again SHERPA/RoMEO gives excellent information with respect to this).

So, assuming you already have some form of research output(s) to show for your work, you’ll want these to be discoverable, readable and re-usable by others – after all, what’s the point of doing research if no-one knows about it! If you’ve invested a significant amount of time writing a publication, gathering data, or developing software – you want people to be able to read and use this output. All outputs are important, not just publications. If you’ve published a paper in a traditional subscription access journal, then most of the world can’t read it. But, you can make a postprint of that work available, subject to the legal nonsense referred to above.

If it’s allowed, why don’t more people do it?

Similar to the cultural issues discussed with preprints, for some reason, researchers on the whole don’t tend to use institutional repositories (IR) to make their work more widely available. My IR at the University of Bath lists metadata for over 3300 published papers, yet relatively few of those metadata records have a fulltext copy of the item deposited with them for various reasons. Just ~6.9% of records have fulltext deposits, as published back in June 2011.

I think it’s because institutional repositories have an image problem: some are functional but extremely drab. I also hear of researchers full of disdain who say of their IR’s (I paraphrase):

“Oh, that thing? Isn’t that just for theses & dissertations – you wouldn’t put proper research there”

All this is set to change though as researchers are increasingly being mandated to deposit their fulltext outputs in IR’s. One particular noteworthy driver of change in this realm could be the newly-launched Zenodo service. Unlike Academia.edu or ResearchGate which are for-profit operations, and are really just websites in many respects; Zenodo is a proper repository – it supports harvesting of content via the OAI-PMH protocol and all metadata about the content is CC0, and it’s a not-for-profit operation. Crucially, it provides a repository for academics less well-served by the existing repository systems – not all research institutions have a repository, and independent or retired scholars also need a discoverable place to put their postprints. I think the attractive, modern-look, and altmetrics to demonstrate impact will also add that missing ‘sex appeal’ to provide the extra incentive to upload.

Even though I use Academia.edu & ResearchGate myself. They’re not perfect solutions. If someone is looking for your papers, or a particular paper that you wrote these websites do well in making your output discoverable for these types of searches from a simple Google search. But interestingly, for more complex queries, these simple websites don’t provide good discoverability.

The technology for searching across repositories for freely accessible postprints isn’t as good as I’d want it to be. But repository search engines like BASE, CORE and Repository Search are improving day by day. Hopefully, one day we’ll have a working system where you can paste-in a DOI and it’ll take you to a freely available postprint copy of the work; Jez Cope has an excellent demo of this here.

Open scholarship is now open to all

So, if there aren’t any suitable fee-free journals in your subject area (1), you find you don’t have funds to publish a gold open access article (2), and you aren’t eligible for am OA fee waiver (3), fear not. With a combination of preprint & postprint postings, you too can make your research freely available online, even if it has the misfortune to be published in a traditional subscription access journal. Upload your work today!