Tag: dataviz

As I was typing out a tweet, I had the feeling that my vocabulary is a bit limited. Papers I tweet about are either “great”, “awesome” or “interesting”. I wondered what my most frequently tweeted words are.

Like the last post you can (probably) do what I’ll describe online somewhere, but why would you want to do that when you can DIY in R?

First, I requested my tweets from Twitter. I wasn’t sure of the limits of rtweet for retrieving tweets and the request only takes a few minutes. This gives you a download of everything including a csv of all your tweets. The text of those tweets is in a column called ‘text’.

## for text mining
library(tm)
## for building a corpus
library(SnowballC)
## for making wordclouds
library(wordcloud)
## read in your tweets
tweets <- read.csv('tweets.csv', stringsAsFactors = FALSE)
## make a corpus of the text of the tweets
tCorpus <- Corpus(VectorSource(tweets$text))
## remove all the punctation from tweets
tCorpus <- tm_map(tCorpus, removePunctuation)
## good idea to remove stopwords: high frequency words such as I, me and so on
tCorpus <- tm_map(tCorpus, removeWords, stopwords('english'))
## next step is to stem the words. Means that talking and talked become talk
tCorpus <- tm_map(tCorpus, stemDocument)
## now display your wordcloud
wordcloud(tCorpus, max.words = 100, random.order = FALSE)

For my @clathrin account this gave:

So my most tweeted word is paper, followed by cell and lab. I’m quite happy about that. I noticed that great is also high frequency, which I had a feeling would also be the case. It looks like @christlet, @davidsbristol, @jwoodgett and @cshperspectives are among my frequent twitterings, this is probably a function of the length of time we’ve been using twitter. The cloud was generated using 10.9K tweets over seven years, it might be interesting to look at any changes over this time…

The cloud is a bit rough and ready. Further filtering would be a good idea, but this quick exercise just took a few minutes.

—

The post title comes from “The Sound of Clouds” by The Posies from their Solid States LP.

I use a Garmin 800 GPS device to log my cycling activity. including my commutes. Since I have now built up nearly 4 years of cycling the same route, I had a good dataset to look at how accurate the device is.

I wrote some code to import all of the rides tagged with commute in rubiTrack 4 Pro (technical details are below). These tracks needed categorising so that they could be compared. Then I plotted them out as a gizmo in Igor Pro and compared them to a reference data set which I obtained via GPS Visualiser.

The reference dataset is black. Showing the “true” elevation at those particular latitude and longitude coordinates. Plotted on to that are the commute tracks coloured red-white-blue according to longitude. You can see that there are a range of elevations recorded by the device, apart from a few outliers they are mostly accurate but offset. This is strange because I have the elevation of the start and end points saved in the device and I thought it changed the altitude it was measuring to these elevation positions when recording the track, obviously not.

To look at the error in the device I plotted out the difference in the measured altitude at a given location versus the true elevation. For each route (to and from work) a histogram of elevation differences is shown to the right. The average difference is 8 m for the commute in and 4 m for the commute back. This is quite a lot considering that all of this is only ~100 m above sea level. The standard deviation is 43 m for the commute in and 26 m for the way back.

This post at VeloViewer comparing GPS data on Strava from pro-cyclists riding the St15 of 2015 Giro d’Italia sprang to mind. Some GPS devices performed OK, whereas others (including Garmin) did less well. The idea in that post is that rain affects the recording of some units. This could be true and although I live in a rainy country, I doubt it can account for the inaccuracies recorded here. Bear in mind that that stage was over some big changes in altitude and my recordings, very little. On the other hand, there are very few tracks in that post whereas there is lots of data here.

It’s interesting that the data is worse going in to work than coming back. I do set off quite early in the morning and it is colder etc first thing which might mean the unit doesn’t behave as well for the commute to work. Both to and from work tracks vary most in lat/lon recordings at the start of the track which suggests that the unit is slow to get an exact location – something every Garmin user can attest to. Although I always wait until it has a fix before setting off. The final two plots show what the beginning of the return from work looks like for location accuracy (travelling east to west) compared to a midway section of the same commute (right). This might mean the the inaccuracy at the start determines how inaccurate the track is. As I mentioned, the elevation is set for start and end points. Perhaps if the lat/lon is too far from the endpoint it fails to collect the correct elevation.

Conclusion

I’m disappointed with the accuracy of the device. However, I have no idea whether other GPS units (including phones) would outperform the Garmin Edge 800 or even if later Garmin models are better. This is a good but limited dataset. A similar analysis would be possible on a huge dataset (e.g. all strava data) which would reveal the best and worst GPS devices and/or the best conditions for recording the most accurate data.

Technical details

I described how to get GPX tracks from rubiTrack 4 Pro into Igor and how to crunch them in a previous post. I modified the code to get elevation data out from the cycling tracks and generally made the code slightly more robust. This left me with 1,200 tracks. My commutes are varied. I frequently go from A to C via B and from C to A via D which is a loop (this is what is shown here). But I also go A to C via D, C to A via B and then I also often extend the commute to include 30 km of Warwickshire countryside. The tracks could be categorized by testing whether they began at A or C (this rejected some partial routes) and then testing whether they passed through B or D. These could then be plotted and checked visually for any routes which went off course, there were none. The key here is to pick the right B and D points. To calculate the differences in elevation, the simplest thing was to get GPS Visualiser to tell me what the elevation should be for all the points I had. I was surprised that the API could do half a million points without complaining. This was sufficient to do the rest. Note that the comparisons needed to be done as lat/lon versus elevation because due to differences in speed, time or trackpoint number lead to inherent differences in lat/lon (and elevation). Note also due to the small scale I didn’t bother converting lat/lon into flat earth kilometres.

—

The post title comes from “Elevation” by Television, which can be found on the classic “Marquee Moon” LP.

Towards the end of 2015, I started distance running. I thought it’d be fun to look at the frequency of my runs over the course of 2016.

Most of my runs were recorded with a GPS watch. I log my cycling data using Rubitrack, so I just added my running data to this. This software is great but to do any serious number crunching, other software is needed. Yes, I know that if I used strava I can do lots of things with my data… but I don’t. I also know that there are tools for R to do this, but I wrote something in Igor instead. The GitHub repo is here. There’s a technical description below, as well as some random thoughts on running (and cycling).

The animation shows the tracks I recorded as 2016 rolled by. The routes won’t mean much to you, but I can recognise most of them. You can see how I built up the distance to run a marathon and then how the runs became less frequent through late summer to October. I logged 975 km with probably another 50 km or so not logged.

Technical description

To pull the data out of rubiTrack 4 Pro is actually quite difficult since there is no automated export. An applescript did the job of going through all the run activities and exporting them as gpx. There is an API provided by Garmin to take the data straight from the FIT files recorded by the watch, but everything is saved and tagged in rubiTrack, so gpx is a good starting point. GPX is an xml format which can be read into Igor using XMLutils XOP written by andyfaff. Previously, I’ve used nokogiri for reading XML, but this XOP keeps everything within Igor. This worked OK, but I had some trouble with namespaces which I didn’t resolve properly and what is in the code is a slight hack. I wrote some code which imported all the files and then processed the time frame I wanted to look at. It basically looks at a.m. and p.m. for each day in the timeframe. Igor deals with date/time nicely and so this was quite easy. Two lookups per day were needed because I often went for two runs per day (run commuting). I set the lat/lon at the start of each track as 0,0. I used the new alpha tools in IP7 to fade the tracks so that they decay away over time. They disappear with 1/8 reduction in opacity over a four day period. Igor writes out to mov which worked really nicely, but wordpress can’t host movies, so I added a line to write out TIFFs of each frame of the animation and assembled a nice gif using FIJI.

Getting started with running

Getting into running was almost accidental. I am a committed cyclist and had always been of the opinion: since running doesn’t improve aerobic cycling performance (only cycling does that), any activity other than cycling is a waste of time. However, I realised that finding time for cycling was getting more difficult and also my goal is to keep fit and not to actually be a pro-cyclist, so running had to be worth a try. Roughly speaking, running is about three times more time efficient compared to cycling. One hour of running approximates to three hours of cycling. I thought, I would just try it. Over the winter. No more than that. Of course, I soon got the running bug and ran through most of 2016. Taking part in a few running events (marathon, half marathons, 10K). A quick four notes on my experience.

The key thing to keeping running is staying healthy and uninjured. That means building up distance and frequency of running very slowly. In fact, the limitation to running is the body’s ability to actually do the distance. In cycling this is different, as long as you fuel adequately and you’re reasonably fit, you could cycle all day if you wanted. This not true of running, and so, building up to doing longer distances is essential and the ramp up shouldn’t be rushed. Injuries will cost you lost weeks on a training schedule.

There’s lots of things “people don’t tell you” about running. Blisters and things everyone knows about, but losing a toenail during a 20 km run? Encountering runner’s GI problems? There’s lots of surprises as you start out. Joining a club or reading running forums probably helps (I didn’t bother!). In case you are wondering, the respective answers are getting decent shoes fitted and well, there is no cure.

Going from cycling to running meant going from very little upper body mass to gaining extra muscle. This means gaining weight. This is something of a shock to a cyclist and seems counterintuitive, since more activity should really equate to weight loss. I maintained cycling through the year, but was not expecting a gain of ~3 kilos.

As with any sport, having something to aim for is essential. Training for training’s sake can become pointless, so line up something to shoot for. Sign up for an event or at least have an achievement (distance, average speed) in your mind that you want to achieve.

So there you have it. I’ll probably continue to mix running with cycling in 2017. I’ll probably extend the repo to do more with cycling data if I have the time.

—

The post title is taken from “Colours Running Out” by TOY from their eponymous LP.

Our recent paper on “the mesh” in kinetochore fibres (K-fibres) of the mitotic spindle was our first adventure in 3D electron microscopy. This post is about some of the new data analysis challenges that were thrown up by this study. I promised a more technical post about this paper and here it is, better late than never.

In the paper we describe how over-expression of TACC3 causes the microtubules (MTs) in K-fibres to become “more wonky”. This was one of those observations that we could see by eye in the tomograms, but we needed a way to quantify it. And this meant coming up with a new spatial statistic.

After a few false starts*, we generated a method that I’ll describe here in the hope that the extra detail will be useful for other people interested in similar problems in cell biology.

The difficulty in this analysis comes from the fact that the fibres are randomly oriented, because of the way that the experiment is done. We section orthogonally to the spindle axis, but the fibre is rarely pointing exactly orthogonal to the tomogram. So the challenge is to reorient all the fibres to be able to pool numbers from across different fibres to derive any measurements. The IgorPro code to do this was made available with the paper. I have recently updated this code for a more streamlined workflow (available here).

We had two 3D point sets, one representing the position of each microtubule in the fibre at bottom of our tomogram and the other set is the position at the top. After creating individual MT waves from these point sets to work with, these waves could be plotted in 3D to have a look at them.

This is done in IgorPro by using a Gizmo. Shown here is a set of MTs from one K-fibre, rotated to show how the waves look in 3D, note that the scaling in z is exaggerated compared with x and y.

We need to normalise the fibres by getting them to all point in the same direction. We found that trying to pull out the average trajectory for the fibre didn’t work so well if there were lots of wonky MTs. So we came up with the following method:

Calculate the total cartesian distance of all MT waves in an xy view, i.e. the sum of all projections of vectors on an xy plane.

Rotate the fibre.

Recalculate the total distance.

Repeat.

So we start off with this set of waves (Original). We rotate through 3D space and plot the total distance at each rotation to find the minimum, i.e. when most MTs are pointing straight at the viewer. This plot (Finding Minimum) is coloured so that hot colours are the smallest distance, it shows this calculation for a range of rotations in phi and theta. Once this minimum is found, the MT waves can be rotated by this value and the set is then normalised (you need to click on the pictures to see them properly).

Original

Finding minimum

Rotated

Now we have all of the fibres that we imaged oriented in the same way, pointing to the zenith. This means we can look at angles relative to the z axis and derive statistics.

The next challenge was to make a measure of “wonkiness”. In other words, test how parallel the MTs are.

Violin plots of theta don’t really get across the wonkiness of the TACC3 overexpressed K-fibres (see figure above). To visualise this more clearly, each MT was turned into a vector starting at the origin and the point where the vector intersected with an xy plane set at an arbitrary distance in z (100 nm) was calculated. The scatter of these intersections demonstrates nicely how parallel the MTs are. If all MTs were perfectly parallel, they would all intersect at 0,0. In the control this is more-or-less true, with a bit of noise. In contrast, the TACC3-overexpressed group have much more scatter. What was nice is that the radial scatter was homogeneous, which showed that there was no bias in the acquisition of tomograms. The final touch was to generate a bivariate histogram which shows the scatter around 0,0 but it is normalised for the total number of points. Note that none of this possible without the first normalisation step.

Parallelism

The only thing that we didn’t have was a good term to describe what we were studying. “Wonkiness” didn’t sound very scientific and “parallelness” was also a bit strange. Parallelism is a word used in the humanities to describe analogies in art, film etc. However, it seemed the best term to describe the study of how parallel the MTs in a fibre are.

With a little help from my friends

The development of this method was borne out of discussions with Tom Honnor and Julia Brettschneider in the Statistics department in Warwick. The idea for the intersecting scatter plot came from Anne Straube in the office next door to me. They are all acknowledged in our paper for their input. A.G. at WaveMetrics helped me speed up my code by using MatrixOP and Euler’s rotation. His other suggestion of using PCA to do this would undoubtedly be faster, but I haven’t implemented this – yet. The bivariate histograms were made using JointHistogram() found here. JointHistogram now ships with Igor 7.

* as we describe in the paper

Several other strategies were explored to analyze deviations in trajectory versus the fiber axis. These were: examining the variance in trajectory angles, pairwise comparison of all MTs in the bundle, comparison to a reference MT that represented the fiber axis, using spherical rotation and rotating by an average value. These produced similar results, however, the one described here was the most robust and represents our best method for this kind of spatial statistical analysis.

The aim of the book is to give a guide to data analysis. It is not meant as a comprehensive data analysis “how to”, nor is it a manual for statistics or programming. Instead it is a high-level guide: how to think about data analysis and how to go about doing it. This makes it an interesting read for anyone working with data.

I think anyone who reads the Simply Statistics blog or who has read the piece Roger and Jeff wrote for Science, will be familiar with a lot of the content in here. At the beginning of the book, I didn’t feel like I learned too much. However, I can see that the “converted” are maybe not the target audience here. Towards the end of the book, the authors walk through a few examples of how to analyse some data focussing on the question in mind, how to refine it and then how to start the analysis. This is the most useful aspect of the book in my opinion, to see the approach to data analysis working in practice. The authors sum up the book early on by comparing it to books about songwriting. I admit to rolling my eyes at this comparison (data analysis as an artform…), but actually it is a good analogy. I think many people who work with data know how to do it, in the way that people who write songs know how to do it, although they probably have not had a formal course in the techniques that are being used. Equally reading a guidebook on songwriting will not make you a great songwriter. A book can only get you so far, intuition and invention are required and the same applies to data science.

The book was published via Lean Pub who have an interesting model where you pay a recommended price (or more!) but if you don’t have the money, you can pay less. Also, you can see what fraction goes to the author(s). The books can be updated continually as typos or code updates are fixed. Roger and the Simply Stats people have put out a few books via this publisher. These books on R, programming, statistics and data science all look good and it seems more books are coming soon.

On a personal note: In 2014, I decided to try and read one book per month. I managed it, but in 2015, I am struggling. It is now November and this book is the 7th I’ve read this year. It was published in September but it took me until now to finish it. Too much going on…

—

My Blank Pages is a track by Velvet Crush. This is an occasional series of book reviews.

Stephen Curry had previously made the call that journals should “show us the data” that underlie the much-maligned Journal Impact Factor (JIF). However, this call made me wonder what “showing us the data” would look like and how journals might do it.

What citation distribution should we look at? The JIF looks at citations in a year to articles published in the preceding 2 years. This captures a period in a paper’s life, but it misses “slow burner” papers and also underestimates the impact of papers that just keep generating citations long after publication. I wrote a quick bit of code that would look at a decade’s worth of papers at one journal to see what happened to them as yearly cohorts over that decade. I picked EMBO J to look at since they have actually published their own citation distribution, and also they appear willing to engage with more transparency around scientific publication. Note that, when they published their distribution, it considered citations to papers via a JIF-style window over 5 years.

I pulled 4082 papers with a publication date of 2004-2014 from Web of Science (the search was limited to Articles) along with data on citations that occurred per year. I generated histograms to look at distribution of citations for each year. Papers published in 2004 are in the top row, papers from 2014 are in the bottom row. The first histogram shows citations in the same year as publication, in the next column, the following year and so-on. Number of papers is on y and on x the number of citations. Sorry for the lack of labelling! My excuse is that my code made a plot with “subwindows”, which I’m not too familiar with.

What is interesting is that the distribution changes over time:

In the year of publication, most papers are not cited at all, which is expected since there is a lag to publication of papers which can cite the work and also some papers do not come out until later in the year, meaning the likelihood of a citing paper coming out decreases as the year progresses.

The following year most papers are picking up citations: the distribution moves rightwards.

Over the next few years the distribution relaxes back leftwards as the citations die away.

The distributions are always skewed. Few papers get loads of citations, most get very few.

Although I truncated the x-axis at 40 citations, there are a handful of papers that are picking up >40 cites per year up to 10 years after publication – clearly these are very useful papers!

To summarise these distributions I generated the median (and the mean – I know, I know) number of citations for each publication year-citation year combination and made plots.

The mean is shown on the left and median on the right. The layout is the same as in the multi-histogram plot above.

Follow along a row and you can again see how the cohort of papers attracts citations, peaks and then dies away. You can also see that some years were better than others in terms of citations, 2004 and 2005 were good years, 2007 was not so good. It is very difficult, if not impossible, to judge how 2013 and 2014 papers will fare into the future.

What was the point of all this? Well, I think showing the citation data that underlie the JIF is a good start. However, citation data are more nuanced than the JIF allows for. So being able to choose how we look at the citations is important to understand how a journal performs. Having some kind of widget that allows one to select the year(s) of papers to look at and the year(s) that the citations came from would be perfect, but this is beyond me. Otherwise, journals would probably elect to show us a distribution for a golden year (like 2004 in this case), or pick a window for comparison that looked highly favourable.

Finally, I think journals are unlikely to provide this kind of analysis. They should, if only because it is a chance for a journal to show how it publishes many papers that are really useful to the community. Anyway, maybe they don’t have to… What this quick analysis shows is that it can be (fairly) easily harvested and displayed. We could crowdsource this analysis using standardised code.

Below is the code that I used – it’s a bit rough and would need some work before it could be used generally. It also uses a 2D filtering method that was posted on IgorExchange by John Weeks.

—

The post title is taken from “The Great Curve” by Talking Heads from their classic LP Remain in Light.

Following on from the last post about publication lag times at cell biology journals, I went ahead and crunched the numbers for all journals in PubMed for one year (2013). Before we dive into the numbers, a couple of points about this kind of information.

Some journals “reset the clock” on the received date with manuscripts that are resubmitted. This makes comparisons difficult.

The length of publication lag is not necessarily a reflection of the way the journal operates. As this comment points out, manuscripts are out of the journals hands (with the reviewers) for a substantial fraction of the time.

The dataset is incomplete because the deposition of this information is not mandatory. About 1/3 of papers have the date information deposited (see below).

Publication lag times go hand-in-hand with peer review. Moving to preprints and post-publication review would eradicate these delays.

Thanks for all the feedback on my last post, particularly those that highlighted the points above.

To see how all this was done, check out the Methods bit below, where you can download the full summary. I ended up with a list of publication lag times for 428500 papers published in 2013 (see left). To make a bit more sense of this, I split them by journal and then found the publication lag time stats for each. This had to be done per journal since PLoS ONE alone makes up 45560 of the records.

To try and visualise what these publication lag times look like for all journals, I made a histogram of the Median lag times for all journals using a 10 d bin width. It takes on average ~100 d to go from Received to Accepted and a further ~120 d to go from Accepted to Published. The whole process on average takes 239 days.

To get a feel for the variability in these numbers I plotted out the ranked Median times for each journal and overlaid Q25 and Q75 (dots). The IQR for some of the slower journals was >150 d. So the papers that they publish can have very different fates.

Is the publication lag time longer at higher tier journals? To look at this, I used the Rec-Acc time and the 2013 Journal Impact Factor which, although widely derided and flawed, does correlate loosely with journal prestige. I have fewer journals in this dataset, because the lookup of JIFs didn’t find every journal in my starting set, either because the journal doesn’t have one or there were minor differences in the PubMed name and the Thomson-Reuters name. The median of the median Rec-Acc times for each bin is shown. So on average, journals with a JIF <1 will take 1 month longer to accept your paper than journal with an IF ranging from 1-10. After this it rises again, to ~2 months longer at journals with an IF over 10. Why? Perhaps at the lower end, the trouble is finding reviewers; whereas at the higher end, multiple rounds of review might become a problem.

The executive summary is below. These are the times (in days) for delays at all journals in PubMed for 2013.

On the whole it seems that publishing in Cell Biology is quite slow compared to the whole of PubMed. Why this is the case is a tricky question. Is it because cell biologists submit papers too early and they need more revision? Are they more dogged in sending back rejected manuscripts? Is it because as a community we review too harshly and/or ask too much of the authors? Do Editors allow too many rounds of revision or not give clear guidance to expedite the time from Received-to-Accepted? It’s probably a combination of all of these factors and we’re all to blame.

Finally, this amusing tweet to show the transparency of EMBO J publication timelines raises the question: would these authors have been better off just sending the paper somewhere else?

Methods: I searched PubMed using journal article[pt] AND ("2013/01/01"[PDAT] : "2013/12/31"[PDAT]) this gave a huge xml file (~16 GB) which nokogiri balked at. So I divided the query up into subranges of those dates (1.4 GB) and ran the script on all xml files. This gave 1425643 records. I removed records that did not have a received date or those with greater than 12 in the month field (leaving 428513 records). 13 of these records did not have a journal name. This gave 428500 records from 3301 journals. Again, I filtered out negative values (papers accepted before they were received) and a couple of outliers (e.g. 6000 days!). With a bit of code it was quite straightforward to extract simple statistics for each of the journals. You can download the data here to look up the information for a journal of your choice (wordpress only allows xls, not txt/csv). The fields show the journal name and the number of valid articles. Then for Acc-Pub, Rec-Acc and Rec-Pub, the number, Median, lower quartile, upper quartile times in days are given. I set a limit of 5 or more articles for calculation of the stats. Blank entries are where there was no valid data. Note that there are some differences with the table in my last post. This is because for that analysis I used a bigger date range and then filtered the year based on the published field. Here my search started out by specifying PDAT, which is slightly different.

The data are OK, but the publication date needs to be taken with a pinch of salt. For many records it was missing a month or day, so the date used for some records is approximate. In retrospect using the Entrez date or one of the other required fields would have probably be better. I liked the idea of the publication date as this is when the paper finally appears in print which still represents a significant delay at some journals. The Recieved-to-Accepted dates are valid though.

If you are reading this, you probably know how science publication works. Feel free to skip. Otherwise, it goes something like this. After writing up your work for publication, you submit it to a journal. Assuming that this journal will eventually publish the paper (there is usually a period of submitting, getting rejected, resubmitting to a different journal etc.), they receive the paper on a certain date. They send it out to review, they collate the reviews and send back a decision, you (almost always) revise your paper further and then send it back. This can happen several times. At some point it gets accepted on a certain date. The journal then prepares the paper for publication in a scheduled issue on a specific date (they can also immediately post papers online without formatting). All of these steps add significant delays. It typically takes 9 months to publish a paper in the biomedical sciences. In 2015 this sounds very silly, when world-wide dissemination of information is as simple as a few clicks on a trackpad. The bigger problem is that we rely on papers as a currency to get jobs or funding and so these delays can be more than just a frustration, they can affect your ability to actually do more science.

The good news is that it is very straightforward to parse the received, accepted and publisheddates from PubMed. So we can easily calculate the publication lags for cell biology journals. If you don’t work in cell biology, just follow the instructions below to make your own list.

The bad news is that the deposition of the date information in PubMed depends on the journal. The extra bad news is that three of the major cell biology journals do not deposit their data: J Cell Biol, Mol Biol Cell and J Cell Sci. My original plan was to compare these three journals with Traffic, Nat Cell Biol and Dev Cell. Instead, I extended the list to include other journals which take non-cell biology papers (and deposit their data).

A summary of the last ten years

Three sets of box plots here show the publication lags for eight journals that take cell biology papers. The journals are Cell, Cell Stem Cell, Current Biology, Developmental Cell, EMBO Journal, Nature Cell Biology, Nature Methods and Traffic (see note at the end about eLife). They are shown in alphabetical order. The box plots show the median and the IQR, whiskers show the 10th and 90th percentiles. The three plots show the time from Received-to-Published (Rec-Pub), and then a breakdown of this time into Received-to-Accepted (Rec-Acc) and Accepted-to-Published (Rec-Pub). The colours are just to make it easier to tell the journals apart and don’t have any significance.

You can see from these plots that the journals differ widely in the time it takes to publish a paper there. Current Biology is very fast, whereas Cell Stem Cell is relatively slow. The time it takes the journals to move them from acceptance to publication is pretty constant. Apart from Traffic where it takes an average of ~3 months to get something in to print. Remember that the paper is often online for this period so this is not necessarily a bad thing. I was not surprised that Current Biology was the fastest. At this journal, a presubmission inquiry is required and the referees are often lined up in advance. The staff are keen to publish rapidly, hence the name, Current Biology. I was amazed at Nature Cell Biology having such a short time from Received-to-Acceptance. The delay in Review-to-Acceptance comes from multiple rounds of revision and from doing extra experimental work. Anecdotally, it seems that the review at Nature Cell Biol should be just as lengthy as at Dev Cell or EMBO J. I wonder if the received date is accurate… it is possible to massage this date by first rejecting the paper, but allowing a resubmission. Then using the resubmission date as the received date [Edit: see below]. One way to legitimately limit this delay is to only allow a certain time for revisions and only allow one round of corrections. This is what happens at J Cell Biol, unfortunately we don’t have this data to see how effective this is.

How has the lag time changed over the last ten years?

Have the slow journals always been slow? When did they become slow? Again three plots are shown (side-by-side) depicting the Rec-Pub and then the Rec-Acc and Acc-Pub time. Now the intensity of red or blue shows the data for each year (2014 is the most intense colour). Again you can see that the dataset is not complete with missing date information for Traffic for many years, for example.

Interestingly, the publication lag has been pretty constant for some journals but not others. Cell Stem Cell and Dev Cell (but not the mothership – Cell) have seen increases as have Nature Cell Biology and Nature Methods. On the whole Acc-Pub times are stable, except for Nature Methods which is the only journal in the list to see an increase over the time period. This just leaves us with the task of drawing up a ranked list of the fastest to the slowest journal. Then we can see which of these journals is likely to delay dissemination of our work the most.

The Median times (in days) for 2013 are below. The journals are ranked in order of fastest to slowest for Received-to-Publication. I had to use 2013 because EMBO J is missing data for 2014.

Journal

Rec-Pub

Rec-Acc

Acc-Pub

Curr Biol

159

99.5

56

Nat Methods

192

125

68

Cell

195

169

35

EMBO J

203

142

61

Nature Cell Biol

237

180

59

Traffic

244

161

86

Dev Cell

247

204

43

Cell Stem Cell

284

205

66

You’ll see that only Cell Stem Cell is over the threshold where it would be faster to conceive and give birth to a human being than to publish a paper there (on average). If the additional time wasted in submitting your manuscript to other journals is factored in, it is likely that most papers are at least on a par with the median gestation time.

If you are wondering why eLife is missing… as a new journal it didn’t have ten years worth of data to analyse. It did have a reasonably complete set for 2013 (but Rec-Acc only). The median time was 89 days, beating Current Biology by 10.5 days.

Methods

Please check out Neil Saunders’ post on how to do this. I did a PubMed search for (journal1[ta] OR journal2[ta] OR ...) AND journal article[pt] to make sure I didn’t get any reviews or letters etc. I limited the search from 2003 onwards to make sure I had 10 years of data for the journals that deposited it. I downloaded the file as xml and I used Ruby/Nokogiri to parse the file to csv. Installing Nokogiri is reasonably straightforward, but the documentation is pretty impenetrable. The ruby script I used was from Neil’s post (step 3) with a few lines added:

and then executed as described. The csv could then be imported into IgorPro and processed. Neil’s post describes a workflow for R, or you could use Excel or whatever at this point. As he notes, quite a few records are missing the date information and some of it is wrong, i.e. published before it was accepted. These need to be cleaned up. The other problem is that the month is sometimes an integer and sometimes a three-letter code. He uses lubridate in R to get around this, a loop-replace in Igor is easy to construct and even Excel can handle this with an IF statement, e.g. IF(LEN(G2)=3,MONTH(1&LEFT(G2,3)),G2) if the month is in G2. Good luck!

Edit 9/3/15 @ 17:17 several people (including Deborah Sweet and Bernd Pulverer from Cell Press/Cell Stem Cell and EMBO, respectively) have confirmed via Twitter that some journals use the date of resubmission as the submitted date. Cell Stem Cell and EMBO journals use the real dates. There is no way to tell whether a journal does this or not (from the deposited data). Stuart Cantrill from Nature Chemistry pointed out that his journal do declare that they sometimes reset the clock. I’m not sure about other journals. My own feeling is that – for full transparency – journals should 1) record the actual dates of submission, acceptance and publication, 2) deposit them in PubMed and add them to the paper. As pointed out by Jim Woodgett, scientists want the actual dates on their paper, partly because they are the real dates, but also to claim priority in certain cases. There is a conflict here, because journals might appear inefficient if they have long publication lag times. I think this should be an incentive for Editors to simplify revisions by giving clear guidance and limiting successive revision cycles. (This Edit was corrected 10/3/15 @ 11:04).

—

The post title is taken from “Waiting to Happen” by Super Furry Animals from the “Something 4 The Weekend” single.

I thought I’d share this piece of analysis looking at productivity of people in the lab. Here, productivity means publishing papers. This is unfortunate since some people in my lab have made some great contributions to other peoples’ projects or have generally got something going, but these haven’t necessarily transferred into print. Also, the projects people have been involved in have varied in toughness. I’ve had students on an 8-week rotation who just collected some data which went straight into a paper and I’ve had postdocs toil for two years trying to purify a protein complex… I wasn’t looking to single out who was the most productive person (I knew who that was already), but I was interested to look at other things, e.g. how long is it on average from someone joining the lab to them publishing their first paper?

The information below would be really useful if it was available for all labs. When trainees are looking for jobs, it would be worth knowing the productivity of a given lab. This can be very hard to discern, since it is difficult to see how many people have worked in the lab and for how long. Often all you have to go on is the PubMed record of the PI. Two papers published per year in the field of cell biology is a fantastic output, but not if you have a lab of thirty people. How likely are you – as a future lab member – to publish your own 1st author paper? This would be very handy to know before applying to a lab.

I extracted the date of online publication for all of our papers as well as author position and other details. I had a record of start and end dates for everyone in the lab. Although as I write this, I realise that I’ve left one person off by mistake. All of this could be dumped into IgorPro and I wrote some code to array the data in a plot vs time. People are anonymous – they’ll know who they are, if they’re reading. Also we have one paper which is close to being accepted so I included this although it is not in press yet.

The first plot shows when people joined the lab and how long they stayed. Each person has been colour-coded according to their position. The lines represent their time spent in the lab. Some post-graduates (PG) came as a masters student for a rotation and then came back for a PhD and hence have a broken line. Publications are shown by markers according to when a paper featuring that person was published online. There’s a key to indicate a paper versus review/methods paper and if the person was 1st author or not. We have published two papers that I would call collaborative, i.e. a minor component from our lab. Not shown here are the publications that are authored by me but don’t feature anyone else working in the lab.

This plot starts when I got my first independent position. As you can see it was ~1 year until I was able to recruit my first tech. It was almost another 2 years before we published our first paper. Our second one took almost another 2 years! What is not factored in here is the time spent waiting to get something published – see here. The early part of getting a lab going is tough, however you can see that once we were up-and-running the papers came out more quickly.

In the second plot, I offset the traces to show duration in the lab and relative time to publication from the start date in the lab. I also grouped people according to their position and ranked them by duration in the lab. This plot is clearer for comparing publication rates and lag to getting the first paper etc. This plot shows quite nicely that lots of people from the lab publish “posthumously”. This is thanks to the publication lag but also to things not getting finished or results that needed further experiments to make sense etc. Luckily the people in my lab have been well organised, which has made it possible to publish things after they’ve left.

I was surprised to see that five people published within ~1.5 years of joining the lab. However, in each case the papers came about because of some groundwork by other people.

I think the number of people and the number of papers are both too low to begin to predict how long someone will take to get their first paper out, but these plots give a sense of how tough it is and how much effort and work is required to make it into print.

Methods: To recreate this for your own lab, you just need a list of lab members with start and end dates. The rest can be extracted from PubMed. Dataviz fans may be interested that the colour scheme is taken from Paul Tol’s guide.

—

The post title comes from “If and When” by The dB’s from Ride The Wild Tomtom

Having recently got my head around violin plots, I thought I would explain what they are and why you might want to use them.

There are several options when it comes to plotting summary data. I list them here in order of granularity, before describing violin plots and how to plot them in some detail.

Bar chart

This is the mainstay of most papers in my field. Typically, a bar representing the mean value that’s been measured is shown with an error bar which shows either the standard error of the mean, the standard deviation, or more rarely a confidence interval. The two data series plotted in all cases is the waiting time for Old Faithful eruptions (waiting), a classic dataset from R. I have added some noise to waiting (waiting_mod) for comparison. I think it’s fair to say that most people feel that the bar chart has probably had its day and that we should be presenting our data in more informative ways*.

Pros: compact, easy to tell differences between groups

Cons: hides the underlying distribution, obscures the n number

Box plot

The box plot – like many things in statistics – was introduced by Tukey. It’s sometimes known as a Tukey plot, or a box-and-whiskers plot. The idea was to give an impression of the underlying distribution without showing a histogram (see below). Histograms are great, but when you need to compare many distributions they do not overlay well and take up a lot of space to show them side-by-side. In the simplest form, the “box” is the interquartile range (IQR, 25th and 75th percentiles) with a line to show the median. The whiskers show the 10th and 90th percentiles. There are many variations on this theme: outliers can be shown or not, the whiskers may show the limits of the dataset (or something else), the boxes can be notched or their width may represent the sample size…

A histogram is a method of showing the distribution of a dataset and was introduced by Pearson. The number of observations within a bin are counted and plotted. The ‘bars’ sit next to each other, because the variable being measured is continuous. The variable being measured is on the x-axis, rather than the category (as in the other plots).

Often the area of all the bars is normalised to 1 in order to assess the distributions without being confused by differences in sample size. As you can see here, “waiting” is bimodal. This was hidden in the bar chart and in the bot plot.

Related to histograms are other display types such as stemplots or stem-and-leaf plots.

Pros: shows multimodal data, shows normality/skewness clearly

Cons: not compact, difficult to overlay, bin size and position can be misleading

Scatter dot plot

It’s often said that if there are less than 10 data points, then best practice is to simply show the points. Typically the plot is shown together with a bar to show the mean (or median) and maybe with error bars showing s.e.m., s.d., IQR. There are a couple of methods of plotting the points, because they need to be scattered in x value in order to be visualised. Adding random noise is one approach, but this looks a bit messy (top). A symmetrical scatter can be introduced by binning (middle) and a further iteration is to bin the y values rather than showing their true location (bottom). There’s a further iteration which constrains the category width and overlays multiple points, but again the density becomes difficult to see.

These plots still look quite fussy, the binned version is the clearest but then we are losing the exact locations of the points, which seems counterintuitive. Another alternative to scattering the dots is to show a rug plot (see below) where there is no scatter.

Pros: shows all the observations

Cons: can be difficult to assess the distribution

Violin plot

This type of plot was introduced in the software package NCSS in 1997 and described in this paper: Hintze & Nelson (1998) The American Statistician 52(2):181-4 [PDF]. As the title says, violin plots are a synergism between box plot and density trace. A thin box plot is shown together with a symmetrical kernel density estimate (KDE, see explanation below). The point is to be able to quickly assess the distribution. You can see that the bimodality of waiting in the plot, but there’s no complication of lots of points just a smooth curve to see the data.

Pros: shows multimodal data, shows normality/skewness unambiguously

Cons: hides n, not familiar to many readers.

—

* Why is the bar chart so bad and why should I show my data another way?

The best demonstration of why the bar chart is bad is Anscombe’s Quartet (the figure to the right is taken from the Wikipedia page). These four datasets are completely different, yet they all have the same summary statistics. The point is, you would never know unless you plotted the data. A bar chart would look identical for all four datasets.

—

Making Violin Plots in IgorPro

I wanted to make Violin Plots in IgorPro, since we use Igor for absolutely everything in the lab. I wrote some code to do this and I might make some improvements to it in the future – if I find the time! This was an interesting exercise, because it meant forcing myself to understand how smoothing is done. What follows below is an aide memoire, but you may find it useful.

What is a kernel density estimate?

A KDE is a non-parametric method to estimate a probability density function of a variable. A histogram can be thought of as a simplistic non-parametric density estimate. Here, a rectangle is used to represent each observation and it gets bigger the more observations are made.

What’s wrong with using a histogram as a KDE?

The following examples are taken from here (which in turn are taken from the book by Bowman and Azzalini described below). A histogram is simplistic. We lose the location of each datapoint because of binning. Histograms are not smooth and the estimate is very sensitive to the size of the bins and also the starting location of the first bin. The histograms to the right show the same data points (in the rug plot).

Using the same bin size, they result in very different distributions depending on where the first bin starts. My first instinct to generate a KDE was to simply smooth a histogram, but this is actually quite inaccurate as it comes from a lossy source. Instead we need to generate a real KDE.

How do I make a KDE?

To do this we place a kernel (a Gaussian is commonly used) at each data point. The rationale behind this is that each observation can be thought of as being representative of a greater number of observations. It sounds a bit bizarre to assume normality to estimate a density non-parametrically, but it works. We can sum all of the kernels to give a smoothed distribution: the KDE. Easy? Well, yes as long as you know how wide to make the kernels. To do this we need to find the bandwidth, h (also called the smoothing parameter).

It turns out that this is not completely straightforward. The answer is summarised in this book: Bowman & Azzalini (1997) Applied Smoothing Techniques for Data Analysis. In the original paper on violin plots, they actually do not have a good solution for selecting h for drawing the violins, and they suggest trying several different values for h. They recommend starting at ~15% of the data range as a good starting point. Obviously if you are writing some code, the process of selecting h needs to be automatic.

Optimising h is necessary because if h is too large, the estimate with be oversmoothed and features will be lost. If h is too small, then it will be undersmoothed and bumpy. The examples to the right (again, taken from Bowman & Azzalini, via this page) show examples of undersmoothed, oversmoothed and optimal smoothing.

An optimal solution to find h is

\(h = \left(\frac{4}{3n}\right)^{\frac{1}{5}}\sigma\)

This is termed Silverman’s rule-of-thumb. If smoothing is needed in more than one dimension, the multidimensional version is

You might need multidimensional smoothing to contextualise more than one parameter being measured. The waiting data used above describes the time to wait until the next eruption from Old Faithful. The duration of the eruption is measured, and also the wait to the next eruption can be extracted, giving three parameters. These can give a 3D density estimate as shown here in the example.

The Bowman & Azzalini recommend that, if the distribution is long-tailed, using the median absolute deviation estimator is robust for \(\sigma\).

\(\tilde\sigma=median\left\{|y_i-\tilde\mu|\right\}/0.6745\)

where \(\tilde\mu\) is the median of the sample. All of this is something you don’t need to worry about if you use R to plot violins, the implementation in there is rock solid having been written in S plus and then ported to R years ago. You can even pick how the h selection is done from sm.density, or even modify the optimal h directly using hmult.

To get this working in IgorPro, I used some code for 1D KDE that was already on IgorExchange. It needed a bit of modification because it used FastGaussTransform to sum the kernels as a shortcut. It’s a very fast method, but initially gave an estimate that seemed to be undersmoothed. I spent a while altering the formula for h, hence the detail above. To cut a long story short, FastGaussTransform uses Taylor expansion of the Gauss transform and it just needed more terms to do this accurately. This is set with the /TET flag. Note also, that in Igor the width of a Gauss is sigma*2^1/2.

OK, so how do I make a Violin for plotting?

I used the draw tools to do this and placed the violins behind an existing box plot. This is necessary to be able to colour the violins (apparently transparency is coming to Igor in IP7). The other half of the violin needs to be calculated and then joined by the DrawPoly command. If the violins are trimmed, i.e. cut at the limits of the dataset, then this required an extra point to be added. Without trimming, this step is not required. The only other issue is how wide the violins are plotted. In R, the violins are all normalised so that information about n is lost. In the current implementation, box width is 0.1 and the violins are normalised to the area under the curve*(0.1/2). So, again information on n is lost.

Future improvements

Ideas for developments of the Violin Plot method in IgorPro

incorporate it into the ipf for making boxplots so that it is integrated as an option to ‘calculate percentiles’