Category Archives: On data science

Post navigation

Recently some colleagues and I published a paper in PLOS in which we analyzed about 47,000 Data Availability Statements as a way of exploring the state of data sharing in a journal with a pretty strong data availability policy. The paper has gotten a good response from what I’ve seen on Twitter, and I’m really happy with how it turned out, thanks in part to some great feedback from the reviewers. But I also wanted to tell a few more things about how this paper came about – the things that don’t make it into the final scholarly article. A behind the scenes look, if you will.

The idea for this paper arose out of a somewhat eye-opening experience. I needed to get a hold of a good dataset – I forget why exactly, but I think it was when I was first starting to teach R and wanted to some real data that I could use in the classes for the hands-on exercises. Remembering that PLOS had this data availability policy, I thought to myself, ah, no problem, I will find an article that looks relevant to the researchers I’m teaching, download the data, and use it in my demo (with proper attribution and credit, of course). So I found an article that looked good and scrolled down to the Data Availability Statement. Data available upon request. Huh. I thought you weren’t allowed to say that, but okay, I guess this one slipped through the policy. Found another one – data is within the paper, it said, except the only data in the paper were summary tables, which were of no use to me (nor would they be of use to anyone hoping to verify the study or reanalyze the data, for example).

What a weird fluke, I thought, that the first two papers I happened to look at didn’t really follow the policy. So I checked a third, and a fourth. Pretty soon I’d spent a half hour combing through recent PLOS articles and I had yet to find one with a publicly available dataset that I could easily download from a repository. I ended up looking elsewhere for data (did you know that baseball fans keep surprisingly in-depth data on a gazillion data points?) but I was left wondering what the real impact of this policy was, which was why I decided to do this study.

I’ll let you read the paper to find out what exactly it is that we found, but there’s one other behind-the-scenes anecdote that I’ll share about this paper that I hope will be encouraging. Obviously if you’re going to write critically about data availability, you’re going to look a little hypocritical if you don’t share your own data. I fully intended to share our data and planned to do so using Figshare, which is how I’d shared a dataset associated with another publication I’d previously published in PLOS. When I shared the data from the first article, I set it to be public immediately, though I didn’t expect anyone to want to see it before the paper was out. Unexpectedly, and unbeknownst to me, someone at Figshare apparently thought this was an interesting dataset and decided to tweet it out the same day I submitted the paper to PLOS, obviously well before it was ever published, much less accepted.

While the interest in the dataset was encouraging, I was also concerned about the fact that it was out before the paper was accepted. I figured I was flattering myself to think that someone would want to scoop me, but then, I got an email from someone I didn’t know, who told me that she had found my dataset and that she would like to write an article describing my results, and would I mind sharing my literature review/citations with her to save her the trouble? In other words, “hi, I would like to write basically the paper that you’re trying to get accepted using all of the work you did.” I want to be clear that I am all for data sharing, but this situation bothered me. Was I about to get scooped?

Obviously our paper came out, no one beat us to it, and as far as I know, no one has ever written another paper using that dataset, but I was thinking about it when I was uploading the data for this most recent paper. This dataset was way more interesting and broadly applicable than the first one, so what if someone did get a hold of it before our paper came out? So what I decided to do was to upload it to Figshare, have it generate a DOI, but keep the dataset listed as private rather than publicly release it. Our data availability statement included the DOI and was therefore on the surface in compliance, but I had a feeling that, if you went to the DOI, it would tell you that the dataset was private or wasn’t found. Obviously I could have checked this before I submitted, but to be totally honest, I just left it as it was because I was genuinely curious whether any of the reviewers would try to check it themselves and say something.

To their credit, all three of the reviewers (who by the way, were incredibly helpful and gave the most useful feedback I’ve ever gotten on peer review, which I think significantly improved the paper) did indeed point out that the DOI didn’t work. In our revisions, our Data Availability Statement included a working link to not only the data, but also the code, on OSF. I invite anyone who is interested to reuse it and hope someone will find it useful. (Please don’t judge me on the quality of my code, though – I wrote it a long time ago when I was first learning R and I would do it way better now.)

My mom got the whole family 23andme kits for Christmas this year, and I’ve been looking forward to getting the results mostly so I could play with the raw data in R. It finally came in, so I went back to a blog post I’d read about analyzing 23andme data using a Bioconductor package called GWASCat. It’s a really good blog post, but as it happens, the package has been updated since it was written, so the code, sadly, didn’t work. Since I of course have VAST knowledge of bioinformatics (by which I mean I’ve hung around a lot of bioinformaticians and talked to them and kind of know a few things but not really) and am super awesome at R (by which I mean I’m like moderately okay at it), I decided to try my hand and coming up with my own analysis, based on the original blog post.

Let me be incredibly clear – I have only a vague notion of what I’m doing here, so you should not take any of what I say to be, you know, like, necessarily fact. In fact, I would love for a real bioinformatician to read this and point out my errors so I can fix them. That said, here is what I did, and what you can do too!

To walk you through it first in plain English – what you get in your raw data from 23andme is a list of RSIDs, which are accession numbers for SNPs, or single nucleotide polymorphisms. At a given position in your genetic sequence, for example, you may have an A, which means you’ll have brown hair, as opposed to a G, which means you’ll have blonde hair. Of course, it’s a lot more complicated than that, but the basic idea is that you can link traits to SNPs.

So the task that needs to be done here is two-fold. First, I need to get a list of SNPs with their strongest risk allele – in other words, what SNP location am I looking for, and which nucleotide is the one that’s associated with higher risk. Then, I need to match this up with my own list of SNPs and find the ones where both of my nucleotides are the risk allele. Here’s how I did it!

Next I need to pull in the data about the SNPs from gwascat. I can use this to match up the RSIDs in my data with the ones in their data. I’m also going to drop some other columns I’m not interested in at the moment.

Now I want to find out where I have the risk allele. This is where this analysis gets potentially stupid. The risk allele is stored in the gwascat dataset with its RSID plus the allele, such as rs1423096-G. My understanding is that if you have two Gs at that position (remembering that you get one copy from each of your parents), then you’re at higher risk. So I want to create a new column in my merged dataset that has the risk allele printed twice, so that I can just compare it to the column with my data, and only have it show me the ones where I have two copies of the risk allele (since I don’t want to dig through all 10,000+ genes to find the ones of interest).

Okay, almost there! Now let’s remove all the stuff that’s not interesting and just keep the ones where I have two copies of the risk allele. I could also have it remove ones that don’t match my ancestry (European) or gender, but I’m not going to bother with it, since keeping the ones where I have a match gives me a reasonable amount to scroll through.

my_risks <- trait_list[trait_list$genotype == trait_list$badsnp,]

And there you have it! I don’t know how meaningful this analysis really is, but according to this, some traits that I have are higher educational attainment (true), persistent temperament (sure, I suppose so), migraines (true), and I care about the environment (true). Also I could be at higher risk of having a psychotic episode if I’m on methamphetamine, so I’ll probably skip trying that (which was actually my plan anyway). Anyway, it’s kind of entertaining to look at, and I’m finding SNPedia is useful for learning more.

So, now, bring on the bioinformaticians telling me how incorrect all of this is. I will eagerly make corrections to anything I am wrong about!

I don’t know if this terminology is common outside of library circles, but it seems like the “flipped classroom” has been all the rage in library instruction lately. The idea is that learners do some work before coming to the session (like read something or watch a video lecture), and then the in-person time is spent on doing more activities, group exercises, etc. As someone who is always keen to try something new and exciting, I decided to see what would happen if I tried out the flipped classroom model for my R classes.

Actually, teaching R this way makes a lot of sense. Especially if you don’t have any experience, there’s a lot of baseline knowledge you need before you can really do anything interesting. You’ve got to learn a lot of terminology, how the syntax of R works, boring things like what a data frame is and why it matters. That could easily be covered before class to save the in person time for the more hands-on aspects. I’ve also noticed a lot of variability in terms of how much people know coming into classes. Some people are pretty tech savvy when they arrive, maybe even have some experience with another programming language. Other people have difficulty understanding how to open a file. It’s hard to figure out how to pace a class when you’ve got people from all over that spectrum of expertise. On the other hand, curriculum planning would be much easier if you could know that everyone is starting out with a certain set of knowledge and build off of it.

The other reason I wanted to try this is just the time factor. I’m busy, really busy. My library’s training room is also hard to book because we offer so many classes. The people I teach are busy. I teach my basic introduction to R course as a 3-hour session, and though I’d really rather make it 4 hours, even finding a 3-hour window when I and the room are both available and people are likely to be able to attend is difficult. Plus, it would be nice if there was some way to deliver this instruction that wasn’t so time-intensive for me. I love teaching R – it’s probably my favorite thing I do in my job and I’d estimate I’ve taught close to 500 researchers how to code. I generally spend around 9 hours a month teaching R, plus another 4-6 hours doing prep, administrative stuff, and all the other things that have to get done to make a class function. That’s a lot of time, and though I don’t at all mind doing it, I’d definitely be interested in any sort of way I could streamline that work without having a negative impact on the experience of learning R from me.

For all these reasons, I decided to experiment with trying out a flipped classroom model for my introduction to R class. I had grand plans of making a series of short video tutorials that covered bite-sized pieces of learning R. There would be a bunch of them, but they’d be about 5 minutes each. I arranged for the library to get Adobe Captivate, which is very cool video tutorial software, and these tutorials are going to be so awesome when I get around to making them. However, I had already scheduled the class for today, February 28, and I hadn’t gotten around to making them yet. Fortunately, I had a recording of a previous Intro to R class I’d taught, so I chopped the relevant parts of that up into smaller pieces and made a YouTube playlist that served as my pre-class work for this session, probably about two and a half hours total.

I had 42 people were either signed up or on the waitlist at the end of last week. I think I made the class description pretty clear – that this session was only an hour, but you did have to do stuff before you got there. I sent out an email with the link to the video reminding people that they would be lost in class if they didn’t watch this stuff. Even so, yesterday morning, the last of the videos had only 8 views, and I knew at least two of those were from me checking the video to make sure it worked. So I sent out another email, once again imploring them to watch the videos before they came to class and to please cancel their registration and sign up for a regular R class if this video thing wasn’t for them.

By the time I taught the class this afternoon, 20 people had canceled their registration. Of the remaining 22, 5 showed up. Of the 5 that showed up, it quickly became apparent to me that none of them had watched the videos. I knew no one was going to answer honestly if I asked who had watched them, so I started by telling them to read in the CSV file to a data frame. This request is pretty fundamental, and also pretty much the first thing I covered in the videos, so when I was met with a lot of blank stares, I knew this experiment had pretty much failed. I did my best to cover what I could in an hour, but that’s not much, so instead of this being a cool, interactive class where people ended up feeling empowered and ready to go write code, I got the feeling those people left feeling bewildered and like they wasted an hour. One guy who had come in 10 minutes late came up to me after class and was like, “so this is a programming language? What can you do with it?” And I kind of looked at him like….whaaaat? It turned out he hadn’t even registered for the class to begin with, much less done any of the pre-class work – he had been in the library and saw me teaching and apparently thought it looked interesting so he decided to wander in.

I felt disappointed by this failed experiment, but I’m not one to give up at the first sign of failure, so I’ve been thinking about how I could make this system work. It could just be that this model is not suited to people in the setting where I teach. I am similar to them – a busy, working professional who knows this is useful and I should learn it, but it’s hard to find the time – and I think about what it would take for me to do the pre-class work. If I had the time and the videos were decent enough quality, I think I’d do it, but honestly chances are 50-50 that I’d be able to find the time. So maybe this model just isn’t made for my community.

Before I give up on this experiment entirely, though, I’d love to hear from anyone who has tried this kind of approach for adult learners. Did it work, did it not? What went well and what didn’t? And of course, being the data queen that I am, I intend to collect some data. I’m working on a modified class evaluation for those 5 brave souls who did come to get some feedback on the pre-class work model, and I’m also planning on sending a survey out to the other 38 people who didn’t come to see what I can find out from them. Data to the rescue of the flipped class!

I had the great pleasure of spending the last few days working on a team at the latest NCBI hackathon. I think this is the sixth hackathon I’ve been involved in, but this is the first time I’ve actually been a participant, i.e. a “hacker.” Prior to working on these events, I’d heard a little bit about hackathons, mostly in the context of competitive hackathons – a bunch of teams compete against each other to find the “best” solution to some common problem, usually with the winning team receiving some sort of cash prize. This approach can lead to successful and innovative solutions to problems in a short time frame. However, the so-called NCBI-style hackathons that I’ve been involved in over the last couple years involve multiple teams each working on their own individual challenge over a period of three days. There are no winners, but in my experience, everyone walks away having accomplished something, and some very promising software products have come out of these hackathons. For more specifics about the how and why of this kind of hackathon, check out the article I co-authored with several participants and the mastermind behind the hackathons, Ben Busby of NCBI.

As I said, this time was the first hackathon that I’ve actually been involved as a participant on a team, but I’ve had a lot of fun doing some librarian-y type “consulting” for five other hackathons before this, and it’s an experience I can highly recommend for any information professional who is interested in seeing science happen real-time. There’s something very exciting about watching groups of people from different backgrounds, with different expertise, most of whom have never met each other before, get together on a Monday morning with nothing but an often very vague idea, and end up on Wednesday afternoon with working software that solves a real and significant biomedical research problem. Not only that, but most of the groups manage to get pretty far along on writing a draft of a paper by that time, and several have gone on to publish those papers, with more on their way out (see the F1000Research Hackathons channel for some good examples).

As motivated and talented as all these hackathon participants are, as you can imagine, it takes a lot of organizational effort and background work to make something like this successful. A lot of that work needs to be done by someone with a lot of scientific and computing expertise. However, if you are a librarian who is reading this, I’m here to tell you that there are some really exciting opportunities to be involved with a hackathon, even if you are completely clueless when it comes to writing code. In the past five hackathons, I’ve sort of functioned as an embedded informationist/librarian, doing things like:

basic lit searching for paper introductions and generally locating background information. These aren’t formal papers that require an extensive or systematic lit review, but it’s useful for a paper to provide some context for why the problem is significant. The hackers have a ton of work to fit in to three days, so it’s silly to have them spend their limited time on lit searching when a pro librarian can jump in and likely use their expertise to find things more easily anyway

manuscript editing and scholarly communication advice. Anyone who has worked with co-authors knows that it takes some work to make the paper sound cohesive, and not like five or six people’s papers smushed together. Having someone like a librarian with editing experience to help make that happen can be really helpful. Plus, many librarians have relevant expertise in scholarly publishing, especially useful since hackathon participants are often students and earlier career researchers who haven’t had as much experience with submitting manuscripts. They can benefit from advice on things like citation management and handling the submission process. Also, I am a strong believer in having a knowledgeable non-expert read any paper, not just hackathon papers. Often writers (and I absolutely include myself here) are so deeply immersed in their own work that they make generous assumptions about what readers will know about the topic. It can be helpful to have someone who hasn’t been involved with the project from the start take a look at the manuscript and point out where additional background or explanation might be beneficial to aiding general understandability.

consulting on information seeking behavior and giving user feedback. Most of the hackathons I’ve worked on have had teams made up of all different types of people – biologists, programmers, sys admins, other types of scientists. They are all highly experienced and brilliant people, but most have a particular perspective related to their specific subject area, whereas librarians often have a broader perspective based on our interactions with lots of people from various different subject areas. I often find myself thinking of how other researchers I’ve met might use a tool in other ways, potentially ones the hackathon creators didn’t necessarily intend. Also, at least at the hackathons I’ve been at, some of the tools have definite use cases for librarians – for example, tools that involve novel ways of searching or visualizing MeSH terms or PubMed results. Having a librarian on hand to give feedback about how the tool will work can be useful for teams with that kind of a scope.

I think librarians can bring a lot to hackathons, and I’d encourage all hackathon organizers to think about engaging librarians in the process early on. But it’s not a one-way street – there’s a lot for librarians to gain from getting involved in a hackathon, even tangentially. For one thing, seeing a project go from idea to reality in three days is interesting and informative. When I first started working with hackathons, I didn’t have that much coding experience, and I certainly had no idea how software was actually developed. Even just hanging around hackathons gave me so much of a better understanding, and as an informationist who supports data science, that understanding is very relevant. Even if you’re not involved in data science per se, if you’re a biomedical librarian who wants to gain a better understanding of the science your users are engaged in, being involved in a hackathon will be a highly educational experience. I hadn’t really realized how much I had learned by working with hackathons until a librarian friend asked me for some advice on genomic databases. I responded by mentioning how cool it was that ClinVar would tell you about pathogenic variants, including their location and type (insertion, deletion, etc), and my friend was like, what are you even talking about, and that was when it occurred to me that I’ve really learned a lot from hackathons! And hey, if nothing else, there tends to be pizza at these events, and you can never go wrong with pizza.

I’ll end this post by reiterating that these hackathons aren’t about competing against each other, but there are awards given for certain “exemplary” achievements. Never one to shy away from a little friendly competition, I hoped I might be honored for some contribution this time around, and I’m pleased to say I was indeed recognized . 🙂

There is a story behind this, but trust me when I say it’s true, I’m the absolute worst at darts.

Doesn’t it seem like a lot of people died in 2016? Think of all the famous people the world lost this year. It was around the time that Alan Thicke died a couple weeks ago that I started thinking, this is quite odd; uncanny, even. Then again, maybe there was really nothing unusual about this year, but because a few very big names passed away relatively young, we were all paying a little more attention to it. Because I’m a data person, I decided to do a rather silly thing, which was to write an R script that would go out and collect a list of celebrity deaths, clean up the data, and then do some analysis and visualization.

You might wonder why I would spend my limited free time doing this rather silly thing. For one thing, after I started thinking about celebrity deaths, I really was genuinely curious about whether this year had been especially fatal or if it was just an average year, maybe with some bigger names. More importantly, this little project was actually a good way to practice a few things I wanted to teach myself. Probably some of you are just here for the death, so I won’t bore you with a long discussion of my nerdy reasons, but if you’re interested in R, Github, and what I learned from this project that actually made it quite worth while, please do stick around for that after the death discussion!

Part One: Celebrity Deaths!

To do this, I used Wikipedia’s lists of deaths of notable people from 2006 to present. This dataset is very imperfect, for reasons I’ll discuss further, but obviously we’re not being super scientific here, so let’s not worry too much about it. After discarding incomplete data, this left me with 52,185 people. Here they are on a histogram, by year.

As you can see, 2016 does in fact have the most deaths, with 6,640 notable people’s deaths having been recorded as of January 3, 2017. The next closest year is 2014, when 6,479 notable people died, but that’s a full 161 people less than 2016 (which is only a 2% difference, to be fair, but still). The average number of notable people who died yearly over this 11-year period, was 4,774, and the number of people that died in 2016 alone is 40% higher than that average. So it’s not just in my head, or yours – more notable people died this year.

Now, before we all start freaking out about this, it should be noted that the higher number of deaths in 2016 may not reflect more people actually dying – it may simply be that more deaths are being recorded on Wikipedia. The fairly steady increase and the relatively low number of deaths reported in 2006 (when Wikipedia was only five years old) suggests that this is probably the case. I do not in any way consider Wikipedia a definitive source when it comes to vital statistics, but since, as I’ve mentioned, this project was primarily to teach myself some coding lessons, I didn’t bother myself too much about the completeness or veracity of the data. Besides likely being an incomplete list, there are also some other data problems, which I’ll get to shortly.

By the way, in case you were wondering what the deadliest month is for notable people, it appears to be January:

Obviously a death is sad no matter how old the person was, but part of what seemed to make 2016 extra awful is that many of the people who died seemed relatively young. Are more young celebrities dying in 2016? This boxplot suggests that the answer to that is no:

This chart tells us that 2016 is pretty similar to other years in terms of the age at which notable people died. The mean age of death in 2016 was 76.85, which is actually slightly higher than the overall mean of 75.95. The red dots on the chart indicate outliers, basically people who died at an age that’s significantly more or less than the age most people died at in that year. There are 268 in 2016, which is a little more than other years, but not shockingly so.

By the way, since I’m making an effort toward doing more open science (if you want to call this science), you can find all the code for this on my Github repository. And that leads me into the next part of this…

Part Two: Why Do This?

I’m the kind of person who learns best by doing. I do (usually) read the documentation for stuff, but it really doesn’t make a whole lot of sense to me until I actually get in there myself and start tinkering around. I like to experiment when I’m learning code, see what happens if I change this thing or that, so I really learn how and why things work. That’s why, when I needed to learn a few key things, rather than just sitting down and reading a book or the help text, I decided to see if I could make this little death experiment work.

One thing I needed to learn: I’m working with a researcher on a project that involves web scraping, which I had kind of played with a little, but never done in any sort of serious way, so this project seemed like a good way to learn that (and it was). Another motivator: I’m going to be participating in an NCBI hackathon next week, which I’m super excited about, but I really felt like I needed to beef up my coding skills and get more comfortable with Github. Frankly, doing command line stuff still makes me squeamish, so in the course of doing this project, I taught myself how to use RStudio’s Github integration, which actually worked pretty well (I got a lot out of Hadley Wickham’s explanation of it). This death project was fairly inconsequential in and of itself, but since I went to the trouble of learning a lot of stuff to make it work, I feel a lot more prepared to be a contributing member of my hackathon team.

I wrote in my post on the open-ish PhD that I would be more amenable to sharing my code if I didn’t feel as if it were so laughably amateurish. In the past, when I wrote code, I would just do whatever ridiculous thing popped into my head that I thought my work, because, hey, who was going to see it anyway? Ever since I wrote that open-ish PhD post, I’ve really approached how I write code differently, on the assumption that someone will look at it (not that I think anyone is really all that interested in my goofy death analysis, but hey, it’s out there in case someone wants to look).

As I wrote this code, I challenged myself to think not just of a way, any way, to do something, but the best, most efficient, and most elegant way. I learned how to write good functions, for real. I learned how to use the %>%, (which is a pipe operator, and it’s very awesome). I challenged myself to avoid using for loops, since those are considered not-so-efficient in R, and I succeeded in this except for one for loop that I couldn’t think of a way to avoid at the time, though I think in retrospect there’s another, more efficient way I could write that part and I’ll probably go back and change it at some point. In the past, I would write code and be elated if it actually worked. With this project, I realized I’ve reached a new level, where I now look at code and think, “okay, that worked, but how can I do it better? Can I do that in one line of code instead of three? Can I make that more efficient?”

So while this little project might have been somewhat silly, in the end I still think it was a good use of my time because I actually learned a lot and am already starting to use a lot of what I learned in my real work. Plus, I learned that thing about Darwin’s tortoise, and that really makes the whole thing worth it, doesn’t it?

When RStudio crashes, it is not subtle about it. You get a picture of an old-timey bomb and the succinct, blunt message “R encountered a fatal error.” A couple hundred of my librarian friends and colleagues got to see it live during the demo I gave as part of a webinar I did for the Medical Library Association on R for librarians earlier today. At first, I thought the problem was minor. When I tried to read in my data, I got this error message:

Error in file(file, “rt”) : cannot open the connection
In addition: Warning message:
In file(file, “rt”) :
cannot open file ‘lib_data_example.csv’: No such file or directory

It’s a good example of R’s somewhat opaque and not-super-helpful error messages, but I’ve seen it before and it’s not a big deal. It just meant that R couldn’t find the file I’d asked for. Most of the time it’s because you’ve spelled the file name wrong, or you’ve capitalized something that should be lower case. I double checked the file name against the cheat sheet I’d printed out with all my code. Nope, the file name was correct. Another likely cause is that you’re in the wrong directory and you just need to set the working directory to where the file is located. I checked that too – my working directory was indeed set to where my file should have been. That was when RStudio crashed, though I’m still not sure exactly why that happened. I assume RStudio did it just to mess with me. 🙂

I’m sure a lot of presenters would be pretty alarmed at this point, but I was actually quite amused. People on Twitter seemed to notice:

Having your live demo crash is not very entertaining in and of itself, but I found the situation rather amusing because I had considered whether I should do a live demo and decided to go with it because it seemed so low risk. What could go wrong? Sure, live demos are unpredictable. Websites go down, databases change their interface without warning (invariably they do this five minutes before your demo starts), software crashes, and so on. Still, the demo I was doing was really quite simple compared to a lot of the R I normally teach, and it involved using an interface I literally use almost every day. I’ve had plenty of presentations go awry in the past, but this was one that I really thought had almost 0% chance of going wrong. So when it all went wrong on the very first line of code, I couldn’t help but laugh. It’s the live demo curse! You can’t escape!

I’m sure most people who have spent any significant amount of doing live demos of technology have had the experience of seeing the whole thing blow up. I know a lot of librarians who avoid the whole thing by making slides with screen shots of what they would show and do sort of a mock demo. There’s nothing wrong with that, and I can understand the inclination to remove the uncertainty of the live demo from the equation. But despite their being fraught with potential issues, I’m still in favor of live demos – and in a sense, I feel this way exactly because of their unpredicability.

For one thing, it’s helpful for learners to see how an experienced user thinks through the process of troubleshooting when something goes wrong. It’s just a fact that stuff doesn’t always work perfectly in real life. If the people I’m teaching are ever actually going to use the tools I’m demonstrating, eventually they’re going to run into some problems. They’re more likely to be able to solve those problems if they’ve had a chance to see someone work through whatever issues arise. This is true for many different types of technologies and information resources, but especially so with programming languages. Learning to troubleshoot is itself an essential skill in programming, and what better way to learn than to see it in action?

Secondly, for brand new users of a technology, watching an instructor give a flawless and apparently effortless demonstration can actually make mastery feel out of reach for them. In reality, a lot of time and effort likely went into developing that demo, trying out lots of different approaches, seeing what works well and what doesn’t, and arriving at the “perfect” final demo. I’m certainly not suggesting that instructors should do freewheeling demos with no prior planning whatsoever, but I am in favor of an approach that acknowledges that things don’t always go right the first time. When I learned R, I would watch tutorials by these incredibly smart and talented instructors and think, oh my gosh, they make this look so easy and I’m totally lost – I’m never going to understand how this works. Obviously I don’t want to look like an unprepared and incompetent fool in front of a class, but hey, things don’t always go perfectly. I’m human, you’re human, we’re all going to make mistakes, but that’s part of learning, so let’s talk about what went wrong and how we fix it.

By the way, in case you’re wondering what did actually go wrong in this instance, I had inadvertently moved the data file in the process of uploading it to my Github repo – I thought I’d made a copy, but I had actually moved the original. I quickly realized what had happened, and I knew roughly where I’d put the file, but it was in some folder buried deep in my file structure that I wouldn’t be able to locate easily on the spot. The quickest solution I could think of, which I quickly did off-screen from the webinar (thank you dual monitors) was to copy the data from the repo, paste it into a new CSV and quickly save it where the original file should have been. It worked fine and the demo went off as planned after that.

More and more lately, I’m asked the question “what do you do?” This is a surprisingly difficult question to answer. Often, how I answer depends on who’s asking – is it someone who really cares or needs to know? – and how much detail I feel like going to at the moment when I’m asked. When I’m asked at conferences, as I was quite a bit at FORCE2016, I tried to be as explanatory as possible without getting pedantic, boring, or long-winded. My answer in those scenarios goes something like “I’m a data librarian – I do a lot of instruction on data science, like R and data visualization, and data management.” When I’m asked in more social contexts, I hardly even bother explaining. Depending on my mood and the person who’s asking, I’ll usually say something like data scientist, medical librarian, or, if I really don’t feel like talking about it, just librarian. It’s hard to know how to describe yourself when you have a job title that is pretty obscure: Research Data Informationist. I would venture to guess that 99% of my family, friends, and even work colleagues have little to no idea what I actually spend my days doing.

In some regards, that’s fine. Does it really matter if my mom and dad know what it means that I’ve taught hundreds of scientists R? Not really (they’re still really proud, though!). Do I care if my date has a clear understanding of what a data librarian does? Not really. Do I care if a random person I happen to chat with while I’m watching a hockey game at my local gets the nuances of the informationist profession? Absolutely not.

On the other hand, there are often times that I wish I had a somewhat more scrutable job title. When I’m talking to researchers at my institution, I want them to know what I do because I want them to know when to ask me for help. I want them to know that the library has someone like me who can help with their data science questions, their data management needs, and so on. I know it’s not natural to think “library” when the question is “how do I get help with finding data” or “I need to learn R and don’t know where to start” or “I’d like to create a data visualization but I have no idea how to do it” or any of the other myriad data-related issues I or my colleagues could address.

The “informationist” term is one that has a clear definition and a history within the realm of medical librarianship, but I feel like it has almost no meaning outside of our own field. I can’t even count the number of weird variations I’ve heard on that title – informaticist, informationalist, informatist, and many more. It would be nice to get to the point that researchers understood what an informationist is and how we can help them in their work, but I just don’t see that happening in the near future.

So what do we do to make our contributions and expertise and status as potential collaborators known? What term can we call ourselves to make our role clear? Librarian doesn’t really do it, because I think people have a very stereotypical and not at all correct view of what librarians do, and it doesn’t capture the data informationist role at all. Informationist doesn’t do it, because no one has any clue what that means. I’ve toyed with calling myself a data scientist, and though I do think that label fits, I have some reservations about using that title, probably mostly driven by a terrible case of imposter syndrome.

What’s in a name? A lot, I think. How can data librarians, informationists, library-based data scientists, whatever you want to call us, communicate our role, our expertise, our services, to our user communities? Is there a better term for people who are doing this type of work?

I’m attending FORCE2016, which is my first FORCE11 conference after following this movement (or group?) for awhile and I have to say, this is one interesting, thought-provoking conference. I haven’t been blogging in awhile, but I felt inspired to get a few thoughts down after the first day of FORCE2016:

I love the interdisciplinarity of this conference, and to me, that’s what makes it a great conference to attend. In our “swag bag,” we were all given a “passport” and could earn extra tickets for getting signatures of attendees from different disciplines and geographic locations. While free drinks are of course a great incentive, I think the fact that we have so many diverse attendees at this conference is a draw on its own. I love that we are getting researchers, funders, publishers, librarians, and so many other stakeholders at the table, and I can’t think of another conference where I’ve seen this many different types of people from this many countries getting involved in the conversatioon.

I actually really love that there are so few concurrent sessions. Obviously, fewer concurrent sessions means fewer voices joining the official conversation, but I think this is a small enough conference that there are ways to be involved, active, and vocal without necessarily being an invited speaker. While I love big conferences like MLA, I always feel pulled in a million different directions – sometimes literally, like last year when I was scheduled to present papers at two different sessions during the same time period. I feel more engaged at a conference when I’m seeing mostly the same content as others. We’re all on the same page and we can have better conversations. I also feel more engaged in the Twitter stream. I’m not trying to follow five, ten, or more tweet streams at once from multiple sessions. Instead, I’m seeing lots of different perspectives and ideas and feedback on one single session. I like us all being on the same page.

Now, those are some positives, but I do have to bring it down with one negative from this conference, and that is that I think it’s hard to constructively talk about how to encourage sharing and open science when you have a whole conference full of open science advocates. I do not in any way want to disparage anyone because I have a lot of respect for many of the participants in the session I’m talking about, but I was a little disappointed in the final session today on data management. I loved the idea of an interactive session (plus I heard there would be balloons and chocolate, so, yeah!) and also the idea of debate on topics in data sharing and management, since that’s my jam. I did debate in high school, so I can recognize the difficulty but also the usefulness of having to argue for a position with which you strongly disagree. There’s real value in spending some time thinking about why people hold positions that are in opposition of your strongly held position. And yeah, this was the last session of a long day, and it was fun, and it had popping of balloons, and apparently some chocolate, and whatnot, but I am a little disappointed at what I see as a real missed opportunity to spend some time really discussing how we can address some of the arguments against data sharing and data management. Sure, we all laughed at the straw men that were being thrown out there by the teams who were being called upon to argue in favor of something that they (and all of us, as open science advocates) strongly disagreed with. But I think we really lost an opportunity to spend some time giving serious thought to some of the real issues that researchers who are not open science advocates actually raise. Someone in that session mentioned the open data excuses bingo page (you can find it here if you haven’t seen it before). Again, funny, but SERIOUSLY I have actually have real researchers say ALL of these things, except for the thing about terrorists. I will reiterate that I know and respect a lot of people involved with that session and I’m not trying to disparage them in any way, but I do hope we can give some real thought to some of the issues that were brought up in jest today. Some of these excuses, or complaints, or whatever, are actual, strongly-held beliefs of many, many researchers. The burden is on us, as open science advocates, to demonstrate why data sharing, data management, and the like are tenable positions and in fact the “correct” choice.

Okay, off my soap box! I’m really enjoying this conference, having a great time reconnecting with people I’ve not seen in years, and making new connections. And Portland! What a great city. 🙂

I recently read an article in The Atlantic about people who are compulsive declutterers – the opposite of hoarders – who feel compelled to get rid of all their possessions. I’m more on the side of hoarding, because I always find myself thinking of eventualities in which I might need the item in question. Indeed, it has often been the case that I will think of something I got rid of weeks or even years later and wish I still had it: a book I would have liked to reference, a piece of clothing I would have liked to wear, a receipt I could have used to take something back. Of course, I don’t have unlimited storage space, so I can’t keep all this stuff. The question of what to keep and for how long is one that librarians think about when it comes to weeding: deciding which parts of the collection to deaccession, or basically, get rid of. There are evidence-based, tried-and-true ways of thinking about weeding a library collection, but that’s not so much true when it comes to data. How is a scientist to decide what to keep and what not to keep?

I know this is a question that researchers are thinking about quite a bit, because I get more emails about this than almost any other issue. In fact, I get emails not only from users of my own library, but researchers from all over the country who have somehow found my name. What exactly do I need to keep? If I have electronic records, do I need to keep a print copy as well? How many years do I need to keep this stuff? These are all very reasonable questions that it would be nice to say, yes, there is an answer and it is….! but it’s almost never so easy to point to a single answer.

A case in point: a couple years ago, I decided to teach a class about data preservation and retention. In my naivete, I thought it would be nice to take a look through all the relevant policy and find the specific number of years that research data is required to be retained. I read handbooks and guides. I read policy documents from various agencies. I even read the U.S. Code (I do not recommend it). At the end of it, I found that not only is there not a single, definitive, policy answer to how long funded research data should be retained, but there are in fact all sorts of contradictory suggestions. I found documents giving times from 3 years to 7 years to the super-helpful “as long as necessary.”

This may be difficult to answer from a policy perspective, but I think answering this from a best practices perspective is even trickier. Let’s agree that we just can’t keep everything – storing data isn’t free, and it takes considerable time and effort to ensure that data remain accessible and usable. Assuming that some stuff has to get thrown away, how do we distinguish trash from treasure, especially given the old adage about how the former might be the latter to others? It’s hard to know whether something that appears useless now might actually be useful and interesting to someone in the future. To take this to the extreme, here’s an actual example from a researcher I’ve worked with: he asked how he could have his program automatically discard everything in the thousandth place from his measurements. In other words, he wanted 4.254 to be saved as 4.25. I told him I could show him how, but I asked why he wanted to do this. He told me that his machine was capable of measuring to the thousandth, but the measurement was only scientifically relevant to the hundredth place. To scientists right now, 4.254 and 4.252 were essentially indistinguishable, so why bother with the extra noise of the thousandth place? Fair point, but what about 5 years from now, or 10 years from now? If science evolves to the point that this extra level of precision is meaningful, tomorrow’s researchers will probably be a little annoyed that today’s researchers had that measurement and just threw it away. But then again, how can we know now when, or even if, that level of precision will be wanted? For that matter, we can’t even say for sure whether this dataset will be useful at all. Maybe a new and better method for making this measurement will be developed tomorrow, and all this stuff we gathered today will be irrelevant. But how can we know?

These are all questions that I think are not easy to answer right now, but that people within research communities should be thinking about. For one thing, I don’t think we can give one simple answer to how long data should be retained. For one type of research, a few years may be enough. For other fields, where it’s harder to replicate data, maybe we need to keep it in perpetuity. When it comes to deciding what should be retained and what should be discarded, I think that answers cannot be dictated by one-size-fits-all policies and that subject matter experts and information professionals should work together to figure out good answers for specific communities and specific data. Eventually, I suppose we’ll probably have some of those well-defined best practices for data retention in the same way that we have those best practices from collection management in libraries. Until then, keep your crystal balls handy. 🙂

In medical education, you’ll often hear the phrase “see one, do one, teach one.” I know this not because I’m a medical librarian, but because I watched ER religiously when I was in high school. 🙂 To put it simply, to learn to do a medical procedure, you first watch a seasoned clinician doing the procedure, then you do it yourself with guidance and feedback, and then you teach someone else how to do it. While I’m not learning how to do medical procedures, I think this same idea applies to learning anything, really, and it’s actually how I’ve learned to do a lot of the cool things I’ve picked up in the last couple of years in my work at my current library.

Being sort of a Data Services department of one, I tend to put a lot of emphasis on instruction. There are many thousands of researchers at my institution, but only one of me. I can’t possibly help all of them one on one, so doing a hybrid in-person/webinar session that can reach lots and lots of people is a good use of my time. I would have to go back to look at my statistics, but I don’t think I’d be too far off base if I said I’ve taught 200 people how to use R in the last year, which I think is a pretty effective use of my time! Even better for me, teaching R has enabled me to learn way more than I would have on my own. This time a year ago, I don’t think I could do much of anything with R, but with every class I teach, I learn more and more, and thus become even more prepared to teach it.

When I came to my library two years ago, I had some ideas about what I thought people should know about data management, but I figured I should collect some data about it (I mean, obviously, right?). We did a survey. I got my data and analyzed them to see what topics people were most interested in. I put on classes on things like metadata, preservation, and data sharing, but the attendance wasn’t what I thought it would be based on the numbers from my survey. Clearly something about my approach wasn’t reaching my researchers. That’s when I decided to focus less on what I thought people should know and look at the problems they were really having. Around the same time, I was starting to learn more about data science, and specifically R, and I realized that R could really solve a lot of the problems that people had. Plus, people were interested in learning it. Lots more people would show up for a class on R than they would for a class on metadata (sad, but true).

The only problem was, I didn’t think I knew R well enough to teach it. What if really experienced people showed up and started calling me out on my inexperience, or asking questions I didn’t know the answer to? I was really nervous about teaching an R class the first time, but I decided that I could make it manageable by biting off a little chunk. I scheduled a class on making heatmaps in R, which was something I knew a lot of people wanted to learn. Mind you, when I scheduled this class, I did not myself know how to make a heatmap in R. But I put it on the instruction calendar, it went up on the website, and soon enough, I had not only a full class, but a waitlist.

Fortunately, there are many, many resources available for learning how to do things in R. Lots of them are free. That solved the “see one” problem. Next, to “do one.” I spend a long, long time putting together the hands-on exercises I create for my classes. I try out lots of different things. I mess around with the code and see what happens if I try things in different ways. I try to anticipate what questions people might ask and experiment with my code so I have an answer. Like, “what happens if you don’t put those spaces between everything in your code?” (answer, at least in R: nothing, it works fine with or without the spaces; I just like them in there because I can read it more easily).

My first few classes went well. Sometimes people asked questions I didn’t know the answers to. Even worse, sometimes I gave incorrect answers because I felt like I should say something even if I wasn’t really sure. In one of the first classes I taught, someone asked whether = was equivalent to <- (the assignment operator) in R. I’d seen <- used most often, but I thought I’d seen = used sometimes too, so I said something like, “uhhh, I don’t know, I mean, yeah, I think they’re the same, like, yeah, sure?” A woman in the back row got really annoyed at that. “They’re not the same at all,” she said, and I could feel myself turning bright red. “That’s factually incorrect,” she added. Shortly after that she got up and left in the middle of the class. I was mortified, but the class still got good evaluations, so I figured it hadn’t been all bad.

These days, I schedule my classes based on two things: is it something I think my researchers want to learn, and is it something I want to learn. That first part is relatively easy to figure out – I just talk to people, a lot, and I implore them to give me feedback about what classes they want on my class evaluations. On the whole, they do, and this is how I end up with probably 90% of the classes I offer. Sometimes this leads to much trepidation on my part, as people ask for things that I worry I’m not going to be able to teach. For example, people had been asking for a class on statistical analysis in R. I’ve taken a few different statistics classes, but stats were still something that filled me with terror. When I submit my own articles for publication, I’m overcome with fear that I’ve made some horrible mistake in my statistical analyses and that peer reviewers are going to rip my article apart. Or worse, the peer reviewers will miss it, it’ll be published, and readers will rip me apart. The thought of actually teaching a class on how to do this seemed like a ridiculous idea, yet it was what so many people wanted.

So I went ahead and scheduled the class. A lot of people signed up. I got some very thick textbooks on statistics and statistical analysis in R and I spent many hours learning about all of this. I got some data, saw what sorts of examples would make sense to demonstrate. I painstakingly wrote out my code in R markdown, with lots of comments, so that everything would be well-explained. And then, the morning arrived when I was to give the class for the first time. Probably it was for the best that it was a webinar. I was teleworking, so I gave the webinar from my home office, wearing sweatpants and my favorite UCLA t-shirt, with some lovely roses my boyfriend had brought me on my desk and my trusty dog looking in through the French doors. I went through my examples, talking about linear regression, and tests of independence, and all sorts of other things, that, until I’d started to teach the webinar, I’d been very doubtful I had a good handle on. But suddenly, I realized I kind of actually knew what I was talking about! People typed their questions in the chat window and I knew the answers! When the two hours were up and I signed off, I felt good about it, and over the next few days, I got lots of emails from people thanking me for the great class, which was great, since my main goal had just been to not say anything too stupid. 🙂

Now, I don’t feel so nervous about offering some of these advanced classes. It’s kind of exciting to have the opportunity to stretch myself to learn things that I think are interesting. Plus, nothing will give you more incentive to learn something you’ve wanted to explore than committing yourself to teach a class on it! I’ve learned so much about so many cool things because people have said, hey, can you teach me this, and I say, sure! then scramble off to my office and check the indices of all my R books to see where I can learn how to do whatever that thing is.

The point of all this is to say that, for me at least, the “teach one” part of the old mantra is perhaps something librarians should jump on when it comes to expanding library roles in data management and data science. I’m very fortunate that I get to spend most of my time working on data and nothing else, so I recognize that not everyone can take a week to immerse themselves in statistics, but I do think that librarians can and should stretch themselves to learn new things that will benefit our patrons.

My other piece of advice, which is surely nothing new: when someone asks a question, don’t be afraid to say I don’t know. I learned quickly from that whole “= is not the same as <-” business. Now when someone asks a question and I don’t know the answer, I do one of two things. If I can, I try it out in the code right then and there. So if someone says something like, can you rearrange the order of those two things in your code? I’ll say, huh, I never thought about that – let’s find out, and then do just that. Other times, the question is something complicated, like, how do I do this random thing? In those cases, I’ll say, that’s a great question, and I don’t actually know the answer, but if you’ll send me an email after this so I have your contact info, I will find out and follow up with you. I’ve said that at least once in every class I’ve taught in the last 6 months, and the number of times someone has actually followed up with me: none. I think this is probably due to one of two reasons. One, I really emphasize troubleshooting and how to find out how to learn to do things in R when I teach, so it’s very possible that the person goes off and finds the answer themselves, which is great. Two, I think there are times when people pose an idle question because they’re just kind of curious, or they want to look smart in front of their peers, and they don’t follow up because the answer doesn’t really matter that much to them anyway.

So there you go! That’s my philosophy of getting to learn how to do cool stuff with data in order to benefit my researchers. 🙂

Post navigation

Connect

Subscribe

Get notified when a new post is added

Name

Email *

About

Librarian in the City is the personal blog of Lisa Federer. The thoughts expressed here are my own and do not reflect the opinion of my employer. Likewise, comments are the views of readers who submit them, and do not necessarily reflect my own opinions.