Category Archives: On science

Post navigation

Recently some colleagues and I published a paper in PLOS in which we analyzed about 47,000 Data Availability Statements as a way of exploring the state of data sharing in a journal with a pretty strong data availability policy. The paper has gotten a good response from what I’ve seen on Twitter, and I’m really happy with how it turned out, thanks in part to some great feedback from the reviewers. But I also wanted to tell a few more things about how this paper came about – the things that don’t make it into the final scholarly article. A behind the scenes look, if you will.

The idea for this paper arose out of a somewhat eye-opening experience. I needed to get a hold of a good dataset – I forget why exactly, but I think it was when I was first starting to teach R and wanted to some real data that I could use in the classes for the hands-on exercises. Remembering that PLOS had this data availability policy, I thought to myself, ah, no problem, I will find an article that looks relevant to the researchers I’m teaching, download the data, and use it in my demo (with proper attribution and credit, of course). So I found an article that looked good and scrolled down to the Data Availability Statement. Data available upon request. Huh. I thought you weren’t allowed to say that, but okay, I guess this one slipped through the policy. Found another one – data is within the paper, it said, except the only data in the paper were summary tables, which were of no use to me (nor would they be of use to anyone hoping to verify the study or reanalyze the data, for example).

What a weird fluke, I thought, that the first two papers I happened to look at didn’t really follow the policy. So I checked a third, and a fourth. Pretty soon I’d spent a half hour combing through recent PLOS articles and I had yet to find one with a publicly available dataset that I could easily download from a repository. I ended up looking elsewhere for data (did you know that baseball fans keep surprisingly in-depth data on a gazillion data points?) but I was left wondering what the real impact of this policy was, which was why I decided to do this study.

I’ll let you read the paper to find out what exactly it is that we found, but there’s one other behind-the-scenes anecdote that I’ll share about this paper that I hope will be encouraging. Obviously if you’re going to write critically about data availability, you’re going to look a little hypocritical if you don’t share your own data. I fully intended to share our data and planned to do so using Figshare, which is how I’d shared a dataset associated with another publication I’d previously published in PLOS. When I shared the data from the first article, I set it to be public immediately, though I didn’t expect anyone to want to see it before the paper was out. Unexpectedly, and unbeknownst to me, someone at Figshare apparently thought this was an interesting dataset and decided to tweet it out the same day I submitted the paper to PLOS, obviously well before it was ever published, much less accepted.

While the interest in the dataset was encouraging, I was also concerned about the fact that it was out before the paper was accepted. I figured I was flattering myself to think that someone would want to scoop me, but then, I got an email from someone I didn’t know, who told me that she had found my dataset and that she would like to write an article describing my results, and would I mind sharing my literature review/citations with her to save her the trouble? In other words, “hi, I would like to write basically the paper that you’re trying to get accepted using all of the work you did.” I want to be clear that I am all for data sharing, but this situation bothered me. Was I about to get scooped?

Obviously our paper came out, no one beat us to it, and as far as I know, no one has ever written another paper using that dataset, but I was thinking about it when I was uploading the data for this most recent paper. This dataset was way more interesting and broadly applicable than the first one, so what if someone did get a hold of it before our paper came out? So what I decided to do was to upload it to Figshare, have it generate a DOI, but keep the dataset listed as private rather than publicly release it. Our data availability statement included the DOI and was therefore on the surface in compliance, but I had a feeling that, if you went to the DOI, it would tell you that the dataset was private or wasn’t found. Obviously I could have checked this before I submitted, but to be totally honest, I just left it as it was because I was genuinely curious whether any of the reviewers would try to check it themselves and say something.

To their credit, all three of the reviewers (who by the way, were incredibly helpful and gave the most useful feedback I’ve ever gotten on peer review, which I think significantly improved the paper) did indeed point out that the DOI didn’t work. In our revisions, our Data Availability Statement included a working link to not only the data, but also the code, on OSF. I invite anyone who is interested to reuse it and hope someone will find it useful. (Please don’t judge me on the quality of my code, though – I wrote it a long time ago when I was first learning R and I would do it way better now.)

My mom got the whole family 23andme kits for Christmas this year, and I’ve been looking forward to getting the results mostly so I could play with the raw data in R. It finally came in, so I went back to a blog post I’d read about analyzing 23andme data using a Bioconductor package called GWASCat. It’s a really good blog post, but as it happens, the package has been updated since it was written, so the code, sadly, didn’t work. Since I of course have VAST knowledge of bioinformatics (by which I mean I’ve hung around a lot of bioinformaticians and talked to them and kind of know a few things but not really) and am super awesome at R (by which I mean I’m like moderately okay at it), I decided to try my hand and coming up with my own analysis, based on the original blog post.

Let me be incredibly clear – I have only a vague notion of what I’m doing here, so you should not take any of what I say to be, you know, like, necessarily fact. In fact, I would love for a real bioinformatician to read this and point out my errors so I can fix them. That said, here is what I did, and what you can do too!

To walk you through it first in plain English – what you get in your raw data from 23andme is a list of RSIDs, which are accession numbers for SNPs, or single nucleotide polymorphisms. At a given position in your genetic sequence, for example, you may have an A, which means you’ll have brown hair, as opposed to a G, which means you’ll have blonde hair. Of course, it’s a lot more complicated than that, but the basic idea is that you can link traits to SNPs.

So the task that needs to be done here is two-fold. First, I need to get a list of SNPs with their strongest risk allele – in other words, what SNP location am I looking for, and which nucleotide is the one that’s associated with higher risk. Then, I need to match this up with my own list of SNPs and find the ones where both of my nucleotides are the risk allele. Here’s how I did it!

Next I need to pull in the data about the SNPs from gwascat. I can use this to match up the RSIDs in my data with the ones in their data. I’m also going to drop some other columns I’m not interested in at the moment.

Now I want to find out where I have the risk allele. This is where this analysis gets potentially stupid. The risk allele is stored in the gwascat dataset with its RSID plus the allele, such as rs1423096-G. My understanding is that if you have two Gs at that position (remembering that you get one copy from each of your parents), then you’re at higher risk. So I want to create a new column in my merged dataset that has the risk allele printed twice, so that I can just compare it to the column with my data, and only have it show me the ones where I have two copies of the risk allele (since I don’t want to dig through all 10,000+ genes to find the ones of interest).

Okay, almost there! Now let’s remove all the stuff that’s not interesting and just keep the ones where I have two copies of the risk allele. I could also have it remove ones that don’t match my ancestry (European) or gender, but I’m not going to bother with it, since keeping the ones where I have a match gives me a reasonable amount to scroll through.

my_risks <- trait_list[trait_list$genotype == trait_list$badsnp,]

And there you have it! I don’t know how meaningful this analysis really is, but according to this, some traits that I have are higher educational attainment (true), persistent temperament (sure, I suppose so), migraines (true), and I care about the environment (true). Also I could be at higher risk of having a psychotic episode if I’m on methamphetamine, so I’ll probably skip trying that (which was actually my plan anyway). Anyway, it’s kind of entertaining to look at, and I’m finding SNPedia is useful for learning more.

So, now, bring on the bioinformaticians telling me how incorrect all of this is. I will eagerly make corrections to anything I am wrong about!

Somehow, shockingly, I’ve arrived at the point where I’m just a few mere months from finishing my coursework for my doctoral program (okay, 50 days, but who’s counting?), which means that next semester, I get down to the business of starting my dissertation. One of the interesting things about being in a highly interdisciplinary program like mine is that your dissertation research can be a lot of things. It can be qualitative, it can be quantitative. It can be rigorously scientific and data-driven or it can be squishy and social science-y (perhaps I’m betraying some of my biases here in these descriptions).

If it weren’t enough that I had so many endless options available to me, this semester I’m taking two classes that couldn’t be more different in terms of methodology. One is a data collection class from the Survey Methodology department. We complete homework assignments in which we calculate response and cooperation rates for surveys, determining disposition for 20 different categories of response/non-response/deferral, and deciding which response and cooperation rate formula is most appropriate for this sample. My other class is a qualitative methods class in the communications department. On the first day of that class, I uncomfortably took down the notes “qual methods: implies multiple truths, not one TRUTH – people have different meaning.”

I count myself lucky to be in a discipline in which I have so many methodological tools in my belt, rather than rely on one method to answer all my questions. But then again, how do I choose which tool to pull out of the belt when faced with a problem, like having to write a dissertation?

I came into my doctoral program with a pretty clear idea of the problem I wanted to address – assessing the value of shared data and somehow quantifying reuse. I envisioned my solution involving some sort of machine learning algorithm that would try to predict usefulness of datasets (because HOW COOL WOULD THAT BE?). Then, halfway through the program, my awesome advisor moved to a new university, and I moved to a new advisor who was equally awesome but seemed to have much more of a qualitative approach. I got very excited about these methods, which were really new to me, and started applying them to a new problem that was also very close to my heart – scientific hackathons, which I’ve been closely involved with for several years. This kind of approach would necessitate an almost entirely qualitative approach – I’d be doing ethnographic observation, in-depth interviews, and so on.

So now, here I find myself 50 days away from the big choice. What’s my dissertation topic? The thing I like to keep in mind is that this doesn’t necessarily mean ALL that much in the long run. This isn’t the sum of my life’s work. It’s one of many large research projects I’ll undertake. Still, I want it to be something that’s meaningful and worthwhile and personally rewarding. And perhaps most importantly of all, I want to use a methodology that makes me feel comfortable. Do I want to talk to people about their truth? I’ve learned some unexpected things using those methodologies and I’m glad I’ve learned something about how to do that kind of research, but in the end, I don’t think I want to be a qual researcher. I want numbers, data, hard facts.

I guess I really knew this was what I would end up deciding in the second or third week of my qual methods class. The professor asked a question about how one might interpret some type of qualitative data, and I answered with a response along the lines of “well, you could verify the responses by cross-checking against existing, verified datasets of a similar population.” She gave me a very odd look, and paused, seemingly uncertain how to respond to this strange alien in her class, and then responded, “You ARE very quantitative, aren’t you?”

This week I’ve been reading the first half of Bruno Latour and Steve Woolgar’s book Laboratory Life: The Construction of Scientific Facts. Like many of the other pieces I’ve been reading lately, this book argues for a social contructivist theory of scientific knowledge, which is a perspective I’m really starting to identify with. What I’m finding most interesting about this book is the ethnographic approach that was taken to observe the creation of scientific knowledge. Basically, Bruno Latour spent two years observing in a biology lab at the Salk Institute. Chapter 1 begins with a snippet of a transcript covering about 5 minutes of activity in a lab – all the little seemingly insignificant bits of conversation and activity that, taken together, would allow an outside observer to understand how scientific knowledge is socially constructed.

The authors emphasize that real sociological understanding of science can only come from an outside observer, someone who is not themselves too caught up in the science – someone who can’t see the forest for the trees, as it were. They even suggest that it’s important to “make the activities of the laboratory seem as strange as possible in order not to take too much for granted” (30). Why should we need someone to spend two years in a lab watching research happen when the researchers are going to be writing up their methods and results in an article anyway, you may ask? The authors argue that “printed scientific communications systematically misrepresent the activity that gives rise to published reports” and even “systematically conceal the nature of the activity” (28). In my experience, I would agree that this is true – a great example of it is #overlyhonestmethods, my absolute favorite Twitter hashtag of all time, in which scientists reveal the dirty secrets that don’t make it into the Nature article.

I’ve been thinking that an ethnographic approach might be an effective way to approach my research, and I’m thinking it makes even more sense after what I’ve read of this book so far. However, this research was done in the 1970s, when research was a lot different. Of course there are still clinical and bench researchers who are doing actual physical things that a person can observe, but a lot of research, especially the research I’m interested in, is more about digital data that’s already collected. If I wanted to observe someone doing the kind of research I’m interested in, it would likely involve me sitting there and staring at them just doing stuff on a computer for 8 hours a day. So I’m not sure if a traditional ethnographic approach is really workable for what I want to do. Plus, I don’t think I’d get anyone to agree to let me observe them. I know I certainly wouldn’t let someone just sit there and watch me work on my computer for a whole day, let alone two years (mostly because I’d be embarrassed for anyone else to know how much time I spend looking at pictures of dogs wearing top hats and videos of baby sloths). Even if I could get someone to agree to that, I do wonder about the problem of observer effect – that the act of someone observing the phenomenon will substantively change that phenomenon (like how I probably wouldn’t take a break from writing this post to watch this video of a porcupine adorably nomming pumpkins if someone was observing me).

This thought takes me back to something I’ve been thinking about a lot lately, which is figuring out methods of indirect observation of researchers’ data reuse practices. I’m very interested in exploring these sorts of methods because I feel like I’ll get better and more accurate results that way. I don’t particularly like survey research for a lot of reasons: it’s hard to get people to fill out your survey, sometimes they answer in ways that don’t really give you the information you need, and you’re sort of limited in what kind of information you can get from them. I like interviews and focus groups even less, for many of the same reasons. Participant observation and ethnographic approaches have the problems I’ve discussed above. So what I think I’m really interested in doing is exploring the “artifacts” of scientific research – the data, the articles, the repositories, the funny Twitter hashtags. This idea sort of builds upon the concept I discussed in my blog last week – how systems can be studied and tells us something about their intended users. I think this approach could yield some really interesting insights, and I’m curious to see what kind of “artifacts” I’ll be able to locate and use.

I don’t know if this terminology is common outside of library circles, but it seems like the “flipped classroom” has been all the rage in library instruction lately. The idea is that learners do some work before coming to the session (like read something or watch a video lecture), and then the in-person time is spent on doing more activities, group exercises, etc. As someone who is always keen to try something new and exciting, I decided to see what would happen if I tried out the flipped classroom model for my R classes.

Actually, teaching R this way makes a lot of sense. Especially if you don’t have any experience, there’s a lot of baseline knowledge you need before you can really do anything interesting. You’ve got to learn a lot of terminology, how the syntax of R works, boring things like what a data frame is and why it matters. That could easily be covered before class to save the in person time for the more hands-on aspects. I’ve also noticed a lot of variability in terms of how much people know coming into classes. Some people are pretty tech savvy when they arrive, maybe even have some experience with another programming language. Other people have difficulty understanding how to open a file. It’s hard to figure out how to pace a class when you’ve got people from all over that spectrum of expertise. On the other hand, curriculum planning would be much easier if you could know that everyone is starting out with a certain set of knowledge and build off of it.

The other reason I wanted to try this is just the time factor. I’m busy, really busy. My library’s training room is also hard to book because we offer so many classes. The people I teach are busy. I teach my basic introduction to R course as a 3-hour session, and though I’d really rather make it 4 hours, even finding a 3-hour window when I and the room are both available and people are likely to be able to attend is difficult. Plus, it would be nice if there was some way to deliver this instruction that wasn’t so time-intensive for me. I love teaching R – it’s probably my favorite thing I do in my job and I’d estimate I’ve taught close to 500 researchers how to code. I generally spend around 9 hours a month teaching R, plus another 4-6 hours doing prep, administrative stuff, and all the other things that have to get done to make a class function. That’s a lot of time, and though I don’t at all mind doing it, I’d definitely be interested in any sort of way I could streamline that work without having a negative impact on the experience of learning R from me.

For all these reasons, I decided to experiment with trying out a flipped classroom model for my introduction to R class. I had grand plans of making a series of short video tutorials that covered bite-sized pieces of learning R. There would be a bunch of them, but they’d be about 5 minutes each. I arranged for the library to get Adobe Captivate, which is very cool video tutorial software, and these tutorials are going to be so awesome when I get around to making them. However, I had already scheduled the class for today, February 28, and I hadn’t gotten around to making them yet. Fortunately, I had a recording of a previous Intro to R class I’d taught, so I chopped the relevant parts of that up into smaller pieces and made a YouTube playlist that served as my pre-class work for this session, probably about two and a half hours total.

I had 42 people were either signed up or on the waitlist at the end of last week. I think I made the class description pretty clear – that this session was only an hour, but you did have to do stuff before you got there. I sent out an email with the link to the video reminding people that they would be lost in class if they didn’t watch this stuff. Even so, yesterday morning, the last of the videos had only 8 views, and I knew at least two of those were from me checking the video to make sure it worked. So I sent out another email, once again imploring them to watch the videos before they came to class and to please cancel their registration and sign up for a regular R class if this video thing wasn’t for them.

By the time I taught the class this afternoon, 20 people had canceled their registration. Of the remaining 22, 5 showed up. Of the 5 that showed up, it quickly became apparent to me that none of them had watched the videos. I knew no one was going to answer honestly if I asked who had watched them, so I started by telling them to read in the CSV file to a data frame. This request is pretty fundamental, and also pretty much the first thing I covered in the videos, so when I was met with a lot of blank stares, I knew this experiment had pretty much failed. I did my best to cover what I could in an hour, but that’s not much, so instead of this being a cool, interactive class where people ended up feeling empowered and ready to go write code, I got the feeling those people left feeling bewildered and like they wasted an hour. One guy who had come in 10 minutes late came up to me after class and was like, “so this is a programming language? What can you do with it?” And I kind of looked at him like….whaaaat? It turned out he hadn’t even registered for the class to begin with, much less done any of the pre-class work – he had been in the library and saw me teaching and apparently thought it looked interesting so he decided to wander in.

I felt disappointed by this failed experiment, but I’m not one to give up at the first sign of failure, so I’ve been thinking about how I could make this system work. It could just be that this model is not suited to people in the setting where I teach. I am similar to them – a busy, working professional who knows this is useful and I should learn it, but it’s hard to find the time – and I think about what it would take for me to do the pre-class work. If I had the time and the videos were decent enough quality, I think I’d do it, but honestly chances are 50-50 that I’d be able to find the time. So maybe this model just isn’t made for my community.

Before I give up on this experiment entirely, though, I’d love to hear from anyone who has tried this kind of approach for adult learners. Did it work, did it not? What went well and what didn’t? And of course, being the data queen that I am, I intend to collect some data. I’m working on a modified class evaluation for those 5 brave souls who did come to get some feedback on the pre-class work model, and I’m also planning on sending a survey out to the other 38 people who didn’t come to see what I can find out from them. Data to the rescue of the flipped class!

This week I’ve been reading the second half of Sergio Sismondo’s An Introduction to Science and Technology Studies and I have been finding myself interested in the question of the universality of scientific knowledge and data. A single sentence that I think captures the scope of the problem I’m finding interesting: “scientific and engineering research is textured at the local level, that it is shaped by professional cultures and interplays of interests, and that its claims and products result from thoroughly social processes” (168). That is to say, the output of a scientific experiment is not some sort of universal truth – rather, data are the record of a manipulation of nature at a given time in a given place by a given person, highly contextualized and far from universally applicable.

I was in my kitchen the other day, baking a mushroom pot pie, after reading Chapter 10, specifically the section on “Tinkering, Skills, and Tacit Knowledge.” That section describes the difficulties researchers were having in recreating a certain type of laser, even when they had written documentation from the original creators, even when they had sufficient technical expertise to do so, even when they had all the proper tools – in fact, even when they themselves had already built one, they found it difficult to build a second laser. As I was pulling my pie out of the oven, I was thinking about the tacit knowledge involved in baking – how I know what exactly is meant when the instructions say I should bake till the crust is “golden brown,” how I make the decision to use fresh thyme instead of the chipotle peppers the recipe called for because I don’t like too much heat, how I know that my oven tends to run a little cold so I should set the temperature 10 degrees higher than called for by the recipe. Just having a recipe isn’t enough to get a really tasty mushroom pot pie out of the oven, just as having a research article or other scientific documentation isn’t enough to get success out of an experiment.

These problems raise some obvious issues around reproducibility, which is a huge focus of concern in science at the moment. Obviously scientific instruments are hopefully a little more standardized than my old apartment oven that runs cold, but you’d be surprised how much variation exists in scientific research. Reproducibility is especially a problem when the researcher is herself the instrument, such as in the case of certain types of qualitative research. Focus group or interview research is usually conducted using a script, so theoretically anyone could pick up the script and use it to do an interview, but a highly experienced researcher knows how to go off-script in appropriate ways to get the needed information, asking probing questions or guiding a participant back from a tangent.

More relevant to my own research, thinking about data not as representations of some sort of universal truth, but as the results of an experiment conducted within a potentially complex local and social context, can shared data be meaningfully reused? How do we filter out the noise and get to some sort of ground truth when it comes to data, or can we at all? Part of the question that I really want to address in my dissertation is what barriers exist to reusing shared data, and I think this is a huge one. Some of the problem can be addressed by standards, or “formal objectivity” (140). However, as Sismondo notes, standards are themselves localized and tied to social processes. Between different scientific fields, the same data point may be measured using vastly different techniques, and within a lab, the equipment you purchase often has a huge impact on how your data are collected and stored. Maybe we can standardize to an extent within certain communities of practice, but can we really hope to get everyone in the world on one page when it comes to standards?

If we can’t standardize, then maybe we can at least document. If I measured in inches but your analysis needs length input in centimeters, that’s okay, as long as you know I measured in inches and you convert the data before doing your analysis. That seems fairly obvious, but how do I necessarily know what I need to document to fully contextualize the data for someone else to use it? Is it important that I took the measurement on a Tuesday at 4 pm, that the temperature outside was 80 degrees with 70% humidity, that I used a ruler rather than a tape measure, that the ruler was made of plastic rather than wood? I could go on and on. How much documentation is enough, and who decides?

The concepts of reproducibility, standardization, and documentation are nothing new, but the idea of data being inextricably caught up in local and social contexts does get me thinking about the feasibility of reusing shared data. I don’t think data sharing is going to stop – there are enough funders and journals on board with requiring data sharing that I think researchers should expect that data sharing will be part of their scientific work going forward. The question then is what is the utility of this shared data. Is it just useful for transparency of the published articles, to document and prove the claims made in those publications? Or can we figure out ways to surmount data’s limited context and make it more broadly usable in other settings? Are there certain fields that are more likely to achieve that formal objectivity than others, and therefore certain fields were data reuse may be more appropriate or at least easier than others? I think this requires further thought. Good thing I have a few years to spend thinking about it!

This week I’ve been reading Sergio Sismondo’s An Introduction to Science and Technology Studies, which has given me a lot to think about in terms of theoretical backgrounds for understanding how science creates knowledge. In fact, it’s almost given me too much to think about. There are so many different theoretical bases brought into the mix here, and I can see the relative merits of each, so I find myself wondering how to make sense of it all, but also what it means to adopt a theoretical underpinning as a social scientist. Is it like a religion, where you accept one and only one dogma, and all parts of it, to the exclusion of all others? Or is it more like a buffet, where you pick a little bit of the things that seem appealing to you and leave behind the things that don’t catch your eye? I’m hoping it’s the latter, and I’m going to go on that assumption until the theory police tell me I can’t do it. 🙂 So, on that assumption, here are some ideas I’ve put on my plate from Sismondo’s buffet.

Structural Functionalism and Mertonian Norms

My favorite theoretical framework I picked up here was structural functionalism, and in particular, Robert Merton’s four guiding norms. Structural functionalism, as I understand it, argues that society is composed of institutional structures that function based on guiding norms and customs. Merton suggests that science is one such institution, the primary goal of which is “the extension of certified knowledge” (23). Merton also outlined four norms of behavior that guide scientific practice, suggesting that those who follow them will be rewarded and those who violate them will be punished. The norms are universalism (that the same criteria should be used to evaluate scientific claims regardless of the race, gender, etc of the person making them), communism (that scientific knowledge belongs to everyone), disinterestedness (that scientists place the good of the scientific community ahead of their own personal gain), and organized skepticism (that the community should not believe new ideas until they have been convincingly proven).

Of those four norms, communism and disinterestedness speak the most to my interest in data sharing and reuse. Communism seems the most obviously related. It’s very interesting to think about what parts of science are typically thought to belong to the community and which are thought to be privately owned. For example, the Supreme Court unanimously ruled in 2015 that human genes could not be patented, a decision that seems in line with Merton’s communism norm. On the other hand, plenty of scientific ideas can be and are patented. While many scientific journals are becoming open access and making their articles freely available, many more work on a subscription model, suggesting that the ideas shared within are available for common consumption – if you are willing and able to pay the fee.

Although this example comes from an entirely different realm than science, thinking about these ideas has reminded me of the case of the artist Anish Kapoor, who purchased the exclusive rights to paint with the world’s “blackest black” so that no other artist can use it. In retaliation, another artist designed the “pinkest pink” paint and made it available for sale – to any artist except Anish Kapoor. While this episode is somewhat entertaining, it does bring up some interesting ideas about ownership in communities that are generally dedicated to the common good. Art and science are very different, but they’re also quite alike in some ways that are very relevant to the work I’m doing. They’re both activities carried out by individuals for their own reasons (artistic expression, scientific curiosity) for the common good (to share beauty with the world, to further scientific knowledge). We are outraged when we hear of a rich artist laying exclusive claim to the raw materials of art so that no one else can use them. It feels somehow petty, and it also seems like a disservice to not just the art world, but to all of is. What could others be creating for us if they had access to that black? I don’t know if we feel that same outrage when we hear of a scientist trying to lay exclusive claim to data. Of course this isn’t a perfect analogy – a big part of the work of science is gathering or creating the data, which confuses the concept of ownership. Still, I think there are some interesting ideas here to explore about how scientists think about common ownership of science – not just the ideas, but the data as well.

I started out this entry saying I was going to dip into some other theories – I have some things to say about social constructionism and actor-network theory, but now I’ve spent a long time going on and on about art and science and this is getting a bit long, so I think I’ll stop here for today. 🙂

As I’ve mentioned on this blog before, I recently started a PhD program at the University of Maryland’s iSchool, focusing on scientific researchers’ data reuse practices. There’s a great deal of attention lately on encouraging, and even requiring, researchers to share their data, but less work has been done on how researchers actually make use of that shared data (or if indeed they do at all). This semester, I’m doing an independent study with my advisor, Dr. Andrea Wiggins, with the aim of better understanding the theoretical background for this problem. I have the good fortune of working in a job that involves interacting with researchers on data questions on a pretty much daily basis, so I have plenty of opportunity to observe actual practices, but I have less background on theoretical frameworks for contextualizing and understanding why these things happen, so that’s my goal this semester! I’ve picked out several readings and am going to write weekly reflections on what I’ve read and thought, and since I have this blog, I figured, why not inflict all this on you, my readers, as well? 🙂

This week I read Paul Davidson Reynolds’ Primer in Theory Construction, which breaks down the research process and explores the scientific method and all its component parts. It is described as being designed “for those who have already studied one or more of the social, behavioral, or natural sciences, but have no formal introduction to the way theories are constructed, stated, tested, and connected together to form a scientific body of knowledge.” While I was reading it, I often was thinking to myself, “well, yeah, obviously…” but after I had a little more time to think about it, it occurred to me that it was useful to really stop and think about why research is done the way it is and what we can really determine using data, inference, and logic.

One of the things I was thinking about as I was reading this book was how we make the jump from data to knowledge, and also how to operationalize terms like “data” and “knowledge.” The NIH’s big data initiative is called Big Data to Knowledge, but what exactly does it mean to translate “big data” to “knowledge”? How do we define “big data” (as opposed to small data?) and “knowledge”? Are the ways that big data become knowledge different than the ways non-big data become knowledge? There are some good definitions of big data, but how do we define “knowledge” in the scientific, and particularly biomedical, realm?

Thinking about how researchers use data by really breaking things down to their most basic level is a little different from how I’ve thought about things before, but actually makes good sense. I suggest that the barriers to reuse of shared data are:

technological: there aren’t good tools for easily getting/reusing the data, or the data are poor quality or hard to find)

social: incentive structures of science often do not reward research that reuses data – take a look at the concept of #researchparasites

educational: reusing data involves a different skill set that most researchers aren’t taught

However, I never really thought about one of the most fundamental social factors, which is how researchers in a field conceptualize data and how it is transformed into knowledge. Are there fundamental differences between the data I gather and data someone else gathers and I reuse? Obviously if I gather my own data, I know more about its context, quality, and provenance. If I reuse someone’s shared data, I don’t know how careful they were when collecting it, or other important things I might need to know about how the data were collected to be able to reuse them meaningfully. For example, I once worked with a researcher on locating a clinical dataset for reuse, and once we got the dataset, the researcher asked how patient temperature had been measured – oral, axillary, rectal? I got back in touch with the original data owner, and they didn’t know – the person who would be able to answer that question had moved on to a new position. Apparently that mattered to the methods of the researcher I was working with, so they couldn’t use that dataset. The sorts of things that seem like little minor details can actually make a big difference, but there’s really no way of knowing that unless you know how a research field works with and understands data.

Some things – like knowing how temperature was measured – are probably pretty specific to a narrow field, or even just a particular research method, and it’s probably not possible to know all of the intricacies of the many fields that comprise biomedical research. However, I think there are also likely other fundamental qualities of data that would apply more broadly across many research fields, and perhaps that would be a useful approach to this question.

I had the great pleasure of spending the last few days working on a team at the latest NCBI hackathon. I think this is the sixth hackathon I’ve been involved in, but this is the first time I’ve actually been a participant, i.e. a “hacker.” Prior to working on these events, I’d heard a little bit about hackathons, mostly in the context of competitive hackathons – a bunch of teams compete against each other to find the “best” solution to some common problem, usually with the winning team receiving some sort of cash prize. This approach can lead to successful and innovative solutions to problems in a short time frame. However, the so-called NCBI-style hackathons that I’ve been involved in over the last couple years involve multiple teams each working on their own individual challenge over a period of three days. There are no winners, but in my experience, everyone walks away having accomplished something, and some very promising software products have come out of these hackathons. For more specifics about the how and why of this kind of hackathon, check out the article I co-authored with several participants and the mastermind behind the hackathons, Ben Busby of NCBI.

As I said, this time was the first hackathon that I’ve actually been involved as a participant on a team, but I’ve had a lot of fun doing some librarian-y type “consulting” for five other hackathons before this, and it’s an experience I can highly recommend for any information professional who is interested in seeing science happen real-time. There’s something very exciting about watching groups of people from different backgrounds, with different expertise, most of whom have never met each other before, get together on a Monday morning with nothing but an often very vague idea, and end up on Wednesday afternoon with working software that solves a real and significant biomedical research problem. Not only that, but most of the groups manage to get pretty far along on writing a draft of a paper by that time, and several have gone on to publish those papers, with more on their way out (see the F1000Research Hackathons channel for some good examples).

As motivated and talented as all these hackathon participants are, as you can imagine, it takes a lot of organizational effort and background work to make something like this successful. A lot of that work needs to be done by someone with a lot of scientific and computing expertise. However, if you are a librarian who is reading this, I’m here to tell you that there are some really exciting opportunities to be involved with a hackathon, even if you are completely clueless when it comes to writing code. In the past five hackathons, I’ve sort of functioned as an embedded informationist/librarian, doing things like:

basic lit searching for paper introductions and generally locating background information. These aren’t formal papers that require an extensive or systematic lit review, but it’s useful for a paper to provide some context for why the problem is significant. The hackers have a ton of work to fit in to three days, so it’s silly to have them spend their limited time on lit searching when a pro librarian can jump in and likely use their expertise to find things more easily anyway

manuscript editing and scholarly communication advice. Anyone who has worked with co-authors knows that it takes some work to make the paper sound cohesive, and not like five or six people’s papers smushed together. Having someone like a librarian with editing experience to help make that happen can be really helpful. Plus, many librarians have relevant expertise in scholarly publishing, especially useful since hackathon participants are often students and earlier career researchers who haven’t had as much experience with submitting manuscripts. They can benefit from advice on things like citation management and handling the submission process. Also, I am a strong believer in having a knowledgeable non-expert read any paper, not just hackathon papers. Often writers (and I absolutely include myself here) are so deeply immersed in their own work that they make generous assumptions about what readers will know about the topic. It can be helpful to have someone who hasn’t been involved with the project from the start take a look at the manuscript and point out where additional background or explanation might be beneficial to aiding general understandability.

consulting on information seeking behavior and giving user feedback. Most of the hackathons I’ve worked on have had teams made up of all different types of people – biologists, programmers, sys admins, other types of scientists. They are all highly experienced and brilliant people, but most have a particular perspective related to their specific subject area, whereas librarians often have a broader perspective based on our interactions with lots of people from various different subject areas. I often find myself thinking of how other researchers I’ve met might use a tool in other ways, potentially ones the hackathon creators didn’t necessarily intend. Also, at least at the hackathons I’ve been at, some of the tools have definite use cases for librarians – for example, tools that involve novel ways of searching or visualizing MeSH terms or PubMed results. Having a librarian on hand to give feedback about how the tool will work can be useful for teams with that kind of a scope.

I think librarians can bring a lot to hackathons, and I’d encourage all hackathon organizers to think about engaging librarians in the process early on. But it’s not a one-way street – there’s a lot for librarians to gain from getting involved in a hackathon, even tangentially. For one thing, seeing a project go from idea to reality in three days is interesting and informative. When I first started working with hackathons, I didn’t have that much coding experience, and I certainly had no idea how software was actually developed. Even just hanging around hackathons gave me so much of a better understanding, and as an informationist who supports data science, that understanding is very relevant. Even if you’re not involved in data science per se, if you’re a biomedical librarian who wants to gain a better understanding of the science your users are engaged in, being involved in a hackathon will be a highly educational experience. I hadn’t really realized how much I had learned by working with hackathons until a librarian friend asked me for some advice on genomic databases. I responded by mentioning how cool it was that ClinVar would tell you about pathogenic variants, including their location and type (insertion, deletion, etc), and my friend was like, what are you even talking about, and that was when it occurred to me that I’ve really learned a lot from hackathons! And hey, if nothing else, there tends to be pizza at these events, and you can never go wrong with pizza.

I’ll end this post by reiterating that these hackathons aren’t about competing against each other, but there are awards given for certain “exemplary” achievements. Never one to shy away from a little friendly competition, I hoped I might be honored for some contribution this time around, and I’m pleased to say I was indeed recognized . 🙂

There is a story behind this, but trust me when I say it’s true, I’m the absolute worst at darts.

(Note: this is an adapted version of a final paper I wrote for one of my classes. That’s why it’s so long!)

A few weeks ago, a researcher called my office to see if we could meet to discuss our shared interest in open data. I agreed, and a week later we were sitting in my office having a lively discussion about the many problems that currently hinder more widespread data sharing and reuse in biomedical research. When I mentioned that these topics would be the focus of my doctoral dissertation work, he expressed an interested in seeing some of my research. I replied that it was only my first semester, so I didn’t have much yet, but that I’d published a few papers on my previous research. “I don’t mean papers,” he said. “I mean your data, your code. If you’re doing a PhD on data sharing, don’t you think you should share your data, too? In fact, why don’t you do an open PhD?”

Perhaps I should have immediately replied, “you’re absolutely right. I will do an open PhD.” After all, on the face of it, this suggestion seems perfectly reasonable. My research, and in fact my entire career, revolves around the premise that researchers should share their data. It should be a no-brainer that I would also share my data. In principle, I have no problem with agreeing to do so, but in the real world of research, lofty ideals like service to the community and furthering science are sometimes abandoned in favor of more practical concerns, like getting one’s paper accepted or finishing one’s dissertation before other people have a chance to capitalize on the data.

So what I ended up telling this researcher was that I found his suggestion intriguing and I’d give it some serious thought. I have done just that in the intervening weeks, and here I will reflect on the reasons for my hesitation and explore the levels of openness I am prepared to take on in my doctoral program and my academic career.

After the tweet went out, I could see from the “views” counter that people were already looking at the data. Someone retweeted the link to the data, then another person, and another. The paper hadn’t even been reviewed by anyone yet, much less accepted for publication, but my data were out there for anyone to see, with the link spreading across Twitter. The situation made me nervous. I was excited that people were interested in my data, but what were they doing with it? The views counter ticked up steadily, and people were not just viewing, but actually downloading the dataset as well.

I finally received word from PLOS that they’d accepted the paper, but they asked for major revisions; Reviewer 2 (it’s always Reviewer 2) was niggling over my statistical methods, and I was going to have to redo much of my work to respond to all the revision requests. During the revision process, I received an email from someone I’d never heard of, from an Eastern European country I can’t now recall. She had seen my data on figshare and she, too, wanted to write a paper on this topic. She asked me to send her a copy of my still-in-process paper, as well as a list of all relevant references I had found. The audacity of her request shocked me. Here was someone I’d never even met, telling me she wanted to use my data, write essentially the same paper as me, and she wanted me to give her my background research as well? I wrote an email back, politely but firmly rebuffing her request, and I never heard from her again.

In the end, everything went fine: the paper was published and it has gone on to be cited seven times and featured in PLOS’s new Open Data collection (PLOS Collections 2016). I do still believe that researchers, particularly those whose work is supported by taxpayers’ money, have a responsibility to share their data when doing so will not violate their human subjects’ privacy. However, my own experience demonstrated to me that sharing research data cannot be viewed as a black and white proposition, that you share and are “good,” or you don’t and you are “bad.” Rather, many researchers have real, valid concerns about how they share their data, when, and with whom. Though my reasons probably differ from those of many other researchers, I have my own concerns that give me pause when it comes to the idea of an “open PhD.”

I don’t think my data would be useful or interesting to anyone else.

Some datasets have near infinite value, with uses that extend far beyond the expertise or disciplinary affiliation of their original collector. New computational methodologies and analytic techniques make it possible to uncover previously undetected meaning in datasets or “mash up” disparate datasets to detect novel connections between seemingly unrelated phenomena. The ability to quickly, easily, and cheaply share massive amounts of data means that researchers around the world are able to make life-saving discoveries. For example, the National Cancer Institute’s Cancer Genomics Cloud Pilot program allows researchers to connect to cancer genome data and perform complex analyses on cloud computing platforms more powerful than any computers they could buy for their lab (National Cancer Institute Center for Biomedical Informatics & Information Technology 2016). Projects like this are exciting – they could bring about cures for cancer and vastly improve our lives. Few people would argue that sharing these kinds of datasets is important.

By comparison, my data just look silly. Personally, I find my research fascinating. I could spend hours talking about biomedical scientists’ research data sharing and reuse practices. However, I don’t flatter myself that others are clamoring to see all the thrilling survey data and titillating interview transcriptions I have collected. Beyond validating the results in my article, I see little value for these data. Of course, I have made the argument that data can have unexpected uses that their original collectors could never have imagined, so I am prepared to admit that my data may have usefulness beyond what I would expect. Perhaps I should take the 252 views and 37 downloads of my figshare dataset as evidence that my data are of interest to more people than I might expect.

I’m often embarrassed by my amateurish ways.

I’m a fan of GitHub, a site where you can share your code and allow others to collaboratively contribute to your work, but I’m also terrified of it. I spend a very significant amount of time at my job working with R, my programming language of choice; I teach it, I consult on it, and I use it for my own research. I like to think I know what I’m doing, but in all honesty, I’m pretty much entirely self-taught in R and, though I’m a quick study, I haven’t been using it for that long. I am far from an expert, and I often write code that makes this fact obvious.

Recently I wrote some R code related to a research project I hope to submit for publication soon. The work involved downloading the full text of over 60,000 articles, but since the server’s interface only allowed downloading a thousand articles at a time, I needed to write code that would download the allowed amount, then repeat itself 60 times, updating the article numbers after each iteration. I spent hours trying to figure out the best way to do it, but everything I tried failed. I could download a hundred at a time, then manually update the numbers in the code and re-run it, but doing this 60 times would have been time-consuming. In a throwing-up-your-hands moment of frustration, I wrote a command that would essentially just write those 60 lines of code for me, then ran all 60 lines.

Frankly, this approach was idiotic. Anyone who knows the first thing about programming would scoff at my code, and rightly so. However, at the time, this slipshod approach was the best I could come up with. It’s not just code that may reveal that I don’t always know what I’m doing; the more open the research process, the more opportunity for others to see the unpolished, imperfect steps that lie beneath the shiny surface of the perfected, word-smithed article.

It takes time to prepare data for broader consumption.

When I teach data management classes for researchers, I emphasize how good data management practices will make submitting their data at the end of the process easy, practically effortless. Of course, having your data perfectly ready to share without any extra effort at the end of your project is about as likely as a jumping out of bed and looking good enough to head off to work without taking any time to freshen up. For example, part of my to-do list for preparing the article for the project I described above for publication is figuring out how to actually write that code the right way, so I can share it without fear of being humiliated. Getting my data, code, writing, or any other scholarly output I produce into the kind of shape it would need to be for me to be willing to put my name on it takes time. When I’m already trying to manage a demanding full-time job with a doctoral program and somehow still find the time to enjoy some sort of leisure every now and then, polishing up something to get it ready for sharing doesn’t often take enough priority to make it onto my daily schedule.

A compromise: the open-ish PhD

Though I’ve just spent five pages expounding on the reasons I cannot do a fully open PhD, I am prepared to compromise. The ideal the researcher urged me toward in our original conversation – don’t wait for your dissertation, share your data now, get your code up on GitHub today! – may not be right for me, but I do believe it is feasible to find some way to share at least some of scholarly output, if not in real-time, than at least in a timely fashion. Therefore, I propose the following tenets of my open-ish PhD:

I will do my best to write code that I am reasonably proud of (or at least not actively ashamed of) and share it on GitHub. While I do not feel comfortable immediately sharing code that corresponds to projects I am actively pursuing and seeking to publish, I will at least share it upon publication. I will also share teaching-related code immediately on GitHub, especially since doing so provides a good model for the researchers I am teaching.

I will make a more concerted effort to share my scholarly writing not just in its final, polished form as journal articles, but also in more casual settings, such as on my blog. I am also interested in exploring pre-print servers like arXiv and bioRxiv as a means of more rapid dissemination of research findings in advance of formal journal article publication.

I will attempt to collect data in a more mindful and intentional way, recognizing that I am not simply collecting my data, but that the point of my efforts are to inform others in my scholarly and research communities. As a federal employee, the work that I conduct in my official capacity cannot be copyrighted because it belongs not to me, but to all the American people who pay my salary. As I go forward with my research, I will do my best to remember that I am doing it not merely to satisfy my curiosity or add to my CV, but to advance science, even in my own small way.

In the end, it probably doesn’t matter so much whether the final data I share are perfect, whether my code impresses other people with its efficiency and elegance, or whether something I write appears in Nature or on my little blog. What matters is making the effort to share, committing to the highest level of openness possible, and doing so publicly and visibly – essentially, leading by example. I can give lectures on the importance of data sharing and teach classes on open source tools until I’m blue in the face, but perhaps the most important thing I can do to convince researchers of the importance of sharing and reusing data is doing exactly that myself.

Post navigation

Connect

Subscribe

Get notified when a new post is added

Name

Email *

About

Librarian in the City is the personal blog of Lisa Federer. The thoughts expressed here are my own and do not reflect the opinion of my employer. Likewise, comments are the views of readers who submit them, and do not necessarily reflect my own opinions.