Before you can work with data you have to get some. This course will cover the basic ways that data can be obtained. The course will cover obtaining data from the web, from APIs, from databases and from colleagues in various formats. It will also cover the basics of data cleaning and how to make data “tidy”. Tidy data dramatically speed downstream data analysis tasks. The course will also cover the components of a complete data set including raw data, processing instructions, codebooks, and processed data. The course will cover the basics needed for collecting, cleaning, and sharing data.

Преподаватели

Jeff Leek, PhD

Associate Professor, Biostatistics

Roger D. Peng, PhD

Associate Professor, Biostatistics

Brian Caffo, PhD

Professor, Biostatistics

Текст видео

In the last lecture, we talked about raw data, where you're starting from, the messy file that you're trying to extract some data from. In this lecture, we're going to be talking about tidy data. This is sort of the target, where you're going to take the raw data and turn it into a tidy data set, which you can then use to do downstream analysis that you might learn about in a regression modeling class or in the prediction end machine learning class. So the four things that you should have when you finish going from a raw data set, to a tidy data set, are the following. So you should have the raw data, obviously that's the files that you actually extracting information from. You should have a tidy data set, and we're going to be talking about in a minute, and then you should have a code book describing each variable, and it's value in the tidy data set. This code book is often called the meta data. So it's the data that surrounds the data, so-to-speak, and explains what the data is trying to say. You might manage each column in your tidy data set corresponds to one variable, and you might want to describe things like the units of those variables in the code book. And the critical part that sometimes is missing, but we're going to learn a lot about in this class, is an explicit and exact recipe that you use to go from steps one to steps two and three. In other words, you need to report the exact steps that you did, took to get from the raw data to the processed data. You can do that in a variety of different ways, but for the purposes of this class, we'll be putting together our scripts that then you could use to process data. And those will be the sort of recipe that you can hand off and say this is how I got from the raw to the tidy data. So the raw data it's important to remember that there are different levels of raw data, as we talked about it the previous lecture, but for the purposes of your looking a particular data set the raw data are the rawest form of the data that you had access to. So for example, it might be a strange binary file that whatever your measuring spits out. It could be an unformatted Excel file with ten worksheets that some company you contracted with sent you, or it could be something that was handed to you as an Excel file, or it could be complicated JSON data you get from scraping an API. Or, it can even be something like handed your numbers you got from looking through a microscope and counting cells that appeared in that, in the window. You know the raw data is in the right format if, and this is a critical one, you ran no software on the data. So, someone else might of run some software on the data, but by the time it handed to you, there was absolutely no software that was run. And, you didn't manipulate any of your numbers in the data set or move any of the data from the data set. And you didn't do any summarization to the data in any way. So this is how you know that the data is in its rawest form. And it's one component of the data that you should have, sort of the unadulterated, raw data, should be one component of, the having a tidy data set handed over to a collaborator. The tidy data is, on the other hand, the target or the end goal of the whole process, and so the idea is you, you should have each variable that you've measured be in exactly one column. So there should be one variable per column, and each different observation should be in a different row. In other words, if you've measured a particular variable on a large number of people who have been tweeting, say you measure the number of tweets that were posted by a large number of users, then, what you would get is the number of tweets in the column and then, for every single user, you would get the number of tweets in that, in a different row. There should be one table for every kind of variable. So, for example, if you collect data from Twitter and Facebook, and so forth, you might have one table for each of those. If you have multiple tables you should definitely make sure, if you're trying to match them up, that they include a column in the table that allows them to be linked together. So this is often an id, and we'll talk a little bit about merging data sets later in the class. A couple of other important tips that I've found have saved me a lot of trouble in the past is, if you include a row at the top of each file of variable names, sometimes this won't always appear in a tidy data-set, but it's much more useful if they are. It's also much more useful if the variable names are human readable. So for example, if you use something like AgeAtDiagnosis for a column name as opposed to AgeDx, which might be a little bit more confusing and harder for people to read. So its better be a little error on the side of being more explicit, and then in genera,l data should be saved in one file per table. So its a habit of some people to save data in multiple Excel spread sheets within a single Excel file, and so its a better idea to save each spread sheet in a different file. So we'll talk about that in the class. The next component is often a piece that is missing, which is the codebook. So you have this tidy data set that's very nice and clean and neat and you got that data set by doing a whole bunch of things to a broad data set. And so at the end of the day you end up with a data set that constitutes information about a bunch of different variables, and you might end up with just the variable names at the top of that file telling you what happened. But you might want to have more information about the variables. So one common example is, you might want to know the units. So the column header might be the amount of money that we made this quarter, and you'd really want to know if that was in units or thousands or millions or so forth. You might also want to know some information about the summary choices that was made. So, it might be the variable measures something about the monthly revenue. And so you want to know whether the median or mean revenue was taken. And then information about the experimental study design that you used. So you might want to know something about the way that you collected these data, whether it was just in a database, you extracted it out, or whether you performed an experiment, a randomized trial, or an A-B test or something like that. So, a common format for this document is a Word or a text file. It'd be like a Markdown file. As you've seen, Markdown files are sort of a common used format in data science. There should definitely be a section called Study Design, and that has a thorough description of how you collected the data. So this should say things like how you picked which observations to collect. What did you extract out of the database, what did you exclude, and so forth. There also should be a section called Code Book that describes all of the variables and it's units. Finally, you need the instruction list. So, even if you collected all that information and you made it available in the certain terms of the tidy data, you should be able to go back to the raw data and re process it and get the same tidy data set. If that doesn't happen, then there's something going wrong in your data processing pipeline, and so, you want to be able to identify that and fix it. So ideally, this is going to be a computer script that will do this for you. And so, for the purposes of this class, of course R, but I guess, you know, if you have to, you can do it in Python, as well. The input for the script is the, raw data. And so, the output is going to be the processed, tidy data. And really good important component of this script is that there are no parameters. In other words, you fix everything that you've done, after you've done all the processing, and you have an exact recipe that doesn't have to be twe, tweaked or modified by the end user. And they will get the same tidy data set out if they put the same raw data set in. So in some cases, it's not possible to script every step. So, for example, not everything you can do to data, you can do in R. I know it's hard to believe that I'm saying that, but it's true. And so, what you might end up having to do is have, in addition to the scri-, R scripts and the other scripts from the other programming languages you might use, you might need a set of commands that look like this where it's actually just written out in a text file. It says take the raw file, and you're going to run some software, and you're going to run version 3.1.2, and you're going to run it with these spec, specified parameters. You have to give all that information because if you don't, if you just say run this software on the raw data file, if the version changes, you might get a totally different answer out. And then you should say things like oh, I ran the software separately for each sample so that people know exactly how you produced the result. And so, if you can't write a computer script, the best case that you can do is be as explicit as possible in this rescue. Go way overboard in the amount of detail you give on how the data got from being the raw data to the processed data. So it's pretty important that you do this, and so this is actually I guess a funny example for us, but not such a funny example for Reinhart and Rogoff. These were two authors. And they wrote a paper that talked about growth in the time of debt, and so it was actually the paper that was used to justify austerity in a large number of countries and in politically. And so it turned out that this graduate student got a hold of the Excel file that they used, and he looked at the way that they had processed the data, and he found a large number of errors. And so, paying attention to how the data are collected, and analyzed, and put together, actually caused this paper to be called into serious question, which called, which called into serious question political decisions by a large number of countries in the basically, the recession. So the graduate student in question actually ended up on the Colbert Report, and so he got to actually joke around a little bit about how this Excel error brought down all this political enterprise. But it's an important safety lesson for all of us that we should pay careful attention and keep the script available that takes the raw data and turns it into the processed data that we use for the analysis.