Data Analysis

Jeff Leek

Learn about the most effective data analysis methods to solve problems and achieve insight.

Watch intro video

Next Session:

Jan 22nd 2013 (8 weeks long)

Workload: 3-5 hours/week

About the Course

You have probably heard that this is the era of “Big Data”. Stories about companies or scientists using data to recommend movies, discover who is pregnant based on credit card receipts, or confirm the existence of the Higgs Boson regularly appear in Forbes, the Economist, the Wall Street Journal, and The New York Times. But how does one turn data into this type of insight? The answer is data analysis and applied statistics. Data analysis is the process of finding the right data to answer your question, understanding the processes underlying the data, discovering the important patterns in the data, and then communicating your results to have the biggest possible impact. There is a critical shortage of people with these skills in the workforce, which is why Hal Varian (Chief Economist at Google) says that being a statistician will be the sexy job for the next 10 years.

This course is an applied statistics course focusing on data analysis. The course will begin with an overview of how to organize, perform, and write-up data analyses. Then we will cover some of the most popular and widely used statistical methods like linear regression, principal components analysis, cross-validation, and p-values. Instead of focusing on mathematical details, the lectures will be designed to help you apply these techniques to real data using the R statistical programming language, interpret the results, and diagnose potential problems in your analysis. You will also have the opportunity to critique and assist your fellow classmates with their data analyses.

About the Instructor(s)

Jeff Leek is an Assistant Professor of Biostatistics at the Johns Hopkins Bloomberg School of Public Health and co-editor of the Simply Statistics Blog. He received his Ph.D. in Biostatistics from the University of Washington and is recognized for his contributions to genomic data analysis and statistical methods for personalized medicine. His data analyses have helped us understand the molecular mechanisms behind brain development, stem cell self-renewal, and the immune response to major blunt force trauma. His work has appeared in the top scientific and medical journals Nature, Proceedings of the National Academy of Sciences, Genome Biology, and PLoS Medicine. He created Data Analysis as a component of the year-long statistical methods core sequence for Biostatistics students at Johns Hopkins. The course has won a teaching excellence award, voted on by the students at Johns Hopkins, every year Dr. Leek has taught the course.

Recommended Background

Some familiarity with the R statistical programming language (http://www.r-project.org/) and proficiency in writing in English will be useful. At Johns Hopkins, this course is taken by first-year graduate students in Biostatistics.

Course Format

The course will consist of lecture videos broken into 8-10 minute segments. There will be two major data analysis projects that will be peer-graded with instructor quality control. Course grades will be determined by the data analyses, peer reviews, and bonus points for answering questions on the course message board.

FAQ

This course will focus on how to plan, carry out, and communicate analyses of real data sets. While we will cover the basics of how to use R to implement these analyses, the course will not cover specific programming skills. Computing for Data Analysis will cover some statistical programming topics that will be useful for this class, but it is not a prerequisite for the course.

What resources will I need for this class?

A computer with internet access on which the R software environment can be installed (recent Mac, Windows, or Linux computers are sufficient).

Do I need to buy a textbook?

There is no standard textbook for data analysis. The course lectures will include pointers to free resources about specific statistical methods, data sources, and other tools for data analysis.