Exploratory Data Analysis with R

Exploratory Data Analysis with R

This book teaches you to use R to effectively visualize and explore complex datasets. Exploratory data analysis is a key part of the data science process because it allows you to sharpen your question and refine your modeling strategies. This book is based on the industry-leading Johns Hopkins Data Science Specialization, the most widely...

About the Book

This book covers the essential exploratory techniques for summarizing data with R. These techniques are typically applied before formal modeling commences and can help inform the development of more complex statistical models. Exploratory techniques are also important for eliminating or sharpening potential hypotheses about the world that can be addressed by the data you have. We will cover in detail the plotting systems in R as well as some of the basic principles of constructing informative data graphics. We will also cover some of the common multivariate statistical techniques used to visualize high-dimensional data.

Share this book

Feedback

About the Author

Roger D. Peng is a Professor of Biostatistics at the Johns Hopkins Bloomberg School of Public Health where his research focuses on the development of statistical methods for addressing environmental health problems. He is also a co-founder of the Johns Hopkins Data Science Specialization, the Simply Statistics blog where he writes about statistics for the general public, the Not So Standard Deviations podcast with Hilary Parker, and The Effort Report podcast with Elizabeth Matsui. He is a Fellow of the American Statistical Association and is the recipient of the 2016 Mortimer Spiegelman Award from the American Public Health Association, which honors a statistician who has made outstanding contributions to public health. Roger can be found on Twitter and GitHub at @rdpeng.

Packages

The Book

This package contains just the book in PDF, EPUB, or MOBI formats.

English

PDF

EPUB

MOBI

APP

Free!

Minimum price

$15.00

Suggested price

The Book + Datasets + R Code Files

This package contains the book and R code files corresponding to each of the chapters in the book. The package also contains the datasets used in all of the chapters so that the code can be fully executed.

Includes:

extras

Datasets

extras

R Code Files

English

PDF

EPUB

MOBI

APP

$15.00

Minimum price

$25.00

Suggested price

The Book + Lecture Videos (HD) + Datasets + R Code Files

This package includes the book, high definition lecture video files (720p) corresponding to each of the chapters, datasets and R code files for all chapters. The videos are licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International license.

Includes:

extras

Datasets

extras

R Code Files

extras

Lecture Videos (HD)

English

PDF

EPUB

MOBI

APP

$30.00

Minimum price

$35.00

Suggested price

Table of Contents

1. Stay in Touch!

2. Preface

3. Getting Started with R

3.1 Installation

3.2 Getting started with the R interface

4. Managing Data Frames with the dplyr package

4.1 Data Frames

4.2 The dplyr Package

4.3 dplyr Grammar

4.4 Installing the dplyr package

4.5 select()

4.6 filter()

4.7 arrange()

4.8 rename()

4.9 mutate()

4.10 group_by()

4.11 %>%

4.12 Summary

5. Exploratory Data Analysis Checklist

5.1 Formulate your question

5.2 Read in your data

5.3 Check the packaging

5.4 Run str()

5.5 Look at the top and the bottom of your data

5.6 Check your “n”s

5.7 Validate with at least one external data source

5.8 Try the easy solution first

5.9 Challenge your solution

5.10 Follow up questions

6. Principles of Analytic Graphics

6.1 Show comparisons

6.2 Show causality, mechanism, explanation, systematic structure

6.3 Show multivariate data

6.4 Integrate evidence

6.5 Describe and document the evidence

6.6 Content, Content, Content

6.7 References

7. Exploratory Graphs

7.1 Characteristics of exploratory graphs

7.2 Air Pollution in the United States

7.3 Getting the Data

7.4 Simple Summaries: One Dimension

7.5 Five Number Summary

7.6 Boxplot

7.7 Histogram

7.8 Overlaying Features

7.9 Barplot

7.10 Simple Summaries: Two Dimensions and Beyond

7.11 Multiple Boxplots

7.12 Multiple Histograms

7.13 Scatterplots

7.14 Scatterplot - Using Color

7.15 Multiple Scatterplots

7.16 Summary

8. Plotting Systems

8.1 The Base Plotting System

8.2 The Lattice System

8.3 The ggplot2 System

8.4 References

9. Graphics Devices

9.1 The Process of Making a Plot

9.2 How Does a Plot Get Created?

9.3 Graphics File Devices

9.4 Multiple Open Graphics Devices

9.5 Copying Plots

9.6 Summary

10. The Base Plotting System

10.1 Base Graphics

10.2 Simple Base Graphics

10.3 Some Important Base Graphics Parameters

10.4 Base Plotting Functions

10.5 Base Plot with Regression Line

10.6 Multiple Base Plots

10.7 Summary

11. Plotting and Color in R

11.1 Colors 1, 2, and 3

11.2 Connecting colors with data

11.3 Color Utilities in R

11.4 colorRamp()

11.5 colorRampPalette()

11.6 RColorBrewer Package

11.7 Using the RColorBrewer palettes

11.8 The smoothScatter() function

11.9 Adding transparency

11.10 Summary

12. Hierarchical Clustering

12.1 Hierarchical clustering

12.2 How do we define close?

12.3 Example: Euclidean distance

12.4 Example: Manhattan distance

12.5 Example: Hierarchical clustering

12.6 Prettier dendrograms

12.7 Merging points: Complete

12.8 Merging points: Average

12.9 Using the heatmap() function

12.10 Notes and further resources

13. K-Means Clustering

13.1 Illustrating the K-means algorithm

13.2 Stopping the algorithm

13.3 Using the kmeans() function

13.4 Building heatmaps from K-means solutions

13.5 Notes and further resources

14. Dimension Reduction

14.1 Matrix data

14.2 Patterns in rows and columns

14.3 Related problem

14.4 SVD and PCA

14.5 Unpacking the SVD: u and v

14.6 SVD for data compression

14.7 Components of the SVD - Variance explained

14.8 Relationship to principal components

14.9 What if we add a second pattern?

14.10 Dealing with missing values

14.11 Example: Face data

14.12 Notes and further resources

15. The ggplot2 Plotting System: Part 1

15.1 The Basics: qplot()

15.2 Before You Start: Label Your Data

15.3 ggplot2 “Hello, world!”

15.4 Modifying aesthetics

15.5 Adding a geom

15.6 Histograms

15.7 Facets

15.8 Case Study: MAACS Cohort

15.9 Summary of qplot()

16. The ggplot2 Plotting System: Part 2

16.1 Basic Components of a ggplot2 Plot

16.2 Example: BMI, PM2.5, Asthma

16.3 Building Up in Layers

16.4 First Plot with Point Layer

16.5 Adding More Layers: Smooth

16.6 Adding More Layers: Facets

16.7 Modifying Geom Properties

16.8 Modifying Labels

16.9 Customizing the Smooth

16.10 Changing the Theme

16.11 More Complex Example

16.12 A Quick Aside about Axis Limits

16.13 Resources

17. Data Analysis Case Study: Changes in Fine Particle Air Pollution in the U.S.

The Leanpub 45-day 100% Happiness Guarantee

Write and Publish on Leanpub

Authors, publishers and universities use Leanpub to publish amazing in-progress and completed books and courses, just like this one. You can use Leanpub to write, publish and sell your book or course as well! Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks. Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. It really is that easy.