Other sites

Data Visualization – Part 2

A Quick Overview of the ggplot2 Package in R

While it will be important to focus on theory, I want to explain the ggplot2 package because I will be using it throughout the rest of this series. Knowing how it works will keep the focus on the results rather than the code. It’s an incredibly powerful package and once you wrap your head around what it’s doing, your life will change for the better! There are a lot of tools out there which provide better charts, graphs and ease of use (i.e. plot.ly, d3.js, Qlik, Tableau), but ggplot2 is still a fantastic resource and I use it all of the time.

Basically, ggplot2 allows for a lot more customization of plots with a lot less code (the rest of it is behind the scenes). Once you are used to the syntax, there’s no going back. It’s faster and easier.

Why wouldn’t you use ggplot2?

A bit of a learning curve

Lack of user interactivity with the plots

Fundamentally, ggplot2 gives the user the ability to start a plot and layer everything in. There are many ways to accomplish the same thing, so figure out what makes sense for you and stick to it.

What happened to get that?

ggplot(economics) loaded the data frame

+ tells ggplot() that there is more to be added to the plot

geom_line() defined the type of plot

aes(x = date, y = unemploy) mapped the variables

The aes() portion is what typically throws new users off but is my favorite feature of ggplot2. In simple terms, this is what “auto-magically” brings your plot to life. You are telling ggplot2, “I want ‘date’ to be on the x-axis and ‘unemploy’ to be on the y-axis.” It’s pretty straightforward in this case but there are more complex use cases as well.

Side Note: you could have achieved the same result by mapping the variables in the ggplot() function rather than in geom_line():ggplot(data = economics, aes(x = date, y = unemploy)) + geom_line()

Here’s the basic formula for success:

Everything in ggplot2 starts with ggplot(data) and utilizes + to add on every element thereafter

Note: I believe the intention of the author of the Top 50 ggplot2 Visualizations Master List was to illustrate how to use ggplot2 rather than doing a full demonstration of what important data visualization techniques are – so keep that in mind as I go through these examples. Some of the visuals do not line up with my best practices addressed in my first post on data visualization.

The Scatterplot

This is one of the most visually powerful tool for data analysis. However, you have to be careful when using it because it’s primarily used by people doing analysis and not reporting (depending on what industry you’re in).

The author of this chart was looking for a correlation between area and population.

Explanation:

geom_point(aes(col=state, size=popdensity)) +
Creates a scatterplot and maps the color and size of points to state and popdensity.

geom_smooth(method="loess", se=F) +
Creates a smoothing curve to fit the data. method is the type of fit and se determines whether or not to show error bars.

xlim(c(0, 0.1)) +
Sets the x-axis limits.

ylim(c(0, 500000)) +
Sets the y-axis limits.

labs(subtitle="Area Vs Population",

y="Population",

x="Area",

title="Scatterplot",

caption = "Source: midwest")
Changes the labels of the subtitle, y-axis, x-axis, title and caption.

Notice that the legend was automatically created and placed on the lefthand side. This is also highly customizable and can be changed easily.

The Density Plot

Density plots are a great way to see how data is distributed. They are similar to histograms in a sense, but show values in terms of percentage of the total. In this example, the author used the mpg data set and is looking to see the different distributions of City Mileage based off of the number of cylinders the car has.

You’ll notice one immediate difference here. The author decided to create a the object g to equal ggplot(mpg, aes(cty)) – this is a nice trick and will save you some time if you plan on keeping ggplot(mpg, aes(cty)) as the fundamental plot and simply exploring other visualizations on top of it. It is also handy if you need to save the output of a chart to an image file.

The Histogram

geom_histogram(bins=20) plots the histogram. If bins isn’t set, ggplot2 will automatically set one.

The Bar/Column Chart

For all intensive purposes, bar and column charts are essentially the same. Technically, the term “column chart” can be used when the bars run vertically. The author of this chart was simply looking at the frequency of the vehicles listed in the data set.

The addition of theme_set(theme_classic()) adds a preset theme to the chart. You can create your own or select from a large list of themes. This can help set your work apart from others and save a lot of time.

However, theme_set() is different than the theme(axis.text.x = element_text(angle=65, vjust=0.6)) the one used inside the plot itself in this case. The author decided to tilt the text along the x-axis. vjust=0.6 changes how far it is spaced away from the axis line.

Within geom_bar() there is another new piece of information: stat="identity" which tells ggplot to use the actual value of Freq.

You may also notice that ggplot arranged all of the data in alphabetical order based off of the manufacturer. If you want to change the order, it’s best to use the reorder() function. This next chart will use the Freq and coord_flip() to orient the chart differently.

This created a much nicer view of the information. It “auto-magically” split everything out by manufacturer!

Spatial Plots

Another nice feature of ggplot2 is the integration with maps and spatial plotting. In this simple example, I wanted to plot a few cities in Colorado and draw a border around them. Other than the addition of the map, ggplot simply places the dots directly on the locations via their longitude and latitude “auto-magically.”

Final Thoughts

I hope you learned a lot about the basics of ggplot2 in this. It’s extremely powerful but yet easy to use once you get the hang of it. The best way to really learn it is to try it out. Find some data on your own and try to manipulate it and get it plotted. Without a doubt, you will have all kinds of errors pop up, data you expect to be plotted won’t show up, colors and fills will be different, etc. However, your visualizations will be leveled-up!