Connecting R to MySQL/MariaDB

Overview

Relational databases, such as MySQL, organize data into tables consisting of rows and columns similarly to an R data.frame. While many beginners will be satisfied with R’s native RData storage format, storing data in a flat-file has its disadvantages. First, they are difficult to share among collaborators. .RData files can contain many matrices and data.frames. It’s very easy to create different versions if two collaborators are performing analyses which require cleaning the data. Second, they are insecure compared to the permissions available to administrators of SQL databases. For example, I may want a research assistant or an analyst on the data science team to have read-only access to certain tables to prevent data corruption, and restrict access entirely to sensitive, identifiable data such as participant names and email addresses. Using MySQL, I can grant SELECT to individuals that require read-only access and each individual can have separate login credentials. Thus, if an individual leaves the academic lab or resigns from the company, I can simply disable their access. Third, you can easily backup and version your database using mysqldump or automatic backups if you choose to use Amazon’s RDS. Did you merge incorrectly or accidentally wipe your data? No problem, just roll back to a previous version. Fourth, data can be efficiently streamed into R using SQL queries. As you may know, R data is stored entirely in-memory. This means to analyze a 24 GB dataset, you’ll need at least 24GB of free RAM in your computer in addition to what’s needed to perform your analysis. Instead, if I know that I only need last month’s customers to perform an analysis, I can stream only that data into R using a query such as:

1

2

SELECT *FROM table WHERE date&gt;'2015-07-01';

If I wanted, I could automatically filter out incomplete values before it gets into R:

1

2

SELECT *FROM table WHERE date&gt;'2015-07-01'ANDtest_score ISNOTNULL;

Interested? In this tutorial, you will learn how to store and retrieve data from a MySQL database with R using the RMySQL package.

Installing MySQL

Since MySQL runs on a variety of platforms, I’ll demonstrate how to install it on Mac, Linux, and Windows, as well as how to provision it using Amazon RDS. These instructions are by no means flawless – feel free to email us feedback if you find problems.

Using Homebrew on OS X

Homebrew is a package manager for OS X that enables the installation of a variety of packages from source. For our purposes, we’re going to install MySQL and run it as a daemon each time your computer starts. To install Homebrew, open Terminal.app and run the following code:

Before doing anything, run brew doctor and make sure you fix any of its complaints before proceeding. After you have a working Homebrew installing, you can now install MySQL:

1

2

3

4

5

6

7

8

# Update Homebrew

brew update

# Install MySQL

brew install MySQL

# Tell MySQL to load at startup

ln-sfv/usr/local/opt/mysql/*.plist~/Library/LaunchAgents

launchctl load~/Library/LaunchAgents/homebrew.mxcl.mysql.plist

Congrats! Try logging in:

1

mysql-uroot

You can also try mycli, which offers a more modern CLI for MySQL with features such as syntax completion. If that command fails, check that the server is actually running:

1

ps-ax|grep mysqld

Installing MySQL on Ubuntu

Chances are, you may have an existing server in your lab or on an EC2 instance that you want to use for this tutorial. If you don’t, I highly recommend you sign up for Amazon’s Free Tier and try it out. Luckily, Linux distributions such as Ubuntu and Fedora come with package managers that make installing MySQL easier for end-users. In this example, we’ll use the APT package manager to complete the installation. Remember, you’ll need sudo access to complete this installation. If you don’t have administrative access, contact your server administrator and ask them to install MySQL for you.

1

2

3

4

5

# Update APT

sudo apt-get update

# Install the server and client

sudo apt-get install mysql-server-5.6mysql-client-5.6

If you’re also using R on the same system, you can install RMySQL using APT. The benefit is that you avoid troubleshooting any compilation errors. However, you may want to add the official CRAN repo to your APT sources.list file before doing this. In my case, I verified my Ubuntu version by running lsb_release -a and added the following line to my APT sources list at /etc/apt/sources.list: deb http://cran.rstudio.org/bin/linux/ubuntu trusty/. Once you have the CRAN Ubuntu repo installed and have updated APT, you can install many CRAN plugins from APT:

1

2

sudo apt-get installr-cran-rmysql

Setting up RDS on AWS

First, sign up for AWS using your Amazon account. Once you’re in, navigate to the RDS section under “Databases”:

Next, click on “Get Started Now” to proceed to the DB type selection screen and select “MySQL” as the type.

After you’ve selected a MySQL database, select “No” for Multi-AZ Zone Deployment, depending on if you want to use the Free Tier for the first year of your RDS server:

Next, select the size and indicate the master username and password for your new database. Keep in mind that databases >= 20GB are still eligible for the RDS Free-Tier:

When the provisioning is finished, you should see your Instance and the endpoint you’ll use for connecting using RMySQL!

Installing RMySQL

Next, we’re going to install another package to extend R’s native capabilities and allow it to read and write from MySQL. First, try to install the binary package from CRAN:

1

2

3

4

install.packages("RMySQL")

# Load the library

library(RMySQL)

If you don’t get any errors, you’re all set! However, depending on your version of R and your platform, you may have to build from source. If that’s the case, don’t fret! As long as you’ve installed the MySQL headers (for example, using Homebrew) you’ll be able to specify that you want to install RMySQL from source, in which it will compile the package:

1

2

3

4

install.packages("RMySQL",type="source"&gt;

library(RMySQL)

Loading required package:DBI

Transferring Data

Next, we’ll discover how to read to and write data from a table in your MySQL database. This tutorial won’t deal with the best ways to store data, but I highly recommend the reader take a peak at Wickham, H. (2014). Tidy Data.RMySQL contains many functions, but the three we’ll focus on are 1) dbListTables(), dbReadTables(), and dbWriteTable(). These, as you’ve no doubt already guessed, allow you to 1) list all the tables available in a given database, read a MySQL table into R as a data.frame, and write to a MySQL table in a way conceptually similar to how write.csv() operates. First, let’s create the connection to the database and explore how some of the functions operate. Note that in my case, dbListTables() returns a character vector with each table name because I’m connecting to an existing database. New databases should not return anything.

Since the result of dbReadTable() is itself a data.frame, you can stream SQL directly into your plot.ly graphs and play around as usual. This functionality is fantastic for building persistent dashboards for your analysis.

1

2

3

library(plotly)

plot_ly(data=mtcars,color=as.factor(cyl),y=mpg,type="box")

Securing your Credentials in R

You’ll notice that in my example above I didn’t embed my credentials directly in my R script. Scripting usernames and passwords is a terrible idea in practice. For example, what if you upload your analysis to a public Github? Suddenly, the world knows that your password is ilovecats!. We can avoid such embarrassment by storing our credentials in a .my.cnf configuration file that RMySQL will look for and read. Values such as the database name, your username, and corresponding password can then be safely stored in your $HOME directory, away from your analysis. To do this, create a file called my.cnf in your home directory:

1

2

3

4

5

6

[group-name]

database=test

user=root

password=

host=127.0.0.1

Going Further

While this post is mainly concerned with 1) connecting R to MySQL databases and 2) reading and writing data, I encourage readers to check out the Khan Academy’s free course on SQL. Using the dbSendQuery() function, we can perform more complex queries that involve joining and subsetting data before they’re read into R: