Introduction

Back in 2015 I wrote a long blog post on importing Excel tables into R. Happily for everyone this is now a lot easier than it was. This post provides an update on importing spreadsheets into R and exporting from R to Excel. I’ll also cover reading an excel file into R from a url as that seems to be an ongoing struggle.

Import Directly from the RStudio Menu

The big change is that it is now very easy to import from Excel using the RStudio Menu: File > Import Dataset > From Excel.

Next, navigate to the file that you want to import and select it. You will then see something like this.

One point to bear in mind is that the import will often default to the name dataset so that you need to make sure you enter a meaningful name for the dataset.

If your workbook has multiple sheets then you can choose a sheet number using Sheet, choose the maximum number of rows or skip rows if you have a bunch of filler junk in the top rows. Regular Excel users may also want to select columns by Range.

You can also click on a column and choose to skip it or change the format.

It is worth bearing in mind that if you are importing a number of worksheets you can easily lose track. I sometimes use the approach of copying the import chunk into an Rmarkdown document to keep track of what I am doing and where a file came from.

When copying chunks note the small clipboard icon in the top right above the chunk that will copy the chunk to the clipboard for pasting into the console or an R markdown code chunk to document your import steps for the future. My approach when working with multiple sheets is to create an R markdown file and copy and paste the import code into chunks that I then save. That allows “future me”, to borrow from Hadley Wickham, to understand where the datasets came from.

As we can see from importing the file behind the scenes RStudio is using the readxl library to import the file.

readxl will commonly generate warning messages during the import process. For example this dataset generated a long long string of warnings that looked like this.

These warnings arise because readxl guesses the column type by reading the top 1000 rows for each column. However, where a column contains a mix of numbers or characters this can lead to an expecting logical/expecting integer type of error. A lot of the time this is not actually a problem. However, it is important to pay attention to the warnings because they may indicate an actual problem with your data (such as lines spilling across rows).

To fix this there are a number of options to try.

1. Use the guess_max argument

Use the guess_max argument to increase the number of rows that are read to guess the column type. The default is 1000 and here we reset it to 2000. In the case of our example dataset this didn’t work because the problems appeared lower down but it often will. You can add an n_max value (shown below as NULL) where you know the maximum number of rows.

library(readxl)
taxonomy

An alternative to this approach is simply to set min as the maximum number of rows. The issue here is that you would of course need to already have opened the spreadsheet to identify the number of rows, but there is no reason not to simply guess large.

If that doesn’t work for you then a third option is to work out what the format should be and pass it as a string. Arguably, this should be the first option. However, it can also be the most time consuming.

A toy example is the following data frame that we can write to excel (see below on writing files).

The documentation for read_excel (?read_excel) sets out quite a few other options. For example we could specify the format of some columns and leave the function to guess the others. That would look like this.

df

3. Convert all columns to a single type

For a dataset with a lot of columns trying to work out the column types or writing guess, logical, character can rapidly become painful. Depending on your needs it may be easier to simply use the col_types = "text" for all columns and change them where needed later using as.character(), as.logical(), as.numeric() or as.Date().

Reading an Excel file from a URL

In the 2015 post on importing Excel I wrote:

“It is faster to simply download the file to your drive, or swim the Atlantic ocean, than to successfully download an excel file on http: or, in particular https:. So maybe ask yourself what is the path of least resistance and run with that.”

As far as I can tell the situation is not radically different now. However, this is something that lots of people have logically wanted to do. By the power of Stack Overflow, a solution can be found. Luke A provided the following answer to this question on downloading excel files.

This code uses the httr package to read in a .xls file from a url that is written to disk and then passed to readxl.

We can wrap this into a small function with some adjustments. In this case we use str_detect to detect whether the file type is included in the URL. Note that this will not address those cases (such as Google Drive) where the Excel file type is not included (see the googledrive package). Nor will it detect other Excel file types such as .xlsm for macro enabled workbooks. As this suggests the task is more complex than it might at first appear. This small function addresses common use cases but will not address all use cases.

The function assumes that the file extension is contained in the URL and will spot that for us, in the case of a zip extension it will download and attempt to extract the file and if all else fails, we can provide the file extension. the ... informs us that other arguments such as col_types = can be passed to the function and will be picked up by read_excel.

readxl_online

This is not perfect, but it is a start. We can now run a test on different data types to see if it will work. These urls are all from excel files on Github. Github file urls are actually placeholders and so we need to follow the link and copy the Raw file url (see raw=true in the url). Note also that these urls are all https:

The .xls case:

dfxls

The xlsx case:

dfxlsx

The zip file case:

dfzip

It is always a good thing if functions fail fast and provide a helpful message.

error

This prints the expected message.

“Expecting file extension of type .xlsx, .xls or .zip. Check the URL or the data source for the correct file extension and use the type argument”

Tidying column names with janitor

One issue once you have your data in R is that column names in excel files often contain mixed case names and spaces or other characters such as brackets that can be awkward to work with in R. To solve that an easy option is to use the recent janitor package.

install.packages("janitor")

For this we need an excel worksheet with noisy names. For R coding Blue Peter fans…“Here is one we prepared earlier”.

noisycols

noisy(yes)

really_,Noisy;!

EVEN noisier !?*$!

OMG- I_can’t-***//believe?it|

these

are

not

the

noisiest

column

names

in

the

world,

just

a

tribute

NA

NANA

NANANANA

library(janitor)
noisycols1

noisy_yes

really_noisy

even_noisier

omg_i_can_t_believe_it

these

are

not

the

noisiest

column

names

in

the

world,

just

a

tribute

NA

NANA

NANANANA

This does a very good job of cleaning up names but may not always catch everything. If you have particular needs the stringr package (now installed with the tidyverse) is the go to package. Try the str_replace_all function.

Exporting to Excel

In the earlier post I wrote about using write.xlsx() from the xlsx package. That is still a very good option. However, as a personal preference I have now switched over to the writexl package as I find it easier to remember and use. It is also an ROpenSci package and I use a lot of ROpenSci packages. writexl has the added bonus that Clippy appears in the documentation to brighten up your day… or drive you insane… as the case may be. So, if you prefer to be a curmudgeon about Clippy you may want to use the xlsx package.

install.packages("writexl")

We simply specify the file and the name of the file we want to write. An additional argument col_names = TRUE is set to TRUE by default so you only need to specify that if you want the value to be FALSE.

You can also now write multiple sheets by specifying the data frames in a list and passing them to write_xlsx.

library(writexl)
tmp

Round Up

There we have it. Reading and writing Excel files in R is now way way easier than it was just a couple of years ago thanks to the dedicated work of those behind readxl (Hadley Wickham and Jenny Bryan) and writexl by Jeroen Ooms and John McNamara. Other packages will take you to the same place but these are my go to packages. Community contributions are helping to solve the mystery of reading Excel files from urls and we might hope that at some point readxl may address this problem.