The table below is a subset of data from a directory of dog rescue resources put together by Speaking of Dogs, a volunteer-based dog rescue organization in Toronto, Canada. The information is real, but for this example I garbled the original data into a particular ‘spreadsheet’ format that I struggled with recently. I chose this source of data in support of the Clear The H*ckin Shelters campaign happening this week (read more here and support dog shelters in general).

Organization

Contact name

phone

website

Organization

Contact name

phone

website

“Small Breed”

“Bulldog (English)”

Happy Tails Rescue

Judy

905-357-5096

www.happytailsrescue.ca

Homeward Bound Rescue*

Kathy

905-987-1104

www.homewardboundrescue.ca

LOYAL Rescue Inc.

Anne

888-739-1221

www.loyalrescue.com

unknown

Joan†

416-738-6059

unknown

Pomeranian and Small Breed Rescue

Shelley

416-225-6808

www.psbrescue.com

“Labrador Retriever “

Tiny Paws Rescue

Brenda

1-800-774-8315

www.tpdr.ca

Labrador Retriever Adoption Service

Laura or Karen

289-997-5227

www.lab-rescue.ca

“Senior Dogs”

Dog Rescuers Inc

Joan

416-567-6249 ‡

www.thedogrescuersinc.ca

Speaking of Dogs Rescue

Lorraine

705-444-7637

www.speakingofdogs.com

Footnotes for the table:

* includes other Flat faced dogs: Bulldogs, Boxers, Bostons, Pugs etc

† limited foster care available

‡ phone may not be up to date

This data is not ‘analysis-ready’. Notice the three main issues that need to get sorted:

The table has repeated columns. It appears that the table has been split in two (vertically) and the columns are stacked side-by-side in a sort of ‘wide’ format. We don’t really want duplicated variables because having duplicated column names is a very unnatural, complicated, and risky format for keeping data.

There are header rows sprinkled throughout the Organization column. These non-data rows are used quite often when we want to save space by having the value in one cell somehow apply to cells below (until we find the next header row used for grouping). These are easy for humans to parse, but not computers. Read more about header rows here.

Some ‘cells’ have special characters, these are used to refer to footnotes/information in the table caption, but in this case we would prefer to have this information inside the data rectangle.

rstats time

This post goes through a possible solution to reshape the table and deal with the header rows and footnotes. Make sure you have the necessary R packages installed, and once you do all the code in this block should be fully reproducible.

Start by putting the data into a character vector by simply pasting the table, delimited by tabs and line breaks.

Now, to stack the table into a long form. When I asked for advice on Twitter the consensus was to use the gather function in tidyr after sorting out the duplicated variable names (or by referring to columns by number). The sensible answer for this issue is to not have duplicated names in the first place, and there are various tools and functions for avoiding or fixing them.

However, the real world is a harsh place and duplicated variables are pretty common. I found this post on Stack Overflow for this exact problem, and SO user akrun had a pretty clever solution.
The suggestion was to:

iterate through the unique names to extract all the observations for each variable name

unlist them

put them into a data frame (with variable names)

All I did was replace lapply with map and used a tibble instead of a data frame for the output because I’m in the process of learning purrr, and because tibbles never convert strings to factors or create row names.

# stack into long formrescDFstacked<-map(unique(names(rescDF)),~unlist(rescDF[names(rescDF)==.x],use.names=FALSE))%>%as.data.frame(stringsAsFactors=FALSE)%>%set_names(unique(names(rescDF)))

The data is looking better but we still need to sort out the awkward header rows. Fortunately, there’s a function for that. Read about it here. In brief, my bumbling attempt at tidy evaluation received a makeover from Jenny Bryan and now we can define and use the untangle2 function. When that happened, it was like having Xzibit knocking at my door offering to enhance my car. Since then, the untangle2 function has been helping me shred through other people’s data because in my field everything follows a taxonomic hierarchy and everyone likes to use header rows. I feel that untangle belongs in tidyr, and maybe when I’m confident enough I’ll try to contribute to the tidyverse.

In this table, the header rows are quoted, making for smooth untangling.

The footnotes are the last major issue. To bring them into the data rectangle, I used case_when inside mutate to add the footnote text conditionally, but I’m not very happy with this approach. To figure out the columns to match with the different individual grepl statements I used map to iterate through the columns.

Ideally, I wanted to iterate though the special characters and the columns at the same time, because any given observation could have any combination of footnotes. I couldn’t figure out map2 and list columns :(

# bring footnotes into data rectanglerescDFstacked%<>%mutate(observation=case_when(grepl("\\*",Organization)~"includes other Flat faced dogs: Bulldogs, Boxers, Bostons, Pugs etc",grepl("\u0086",`Contact name`)~"limited foster care available",grepl("\u0087",phone)~"phone may not be up to date"))# how I figured out which columns contained which special charrescDFstacked%>%map(~grepl("\\*",.x))%>%map(~.x[.x==TRUE])%>%unlist()%>%names()rescDFstacked%>%map(~grepl("\u0086",.x))%>%map(~.x[.x==TRUE])%>%unlist()%>%names()rescDFstacked%>%map(~grepl("\u0087",.x))%>%map(~.x[.x==TRUE])%>%unlist()%>%names()# DIDNT WORK# map2(rescDFstacked,c("\\*","\u0086","\u0087"),~ grepl(.y,.x))

Because the footnotes were informative enough, we can wrap things up by removing all the special characters.