Introduction

The text you are reading now is actually created as an RMarkdown Notebook in RStudio. The best way to read this would be to open the notebook in RStudio and follow along, evaluating the code. You can get the notebook file here.

R code can be evaluated either by pressing Ctrl-Shift-Enter with the cursor inside some code or by pressing the little green arrow at the right margin of the code blocks.

Prolouge

Computer languages like R, that has been around for a long time, live through different styles and opinionated principles. One such principle is expressed by the Tidyverse which I like and advocate.

You enter the Tidyverse by loading the tidyverse library – almost. On a newly created R installation, you first need to install the libraries on your computer. This is done using the install.packages function in the following code part. Note that this is only nessecary once!

#install.packages("tidyverse")
#install.packages("rlist")

Then you can load the tidyverse and some other nessecary libraries for this tutorial.

Pipelines

To me, one of the things that R wonderfull to work with, is the pipe operator: %>%. This operator take what's on the left and sends it to whatever is on the right, e.g. "The Quick Brown Fox" %>% length calculates the length of the given sentence.

You enable the pipe operator by loading the magrittr library.

library(magrittr)

Okay, so what can we do with this pipe?

Let's say we have a string of words that we want to count. Evaluating such a sentence just gives us the same thing back:

"Gollum or Frodo And Sam"

## [1] "Gollum or Frodo And Sam"

To count elements in a list, i.e. the words in the sentence, we can use the length function:

"Gollum or Frodo And Sam" %>% length()

## [1] 1

Okay, so length recieves a list with one element: sentence. Let's split that sentence into words (observe that it's okay to break pipes into multiple lines):

Now, the simplify = TRUE is needed because str_split can do a lot more that just split a sentence, but for now we just need the simple stuff.

Well, we're not content yet, as we don't want to count words like “or” and “and”. Such words are called stop words. In other words, we want to ignore words that belong to a list of stop words we define. In R, a list of words is defines thus:

c("or", "and")

## [1] "or" "and"

If we want to know if a a word is contained in such a list, i.e. we want to ask whether “or” is in the list (“or”, “and”), we can do like this:

"or" %in% c("or", "and")

## [1] TRUE

But we really want to ask whether “or” is not in the list. In most computer languages, a truth or false statement can be reversed by the ! character.

!FALSE

## [1] TRUE

So, our list checking expression becomes

!("or" %in% c("or", "and"))

## [1] FALSE

Back to our word counting example. We can now filter the list of words using the above with the list.filter function

Data in tables

Most data come in tables in one form or another. Data could be in an Excel spreadsheet, a csv file, a database table, an HTML table, and so on. R understands all these forms and can import them into an R data table, or data frame, as they are called in R.

A very easy way to create a data table or frame, is to use the tibble package, again part of the Tidyverse. The following function creates a data frame with two columns named letter_code and value:

Getting ready for large scale

Okay, so let's take R code to the next level. R is normally developed and run on a desktop computer or laptop, but it can also run as a server with a web browser interface — and you can hardly tell the difference.

As stated in the introduction, the aim of this text is to show how to run R analysis on the Cultural Heritage Cluster. This cluster is primarily an Apache Spark cluster and off course R, through the Tidyverse, has an interface to such a Spark cluster.

Now, let's see how that works, but be aware: we're trying to break a butterfly upon a wheel…

First ensure that the package for the Spark integration is installed:

#install.packages("sparklyr")

Now, sparklyr works up against two different Spark clusters. The one being a real cluster running on physical or virtual hardware in some server room and the other being a local pseudo cluster. The latter makes it easy for us to create the nessecary code for analysis before turning to the Real Big Thing.

Load the Spark library:

library(sparklyr)

If you want to run against a local pseudo instance, do this, which installs Apache Spark on your machine.

spark_install(version = "2.1.0")

## Spark 2.1.0 for Hadoop 2.7 or later already installed.

The only difference for us is how to initiate the cluster, pseudo or not:

# Sys.setenv(SPARK_HOME='/usr/hdp/current/spark2-client') # for connecting to the CHC
# sc <- spark_connect(master = 'yarn-client') # for connecting to the CHC
sc <- spark_connect(master = "local", version = "2.1.0") # for connection to a local Spark

Load the texts onto Spark

Now, the texts are on the local file system, but we want it in Spark. Remember that we are breaking butterflies on wheels here!

twain <- spark_read_text(sc, "twain", "mark_twain.txt")

Analysis

The texts now has a copy in the Spark system, cluster, machine, or whatever we should call that thing. What's important is that we can use that copy for very large scale analysis. Here, we'll just do some very simple visualization.

First, let's get the data on a tidy form, i.e. remove all punctuation, remove stop-words and transform the text into a form with one word per row.

The first filter function remove all empty lines (number of characters is more than zero)

the mutate function replaces all punctuation with spaces

the ft_tokenizer function tramsforms each line into a list of words

the ft_stop_words_remover removes a set of pre-defined stop words

the second mutate takes the list of words on each line a transforms that list into multiple rows, one per word

the select function removes all columns except the column with the word

the last filter function removes words with only one or two letters

the compute function stores the result in the Spark cluster for easy retrival later

and lastly save that Spark result as an R name called tidy_words

Count the word frequencies

Okay, so that can be used to perform a word count. The arrange function sorts a data frame, and the desc function gives us descending order, i.e. that largest number first. n is a implicit name created by the count function and n refers to the count of the thing counted in the count function.

Next steps

The DeIC National Cultural Heritage Cluster, the Royal Danish Library (CHC) has R as one of its two main interfaces, Python being the other one. R is very widespread in the data centric communities including the digital humanities. This blog post describes how to get startet with R with the main objective of enabling the use of R at the CHC.Still, most of the descriptions here are generic and platform agnostic.

The R Project describes R in the following way:

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.

One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.

R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes an effective data handling and storage facility, a suite of operators for calculations on arrays, in particular matrices, a large, coherent, integrated collection of intermediate tools for data analysis, graphical facilities for data analysis and display either on-screen or on hardcopy, and a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.

The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.

R, like S, is designed around a true computer language, and it allows users to add additional functionality by defining new functions. Much of the system is itself written in the R dialect of S, which makes it easy for users to follow the algorithmic choices made. For computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users can write C code to manipulate R objects directly.

Many users think of R as a statistics system. We prefer to think of it of an environment within which statistical techniques are implemented. R can be extended (easily) via packages. There are about eight packages supplied with the R distribution and many more are available through the CRAN family of Internet sites covering a very wide range of modern statistics.

We propose to use the RStudio platform for working with R. RStudio is a commercial organisation developing tools and methods for and with R and they describes their mission thus

RStudio has a mission to provide the most widely used open source and enterprise-ready professional software for the R statistical computing environment. These tools further the cause of equipping everyone, regardless of means, to participate in a global economy that increasingly rewards data literacy.

We offer open source and enterprise ready tools for the R computing environment. Our flagship product is an Integrated Development Environment (IDE) which makes it easy for anyone to analyze data with R. We also offer many R packages, including Shiny and R Markdown, and a platform for sharing interactive applications and reproducible reports with others.

Some notes on coding in R

As R is several decades old, a lot of R-code has been written using a lot of styles and principles and a lot of extension libraries that add functionality to the base of R. In recent years, the biggest movement within the R community has been the Tidyverse. The Tidyverse is, in their own words

R packages for data science

The tidyverse is an opinionated collection of R packages designed for data science.

The “tidy” in Tidyverse refers to an underlying principle on the structure on the data to be analyzed. In tidy data, each variable is a column, each observation is a row, and each type of observational unit is a table. This principle makes data much more easy to clean, explore, visualize, analyse, and so on. An in-depth description of, and argumentation for, the tidy data principle, can be found in Tidy data by Hadley Wickham (also published in The Journal of Statistical Software, vol. 59, 2014).

Communities and online resources

If you want to learn R, make it habit of visiting R-bloggers with daily news and tutorials about R, contributed by over 750 bloggers.

The R community is also very active on Twitter, where most R tweets are tagged with #rstat. Some important tweeters are:

Hadley Wickham hadleywickham is the main author of a lot of the Tidyverse (Not so long ago it was actually called the Hadleyverse) and ggplot, the primary plotting library for R. He is also the author of the books R for Data Science and Advanced R Programming

Mara Averick dataandme tweets a lot on everything R and does so in a fun and entertaining way.

The dane Thomas Lin Pedersen thomasp85 tweets a lot on data visualisering and is the author on a lot of very interesting R packages.