Organizing, Clarifying and Communicating the R Data Analyses

Packt Publishing

This guide uses an inventive role-playing approach to teaching the most effective data analysis techniques using R. Entertaining and involving, it comes with examples, screenshots, and code for fast learning.

Retracing and refining a complete analysis

For demonstration purposes, it will be assumed that a fire attack was chosen as the optimal battle strategy. Throughout this segment, we will retrace the steps that lead us to this decision. Meanwhile, we will make sure to organize and clarify our analyses so they can be easily communicated to others.

Suppose we determined our fire attack will take place 225 miles away in Anding, which houses 10,000 Wei soldiers. We will deploy 2,500 soldiers for a period of 7 days and assume that they are able to successfully execute the plans. Let us return to the beginning to develop this strategy with R in a clear and concise manner.

Time for action – first steps

To begin our analysis, we must first launch R and set our working directory:

Launch R.

The R console will be displayed.

Set your R working directory using the setwd(dir) function. The following code is a hypothetical example. Your working directory should be a relevant location on your own computer.

> #set the R working directory using setwd(dir)> setwd("/Users/johnmquick/rBeginnersGuide/")

Verify that your working directory has been set to the proper location using the getwd() command :

> #verify the location of your working directory> getwd()[1] "/Users/johnmquick/rBeginnersGuide/"

What just happened?

We prepared R to begin our analysis by launching the soft ware and setting our working directory. At this point, you should be very comfortable completing these steps.

Time for action – data setup

Next, we need to import our battle data into R and isolate the portion pertaining to past fire attacks:

Copy the battleHistory.csv file into your R working directory. This file contains data from 120 previous battles between the Shu and Wei forces.

Read the contents of battleHistory.csv into an R variable named battleHistory using the read.table(...) command:

Verify the contents of the new subset. Note that the console should return 30 rows, all of which contain fire in the Method column:

> #display the fire attack data subset> subsetFire

What just happened?

We imported our dataset and then created a subset containing our fire attack data. However, we used a slightly different function, called read.table(...), to import our external data into R.

read.table(...)

U p to this point, we have always used the read.csv() function to import data into R. However, you should know that there are oft en many ways to accomplish the same objectives in R. For instance, read.table(...) is a generic data import function that can handle a variety of file types. While it accepts several arguments, the following three are required to properly import a CSV file, like the one containing our battle history data:

file: t he name of the file to be imported, along with its extension, in quotes

header: whether or not the file contains column headings; TRUE for yes, FALSE (default) for no

sep: t he character used to separate values in the file, in quotes

Using these arguments, we were able to import the data in our battleHistory.csv into R. Since our file contained headings, we used a value of TRUE for the header argument and because it is a comma-separated values file, we used "," for our sep argument:

> battleHistory <- read.table("battleHistory.csv", TRUE, ",")

This is just one example of how a different technique can be used to achieve a similar outcome in R. We will continue to explore new methods in our upcoming activities.

Pop quiz

Suppose you wanted to import the following dataset, named newData into R. Which of the following read.table(...) functions would be best to use?

4,55,96,12

read.table("newData", FALSE, ",")

read.table("newData", TRUE, ",")

read.table("newData.csv", FALSE, ",")

read.table("newData.csv", TRUE, ",")

Time for action – data exploration

To begin our analysis, we will examine the summary statistics and correlations of our data. These will give us an overview of the data and inform our subsequent analyses:

Display the numeric version of the fire attack subset. Notice that all of the columns now contain numeric data; it will look like the following:

Having replaced our original text values in the SuccessfullyExecuted and Result columns with numeric data, we can now calculate all of the correlations in the dataset using the cor(data) function:

> #use cor(data) to calculate all of the correlations in thefire attack dataset> cor(subsetFire)

Note that the error message and NA values in our correlation output result from the fact that our Method column contains only a single value. This is irrelevant to our analysis and can be ignored.

What just happened?

Initially, we calculated summary statistics for our fire attack dataset using the summary(object) function. From this information, we can derive the following useful insights about our past battles:

The rating of the Shu army's performance in fire attacks has ranged from 10 to 100, with a mean of 45

Fire attack plans have been successfully executed 10 out of 30 times (33%)

Fire attacks have resulted in victory 8 out of 30 times (27%)

Successfully executed fire attacks have resulted in victory 8 out of 10 times (80%), while unsuccessful attacks have never resulted in victory

The number of Shu soldiers engaged in fire attacks has ranged from 100 to 10,000 with a mean of 2,052

The number of Wei soldiers engaged in fire attacks has ranged from 1,500 to 50,000 with a mean of 12,333

The duration of fire attacks has ranged from 1 to 14 days with a mean of 7

Next, we recoded the text values in our dataset's Method, SuccessfullyExecuted, and Result columns into numeric form. Aft er adding the data from these variables back into our our original dataset, we were able to calculate all of its correlations. This allowed us to learn even more about our past battle data:

The performance rating of a fire attack has been highly correlated with successful execution of the battle plans (0.92) and the battle's result (0.90), but not strongly correlated with the other variables.

The execution of a fire attack has been moderately negatively correlated with the duration of the attack, such that a longer attack leads to a lesser chance of success (-0.46).

The numbers of Shu and Wei soldiers engaged are highly correlated with each other (0.74), but not strongly correlated with the other variables.

The insights gleaned from our summary statistics and correlations put us in a prime position to begin developing our regression model.

Pop quiz

Which of the following is a benefit of adding a text variable back into its original dataset aft er it has been recoded into numeric form?

Calculation functions can be executed on the recoded variable.

Calculation functions can be executed on the other variables in the dataset.

Time for action – model development

Let us continue to the most extensive phase of our data analysis, which consists of developing the optimal regression model for our situation. Ultimately, we want to predict the performance rating of the Shu army under potential fire attack strategies. From our previous exploration of the data, we have reason to believe that successful execution greatly influences the outcome of battle. We can also infer that the duration of a battle has some impact on its outcome. At the same time, it appears that the number of soldiers engaged in battle does not have a large impact on the result. However, since the numbers of Shu and Wei soldiers themselves are highly correlated, there is a potential interaction effect between the two that is worth investigating. We will start by using our insights to create a set of potentially useful models:

Use the glm( formula, data) function to create a series of potential linear models that predict the Rating of battle (dependent variable) using one or more of the independent variables in our dataset. Then, use the summary(object) command to assess the statistical significance of each model:

Our first model used only the successful (or unsuccessful) execution of battle plans to predict the performance of the Shu army in a fire attack. Our summary tells us that execution is an important factor to include in the model.

Now, let us examine the impact that the duration of battle has on our model:

This time, we added the number of Shu and Wei soldiers into our model, but determined that they were not significant enough predictors of the Shu army's performance. Therefore, we elected to exclude them from our model.

Lastly, let us investigate the potential interaction effect between the number of Shu and Wei soldiers:

> #investigate a potential interaction effect between thenumber of Shu and Wei soldiers> #center each variable by subtracting its mean from eachof its values> centeredShuSoldiersFire <- subsetFire$ShuSoldiersEngaged- mean(subsetFire$ShuSoldiersEngaged)> centeredWeiSoldiersFire <- subsetFire$WeiSoldiersEngaged- mean(subsetFire$WeiSoldiersEngaged)> #multiply the two centered variables to create theinteraction variable> interactionSoldiersFire <- centeredShuSoldiersFire* centeredWeiSoldiersFire> #predict the rating of battle using execution, duration,and the interaction between the number of Shu and Weisoldiers engaged> lmFireRating_ExecutionDurationShuWeiInteraction <-glm(Rating ~ SuccessfullyExecuted + DurationInDays +interactionSoldiersFire, data = subsetFire)> #generate a summary of the modellmFireRating_ExecutionDurationShuWeiInteraction_Summary<- summary(lmFireRating_ExecutionDurationShuWeiInteraction)> #display the model summary> lmFireRating_ExecutionDurationShuWeiInteraction_Summary> #keep the interaction between the number of Shu and Weisoldiers engaged in the model as an independent variable

We can see that the interaction effect between the number of Shu and Wei soldiers does have a meaningful impact on our model and should be included as an independent variable.

Note that some statisticians may argue that it is inappropriate to include an interaction variable between the Shu and Wei soldiers in this model, without also including the number of Shu and Wei soldiers alone as variables in the model. In this fictitious example, there is no practically significant difference between these two options, and therefore, the interaction term has been included alone for the sake of simplicity and clarity. However, were you to incorporate interaction effects into your own regression models, you are advised to thoroughly investigate the implications of including or excluding certain variables.

We have identified four potential models. To determine which of these is most appropriate for predicting the outcome of our fire attack, we will use an approach known as Akaike Information Criterion, or AIC:

The AIC procedure revealed that our model containing execution, duration, and the interaction between the number of Shu and Wei soldiers is the best choice for predicting the performance of the Shu army.

What just happened?

We just completed the process of developing potential regression models and comparing them in order to choose the best one for our analysis. Through this process, we determined that the successful execution, duration, and the interaction between the number of Shu and Wei soldiers engaged were statistically significant independent variables, whereas the number of Shu and Wei soldiers alone were not. By using an AIC test, we were able to determine that the model containing all three statistically significant variables was best for predicting the Shu army's performance in fire attacks. Therefore, our final regression equation is as follows:

glm(...)

Each of our models in this article were created using the glm(formula, data) function. We used glm(formula, data) here to demonstrate an alternative R function for creating regression models. In your own work, the appropriate function will be determined by the requirements of your analysis.

You may also have noticed that our lm(formula, data) functions listed only the variable names in the formula argument. This is a short-hand method for referring to our dataset's column names, as demonstrated by the following code:

Notice that the subsetFire$ prefix is absent from each variable name and that the data argument has been defined as subsetFire. When the data argument is used, and the independent variables in the formula argument are unique, the dataset$ prefix may be omitted. This technique has the effect of keeping our code more readable, without changing the results of our calculations.

AIC(object, ...)

AIC can be used to compare regression models. It yields a series of AIC values, which indicate how well our models fit our data. AIC is used to compare multiple models relative to each other, whereby the model with the lowest AIC value best represents our data.

Similar in structure to the anova(object, ...) function, the AIC(object, ...) function accepts a series of objects (regression models in our case) as input. For example, in AIC(A, B, C) we are telling R to compare three objects (A, B, and C) using AIC. Thus, our AIC function compared the four regression models that we created:

As output, AIC(object, ...) returned a series of AIC values used to compare our models.

The glm(...) function coordinates well with AIC(object, ...), hence our decision to use them together in this example. Again, the appropriate techniques to use in your future analyses should be determined by the specific conditions surrounding your work.

Pop quiz

When can the dataset$ prefix be omitted from the variables in the formula argument of lm(formula, data) and glm(formula, data)?

When the data argument is defined.

When the data argument is defined and all of the variables come from different datasets.

When the data argument is defined and all of the variables have unique names.

When the data argument is defined, all of the variables come from different datasets, and all of the variables have unique names.

Which of the following is not true of the anova(object, ...) and AIC(object, ...) functions?

Time for action – model deployment

Having selected the optimal model for predicting the outcome of our fire attack strategy, it is time to put that model to practical use. We can use it to predict the outcomes of various fire attack strategies and to identify one or more strategies that are likely to lead to victory. Subsequently, we need to ensure that our winning strategies are logistically sound and viable. Once we strike a balance between our designed strategy and our practical constraints, we will arrive at the best course of action for the Shu forces.

We set a rating value of 80 as our minimum threshold. As such, we will only consider a strategy adequate if it yields a rating of 80 or higher when all variables have been entered into our model.

In the case of our fire attack regression model, we know that to achieve our desired rating value, we must assume successful execution. We also know the number of Wei soldiers housed at the target city. Consequently, our major constraints are the number of Shu soldiers that we choose to engage in battle and the duration of the attack. We will assume a moderate attack duration.

Subsequently, we can rearrange our regression equation to solve for the number of Shu soldiers engaged and then represent it as a custom function in R:

Use the coef(object) function to isolate the independent variables in our regression model:

> #use the coef(object) function to extract the coefficientsfrom a regression model> #this will make it easier to rearrange our equation byallowing us to focus only on these values> coef(lmFireRating_ExecutionDurationShuWeiInteraction)

Rewrite the fire attack regression equation to solve for the number of Shu soldiers engaged in battle:

Use the custom function to solve for the number of Shu soldiers that can be deployed, given a rating of 80, duration of 7, success of 1.0, and 10,000 WeiSoldiers:

> #solve for the number of Shu soldiers that can be deployedgiven a result of 80, duration of 7, success of 1.0, and15,000 WeiSoldiers> functionFireShuSoldiers(80, 1.0, 7, 10000)[1] 3323.077

Our regression model suggests that to achieve a rating of 80, our minimum threshold, we should deploy 3,323 Shu soldiers. However, from looking at the data in our fire attack subset, a force between 2,500 and 5,000 soldiers has not been previously used to launch a fire attack. Further, four past successful fire attacks on 7,500 to 12,000 Wei soldiers have deployed only 1,000 to 2,500 Shu soldiers. What would happen to our predicted rating value if we were to deploy 2,500 Shu soldiers instead of 3,323?

Create a custom function to solve for the rating of battle when execution, duration, and number of ShuSoldiers and WeiSoldiers are known:

Use the custom function to solve for the rating of battle, given successful execution, a 7-day duration, 2,500 Shu soldiers, and 10,000 Wei soldiers:

> What would happen to our rating value if we were to deploy2,500 Shu soldiers instead of 3,323?> functionFireRating(1.0, 7, 2500, 10000)[1] 81.07> #Is the 1.07 increase in our predicted chances for victoryworth the practical benefits derived from deploying 2,500soldiers?

By using 2,500 soldiers, our rating value increased to 81, which is slightly above our threshold of confidence for victory. Here, we have encountered a classic dilemma for the data analyst. On one hand, our data model tells us that it is safe to use 3,323 soldiers. On the other, our knowledge of war strategy and past outcomes tells us that a number between 1,000 and 2,500 would be sufficient. Essentially, we have to identify the practical benefits or detriments from deploying a certain number of soldiers. In this case, we are inclined to think that it is beneficial to deploy fewer than 3,323, but more than 1,000. The exact number is a matter of debate and uncertainty that deserves serious consideration. It is always the strategist's challenge to weigh both the practical and statistical benefits of potential decisions. On that note, let us consider the logistics of our proposed fire attack. Our plan is to deploy 2,500 Shu soldiers over a period of 7 days to attack 10,000 Wei soldiers who are stationed 225 miles away.

> #our gold cost of 6,792 is well below our allotment of 1,000,000> #our required provisions of 583 are well below our allotment of1,000,000> #our 2,500 soldiers account for only 1.25% of our total armypersonnel> #yes, the fire attack strategy is viable given our resourceconstraints

What just happened?

We successfully used our optimal regression model to refine our battle strategy and test its viability in light of our practical resource constraints. Custom functions were used to calculate the number of soldiers necessary to yield our desired outcome, the performance rating given the parameters of our plan, and the overall gold cost of our strategy. In determining the number of soldiers to engage in our fire attack, we encountered a common occurrence whereby our data models conflicted with our practical understanding of the world. Subsequently, we had to use our expertise as data analysts to balance the consequences between the two and arrive at a sound conclusion. We then assessed the overall viability of our strategy and determined it to be sufficient in consideration of our resource allotments.

coef(object)

Prior to rewriting our regression equation and converting it into a custom function, we executed the coef(object) command on our model. The coef(object) function, when executed on a regression model, has the effect of extracting and displaying its independent variables (or coefficients). By isolating these components, we were able to easily visualize our model's equation:

> coef(lmFireRating_ExecutionDurationShuWeiInteraction)

In contrast, the summary(object) function contains much more information than we need for this purpose, thus making it potentially confusing and difficult to locate our variables. This can be seen in the following:

> lmFireRating_ExecutionDurationShuWeiInteraction_Summary

Hence, in circumstances where we only care to see the independent variables in our model, the coef(object) function can be more effective than summary(object).

Pop quiz

Under which of the following circumstances might you use the coef(object) function instead of summary(object)?

You want to know the practical significance of the model's variables.

You want to know the statistical significance of the model's variables.

You want to know the model's regression equation.

You want to know the formula used to generate the model.

Time for action – last steps

Lastly, we need to save the workspace and console text associated with our fire attack analysis:

Use the save.image(file) function to save your R workspace to your working directory. The file argument should contain a meaningful filename and the .RData extension:

> #save the R workspace to your working directory> save.image("rBeginnersGuide_Ch_07_fireAttackAnalysis.RData")

R will save your workspace file. Browse to the working directory on your hard drive to verify that this file has been created.

Manually save your R console log by copying and pasting it into a text file. You may then format the console text to improve its readability.

We have now completed an entire data analysis of the fire attack strategy from beginning to end using R.

The common steps to all R analyses

While retracing the development process behind our fire attack strategy, we encountered a key series of steps that are common to every analysis that you will conduct in R. Regardless of the exact situation or the statistical techniques used, there are certain things that must be done to yield an organized and thorough R analysis. Each of these steps is detailed.

Perhaps it goes without saying that the thing to do before beginning any R analysis is to launch R itself. Nevertheless, it is mentioned here for completeness and transparency.

Step 1: Set your working directory

Once R is launched, the first common step is to set your working directory. This can be done using the setwd(dir) function and subsequently verified using the getwd() command:

> #Step 1: set your working directory> #set your working directory using setwd(dir)> #replace the sample location with one that is relevant to you> setwd("/Users/johnmquick/rBeginnersGuide/")> #once set, you can verify your new working directory using getwd()> getwd()[1] "/Users/johnmquick/rBeginnersGuide/"

Comment your work

Note that commented lines, which are prefixed with the pound sign (#), appeared before each of our functions in step one. It is vital that you comment all of the actions that you take within the R console. This allows you to refer back to your work later and also makes your code accessible to others.

This is an opportune time to point out that you can draft your code in other places besides the R console. For example, R has a built in editor that can be opened by going to the File New Document/Script menu or simultaneously pressing the Command + N or Ctrl + N keys. Other free editors can also be found online. The advantages of using an editor are that you can easily modify your code and see different types of code in different colors, which helps you to verify that it is properly constructed. Note however, that to execute your code, it must be placed in the R console.

Step 2: Import your data (or load an existing workspace)

Aft er you set the working directory, it is time to pull your data into R. This can be achieved by creating a new variable in tandem with the read.csv(file) command:

> #Step 2: Import data (or load an existing workspace)

> #read a dataset from a csv file into R using read.csv(file) and saveit into a new variable> dataset <- read.csv("datafile.csv")

Alternatively, if you were continuing a prior data analysis, rather than starting a new one, you would instead load a previously saved workspace using load.image(file). You can then verify the contents of your loaded workspace using the ls() command.

Step 3: Explore your data

Regardless of the type or amount of data that you have, summary statistics should be generated to explore your data. Summary statistics provide you with a general overview of your data and can reveal overarching patterns, trends, and tendencies across a dataset. Summary statistics include calculations such as means, standard deviations, and ranges, amongst others:

Also recall R's summary(object) function, which provides summary statistics along with additional vital information. It can be used with almost any object in R and will off er information specifically catered to that object:

> #generate a detailed summary for a given object usingsummary(object)> summary(object)

Note that there are oft en other ways to make an initial examination of your data in addition to using summary statistics. When appropriate, graphing your data is an excellent way to gain a visual perspective on what it has to say. Furthermore, before conducting an analysis, you will want to ensure that your data are consistent with the assumptions necessitated by your statistical methods. This will prevent you from expending energy on inappropriate techniques and from making invalid conclusions.

Step 4: Conduct your analysis

Here is where your work will diff er from project to project. Depending on the type of analysis that you are conducting, you will use a variety of different techniques. The correct techniques to use will be determined by the circumstances surrounding your work.

Step 5: Save your workspace and console files

At the conclusion of your analysis, you will always want to save your work. To have the option to revisit and manipulate your R objects from session to session, you will need to save your R workspace using the save.image(file) command, as follows:

To save your R console text, which contains the log of every action that you took during a given session, you will need to copy and paste it into a text file. Once copied, the console text can be formatted to improve its readability. For instance, a text file containing the five common steps of every R analysis could take the following form:

> #There are five steps that are common to every data analysisconducted in R

> #Step 1: set your working directory

> #set your working directory using setwd(dir)> #replace the sample location with one that is relevant to you> setwd("/Users/johnmquick/rBeginnersGuide/")

> #once set, you can verify your new working directory using getwd()> getwd()[1] "/Users/johnmquick/rBeginnersGuide/"

> #Step 2: Import data (or load an existing workspace)

> #read a dataset from a csv file into R using read.csv(file) and saveit into a new variable> dataset <- read.csv("datafile.csv")> #OR

> #save your R console text by copying it and pasting it into a textfile.

Pop quiz

Which of the following is not a benefit of commenting your code?

It makes your code readable and organized.

It makes your code accessible to others.

It makes it easier for you to return to and recall your past work.

It makes the analysis process faster.

Summary

In this article, we conducted an entire data analysis in R from beginning to end. While doing so, we ensured that our work was as organized and transparent as possible, thereby making it more accessible to others. Afterwards, we identified the five steps that are common to all well-executed data analyses in R. You then used these steps to conduct, organize, and refine a battle strategy for the Shu army. Having completed this article, you should now be able to:

Alerts & Offers

Series & Level

We understand your time is important. Uniquely amongst the major publishers, we seek to develop and publish the broadest range of learning and information products on each technology. Every Packt product delivers a specific learning pathway, broadly defined by the Series type. This structured approach enables you to select the pathway which best suits your knowledge level, learning style and task objectives.

Learning

As a new user, these step-by-step tutorial guides will give you all the practical skills necessary to become competent and efficient.

Beginner's Guide

Friendly, informal tutorials that provide a practical introduction using examples, activities, and challenges.

Essentials

Fast paced, concentrated introductions showing the quickest way to put the tool to work in the real world.

Cookbook

A collection of practical self-contained recipes that all users of the technology will find useful for building more powerful and reliable systems.

Blueprints

Guides you through the most common types of project you'll encounter, giving you end-to-end guidance on how to build your specific solution quickly and reliably.

Mastering

Take your skills to the next level with advanced tutorials that will give you confidence to master the tool's most powerful features.

Starting

Accessible to readers adopting the topic, these titles get you into the tool or technology so that you can become an effective user.

Progressing

Building on core skills you already have, these titles share solutions and expertise so you become a highly productive power user.