R-language for Data Science Beginner

What are R-language and the need to learn it?

R-language is a statistical programming language used by Data Scientists for data analysis and statistical computing in nearly every industry & field. It was developed in the early 90s. Since then, numerous efforts have been made to keep improving the proficiency & user interface of R.

The journey of R language from a fundamental
text editor to an interactive R Studio, and more recently- Jupyter Notebooks,
has intrigued & tangled many data science communities across the world.

The inclusion of various high-end packages in
R has made it more and more potent with time. Packages such as dplyr, tidyr,
readr, data.table, SparkR, ggplot2 have made data manipulation, visualization,
and computation much more efficient, fast, and accurate.

R has enough potential & provisions to
implement machine learning algorithms in a fast and simple protocol. By the end
of this blog post, you will have the right amount of exposure to understanding
the whereabouts & fundamentals for building predictive models using machine
learning on your own.

Note: No prior knowledge of data science or analytics is mandatory here.
However, previous experience of algebra and statistics will be helpful.

Benefits of using R-language

Coming to analyze, count & consider, there
are a plethora of benefits R extends in the Data Science world. But, I have
chalked out a few basic ones that seem relevant for a fundamental idea about R.

The style of coding is quite handy
& uncomplicated.

It’s open-source. Hence, no need
to pay any subscription fee to use it.

There is the availability of
instant access to over 7800 packages customized for various computation tasks.

The community support is
overwhelming. There are plenty of forums to help you out.

In the section Installers for Supported
Platforms, select and click on the RStudio installer based on your
existing operating system. The download should begin as soon as you click.

Keep clicking next till you reach Finish, then click that.

Wait for the download to complete.

As soon as the download gets completed, click on the generated RStudio desktop icon or use ‘search windows’ to access the program. It looks like this:

A quick understanding of the RStudio
interface:

R Script Code Editor: This is the space to
write codes. To run those codes, select the line(s) of code and press Ctrl +
Enter. Alternatively, you can click on the Run button located at the top right
corner of R Script.

R Console: This area shows the output of the
code run. Also, you can directly write codes in the console. However, code
entered directly in the R console cannot be traced later on. This is where R
script comes to use.

R Environment: This space displays the set of
external elements added, which includes data set, variables, vectors,
functions, etc. To check if the data has been successfully loaded in R, always
keep an eye on this area.

Graphical Output: This space displays the
graphs generated during exploratory data analysis. Not just charts, you can
select packages. For further detail on this go through the embedded R’s
official documentation.

Installation
of R Packages

The real power of R lies in its incredible
packages. In R, most data handling tasks can be performed in two ways- using R
packages or using R base functions. In this post, I’ll also introduce you to
the handiest and powerful R packages. There are two ways to install packages in
R.

1.
Using R Script-

To install a package from Script, type
install.packages(“package name”)

As a first time user, a pop might appear to
select your CRAN mirror (country server), choose accordingly and press OK.

Note: You can type this either in the console directly and press ‘Enter’ or
in the R script and click on ‘Run.’

2.
Using Package Library-

Run R studio

Click on the Packages tab in the
bottom-right section and then click on install. The following dialog box will
appear.

In the Install Packages dialog,
write the package name you want to install under the Packages field and then
click install. This will install the package you searched for or give you a
list of matching kit based on your package text.

Basic
Computations in R (Simple Example)

To familiarize ourselves with the R coding environment, let us start with some basic calculations. R console can be used as an interactive calculator too. Type the following in your console:

Similarly, you can experiment with different
combinations/operations of calculations and check out the results. In case you
want to check the previous calculation, it can be done in two ways. First,
click on the R console, and press the Up/Down arrow key on your keyboard. This
will activate the previously executed commands. Press Enter.

But, what if you have performed too many
calculations & want to check out the ones lying way before the last one? In
this case, finding out the result by scrolling using arrow keys through every
command will turn out to be too tedious.

In such situations, creating a variable is a better way. In R, you can create a variable using <- or = sign. Let’s assume I want to create a variable x to compute the sum of 10 and 15. I’ll write it as:

> x <- 15 + 10
> x
> 15

Once you create a variable, you no longer get
the output directly on the console, unless you call the Variable in the next
line. Always remember, variables can be alphabets, alphanumeric, but not
numeric.

Essentials of R-language

A thorough understanding of this section is
exceptionally vital. This is one of the building blocks of your R programming
knowledge. If you get this right, you will face less trouble in debugging.

R has five basic or atomic classes of objects.
But, let us first understand what an object is!

Everything you see or create in R is an
object. A vector, matrix, data frame, or even a variable is treated as an
object by R. So, R has five basic classes of objects:

Character

Numeric
(Real Numbers)

Integer
(Whole Numbers)

Complex

Logical
(True / False)

Since these classes are self-explanatory by
names, I wouldn’t elaborate on that topic. These classes have attributes. An
attribute can be referred to as an identifier- a name or a number which aptly
identifies them. An object can have the following attributes:

Names,
dimension names

Dimensions

Class

Length

Attributes of an object can be accessed using
attributes() function.

Let us now understand the concept of the
object and attributes programmatically. The most fundamental purpose in R is
known as a Vector. You can create an empty vector using vector(). Remember,
vectors contain objects of the same class.

For example, Let’s create vectors of different classes. We can create a vector using c() or concatenate command.

Data Types in R language

R has various data types, which include,
vector (numeric, integer, etc.), matrices, data frames, and lists. Let’s get a
brief idea about each one of them.

Vector:

A vector contains objects of the same class. But, you can mix objects of different classes too. When objects of different classes are mixed in a list, coercion occurs, which causes the conversion of objects of different types into one class. For example:

The double bracket [[1]] shows the index of
the first element and so on. Hence, you can easily extract the element of lists
depending on their index. Like this:

> my_list[[3]]
> [1] TRUE

Matrices:

When a vector is introduced with dimension attributes (rows & columns), it becomes a matrix. Sets of rows and columns represent a matrix. It is a two-dimensional data structure. It consists of elements of the same class. Let’s create a matrix of 3 rows and two columns:

Data Frame:

This is the most commonly used data type. It is used to store tabular data. In a matrix, every element must have the same class. But, in a data frame, you can put together a list of vectors containing different classes, which means, every column of a data frame acts like a list. Every time you will read data in R, it will be stored in the form of a data frame. Hence, it is important to understand the majorly used commands on the data frame:

Control
Structures in R

As the name suggests, a control structure
controls the flow of the code or commands written inside a function. A function
is a set of multiple commands written to automate a repetitive coding task.
Some of the important Control Structures in R include- if/else, for, while,
etc. Let us understand these in brief:

while:

It first tests a condition and executes only if the state is found to be true. When the loop gets executed for its first iteration, the condition is rechecked. Hence, it is necessary to set the condition such that the loop doesn’t run infinitely.

There are a few other control structures as
well but are less frequently used than the ones explained above, as follows:

repeat – executes an
infinite loop

break – breaks the
execution of a loop

next – allows skipping an
iteration in a circle

return – helps exit a
function

Useful
R Packages

R is supported by various packages to complement the work done by control structures. Some of the most basic and commonly used packages in predictive modelling are as follows:

Importing Data: R provides a wide range of packages for importing data available in any format such as .txt, .csv, .json, .sql etc. In order to import large files of data quickly, it is advisable to install and use data.table, readr, RMySQL, sqldf, jsonlite.

Data Visualization: R provides some in-built
plotting commands. The built-in controls are suitable for generating simple
graphical representations. But they get complicated when it comes to creating
advanced graphics. Hence, you should install ggplot2.

Data Manipulation: R has a vast collection of
packages for data manipulation. These packages enable you to perform basic
& advanced computations in a spree. They are dplyr, plyr, tidyr, lubridate,
stringr.

Modelling/ Machine Learning: For modelling, the caret package in R is powerful enough to cater to every need for creating a machine learning model. However, you can install packages based specifically on algorithmic requirements, such as randomForest, rpart, gbm, etc.

Exploratory Data Analysis in R-language

From this section onwards, we’re diving deep into understanding the primary stages of predictive modeling. Data Exploration is a crucial stage of predictive modeling. You can’t build efficient & proficient models unless you learn the skill of exploration of data entirely. This stage forms a concrete foundation for data manipulation (the very next step). Let’s understand it in R-language.

Before we start, you must get familiar with
these terms:

Response Variable (i.e., Dependent Variable):
In a data set, the response variable (y) is the one on which we make
predictions.

Predictor Variable (i.e., Independent
Variable): In a data set, predictor variables (Xi) are those using which we
predict the response variable.

(Img src: data set from Big Mart Sales Prediction)

Train Data: The predictive model is always
built on the train data set. An intuitive way to identify the training data is
that it still has the response variable included.

Test Data: Once the model is built, its
accuracy is verified on test data. This data always contains a lesser number of
observations than the train data set. Also, it does not include the response
variable.

Graphical
Representation of Variables

It is more comfortable & clearer to
understand all the variables with the help of graphical/visual aid. Using
graphs, we can analyze the data in two ways- Univariate Analysis and Bivariate
Analysis.

Univariate
analysis is done with one variable. It is a lot easier
to implement. Bivariate analysis is done with two variables. Let’s now
experiment by performing bivariate analysis & check out the results.

For visualization, I’ll use ggplot2 package.
These graphs would help us understand the distribution and frequency of
variables in the data set.

Train Data: The predictive model is always built on train data set. An intuitive way to identify the train data is that it always has the response variable included.

Test Data: Once the model is built, its accuracy is verified on test data. This data always contains a lesser number of observations than the train data set. Also, it does not include the response variable.

Graphical Representation of Variables

It is easier & clearer to understand all
the variables with the help of graphical/visual aid. Using graphs, we can
analyze the data in two ways- Univariate
Analysis and Bivariate Analysis.

Univariate
analysis is done with one variable. It is a lot easier
to implement. Bivariate analysis is
done with two variables. Let’s now experiment by implementing bivariate
analysis & check out the results.

For visualization, I’ll use ggplot2 package. These graphs would help
us understand the distribution and frequency of variables in the data set.

Here we can see that the majority of sales
have been obtained from products having a visibility of less than 0.2. This
suggests that item_visibility < 2 must be an essential factor in determining
sales.

Let’s plot another interesting sample graph in
order to strengthen our understanding of concepts.

Train Data: The predictive model is always built on train data set. An intuitive way to identify the training data is that it always has the response variable included.

Test Data: Once the model is built, its accuracy is verified on test data. This data always contains a lesser number of observations than the train data set. Also, it does not include the response variable.

Graphical
Representation of Variables

It is easier & clearer to understand all
the variables with the help of graphical/visual aid. Using graphs, we can
analyze the data in two ways- Univariate
Analysis and Bivariate Analysis.

Univariate
analysis is done with one variable. It is a lot easier
to implement. Bivariate analysis is
done with two variables. Let’s now experiment by implementing bivariate
analysis & check out the results.

For visualization, I’ll use ggplot2 package. These graphs would help
us understand the distribution and frequency of variables in the data set.

Here we can infer that OUT027 has contributed
to the majority of sales followed by OUT35. OUT10 and OUT19 have probably the
least footfall, thereby providing to the least outlet sales.

Now, we have an idea of the variables and
their importance on the response variable.

Let us now combine the data sets which will save our time as we don’t need to write separate codes for train and test data sets. To combine the two data frames, we must make sure that they have equal columns-

Combining Columns and Rows: If the columns of
two matrices have the same number of rows, they can be combined into a larger
matrix using cbind() function. In the example below, A and B are matrices.

newdata<- cbind(A, B)

Similarly, we can combine the rows of two
matrices if they have the same number of columns with the rbind() function. In
the example below, A and B are matrices.

newdata<- rbind(A, B)

Combining Rows when a different set of columns: The function rbind() doesn’t work when the column names do not match in the two datasets. For example, dataframe1 has three columns A, B, C. Dataframe2 also has three columns A, D, E. The function rbind() when used here will throw an error. The function smartbind() from gtools would combine column A and return NA where column names do not match.

So, we finally came to the end of the blog
post. I hope I was able to impart you, some basic idea & fundamental
knowledge about R. The experience can get you started with your detailed
learning spree about data munging & modeling using the language. If you are
to follow my advice, don’t jump towards building a complex model. Simple models
give you fundamental learning, a benchmark score, and a threshold to work.

In this brief tutorial, I have demonstrated
the steps used in data exploration, data visualization, data manipulation.
Happy learning, folks!