Exploratory Data Analysis: Graphical Data Analysis with R

In this blog, we will discuss visualizing the most important attributes of data through graphical exploratory data analysis with R. We will also learn about the suitability of visualization in different scenarios.

We recommend users to go through our previous blogs on Exploratory Data Analysis to have better understanding of the concepts discussed further.

Introduction to Data Visualization

Data visualization is the presentation of data through pictures and shapes. Data visualization can be considered as a modern equivalent of visual communication. It enables decision makers to grasp difficult concepts or identify new patterns in analytics.

The primary goal of visual representation of data is to communicate information clearly and efficiently to users through statistical graphics, plots, information graphics, tables, and charts. Effective visualization helps users to reason and analyze data and evidence. It makes complex data more accessible, understandable, and usable.

Human brain has a tendency to learn things much faster when the data is presented using different shapes and colorful images. Charts and graphs can be used rather than spreadsheets, reports or numbers to visualize large amount of complex data.

Data visualization is a quick, easy way of conveying concepts in a universal manner – and you can experiment with different scenarios by making slight adjustments.

It is said that one meaningful picture is powerful than thousand words.

Data visualization can help us in the following ways:

• Identify areas that need attention or improvement

• Clarify which factors influence customer behavior

• Help you understand the right placement of the products.

• Predict sales volume

Dataset

Following is a bank data for the loan given out by a bank:

The fields are defined as follows:

loan_id:

Each new loan is identified by this number

Discrete variable since it is like a count of the loan

amount:

The amount that has been given as a loan

Continuous variable since it can be fractioned between given intervals like R.s. 2.5

Duration (in months):

Repayment period

Discrete since it is countable and is restricted to months

payments:

Total amount that has been repaid

Continuous variable since it can be fractioned between given intervals like R.s. 2.5

status:

The status of the loan i.e. whether the customer has paid it on time or not.

Ordinal categorical/Qualitative variable

We will see how the data can be visualized.

Central Tendency Measures

A central tendency measure is a value which can best describe an entire set of observations.

It can be measured by:

Mean

Median

Mode

Histogram is the best choice for visualizing central tendency of data.

Histogram

Histograms are a special form of bar chart where the data represents continuous rather than discrete categories.

There are no gaps between the columns representing the different categories.

In a bar chart, the length of the bar indicates the size of the category, but in a histogram it is the area of the bar that is proportional to the size of the category.

This difference is due to the fact that in a histogram both the x-axis and y-axis have a scale, whereas in a bar chart only the y-axis has a scale.

In the example below, a histogram has been used to show the average height of children of different ages in 1837. A histogram is used because age is a continuous rather than a discrete category.

Attach the data frame for current use in the environment. This will help us to refer to columns directly by their names

attach(loan)

Pick relevant columns from the data; here it is (“amount” “duration” “status”)

loan_date_loan_amt_payment_duration<-loan[,c(4:5,7)]

Here amount is a continuous variable, duration is discrete variable and status is an ordinal variable. Before we plot histogram on amount, we need to scale down the variable by converting it into thousand units.

Next, we will mark the mode of our distribution. We’ll first construct a method for mode

# we do not have inbuilt function for mode in R,so we create one mode <- function(v) {uniqv <- unique(v)print(uniqv[which.max(tabulate(match(v, uniqv)))])}

Now, use the above function then use it to plot the line on pdf.

# And a line for the mode:abline(v = mode(amount_in_thousand),col = “green”,lwd = 2)

Now, we will add legends to the graph then put an interpretation to it:

#We add a legend, so it will be easy to tell which line is which.legend(x = “topright”, # location of legend within plot areac(“Density plot”, “Mean”, “Median”,”Mode”), # name of the plot lines that we plottedcol = c(“chocolate3”, “royalblue”, “red”,”green”), # color for each linelwd = c(2, 2, 2,2) # line width for each plotted line)

Interpretation: Here, we can see that mean, median and mode are far away from each other which means loan amount is not uniformly distributed. Mean lies somewhere at 150 thousand.

The middle loan amount is around 120 thousand. The amount that has been given as loan is around 30000, as our mode value lies in that region.

For categorical data, we use Bar chart to check the frequency of each category.

Bar Chart

Bar charts are used to display and compare the density, frequency or other measure (e.g. mean) for different discrete categories of data.

There are several variations of the standard bar chart including horizontal bar charts, grouped or component charts, and stacked bar charts.

Bar charts are useful for displaying data that are classified into nominal or ordinal categories.

3.2.1 Types of Bar Charts

Following are the different types of Bar charts.

Vertical Bar Charts

Bar charts normally have vertical bars. Taller the bar, larger is the category.

It is also possible to draw bar charts in such a way that the bars are horizontal. Longer is the bar, larger is the category.

It is useful when different categories have long titles that would be difficult to include below a vertical bar, or when there are a large number of different categories and there is insufficient space to fit all the columns required for a vertical bar chart across the page.

They are used to display information about different sub-groups of the main category. A separate bar represents each of the sub-groups and these are usually colored or shaded differently to distinguish between them.

In such cases, a legend or key is usually provided to indicate the sub-group and color that it represents.

We hope this blog was useful. If you have any questions, feel free to contact us at [email protected]. Keep visiting our website Acadgild for more updates on Machine Learning and other technologies. Click here to learn Machine Learning with R.