Objectives

Understand what analytics and the difference between analysis and analytics is.

Know the popular tools used in analytics.

Understand the role of a data scientist.

Know the processes involved in analytics.

Define a problem statement.

Collect and summarize data.

Detect and treat outliers in the data.

Analytics vs. Analysis

We all know we are studying about analytics. Then what is analysis? Is it by any means different from analytics? If yes, in what ways is it different. In this section, we see the basic understanding difference in analytics and analysis.

Analytics is the science of analysis wherein we apply methods of statistics, data mining and computer technology for doing the analysis. Whereas analysis is the process wherein the complex data available is being broken down into simpler forms which provide more compact and better data for understanding.

In the next section, we will see more about the difference between analytics and analysis.

What is Data Analytics?

So what is analytics? It’s the science of wisely acquiring meaningful results from given data using various methods and technologies by which a pattern of variation is observed.

By studying this pattern, we can understand the future and predict the uncertainty that is related to business. Various sophisticated models and techniques are used like statistical, mathematical and economical models.

So how do you go about analytics?

First, you gather Data. As an example of a commercial entity, you collect data from enterprise, social media, internet forums, etc.

Stage 1: Descriptive Analytics

Where data or information is gathered and summarized upon. This stage usually caters to questions like “How many students dropped out last year?”

Stage 2: Diagnostic Analytics

Where data is analyzed, and insights are generated which help in answering the question in the first stage. Here the question that comes up will be “Why has the dropout rate increased in the last one year?”

Stage 3: Predictive analysis

With the help of the analyses done in the previous two stages, this stage tries to answer unforeseen phenomena like “Which students are most likely to drop out?”

Stage 4: Prescriptive Analytics

Finally, the last stage tries to analyze the type action required to be taken to support or avoid the unforeseen phenomena predicted in the previous stage. In this scenario, prescriptive analytics has queries like “Which students should I target to keep from dropping out?”

Next, we will look at a list of a few popular tools used in analytics.

Popular Data Analytic Tools

These are some of the commonly used tools for the analytical purpose to get meaningful results. R Revolution, R, R Studio, Tableau, SAP HANA, Weka, KXEN, SAS.

Next, we will discuss the role of a data scientist.

Role of a Data Scientist

What is the role of a data scientist in data analytics? A data scientist is somebody who is inquisitive, who can look at data and spot trends. They come out with unrevealed stories hidden in data that helps in creating more useful insights and thus help in solving business problems.

Next, we will see the methodology involved in data analytics.

Data Analytics Methodology

The data analytics methodology comprises of 6 steps:

The discovery step is questioning if there is sufficient data to make an analytical plan.

Once we have the data, in data preparation, all unstructured data is converted into structured data in a way that it helps in planning.

Model planning is choosing an appropriate model that suits your data.

Model building is building the model that is being specified in the previous step.

The next step is to come up with insights that help customers to solve their problems, i.e., to deliver results.

The final step is to make the model operational and put the findings to use.

Next, we will discuss how to define the problem.

Problem Definition

Defining the problem statement is an important step in analytics, and needs to be done before starting up with the project. On a high level, the problem statement should answer four basic questions –

What is the problem? - Find the existing problem that needs to be answered.

What is it not? - Define the problem that is not being caused.

We have this problem because? – Find the root cause of the problem.

We don’t have a solution because? – This answers why the problem could not be resolved.

In the next section, we will discuss how to define a problem statement formally.

The techniques that are involved in formally defining a problem are:

State the problem in a general way – that is, state the problem in general terms concerning some practical, scientific or intellectual interest.

The next step is to understand the nature and origin of the problem, its objectives, and the environment in which the problem is to be studied.

Next, all available literature including relevant theories, reports, records, and any other form of relevant literature on the problem needs to be reviewed and examined.

The next step is optional; the researcher may discuss the problem with his/her colleagues and others related to the concerned subject to brainstorm ideas.

Finally, rephrasing the problem, that is, putting the problem in specific terms that is feasible and helps in the development of working hypotheses.

Next, we will look at ways to summarize data.

Summarizing Data Analytics

Before summarizing data, the types of data need to be discussed. Data can be of two types – qualitative or quantitative.

Qualitative or categorical data refers to data that are descriptive. They are generally recorded in groups or in a descriptive language. Example – dividing people into high, medium and low height groups.

Quantitative or numeric data refers to data that are recorded in numbers. They could be either continuous or discrete variables. Example – the measured height of a person.

In the next section, we will discuss how to analyze this data.

Summarizing refers to procedures that are applied to the raw data to convert them into an easily readable and analyzable format. The mathematical summaries differ according to the type of data – numerical or categorical and fall into the descriptive and graphical categories.

Seen here is an example of the population divided by Marital status. A bar chart has been used to depict the same data. In the next section, we will see the different summaries based on the categories.

Numeric data can be summarized using the mean, median and mode values.These are discussed in detail in the following chapters, along with other methods for summarizing. Categorical data can be summarized using a frequency distribution table, that is, the number of occurrences of a value for all values.

Graphical representations help in easy visualization of the data. Numeric variables can be viewed as box plots and the categorical variables can be viewed in Bar Charts or Histograms. The visualizations will be discussed in detail in the following chapters.

Data Collection and Analysis

Data collection is the process of gathering data that is relevant to the problem statement at hand. The data aids in proving or disproving the hypothesis; or provide solutions to a research or problem statement.

Regardless of the problem, the data collection process needs to be well defined and systematic, in an effort to construct accurate data. The accuracy of the data is the primary factor in deciding the validity of the findings.

Depending on the method used to collect data, the observations need to be recorded and organized in such a way as to provide optimum usefulness.

The recording and organizing of data must also be directly related to the methods of analysis and use of the data.

We will see the methods of collecting data in the next section.

The data collection methods fall into two major categories – primary and secondary.

Observations, experiments, and surveys fall into the primary category. Observations refer to measuring the data and various attributes.

In experiments, the subjects are divided into groups, receiving different external input; and the results are recorded. Surveys, questionnaires, and interviews help in recording feedback and used in studying characteristics of a population.

Data Sources refer to data that have already been gathered and are available in a published format. Reporting refers to cases where there exists a database of information that can be legally obtained.

Next, we will discuss Data Dictionary.

Data Dictionary

A data dictionary, also called metadata repository is a file that describes the structure and content of the data. The data dictionary is a file or a list of files that contain data about the database like –

Number of records

Name of each field

Characteristic and type of each field

Description of each field

Relationships between different fields.

In the analysis, data dictionary helps in analyzing different variables and the relationships and dependencies between each other.

In the next section, we will discuss the final topic of this lesson – Outliers.

Outlier Treatment

Outliers are observation points in a distribution that significantly deviates from the other observations. Outliers could exist in a distribution due to erroneous observations, or as special circumstances, like a bulk sales order on a particular day in a store. Outlier detection tests exist to check for outliers that are model-based and generally look for outliers based on mean and variance of the data.

Outliers can be excluded from the analysis or included while using algorithms that treat outliers as well. There are other outlier treatments and are out of scope for this chapter.

Summary

To quickly summarize what we have learned in this introduction to analytics tutorial, we have discussed: –

What are analytics and analysis, and what are the differences between them?