The Outliers service identifies the odd profiles of a dataset whose target indicator is significantly different from what is expected.

This service:

Identifies outliers contained in a dataset with regard to a target indicator

Ranks the outliers to get the oddest on top

Provides the reasons why an identified outlier is odd

In general, an outlier can either result from a data quality issue to correct or represent a suspicious case to investigate.

An observation is considered an outlier if the difference between its “predicted value” and its “real value” exceeds the value of the error bar where the error bar is a deviation measure of the values around the predicted score.

Reasons will list the variables whose values have the most influence in the score. For each variables, the contribution corresponding to the score is compared to its contribution for the whole population. The variables for which the contribution is the most differential are selected as the most important reason.

Note: The target of the dataset must be either binary or continuous. Multinomial targets are not supported.

To summarize, in order to execute the outliers service, you need a dataset with:

a target variable

a set of variables that will be analyzed

Optionally, you can define the following parameters to enhance your analysis:

number of outliers : number of outliers to return

number of reasons : number of reasons to return for each outlier

weight variable: column to be used to increase the importance of a row

skipped variables: a list of variables to skip from the analysis

variable description: a more details description of the dataset

weight variable: a column to be used to increase the importance of a row

The dataset will be using during this tutorial is extracted from the sample dataset available with SAP BusinessObjects Predictive Analytics.

The Census sample data file that you will use to follow the scenarios for Regression/Classification and Segmentation/Clustering is an excerpt from the American Census Bureau database, completed in 1994 by Barry Becker.

This file presents the data on 48,842 individual Americans, of at least 17 years of age. Each individual is characterized by 15 data items. These data, or variables, are described in the following table.

Variable

Description

Example of Values

age

Age of individuals

Any numerical value greater than 17

workclass

Employer category of individuals

Private, Self-employed-not-inc, …

fnlwgt

Weight variable, allowing each individual to represent a certain percentage of the population

Any numerical value, such as 0, 2341 or 205019

education

Level of study, represented by a schooling level, or by the title of the degree earned

11th, Bachelors

education_num

Number of years of study, represented by a numerical value

A numerical value between 1 and 16

marital_status

Marital status

Divorced, Never-married, …

occupation

Job classification

Sales, Handlers-cleaners, …

relationship

Position in family

Husband, Wife, …

race

Ethnicity

sex

Gender

Male, Female, …

capital_gain

Annual capital gains

Any numerical value

capital_loss

Annual capital losses

Any numerical value

native country

Country of origin

United States, France, …

class

Variable indicating whether or not the salary of the individual is greater or less than $50,000

“1” if the individual has a salary of greater than $50,000 & “0” if the individual has a salary of less than $50,000

With these settings, we will get a scoring equation as SQL for HANA to predict the probability of the class variable to be a 1, excluding the “id”, “sex”, “race” variables from the analysis. It will also adjust the dataset description with proper settings.

Click on Send

Congratulations! You have just run the outliers service on the Census dataset.

We can see that 356 records out of the 48842 are marked as outliers, where the difference between the “predicted value” and the “real value” exceeds the value of the error bar. The list is sorted by descending order to give first the records with the highest difference.

You can also play with the following parameters and check the differences:- number of outliers : ask for 10, 50 and 100- number of reasons“ : ask for 1,5 and 10- skipped variables: exclude ”marital_status"- variable description: for example as an ordinal variable