Provost - Chapter 3

Model creation where the model describes a relationship betweena set of selected variables (attributes or features) and a predefined variable

Estimates the value of the target variable as a function of the features

Instance / example

Represents a fact or data point (a row in a database)

Also called feature vector because it can be represented as a fixed-length ordered collected of feature values

Target Variable

Whose values are to be predicted (called dependent variable in statistics)

Deduction

Starts with general rules and specific facts, and creates other specific facts from them.

Training data

The input data for the induction algorithm (used for inducing the model)

Supervised Segmentation

Trying to segment the population into subgroups that have different values for the target variable

Selecting Informative Attributes

Selecting attributes which would best segment people / things into groups, in a way that will distinguish write-offs from non-write-offs

Wanting resulting group to be pure (homogeneous with respect to the target variable). If every member of a group has the same value for the target, then the group is pure

Very hard to find pure variable.

Complications

Attributes rarely split a group perfectly. (Even if the subgroup is pure, the other half may not be)

Sometimes conditions splits data into one pure subset. Is that better than another split that doesn't produce a pure subset but increases purity overall?

Not all attributes are binary. (The attributes can have 3 or more values) How do we properly compare non-binary groups?

Some attributes are numeric values (continuous or integer). Do you segment every value (no), rather how should the numeric values be segmented?

Purity Measure: formula that evaluates how well each attribute splits a set of examples into segments, with respect to a chosen target variable.

Information Gain: common splitting criterion & it is based on purity measure called entropy. Measures the change in entropy due to any amount of new information being added.

Entropy: a measure of disorder that can be applied to a set, & each member only has one of the properties. (eg. a mixed up segment with lots of write-offs & non write-offs would have high entropy)

Entropy: p1 log(p1) - p2 log(p2) - ⋯pi = the probability of property i within the set, ranging from pi=1 when all members of the set have property i, & pi=0 when no member of the set have the property i.

For regression problems: a natural measure of impurity for numeric values is variance