Classification WM

Course Roadmap: Course Roadmap Introduction to Data mining Introduction, Preprocessing, Clustering, Association Rules, Classification,… Web structure mining Authoritative Sources in a Hyperlinked environment, by John.M.Kleonberg Efficient Crawling through URL ordering, J.Cho, H.Garcia Molina and L.Page The PageRank Citation Ranking: Bringing order to the Web. The anatomy large scale Hyper-textual Web Search engine, Sergey Brin and Lawrance Page Graph Structure in the Web, Andrei Broder et al Focused Crawling: A new approach to topic-specific web resource discovery, by Souman Chakravarthi et al. Trawling the web for emerging cyber communities, Ravi Kumar… Building a cyber-community hierarchy based on link analysis, P.Krishna Reddy and Masaru Kitsuregawa Efficient identification of web communities, GW Flake, S..Lawrence, et al. Finding related pages in WWW, J Dean and MR Herizinger Web content mining/Information Retrieval Web log mining/Recommendation systems/E-commerce
Classification and Prediction: Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Summary
Classification vs. Prediction: Classification: predicts categorical class labels classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction: models continuous-valued functions, i.e., predicts unknown or missing values Typical Applications credit approval target marketing medical diagnosis treatment effectiveness analysis Classification vs. Prediction
Classification—A Two-Step Process : Classification—A Two-Step Process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur
Classification Process (1): Model Construction: Classification Process (1): Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’
Classification Process (2): Use the Model in Prediction: Classification Process (2): Use the Model in Prediction (Jeff, Professor, 4) Tenured?
Supervised vs. Unsupervised Learning: Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Chapter 7. Classification and Prediction: Chapter 7. Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by back propagation Classification based on concepts from association rule mining Other Classification Methods Prediction Classification accuracy Summary
Data Preparation: Data Preparation Data cleaning Preprocess data in order to reduce noise and handle missing values Relevance analysis (feature selection) Remove the irrelevant or redundant attributes Data transformation Generalize and/or normalize data
Evaluating Classification Methods: Evaluating Classification Methods Predictive accuracy Speed and scalability time to construct the model time to use the model Robustness handling noise and missing values Scalability efficiency in disk-resident databases Interpretability: understanding and insight provided by the model Goodness of rules decision tree size compactness of classification rules
Chapter 7. Classification and Prediction: Chapter 7. Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Classification based on concepts from association rule mining Other Classification Methods Prediction Classification accuracy Summary
Classification by Decision Tree Induction: Classification by Decision Tree Induction Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree
Training Dataset: Training Dataset
Output: A Decision Tree for “buys_computer”: Output: A Decision Tree for “buys_computer”
Algorithm for Decision Tree Induction: Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left
Attribute Selection Measure: Attribute Selection Measure Information gain All attributes are assumed to be categorical Can be modified for continuous-valued attributes Gini index All attributes are assumed continuous-valued Assume there exist several possible split values for each attribute May need other tools, such as clustering, to get the possible split values Can be modified for categorical attributes
Information Gain : Information Gain Select the attribute with the highest information gain Let S be a set consisting of s data samples. Suppose the class label attribute has m distinct values defining m distinct classes, Ci (for i=1,…,m). Let si be the number of samples of S in class Ci. The expected information needed to classify a given sample is given by I(s1,s2,…,sm)= -pi  log (pi) i=1 to m, where pi is the probability that an arbitrary sample belongs to class Ci and is estimated by si/s. Let attribute A have v distinct values, {a1,a2,…,av}. Attribute A can be used to partition S into v subsets, {S1,S2,…,Sv}, where Sj contains those samples in S that have value aj of A. If A were selected as the test attribute (i.e., the best attribute for splitting), then these subsets would correspond to the branches grown from the node containing the set S.
Information Gain : Information Gain Let Sij be the number of samples of class Ci in a subset Sj. The entropy, or expected information based on the partitioning into subsets by A, is given by E(A)= ( (s1j+…+smj)/s) I(s1j,s2j,…,smj) where j=1,…,v Smaller is the entropy value the greater the purity of the subset partitions. For a given subset Sj, I(s1j,s2j,…,smj)= -pij  log (pij) j=1 to m, pij=sij/Sj, which is the probability that a sample Sj belongs to class Ci. The encoding information that would be gained by branching on A is Gain(A)= I(s1,s2,…,sm) – E(A). Gain(A) is the expected reduction in entropy caused by knowing the value of attribute A. The algorithm computes information gain for every attribute. The attribute with the highest information gain is chosen as the test attribute.
Information Gain : Information Gain Select the attribute with the highest information gain Assume there are two classes, P and N Let the set of examples S contain p elements of class P and n elements of class N The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as
Information Gain in Decision Tree Induction: Information Gain in Decision Tree Induction Assume that using attribute A a set S will be partitioned into sets {S1, S2 , …, Sv} If Si contains pi examples of P and ni examples of N, the entropy, or the expected information needed to classify objects in all subtrees Si is The encoding information that would be gained by branching on A
Attribute selection: example: Attribute selection: example Example: The class label Buys-computer has two values”: Yes and No. Let C1=Yes and C2=no; there are 9 samples for yes and 5 samples for no. Expected information gain needed to classify the sample= I(s1,s2)=I(9,5)=-9/14 log 9/14 - 5/14 log 5/14=0.940 Compute the entropy of each attribute Take “age”: Look at the distribution of yes and no samples for each value of age. Compute the expected information or entropy for each these distributions. For age= “<=30”, S11=2, S22=3, I(S11,S21)=0.971 For age= “31…40”, S12=4, S22=0, I(S12,S22)=0. For age= “>=40”, S13=3, S23=2, I(S13,S23)=0.971
Attribute selection: example : Attribute selection: example The entropy or expected information needed to classify a given sample if the samples are partitioned according to age is E(age)=5/14 I(s11,s21) + 4/14 I(S12,S22)+ 5/14 I(S13,S23)=0.694 So the gain in information from such a partitioning would be Gain(age)=I(s1,s2) – E(age)=0.246 Similarly we can compute Gain(income)=0.029, Gain(student)=0.151, and Gain(credit_rating)=0.048. Since age is highest information gain among the attributes, it is selected as the test attribute.
Extracting Classification Rules from Trees: Extracting Classification Rules from Trees Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
Avoid Overfitting in Classification: Avoid Overfitting in Classification The generated tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Result is in poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree”
Approaches to Determine the Final Tree Size: Approaches to Determine the Final Tree Size Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross validation Use all the data for training but apply a statistical test to estimate whether expanding or pruning a node may improve the entire distribution Use minimum description length (MDL) principle:
Enhancements to basic decision tree induction: Enhancements to basic decision tree induction Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible values Attribute construction Create new attributes based on existing ones that are sparsely represented This reduces fragmentation, repetition, and replication
Classification in Large Databases: Classification in Large Databases Classification—a classical problem extensively studied by statisticians and machine learning researchers Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed Why decision tree induction in data mining? relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods
Scalable Decision Tree Induction Methods in Data Mining Studies: Scalable Decision Tree Induction Methods in Data Mining Studies SLIQ (EDBT’96 — Mehta et al.) builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.) constructs an attribute list data structure PUBLIC (VLDB’98 — Rastogi & Shim) integrates tree splitting and tree pruning: stop growing the tree earlier RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) separates the scalability aspects from the criteria that determine the quality of the tree builds an AVC-list (attribute, value, class label)
Presentation of Classification Results: Presentation of Classification Results
Chapter 7. Classification and Prediction: Chapter 7. Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Summary
Bayesian Classification: Why?: Bayesian Classification: Why? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
Derivation of Bays’ theorem : Derivation of Bays’ theorem Definition: The conditional probability that A is true, given that B, with the notion P(A/B) (read “probability of A given B). P(A/B)= P(A  B)/P(B) Or P(A  B)= P(A/B) * P(B) Bays theorem P(A  B)= P(B  A) P(B A)=P(B/A) * P(A) So, we have P(A/B)=P(B/A)*P(A) / P(B) Bays theorem is presented with theory and evidence, T and E P(T/E)=P(E/T)*P(T)/P(E) Can be generalized. Calculating the probability of a particular theory Tk out of collection of alternatives, T1, T2,…,Tn. P(Tk/E)=P(E/Tk)*P(Tk)/  P(E/Ti)*P(Ti) i=1 to n.
Bayesian Classification: Simple introduction: Bayesian Classification: Simple introduction "The essence of the Bayesian approach is to provide a mathematical rule explaining how you should change your existing beliefs in the light of new evidence. In other words, it allows scientists to combine new data with their existing knowledge or expertise. The canonical example is to imagine that a precocious newborn observes his first sunset, and wonders whether the sun will rise again or not. He assigns equal prior probabilities to both possible outcomes, and represents this by placing one white and one black marble into a bag. The following day, when the sun rises, the child places another white marble in the bag. The probability that a marble plucked randomly from the bag will be white (ie, the child's degree of belief in future sunrises) has thus gone from a half to two-thirds. After sunrise the next day, the child adds another white marble, and the probability (and thus the degree of belief) goes from two-thirds to three-quarters. And so on. Gradually, the initial belief that the sun is just as likely as not to rise each morning is modified to become a near-certainty that the sun will always rise."
Bayesian Classification: Simple introduction…: Bayesian Classification: Simple introduction… Suppose your data consist of fruits, described by their color and shape. Bayesian classifiers operate by saying "If you see a fruit that is red and round, which type of fruit is it most likely to be, based on the observed data sample? In future, classify red and round fruit as that type of fruit." A difficulty arises when you have more than a few variables and classes -- you would require an enormous number of observations (records) to estimate these probabilities. Naive Bayes classification gets around this problem by not requiring that you have lots of observations for each possible combination of the variables. Rather, the variables are assumed to be independent of one another and, therefore the probability that a fruit that is red, round, firm, 3" in diameter, etc. will be an apple can be calculated from the independent probabilities that a fruit is red, that it is round, that it is firm, that is 3" in diameter, etc. In other words, Naive Bayes classifiers assume that the effect of an variable value on a given class is independent of the values of other variable. This assumption is called class conditional independence. It is made to simplify the computation and in this sense considered to be naïve.
Bayesian Classification: Simple introduction…: Bayesian Classification: Simple introduction… This assumption is a fairly strong assumption and is often not applicable. However, bias in estimating probabilities often may not make a difference in practice -- it is the order of the probabilities, not their exact values, that determine the classifications. Studies comparing classification algorithms have found the Naïve Bayesian classifier to be comparable in performance with classification trees and with neural network classifiers. They have also exhibited high accuracy and speed when applied to large databases.
Bayes Theorem : Bayes Theorem Bayes Theorem Let X be the data record (case) whose class label is unknown. Let H be some hypothesis, such as "data record X belongs to a specified class C.“ For classification, we want to determine P (H|X) – the probability that the hypothesis H holds, given the observed data record X. P (H|X) is the posterior probability of H conditioned on X. For example, the probability that a fruit is an apple, given the condition that it is red and round. In contrast, P(H) is the prior probability, or a priori probability, of H. In this example P(H) is the probability that any given data record is an apple, regardless of how the data record looks. The posterior probability, P (H|X), is based on more information (such as background knowledge) than the prior probability, P(H), which is independent of X.
Bayesian Classification: Simple introduction…: Bayesian Classification: Simple introduction… Similarly, P (X|H) is posterior probability of X conditioned on H. That is, it is the probability that X is red and round given that we know that it is true that X is an apple. P(X) is the prior probability of X, i.e., it is the probability that a data record from our set of fruits is red and round. Bayes theorem is useful in that it provides a way of calculating the posterior probability, P(H|X), from P(H), P(X), and P(X|H). Bayes theorem is P (H|X) = P(X|H) P(H) / P(X)
Bayesian Theorem: Bayesian Theorem Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem Practical difficulty: require initial knowledge of many probabilities, significant computational cost
Naïve Bayes Classifier (I): Naïve Bayes Classifier (I) A simplified assumption: attributes are conditionally independent: Greatly reduces the computation cost, only count the class distribution.
Naive Bayesian Classifier (II): Naive Bayesian Classifier (II) Given a training set, we can compute the probabilities
Bayesian classification: Bayesian classification The classification problem may be formalized using a-posteriori probabilities: P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C. E.g. P(class=N | outlook=sunny,windy=true,…) Idea: assign to sample X the class label C such that P(C|X) is maximal.
Estimating a-posteriori probabilities: Estimating a-posteriori probabilities Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X) P(X) is constant for all classes P(C) = relative freq of class C samples C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum Problem: computing P(X|C) is unfeasible!
Naïve Bayesian Classification: Naïve Bayesian Classification Naïve assumption: attribute independence P(x1,…,xk|C) = P(x1|C)·…·P(xk|C) If i-th attribute is categorical: P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C If i-th attribute is continuous: P(xi|C) is estimated thru a Gaussian density function Computationally easy in both cases
Play-tennis example: estimating P(xi|C): Play-tennis example: estimating P(xi|C)
Play-tennis example: classifying X: Play-tennis example: classifying X An unseen sample X = <rain, hot, high, false> P(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582 P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286 Sample X is classified in class n (don’t play)
The independence hypothesis…: The independence hypothesis… … makes computation possible … yields optimal classifiers when satisfied … but is seldom satisfied in practice, as attributes (variables) are often correlated. Attempts to overcome this limitation: Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes Decision trees, that reason on one attribute at the time, considering most important attributes first
Chapter 7. Classification and Prediction: Chapter 7. Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Other Classification Methods Summary
Other classification methods : Other classification methods Neural Networks Association rules K-nearest neighbor classifiers Case based reasoning Genetic algorithms Rough set approach Fuzzy set approaches
Summary: Summary Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks) Classification is probably one of the most widely used data mining techniques with a lot of extensions Scalability is still an important issue for database applications: thus combining classification with database techniques should be a promising topic Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc..