building a univariate (single attribute is tested) decision tree from a set T of training cases for a concept C with classes C1,Ck

Consider three possibilities

T contains 1 or more cases all belonging to the same class Cj. The decision tree for T is a leaf identifying class Cj

T contains no cases. The tree is a leaf, but the label is assigned heuristically, e.g. the majority class in the parent of this node

12

T contains cases from different classes. T is divided into subsets that seem to lead towards collections of cases. A test t based on a single attribute is chosen, and it partitions T into subsets T1,,Tn. The decision tree consists of a decision node identifying the tested attribute, and one branch for ea. outcome of the test. Then, the same process is applied recursively to ea.Ti

13Choosing the test

why not explore all possible trees and choose the simplest (Occams razor)? But this is an NP complete problem. E.g. in the union example there are millions of trees consistent with the data

notation S set of the training examples freq(Ci, S) number of examples in S that belong to Ci

information measure (in bits) of a message is - log2 of the probability of that message

idea to maximize the difference between the info needed to identify a class of an example in T, and the the same info after T has been partitioned in accord. with a test X

shows the proportion of info generated by the split that is useful for classification in the example (Witten p. 96), log(k)/log(n)

maximize gain ratio

19Partition of cases and corresp. tree 20In fact, learning DTs with the gain ratio heuristic is a search 21continuous attrs

a simple trick sort examples on the values of the attribute considered choose the midpoint between ea two consecutive values. For m values, there are m-1 possible splits, but they can be examined linearly

cost?

22

From trees to rules

traversing a decision tree from root to leaf gives a rule, with the path conditions as the antecedent and the leaf as the class

rules can then be simplified by removing conditions that do not contribute to discriminate the nominated class from other classes

rulesets for a whole class are simplified by removing rules that do not contribute to the accuracy of the whole set

partition the set E of all labeled examples (examples with their classification labels) into a training set and a testing set

use the training set for learning, obtain a hypothesis H, set acc 0

for ea. element t of the testing set,

apply H on t if H(t) label(t) then acc acc1

acc acc/testing set

29Testing - contd

Given a dataset, how do we split it between the training set and the test set?

cross-validation (n-fold)

partition E into n groups

choose n-1 groups from n, perform learning on their union

repeat the choice n times

average the n results

usually, n 3, 5, 10

another approach - learn on all but one example, test that example.

Leave One Out

30Confusion matrix

classifier-determined classifier-determined

positive label negative label

true positive a b

label

true negative c d

label

Accuracy (ad)/(abcd)

a true positives

b false negatives

c false positives

d true negatives

31

Precision a/(ac)

Recall a/(ab)

F-measure combines Recall and Precision

Fb (b21)PR / (b2 P R)

Refelects importance of Recall versus Precision eg F0 P

32Cost matrix

Is like confusion matrix, except costs of errors are assigned to the elements outside the diagonal (mis-classifications)

this may be important in applications, e.g. when the classifier is a diagnosis rule

see

http//ai.iit.nrc.ca/bibliographies/cost-sensitive.html

for a survey of learning with misclassification costs

33Bayesian learning

incremental, noise-resistant method

can combine prior Knowledge (the K is probabilistic)

predictions are probabilistic

34

Bayes law of conditional probability

results in a simple learning rule choose the most likely (Maximum APosteriori)hypothesisExample Two hypo (1) the patient has cancer (2) the patient is healthy 35Priors 0.8 of the population has cancer

P(not cancer) .992

P( - cancer) .02

P(-not cancer) .97

P(cancer) .008

P( cancer) .98

P(not cancer) .03

We observe a new patient with a positive test. How should they be diagnosed? P(cancer) P(cancer)P(cancer) .98.008 .0078 P(not cancer) P(not cancer)P(not cancer) .03.992.0298 36Minimum Description Length

revisiting the def. of hMAP

we can rewrite it as

or

But the first log is the cost of coding the data given the theory, and the second - the cost of coding the theory

37

Observe that

for data, we only need to code the exceptions the others are correctly predicted by the theory

MAP principles tells us to choose the theory which encodes the data in the shortest manner

the MDL states the trade-off between the complexity of the hypo. and the number of errors

38Bayes optimal classifier

so far, we were looking at the most probable hypothesis, given a priori probabilities. But we really want the most probable classification

this we can get by combining the predictions of all hypotheses, weighted by their posterior probabilities

in NBC, the conditional probabilities are estimated from training data simply as normalized frequencies how many times a given attribute value is associated with a given class

no search!

example

m-estimate

46

Example (see the Dec. Tree sec. in these notes)

we are trying to predict yes or no for Outlooksunny, Temperaturecool, Humidityhigh, Windstrong

P(yes)9/14 P(no)5/14 P(Windstrongyes)3/9 P(Windstrongno)3/5 etc. P(yes)P(sunnyyes)P(coolyes)P(highyes)Pstrongyes).0053 P(yes)P(sunnyno)P(coolno)P(highno)Pstrongno).0206 so we will predict no compare to 1R! 47

Further, we can not only have a decision, but also the prob. of that decision

we rely on for the conditional probability

if the conditional probability is very small, and n is small too, then we should assume that nc is 0. But this biases too strongly the NBC.

So smoothen see textbook p. 85

Instead, we will use the estimate

where p is the prior estimate of probability,

m is equivalent sample size. If we do not know otherwise, p1/k for k values of the attribute m has the effect of augmenting the number of samples of class

large value of m means that priors p are important wrt training data when probability estimates are computed, small less important

48Text Categorization

Representations of text are very high dimensional (one feature for each word).

High-bias algorithms that prevent overfitting in high-dimensional space are best.

For most text categorization tasks, there are many irrelevant and many relevant features.

Methods that sum evidence from many or all features (e.g. naïve Bayes, KNN, neural-net) tend to work better than ones that try to isolate just a few relevant features (decision-tree or rule induction).

49Naïve Bayes for Text

Modeled as generating a bag of words for a document in a given category by repeatedly sampling with replacement from a vocabulary V w1, w2,wm based on the probabilities P(wj ci).

Equivalent to a virtual sample of seeing each word in each category exactly once.

50Text Naïve Bayes Algorithm (Train)Let V be the vocabulary of all words in the documents in D For each category ci ? C Let Di be the subset of documents in D in category ci P(ci) Di / D Let Ti be the concatenation of all the documents in Di Let ni be the total number of word occurrences in Ti For each word wj ? V Let nij be the number of occurrences of wj in Ti Let P(wi ci) (nij 1) / (ni V) 51Text Naïve Bayes Algorithm (Test)Given a test document X Let n be the number of word occurrences in X Return the category where ai is the word occurring the ith position in X 52Naïve Bayes Time Complexity

Training Time O(DLd CV)) where Ld is the average length of a document in D.

Assumes V and all Di , ni, and nij pre-computed in O(DLd) time during one pass through all of the data.

Generally just O(DLd) since usually CV lt DLd

Test Time O(C Lt) where Lt is the average length of a test document.

Very efficient overall, linearly proportional to the time needed to just read in all the data.

Similar to Rocchio time complexity.

53Underflow Prevention

Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow.

Since log(xy) log(x) log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.

Class with highest final un-normalized log probability score is still the most probable.

54Naïve Bayes Posterior Probabilities

Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate.

However, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are not.

Output probabilities are generally very close to 0 or 1.

55Textual Similarity Metrics

Measuring similarity of two texts is a well-studied problem.

Standard metrics are based on a bag of words model of a document that ignores word order and syntactic structure.

May involve removing common stop words and stemming to reduce words to their root form.

Vector-space model from Information Retrieval (IR) is the standard approach.

A hyperplane divides the n-dimensional space into two subspaces one is y y((w ? x) b) gt 0, the other is complementary (y y((w ? x) b) lt0)

75

Lets revisit the general classification problem.

We want to estimate an unknown function f, all we know about it is the training set (x1,y1), (xn,yn)

The objective is to minimize the expected error (risk)

where l is a loss function, eg

and ?(z) 0 for zlt0 and ?(z)1 otherwise

Since we do not know P, we cannot measure risk

We want to approximate the true error (risk) by the empirical error (risk)

76

We know from the PAC theory that conditions can be given on the learning task so that the empirical risk converges towards the true risk

We also know that the difficulty of the learning task depends on the complexity of f (VC dimension)

It is known that the following relationship between the empirical risk and the complexity of the language (h denotes VC dimension of the class of f)

is true with probability at least ? for ngt h

77SRM

Structural Risk Minimization (SRM) chooses the class of F to find a balance between the simplicity of f (very simple may result in a large empirical risk) and and the empirical risk (small may require a class function with a large h)

78Points lying on the margin are called support vectors w can be constructed efficiently quadratic optimization problem. 79Basic idea of SVM

Linearly separable problems are easy (quadratic), but of course most problems are not l. s.

Take any problem and transform it into a high-dimensional space, so that it becomes linearly separable, but

Calculations to obtain the separability plane can be done in the original input space (kernel trick)

80Basic idea of SVM 81

Original data is mapped into another dot product space called feature space F via a non-linear map ?

Then linear separable classifier is performed in F

Note that the only operations in F are dot products

Consider e.g.

82Lets see that ? geometrically, and that it does what we want it to do transform a hard classification problem into an easy one, albeit in a higher dimension 83

But in general quadratic optimization in the feature space could be very expensive

Consider classifying 16 x 16 pixel pictures, and 5th order monomials

Feature space dimension in this example is O( ) 1010

84Here we show that the that transformation from ellipsoidal decision space to a linear one, requiring dot product in the the feature space, can be performed by a kernel function in the input space in general, k(x,y) (x ? y)d computes in the input space kernels replace computation in FS by computation in the input space in fact, the transformation ? needs not to be applied when a kernel is used! 85Some common kernels usedUsing different kernels we in fact use different classifiers in the input space gaussian, polynomial, 3-layer neural nets, 86Simplest kernel

Is the linear kernel (w ? x) b

But this only works if the training set is linearly separable. This may not be the case

For the linear kernel, or even

In the feature space

87The solution for the non-separable case is to optimize not just the margin, but the margin plus the influence of training errors ?i 88Classification with SVMs

Convert each example x to F(x)

Perform optimal hyperplane algorithm in F but since we use the kernel all we need to do is to compute

where xi, yi are training instances, ai are computed as the solution to the quadratic programming problem

Bootstrapping using the system on images with no faces and storing false positives to use as negative examples in later training

100Performance on 2 test sets Set A 313 high quality Images with 313 faces, set B 23 images with 155 faces This results in gt4M frames for A and gt5M frames for B. SVM achieved recall of 97 on A and 74 on B, with 4 and 20 false positives, resp. 101SVM in text classification

Impractical (too many) for larger substrings and docs, but the kernel using such features can be calculated efficiently (substring kernel SSK) maps strings (a whole doc) to a feature vector indexed by all k-tuples

106

Value of the feature sum over the occurrences of the k-tuple of a decay factor of the length of the occurrence

Def of SSK ? is an alphabet string finite sequence of elems of ? . s length of s sij substring of s. u is a subsequence of s if there exist indices i(i1,,iu ) with 1i1ltlt iu s such that uj sij for j1,,u (usi for short).

Length l(i) of of the subsequence in s is iu - i1 1 (span in s)

Feature space mapping ? for s is defined by

for each u ??n (set of all finite strings of length n) features measure the number of occurrences of subsequences in s weighed by their length (??1)

The kernel can be evaluated in O(nst) time (see Lodhi paper)

107Experimental results with SSK

The method is NOT fast, so a subset of Reuters (n470/380) was used, and only 4 classes corn, crude, earn, acquisition

Compared to the BOW representation (see earlier in these notes) with stop words removed, features weighed by tf/idflog(1tf)log(n/df)

F1 was used for the evaluation, C set experimentally

Best k is between 4 and 7

Performance comparable to a classifier based on k-grams (contiguous), and also BOW

? controls the penalty for gaps in substrings best precision for high ? 0.7. This seems to result in high similarity score for docs that share the same but semantically different words - WHY?

Results on full Reuters not as good as with BOW, k-grams the conjecture is that the kernel performs something similar to stemming, which is less important onlarge datasets where there is enough data to learn the samness of different inflections

Classification task does a sequence window around the ATG indicate a TIS?

Each nucleotide is encoded by 5 bits, exactly one is set to 1, indicating whether the nucleotide is A, C, G, T, or unknown. So the dimension n of the input space 1000 for window size 100 to left and right of the ATG sequence.

Positive and negaite windows are provided as the training set

This representation is typical for the kind of problem where SVMs do well

109

What is a good feature space for this problem? how about including in the kernel some prior domain knowledge? Eg

Dependencies between distant positions are not important or are known not to exist

Compare, at each sequence position, two sequences locally in a window of size 2l1 around that position, with decreasing weight away from the centre of the window

Where d1 is the order of importance of local (within the window) correlations, and is 1 for matching nucleotides at position pj, 0 otherwise

110

Window scores are summed over the length of the sequence, and correlations between up to d2 windows are taken into account

Also it is known that the codon below the TIS is a CDS CDS shifted by 3 nucleotides is still a TDS

not a learning technique on its own, but a method in which a family of weakly learning agents (simple learners) is used for learning

based on the fact that multiple classifiers that disagree with one another can be together more accurate than its component classifiers

if there are L classifiers, each with an error rate lt 1/2, and the errors are independent, then the prob. That the majority vote is wrong is the area under binomial distribution for more than L/2 hypotheses

114Boosting as ensemble of learners

The very idea focus on difficult parts of the example space

Train a number of classifiers

Combine their decision in a weighed manner

115(No Transcript) 116

Bagging Breiman is to learn multiple hypotheses from different subset of the training set, and then take majority vote. Each sample is drawn randomly with replacement (a boostratrap). Ea. Bootstrap contains, on avg., 63.2 of the training set

boosting is a refinement of bagging, where the sample is drawn according to a distribution, and that distribution emphasizes the misclassified examples. Then a vote is taken.

117(No Transcript) 118

Lets make sure we understand the makeup of the final classifier

119

AdaBoost (Adaptive Boosting) uses the probability distribution. Either the learning algorithm uses it directly, or the distribution is used to produce the sample.

Micro-array give us information about the rate of production protein of gene during a experiment. Those technologies give us a lot of information,

Analyzing microarray data tells us how the gene protein production evolve.

Each data point represents log expression ratio of a particular gene under two different experimental conditions. The numerator of each ratio is the expression level of the gene in the varying condition, whereas the denominator is the expression level of the gene in some reference condition. The expression measurement is positive if the gene expression is induced with respect to the reference state and negative if it is repressed. We use those values as derivatives.

find all itemsets that have transaction support gt mins large itemsets

144Associations - mining

to do that start with indiv. items with large support

in ea next step, k,

use itemsets from step k-1, generate new itemset Ck,

count support of Ck (by counting the candidates which are contained in any t),

prune the ones that are not large

145Associations - miningOnly keep those that are contained in some transaction 146Candidate generationCk apriori-gen(Lk-1) 147

From large itemsets to association rules

148Subset functionSubset(Ck, t) checks if an itemset Ck is in a transaction t It is done via a tree structure through a series of hashingHash C on every item in t itemsets not containing anything from t are ignoredIf you got here by hashing item i of t, hash on all following items of tset of itemsetsset of itemsetsCheck if itemset contained in this leaf 149Example

L31 2 3, 1 2 4,1 3 4,1 3 5,2 3 4

C41 2 3 4 1 3 4 5

pruning deletes 1 3 4 5 because 1 4 5 is not in L3.

See http//www.almaden.ibm.com/u/ragrawal/pubs.htmlassociations for details

150DM result evaluation

Accuracy

ROC

lift curves

cost

but also INTERPRETABILITY

151Feature Selection sec. 7.1 in Witten, Frank

Attribute-vector representation coordinates of the vector are referred to as attributes or features

curse of dimensionality learning is search, and the search space increases drastically with the number of attributes

Theoretical justification We know from PAC theorems that this increase is exponential discuss e.g. slide 70

Practical justification with divide-and-conquer algorithms the partition sizes decrease and at some point irrelevant attributes may be selected

The task find a subset of the original attribute set such that the classifier will perform at least as well on this subset as on the original set of attributes

152Some foundations

We are in the classification setting, Xi are the attrs and Y is the class. We can define relevance of features wrt Optimal Bayes Classifier OBC

Let S be a subset of features, and X a feature not in S

X is strongly relevant if removal of X alone deteriorates the performance of the OBC.

Xi is weakly relevant if it is not strongly relevant AND performance of BOC on S?X is better than on S

153three main approaches

Manually often unfeasible

Filters use the data alone, independent of the classifier that will be used on this data (aka scheme-independent selection)

Wrappers the FS process is wrapped in the classifier that will be used on the data

154Filters - discussion

Find the smallest attribute set in which all the instances are distinct. Problem cost if exhaustive search used

But learning and FS are related in a way, the classifier already includes the the good (separating) attributes. Hence the idea

Use one classifier for FS, then another on the results. E.g. use a DT, and pass the data on to NB. Or use 1R for DT.

155Filters contd RELIEF Kira, Rendell

Initialize weight of all atrrs to 0

Sample instances and check the similar ones.

Determine pairs which are in the same class (near hits) and in different classes (near misses).

For each hit, identify attributes with different values. Decrease their weight

For each miss, attributes with different values have their weight increased.

Repeat the sample selection and weighing (2-5) many times

Keep only the attrs with positive weight

Discussion high variance unless the of samples very high

Deterministic RELIEF use all instances and all hits and misses

156A different approach

View attribute selection as a search in the space of all attributes

Search needs to be driven by some heuristic (evaluation criterion)

This could be some measure of the discrimination ability of the result of search, or

Cross-validation, on the part of the training set put aside for that purpose. This means that the classifier is wrapped in the FS process, hence the name wrapper (scheme-specific selection)

157

Greedy search example

A single attribute is added (forward) or deleted (backward)

Could also be done as best-first search or beam search, or some randomized (e.g. genetic) search

158Wrappers

Computationally expensive (k-fold xval at each search step)

backward selection often yields better accuracy

x-val is just an optimistic estimation that may stop the search prematurely

in backward mode attr sets will be larger than optimal

Forward mode may result in better comprehensibility

Experimentally FS does particularly well with NB on data on which NB does not do well

NB is sensitive to redundant and dependent (!) attributes

Forward selection with training set performance does well Langley and Sage 94

159Discretization

Getting away from numerical attrs

We know it from DTs, where numerical attributes were sorted and splitting points between each two values were considered

Global (independent of the classifier) and local (different results in ea tree node) schemes exist

What is the result of discretization a value of an nominal attribute

Ordering information could be used if the discretized attribute with k values is converted into k-1 binary attributes the i-1th attribute true represents the fact that the value is lt I

Supervised and unsupervised discretization

160Unsuprevised discretization

Fixed length intervals (equal interval binning) eg (max-min)/k

How do we know k?

May distribute instances unevenly

Variable-length intervals, ea containing the same number of intervals (equal frequency binning, or histogram equalization)

161

Supervised discretization

Example of Temperature attribute in the play/dont play data

A recursive algorithm using information measure/ We go for the cut point with lowest information (cleanest subset)

162Supervised discretization contd

Whats the right stopping criterion?

How about MDL? Compare the info to transmit the label of ea instance before the split, and the split point in log2(N-1) bits, info for points below and info for points above.

Ea. Instance costs 1 bit before the split, and slightly gt 0 bits after the split

This is the Irani, Fayyad 93 method

163Error-correcting Output Codes (ECOC)

Method of combining classifiers from a two-class problem to a k-class problem

Often when working with a k-class problem k one-against-all classifiers are learned, and then combined using ECOC

Consider a 4-class problem, and suppose that there are 7 classifiers, and classed are coded as follows

Suppose an instance ?a is classified as 1011111 (mistake in the 2nd classifier).

But the this classification is the closest to class a in terms of edit (Hamming) distance. Also note that class encodings in col. 1 re not error correcting

cannot really be expressed in AV representation, but is very easy in relational representation

linked-to lt0,1gt, lt0,3gt, lt1,2gt,,lt7,8gt

can-reach(X,Y) - linked-to(X,Z), can-reach(Z,Y)

174Another example of recursive learning

E

boss(mary,john). boss(phil,mary).boss(phil,john).

E-

boss(john,mary). boss(mary,phil). boss(john,phil).

BK

employee(john, ibm). employee(mary,ibm). employee(phil,ibm).

reports_to_imm(john,mary). reports_to_imm(mary,phil).

h boss(X,Y)- employee(X,O), employee(Y,O),reports_to(Y, X).

reports_to(X,Y)-reports_to_imm(X,Z), reports_to(Z,Y).

reports_to(X,X).

175How is learning done covering algorithm

Initialize the training set T

while the global training set contains ex find a clause that describes part of relationship Q remove the ex covered by this clause

Finding a clause

initialize the clause to Q(V1,Vk) - while T contains ex

find a literal L to add to the right-hand side of the clause

Finding a literal greedy search

176

Find a clause loop describes search

Need to structure the search space

generality semantic and syntactic

since logical generality is not decidable, a stronger property of ?-subsumption

then search from general to specific (refinement)

177RefinementHeuristics link to he

About PowerShow.com

PowerShow.com is a leading presentation/slideshow sharing website. Whether your application is business, how-to, education, medicine, school, church, sales, marketing, online training or just for fun, PowerShow.com is a great resource. And, best of all, most of its cool features are free and easy to use.

You can use PowerShow.com to find and download example online PowerPoint ppt presentations on just about any topic you can imagine so you can learn how to improve your own slides and
presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!