Σχόλια 0

Το κείμενο της παρουσίασης

CS340 Machine learningLecture 4Learning theorySome slides are borrowed from Sebastian Thrunand Stuart RussellAnnouncement•What:Workshop on applying for NSERCscholarships and for entry tograduate schoolWhen:Thursday, Sept 14, 12:30-14:00Where: DMP110Who:All Computer Science undergraduatesexpecting to graduate withinthe next 12 months who are interested inapplyingto graduate schoolPAC Learning: intuition•If we learn hypothesis h on the training data, how can besure this is close to the true target function f if we don't knowwhat f is?•Any hypothesis that we learn but which is seriously wrongwill almost certainly be "found out" with high probability aftera small number of examples, because it will make anincorrect prediction.•Thus any hypothesis that is consistent with a sufficientlylarge set of training examples is unlikely to be seriouslywrong, i.e., it must be probably approximately correct.•Learning theory is concerned with estimating the samplesize needed to ensure good generalization performance.PAC Learning•PAC = Probably approximately correct•Let f(x) be the true class, h(x) our guess, and π(x) adistribution of examples. Define the error as•Define h as approximately correctif error(h) < ε.•Goal: find sample size m s.t. for any distribution π•If Ntrain>= m, then with probability 1-δ, the hypothesis willbe approximately correct.•Test examples must be drawn from same distribution astraining examples.•We assume there is no label noise.Derivation of PAC bounds for finite H•Partition H into Hε, an ε"ball" around ftrue, andHbad= H \Hε•What is the prob. that a "seriously wrong"hypothesis hb∈Hbadis consistent with m examples(so we are fooled)? We can use a union boundThe probof finding such an hbis bounded byDerivation of PAC bounds for finite H•We want to find m s.t.•This is called the sample complexity of H•We use to derive•If |H| is larger, we need more training data toensure we can choose the "right" hypothesis.PAC Learnability•Statistical learning theory is concerned with samplecomplexity.•Computational learning theory is additionallyconcerned with computational (time) complexity.•A concept class C is PAC learnable, if it can belearnt with probability δand error εin timepolynomial in 1/δ, 1/ε, n, and size(c).•Implies–Polynomial sample complexity–Polynomial computational timeH = any booleanfunction•Consider all 222= 16 possiblebinary functions on k=2 binary inputs•If we observe (x1=0, x2=1, y=0), this removesh5, h6, h7, h8, h13, h14, h15, h16•Each example halves the version space.•Still leaves exponentially many hypotheses!H = any booleanfunctionUnbiased Learner: |H|=22k))/1ln(2ln2(1δε+≥km•Needs exponentially large sample size to learn.•Essentially has to learn whole lookup table, since for anyunseen example, H contains as many consistent hypothesesthat predict 1 as 0.Making learning tractable•To reduce the sample complexity, and allowgeneralization from a finite sample, there are twoapproaches–Restrict the hypothesis space to simpler functions–Put a prior that encourages simpler functions•We will consider the latter (Bayesian) approachlaterH = conjunction of booleanliterals•Conjunctions of Boolean literals:|H|=3k))/1ln(3ln(1δε+≥kmH = decision listsH = decision listsk-DL(n) restricts each test to contain at most k literals chosen from n attributesk-DL(n) includes the set of all decision trees of depth at most kPAC bounds for rectangles•Let us consider an infinite hypothesis space, for which|H| is not defined.•Let h be the most specific hypothesis, so errors occur in thepurple strips.•Each strip is at most ε/4•Pr that we miss a strip 1‒ ε/4•PrthatNinstances miss a strip (1 ‒ ε/4)N•PrthatNinstances miss 4 strips 4(1 ‒ ε/4)N•4(1

‒ ε/4)N≤ δ and (1 ‒ x)≤exp( ‒ x)•4exp(‒ εN/4) ≤ δand N≥ (4/ε)log(4/δ)VC Dimension•We can generalize the rectangle example using theVapnik-Chervonenkisdimension.•VC(H) is the maximum number of points that canbe shattered by H.•A set of instances S is shatteredby H if for everydichotomy (binary labeling) of S there is aconsistent hypothesis in H.•This is best explained by examples.Shattering 3 points in R2with circlesIs this set of pointsshattered by thehypothesis space Hof all circles?Shattering 3 points in R2with circles++-+-+-+++++----+---++--Every possible labeling can be covered by a circle, so we can shatter3 points.Is this set of points shattered by circles?Is this set of points shattered by circles?No, we cannot shatter anyset of 4 points.How About This One?How About This One?We cannot shatter this set of 3 points,but we canfind someset of 3 points which we can shatterVCD(Circles)=3•VC(H) = 3, since 3 points can beshattered but not 4VCD(Axes-Parallel Rectangles) = 4Can shatter at most 4 points in R2with a rectangleLinear decision surface in 2DVC(H) = 3, so xorproblem is notlinearly separableLinear decision surface in n-dVC(H) = n+1Is there an H with VC(H)=∞?Yes! The space of all convex polygonsPAC-Learning with VC-dim.•Theorem: After seeingrandom training examples the learner will withprobability 1-δgenerate a hypothesis witherror at most ε.))/13(log)(8)/2(log4(122εδεHVCm+≥Criticisms of PAC learning•The bounds on the generalization error are veryloose, because–they are distribution free/ worst case bounds, and do notdepend on the actual observed data–they make various approximations•Consequently the bounds are not very useful inpractice.