Similar presentations

1 Particle Filtering Sometimes |X| is too big to use exact inference|X| may be too big to even store B(X)E.g. X is continuous|X|2 may be too big to do updatesSolution: approximate inferenceTrack samples of X, not all valuesSamples are called particlesTime per step is linear in the number of samplesBut: number needed may be largeIn memory: list of particlesThis is how robot localization works in practice

3 Dynamic Bayes Nets (DBNs)We want to track multiple variables over time, using multiple sources of evidenceIdea: repeat a fixed Bayes net structure at each timeVariables from time t can condition on those from t-1DBNs with evidence at leaves are HMMs

12 Acoustic Feature SequenceTime slices are translated into acoustic feature vectors (~39 real numbers per slice)These are the observations, now we need the hidden states X

13 State Spacep(E|X) encodes which acoustic vectors are appropriate for each phoneme (each kind of sound)p(X|X’) encodes how sounds can be strung togetherWe will have one state for each sound in each wordFrom some state x, can only:Stay in the same state (e.g. speaking slowly)Move to the next position in the wordAt the end of the word, move to the start of the next wordWe build a little state graph for each word and chain them together to form our state space X

16 DecodingWhile there are some practical issues, finding the words given the acoustics is an HMM inference problemWe want to know which state sequence x1:T is most likely given the evidence e1:T:From the sequence x, we can simply read off the words

17 Machine LearningUp until now: how to reason in a model and how make optimal decisionsMachine learning: how to acquire a model on the basis of data / experienceLearning parameters (e.g. probabilities)Learning structure (e.g. BN graphs)Learning hidden concepts (e.g. clustering)

18 Parameter Estimation Estimating the distribution of a random variableElicitation: ask a human (why is this hard?)Empirically: use training data (learning!)E.g.: for each outcome x, look at the empirical rate of that value:This is the estimate that maximizes the likelihood of the data

19 Estimation: SmoothingRelative frequencies are the maximum likelihood estimates (MLEs)In Bayesian statistics, we think of the parameters as just another random variable, with its own distribution????

20 Estimation: Laplace SmoothingLaplace’s estimate:Pretend you saw every outcome once more than you actually didCan derive this as a MAP estimate with Dirichlet priors

21 Estimation: Laplace SmoothingLaplace’s estimate (extended):Pretend you saw every outcome k extra timesWhat’s Laplace with k=0?k is the strength of the priorLaplace for conditionals:Smooth each condition independently:

22 Example: Spam Filter Input: email Output: spam/ham Setup:Get a large collection of example s, each labeled “spam” or “ham”Note: someone has to hand label all this data!Want to learn to predict labels of new, future sFeatures: the attributes used to make the ham / spam decisionWords: FREE!Text patterns: $dd, CAPSNon-text: senderInContacts……

23 Example: Digit RecognitionInput: images / pixel gridsOutput: a digit 0-9Setup:Get a large collection of example images, each labeled with a digitNote: someone has to hand label all this data!Want to learn to predict labels of new, future digit imagesFeatures: the attributes used to make the digit decisionPixels: (6,8) = ONShape patterns: NumComponents, AspectRation, NumLoops……

25 Naive Bayes for Digits Simple version: Naive Bayes model:One feature Fij for each grid position <i,j>Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying imageEach input maps to a feature vector, e.g.Here: lots of features, each is binary valuedNaive Bayes model:What do we need to learn?

26 General Naive Bayes A general naive Bayes model:We only specify how each feature depends on the classTotal number of parameters is linear in n

28 General Naive Bayes What do we need in order to use naive Bayes?Inference (you know this part)Start with a bunch of conditionals, p(Y) and the p(Fi|Y) tablesUse standard inference to compute p(Y|F1…Fn)Nothing new hereEstimates of local conditional probability tablesp(Y), the prior over labelsp(Fi|Y) for each feature (evidence variable)These probabilities are collectively called the parameters of the model and denoted by θUp until now, we assumed these appeared by magic, but…… they typically come from training data: we’ll look at this now

30 Important ConceptsData: labeled instances, e.g. s marked spam/hamTraining setHeld out setTest setFeatures: attribute-value pairs which characterize each xExperimentation cycleLearn parameters (e.g. model probabilities) on training set(Tune hyperparameters on held-out set)Compute accuracy of test setVery important: never “peek” at the test set!EvaluationAccuracy: fraction of instances predicted correctlyOverfitting and generalizationWant a classifier which does well on test dataOverfitting: fitting the training data very closely, but not generalizing wellWe’ll investigate overfitting and generalization formally in a few lectures