3
Natural Language Decisions are Structured Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. It is essential to make coherent decisions in a way that takes the interdependencies into account. Joint, Global Inference. TODAY: How to support real, high level, natural language decisions How to learn models that are used, eventually, to make global decisions A framework that allows one to exploit interdependencies among decision variables both in inference (decision making) and in learning. Inference: A formulation for incorporating expressive declarative knowledge in decision making. Learning: Ability to learn simple models; amplify its power by exploiting interdependencies. Learning and Inference in NLP Page 3

4
Comprehension 1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robins dad was a magician. 4. Christopher Robin must be at least 65 now. (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. This is an Inference Problem Page 4

5
Learning and Inference Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. In current NLP we often think about simpler structured problems: Parsing, Information Extraction, SRL, etc. As we move up the problem hierarchy (Textual Entailment, QA,….) not all component models can be learned simultaneously We need to think about (learned) models for different sub-problems Knowledge relating sub-problems (constraints) becomes more essential and may appear only at evaluation time Goal: Incorporate models information, along with prior knowledge (constraints) in making coherent decisions Decisions that respect the local models as well as domain & context specific knowledge/constraints. Page 5

9
Constrained Conditional Models How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Cutting Planes, Dual Decomposition & other search techniques are possible (Soft) constraints component Weight Vector for local models Penalty for violating the constraint. How far y is from a legal assignment Features, classifiers; log- linear models (HMM, CRF) or a combination How to train? Training is learning the objective function Decouple? Decompose? How to exploit the structure to minimize supervision? Page 9

10
Inference: given input x (a document, a sentence), predict the best structure y = {y 1,y 2,…,y n } 2 Y (entities & relations) Assign values to the y 1,y 2,…,y n, accounting for dependencies among y i s Inference is expressed as a maximization of a scoring function y = argmax y 2 Y w T Á (x,y) Inference requires, in principle, touching all y 2 Y at decision time, when we are given x 2 X and attempt to determine the best y 2 Y for it, given w For some structures, inference is computationally easy. Eg: Using the Viterbi algorithm In general, NP-hard (can be formulated as an ILP) Structured Prediction: Inference Joint features on inputs and outputs Feature Weights (estimated during learning) Set of allowed structures Placing in context: a crash course in structured prediction Page 10

11
Structured Prediction: Learning Learning: given a set of structured examples {(x,y)} find a scoring function w that minimizes empirical loss. Learning is thus driven by the attempt to find a weight vector w such that for each given annotated example (x i, y i ): Page 11

12
Structured Prediction: Learning Learning: given a set of structured examples {(x,y)} find a scoring function w that minimizes empirical loss. Learning is thus driven by the attempt to find a weight vector w such that for each given annotated example (x i, y i ): We call these conditions the learning constraints. In most learning algorithms used today, the update of the weight vector w is done in an on-line fashion, Think about it as Perceptron; this procedure applies to Structured Perceptron, CRFs, Linear Structured SVM W.l.o.g. (almost) we can thus write the generic structured learning algorithm as follows: Score of annotated structure Score of any other structure Penalty for predicting other structure 8 y Page 12

13
In the structured case, the prediction (inference) step is often intractable and needs to be done many times Structured Prediction: Learning Algorithm For each example (x i, y i ) Do: (with the current weight vector w) Predict: perform Inference with the current weight vector y i = argmax y 2 Y w T Á ( x i,y) Check the learning constraints Is the score of the current prediction better than of (x i, y i )? If Yes – a mistaken prediction Update w Otherwise: no need to update w on this example EndFor Page 13

14
Structured Prediction: Learning Algorithm For each example (x i, y i ) Do: Predict: perform Inference with the current weight vector y i = argmax y 2 Y w EASY T Á EASY ( x i,y) + w HARD T Á HARD ( x i,y) Check the learning constraint Is the score of the current prediction better than of (x i, y i )? If Yes – a mistaken prediction Update w Otherwise: no need to update w on this example EndDo Solution I: decompose the scoring function to EASY and HARD parts EASY: could be feature functions that correspond to an HMM, a linear CRF, or even Á EASY (x,y) = Á (x), omiting dependence on y, corresponding to classifiers. May not be enough if the HARD part is still part of each inference step. Page 14

20
Semantic Role Labeling (SRL) I left my pearls to my daughter in my will Page 20

21
Semantic Role Labeling (SRL) I left my pearls to my daughter in my will Page 21

22
Semantic Role Labeling (SRL) I left my pearls to my daughter in my will One inference problem for each verb predicate. Page 22

23
No duplicate argument classes Reference-Ax Continuation-Ax Many other possible constraints: Unique labels No overlapping or embedding Relations between number of arguments; order constraints If verb is of type A, no argument of type B Any Boolean rule can be encoded as a set of linear inequalities. If there is an Reference-Ax phrase, there is an Ax If there is an Continuation-x phrase, there is an Ax before it Constraints Universally quantified rules Learning Based Java: allows a developer to encode constraints in First Order Logic; these are compiled into linear inequalities automatically. Page 23

28
Verb SRL is not sufficient John, a fast-rising politician, slept on the train to Chicago. Relation: sleep Sleeper: John, a fast-rising politician Location: on the train to Chicago. What was Johns destination? train to Chicago gives answer without verbs! Who was John? John, a fast-rising politician gives answer without verbs! Page 28

29
Examples of preposition relations Queen of England City of Chicago Page 29

30
Predicates expressed by prepositions live at Conway House at:1 stopped at 9 PM at:2 cooler in the evening in:3 drive at 50 mph at:5 arrive on the 9 th on:17 the camp on the island on:7 look at the watch at:9 Location Temporal ObjectOfVerb Numeric Index of definition on Oxford English Dictionary Ambiguity & Variability Page 30

32
Computational Questions 1. How do we predict the preposition relations? [EMNLP, 11] Capturing the interplay with verb SRL? Very small jointly labeled corpus, cannot train a global model! 2. What about the arguments? [Trans. Of ACL, 13] Annotation only gives us the predicate How do we train an argument labeler? Page 32

33
Predicting preposition relations Multiclass classifier Uses sense disambiguation features Depend on words syntactically connected to preposition [Hovy, et al. 2009] Additional features based on NER, gazetteers, word clusters Does not take advantage of the interactions between preposition and verb relations Page 33

34
The bus was heading for Nairobi in Kenya. Coherence of predictions Location Destination Predicate: head.02 A0 (mover): The bus A1 (destination): for Nairobi in Kenya Predicate: head.02 A0 (mover): The bus A1 (destination): for Nairobi in Kenya Predicate arguments from different triggers should be consistent Joint constraints linking the two tasks. Destination A1 Page 34

35
Joint Constraints Hand written constraints 1. Each verb attached preposition that is labeled as Temporal should correspond to the start of some AM-TMP 2. For verb attached prepositions, some roles should correspond to an non-null argument label Mined constraints Using Penn Treebank data for arguments that start with a preposition Set of allowed verb argument labels for each preposition relation and vice versa Constraints written as linear inequalities Page 35

37
Desiderata for joint prediction Intuition: The correct interpretation of a sentence is the one that gives a consistent analysis across all the linguistic phenomena expressed in it 1. Should account for dependencies between linguistic phenomena 2. Should be able to use existing state of the art models minimal use of expensive jointly labeled data Joint constraints between tasks, easy with ILP forumation Use small amount of joint data to re-scale scores to be in the same numeric range Joint Inference – no (or minimal) joint learning Page 37

39
Example Weatherford said market conditions led to the cancellation of the planned exchange. Independent preposition SRL mistakenly labels the to as a Location Verb SRL identifies to the cancellation of the planned exchange as an A2 for the verb led Constraints (verbnet) prohibits an A2 to be labeled as a Location, joint inference correctly switches the prediction to EndCondition Page 39

40
Preposition relations and arguments 1. How do we predict the preposition relations? [EMNLP, 11] Capturing the interplay with verb SRL? Very small jointly labeled corpus, cannot train a global model! 2. What about the arguments? [Trans. Of ACL, 13] Annotation only gives us the predicate How do we train an argument labeler? Enforcing consistency between verb argument labels and preposition relations can help improve both Page 40

41
Page 41 Indirect Supervision In many cases, we are interested in a mapping from X to Y, but Y cannot be expressed as a simple function of X, hence cannot be learned well only as a function of X. Consider the following sentences: S1: Druce will face murder charges, Conte said. S2: Conte said Druce will be charged with murder. Are S1 and S2 a paraphrase of each other? There is a need for an additional set of variables to justify this decision. There is no supervision of these variables, and typically no evaluation of these, but learning to assign values to them support better prediction of Y. Discriminative form of EM [Chang et. al ICML10, NAACL10] [[Yu & Joachims 09]

43
Types are an abstraction that capture common properties of groups of entities. Relations depend on argument types Our primary goal is to model preposition relations and their arguments But the relation prediction strongly depends also on the semantic type of the arguments. Page 43 Poor care led to her death from pneumonia. Cause (death, pneumonia) Poor care led to her death from the flu. How do we generalize to unseen words in the same type?

47
Inference takes into account constrains among parts of the structure; formulated as an ILP Latent inference Standard inference: Find an assignment to the full structure Latent inference: Given an example with annotated r(y*) Given that we have constraints between r(y) and h(y) this process completes the structure in the best possible way to support correct prediction of the variables being supervised. Page 47

49
Supervised Learning Learning (updating w) is driven by: Score of annotated structure Score of any other structure Penalty for predicting other structure Penalty for making a mistake must not be the same for the labeled r(y) and inferred h(y) parts Completion of the hidden structure done in the inference step (guided by constraints) Page 49

51
Preposition relations and arguments 1. How do we predict the preposition relations? [EMNLP, 11] Capturing the interplay with verb SRL? Very small jointly labeled corpus, cannot train a global model! 2. What about the arguments? [Trans. Of ACL, 13] Annotation only gives us the predicate How do we train an argument labeler? Enforcing consistency between verb argument labels and preposition relations can help improve both Knowing the existence of a hidden structure lets us complete it and helps us learn Page 51

53
Constrained Conditional Models (aka ILP Inference) How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Cutting Planes, Dual Decomposition & other search techniques are possible (Soft) constraints component Weight Vector for local models Penalty for violating the constraint. How far y is from a legal assignment Features, classifiers; log- linear models (HMM, CRF) or a combination How to train? Training is learning the objective function Decouple? Decompose? How to exploit the structure to minimize supervision? Page 53

54
Inference in NLP In NLP, we typically dont solve a single inference problem. We solve one or more per sentence. Beyond improving the inference algorithm, what can be done? S1 He is reading a book After inferring the POS structure for S1, Can we speed up inference for S2 ? Can we make the k-th inference problem cheaper than the first? S2 I am watching a movie POS PRP VBZ VBG DT NN S1 & S2 look very different but their output structures are the same The inference outcomes are the same Page 54

55
Amortized ILP Inference [Kundu, Srikumar & Roth, EMNLP-12,ACL-13] We formulate the problem of amortized inference: reducing inference time over the lifetime of an NLP tool We develop conditions under which the solution of a new problem can be exactly inferred from earlier solutions without invoking the solver. Results: A family of exact inference schemes A family of approximate solution schemes Algorithms are invariant to the underlying solver; we simply reduce the number of calls to the solver Significant improvements both in terms of solver calls and wall clock time in a state-of-the-art Semantic Role Labeling Page 55

56
The Hope: POS Tagging on Gigaword Number of Tokens Page 56

57
Number of structures is much smaller than the number of sentences The Hope: POS Tagging on Gigaword Number of Tokens Number of examples of a given size Number of unique POS tag sequences Page 57

58
The Hope: Dependency Parsing on Gigaword Number of Tokens Number of structures is much smaller than the number of sentences Number of examples of a given size Number of unique Dependency Trees Page 58

59
The Hope: Semantic Role Labeling on Gigaword Number of Arguments per Predicate Number of structures is much smaller than the number of sentences Number of examples of a given size Number of unique SRL structures Page 59

60
POS Tagging on Gigaword Number of Tokens How skewed is the distribution of the structures? A small # of structures occur very frequently Page 60

61
Amortized ILP Inference These statistics show that many different instances are mapped into identical inference outcomes. How can we exploit this fact to save inference cost? We do this in the context of 0-1 LP, which is the most commonly used formulation in NLP. Max cx Ax b x 2 {0,1} Page 61

68
c P1 c P2 Solution x* Feasible region ILPs corresponding to all these objective vectors will share the same maximizer for this feasible region All ILPs in the cone will share the maximizer Exact Theorem II (Geometric Interpretation) Page 68

69
Exact Theorem III (Combining I and II) Page 69

70
Approximation Methods Will the conditions of the exact theorems hold in practice? The statistics we showed almost guarantees they will. There are very few structures relative to the number of instances. To guarantee that the conditions on the objective coefficients be satisfied we can relax them, and move to approximation methods. Approximate methods have potential for more speedup than exact theorems. It turns out that indeed: Speedup is higher without a drop in accuracy. Page 70

71
Simple Approximation Method (I, II) Most Frequent Solution: Find the set C of previously solves ILPs in Qs equivalence class Let S be the most frequent solution in C If the frequency of S is above a threshold (support) in C, return S, otherwise call the ILP solver Top K Approximation: Find the set C of previously solves ILPs in Qs equivalence class Let K be the set of most frequent solutions in C Evaluate each of the K solutions on the objective function of Q and select the one with the highest objective value Page 71

74
Experiments: Semantic Role Labeling SRL: Based on the state-of-the-art Illinois SRL [V. Punyakanok and D. Roth and W. Yih, The Importance of Syntactic Parsing and Inference in Semantic Role Labeling, Computational Linguistics – 2008] In SRL, we solve an ILP problem for each verb predicate in each sentence Amortization Experiments: Speedup & Accuracy are measured over WSJ test set (Section 23) Baseline is solving ILP using Gurobi 4.6 For amortization: We collect 250,000 SRL inference problems from Gigaword and store in a database For each ILP in test set, we invoke one of the theorems (exact / approx.) If found, we return it, otherwise we call the baseline ILP solver Page 74

76
Summary: Amortized ILP Inference Inference can be amortized over the lifetime of an NLP tool Yields significant speed up, due to reducing the number of calls to the inference engine, independently of the solver. Current/Future work: Decomposed Amortized Inference Possibly combined with Lagrangian Relaxation Approximation augmented with warm start Relations to lifted inference Page 76

77
Conclusion Presented Constrained Conditional Models: An ILP based computational framework that augments statistically learned linear models with declarative constraints as a way to incorporate knowledge and support decisions in an expressive output spaces Maintains modularity and tractability of training A powerful & modular learning and inference paradigm for high level tasks. Multiple interdependent components are learned and, via inference, support coherent decisions, modulo declarative constraints. Learning issues: Exemplified some of the issues in the context of extended SRL Learning simple models; modularity; latent and indirect supervision Inference: Presented a first step in amortized inference: How to use previous inference outcomes to reduce inference cost Thank You! Check out our tools, demos, tutorial Page 77