Abstract

In this paper, we present a novel model for entity disambiguation that combines both local contextual information and global evidences through Limited Discrepancy Search (LDS). Given an input document, we start from a complete solution constructed by a local model and conduct a search in the space of possible corrections to improve the local solution from a global view point. Our search utilizes a heuristic function to focus more on the least confident local decisions and a pruning function to score the global solutions based on their local fitness and the global coherences among the predicted entities. Experimental results on CoNLL 2003 and TAC 2010 benchmarks verify the effectiveness of our model.

This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/.
The goal of entity disambiguation is to link a set of given query mentions in a document to their referent entities in a Knowledge Base (KB). As an essential and challenging task in Knowledge-Base Population (KBP) for text Analysis
(Ji et al., 2014; Heng et al., 2015), entity disambiguation has attracted many research efforts from the NLP community. Recently, deep learning based approaches have demonstrated strong performances on this task (Ganea and Hofmann, 2017; Sil et al., 2018).

A main challenge for entity disambiguation is to best identify and represent the appropriate context, which can be local or global. Different methods have been proposed to capture and represent different types of contexts. Textual context has been heavily investigated for local models that score each query’s candidates independently. Representations of the textual contexts range from weighted combination of the word embeddings based on attention (Ganea and Hofmann, 2017), to more fine-grained contextual representations using recurrent neural networks (Sil et al., 2018). The global context of other entities in the document has also been studied for a more global and joint prediction view on the problem. Ganea and Hofmann (2017) use a Conditional Random Field (CRF) based model to capture the interrelationship among entities in the same document, whereas Globerson et al. (2016) introduce a soft k-max attention model to weigh the importance of other entities in the document in making prediction for any given query.

Working with the recently proposed models, we observe that local models that employ an appropriate attention mechanism often have a solid linking performance. In a single document, there are often a small number of hard queries for which the local model fails to make a correct decision. We conjecture that if some of these mistakes can be corrected, a global model that enforces coherence among entities will be able to propagate these corrections to improve the overall solution quite effectively.

This inspires us to consider the Limited Discrepancy Search (LDS) framework (Doppa et al., 2014), which conducts a search over possible corrections on a complete output with the goal of improving the final output. Critically, LDS works well for cases where only a small number of local corrections are needed to reach a good global solution. This nicely matches up with our observation of the behavior of entity disambiguating models.

In this paper, we propose a LDS based global entity disambiguating system. Given a document and its query mentions, our system first applies a local disambiguation model to produce an initial solution. We then use LDS to conduct a shallow search in the space of possible corrections (with the focus on hard/least confident mentions) to find a better solution. Evaluation on CoNLL 2003 and TAC 2010 shows that our method outperforms the current state-of-the-art models. We also conduct an extensive ablation study on different variants of our model, providing insight into the strength of our method.

We are given a document D containing n mentions [x1,...xn]. We assume that all mentions are linkable to a Knowledge Base (KB) by excluding the NILL mentions. We are interested in finding a joint assignment of all mentions to Y=[y1,...yn] of referent entities to maximize the following score:

s(Y)=n∑i=1ψ(xi,yi)+n∑i=1n∑j=1:j≠iϕ(yi,yj)

(1)

where function ψ(xi,yi) gives the local compatibility score between the mention xi and its candidate yi and ϕ(yi,yj) indicates the amount of relatedness between the assigned candidates yi and yj. Optimizing this objective, however, is NP-hard. In this work, we develop a LDS-based search strategy for optimizing this objective.

2.1 Overview of the Approach

Given a document with n mentions, we initialize the search with a solution acquired based solely on the local scoring function ψ(⋅,⋅). We then conduct
a greedy beam search in the space of possible
discrepancies (changes/corrections) to this initial solution while focusing on mentions with least confident local scores in the hope of finding a better solution.

Figure 1: The general framework of our proposed model with beam size b=1.

Fig 1 shows the overview of our search framework for beam size b=1. Each state Si is a pair (I,Di) where I=(e1,e2,...,en) is the initial solution given by the local model and Di is the discrepancy set for state Si. For example, a discrepancy set {xk1:yk1,xk2:yk2} contains two discrepancies, changing the assignment for mentions xk1 and xk2 from ek1 and ek2 in I to yk1 and yk2 respectively.
Starting with initial state (I, ), we utilize a heuristic function h (Section 2.3.2) to sort the mentions in the increasing order of their local confidence. Let (z1,...,zn) be the ordered mentions where z1 is the least confident mention. Each iteration of the greedy beam search explores and prunes the space of discrepancy sets as follows:

We select the least confident mention z1 and expand state (I,{}) to k new
states (I,{z1:y1j}) for j=1⋯k where y1j
is the jth candidate for mention z1 (we consider top k most probable candidates). Each expanded
state (I,{z1:y1j}),j=1,…,k is given to a Discrepancy Propagator (DP) (Section 2.3.1) to get
its discrepancy set propagated throughout the document and produce an updated complete solution (o1,...on)j, for j=1,…,k.
Utilizing a trained pruning function p (Section 2.3.3) we then rank the k complete solutions and prune them
to top b states (b is beam size).

The search continues with the next least confident mention z2 as
shown in Fig 1. Each iteration increases the size
of the discrepancy set by one. Note that a mention
is not repeated in a discrepancy set.
We consider two different
strategies for terminating the search depending
on the used heuristic function h (Section 2.3.3). Each strategy causes the search to terminate at different depth (length) of the discrepancy set. The output of the search is selected by the pruning function p from the last set of complete solutions.
In the following, we will first explain our local model (ψ(⋅,⋅)) for producing the initial solution. We will then introduce our LDS search framework, and its key components including the Discrepancy Propagator to propagate a set of discrepancies to other mentions in the document; the Heuristic Function to compute the local confidence of the mentions in the document and identify the least confident ones; and finally the Pruning Function to guide the search and select the final solution.

2.2 Local Model

Our local model utilizes contextual, lexical and prior evidences to compute the compatibility score ψ(xi,yi) for assigning candidate yi to mention xi. These evidences are extracted and used as follows:

Contextual Evidence

Given a query mention, a key challenge for the local model is to identify a minimum but sufficient contextual evidence to disambiguate the query. We first extract all sentences in the document relevant to the query mention. This is achieved by applying CoreNLP (Manning and McClosky, 2014) to perform coreference resolution to all mentions in the document and extract all sentences containing a mention in the query’s coreference chain.

The set of sentences are then concatenated to form the context for the query; wi=[wi1,...wim], where wij∈Rd is the embedding of the j-th word in the context. We then use an attention model introduced by (Ganea and Hofmann, 2017) to compress the context into a single embedding. Specifically, we define the contextual representation ci∈Rd for mention xi with candidate set {yi1,yi2,...yik} over context wi as

ci=m∑l=1αilwil

(2)

The weight vector α is computed using the following attention considering all k entity candidates:

αi=softmax([maxj∈1..kyTijAwi1,...,maxj∈1..kyTijAwim])

(3)

where A is a learned matrix that scores the relatedness between word and entity.
This attention model computes the relatedness of each word with all of the entity candidates and takes the max as the score for each word. The scores of all context words are then passed through softmax to compute their weight. Under this model, if a word is strongly related to one of the candidates, it will be given a high weight. Subsequently, using ci we define the following contextual features for candidate yij as f(c)ij=[yTijBci;yTijB;ci], where B is a learned Rd×Rd matrix. Note that f(c)ij∈R2d+1.

Lexical Evidence

The contextual features extracted above ignores any lexical/surface information between query mention and entity title, which can be useful. To this end we include some lexical features to our local model including variants of the edit distance between the surface strings of the mention and the candidate title, whether their surface strings follow an acronym pattern and etc. Detailed list can be found in the Table 1(a).
The extracted features are scalar real values. We use RBF binning (Sil et al., 2018) to transform each scalar to a 10-d vector. Hence for each query xi and candidate yij we have lexical vector feature f(l)ij∈R10×|f| where |f| is the number of features listed in Table 1(a).

(a)

mention: m = [m1, … ma], entity title: e = [t1, … tb]

f1: mention length = a

f2: entity title length = b

f3: ∑bi=1 (occurrence counts of ti in the document)

f4: m is acronym

f5: e is acronym

f6: m and e acronym patterns are exact match

f7: m and e are non-acronyms and exact match

f8: min-edit(m, e)

f10: sum of partial min edits:

∑ai=1(minbj=1(min-edit(mi,tj)))

(b)

local score for mention m:

l = sotmax([ψ(xi,yi1), … ψ(xi,yik)])

f1: max(l)

f2: second-max(l)

f3: entropy(l)

f4: m is an acronym

f5: length(m)

Table 1: (a) list of lexical features in the local model (b) list of the features to learn heuristic function h2

Prior Evidence

We consider p(e|m), the prior probability that an entity e is linked to a mention string m as prior evidence. This is computed using hyper-link statistics from Wikipedia and aliases mapping from (Hoffart et al., 2011) obtained by extending the âmeansâ tables of YAGO (Hoffart et al., 2013). The p(e|m) is also transformed to a 10-d vector using RBF binning (Sil et al., 2018) to create prior feature f(p)ij∈R10.

Overall local model

The contextual, lexical and prior features are concatenated and fed through a Multi Layer Perceptron (MLP) with 2 hidden layers (relu, 200-d and 50-d respectively) to produce a final local score for all candidates as follows:

ψ(xi,yij)=σ(Ws[f(c)ij;f(l)ij;f(p)ij]+bs)

(4)

Where Ws and bs are weight and bias parameters for the MLP. The learning of the model is two tiered. We first pre-train the matrices A and B using the cross-entropy loss by using softmax([yTijBci]j∈1..k) as the predicted probability for each candidate for mention xi,
Keeping A and B fixed, we then train the MLP weights with a drop-out rate of 0.7 minimizing again the cross-entropy loss. Note that we use the entity embeddings produced by (Ganea and Hofmann, 2017) and the word embeddings produced by (Mikolov et al., 2013) using skip-gram model.

2.3 Global Model via LDS

Our local model primarily focuses on the textual context of each mention in making its decisions and ignores the relationship between mentions. Our global model takes as input the local predictions for all the mentions in a single document and constructs a globally coherent final solution using Limited Discrepancy Search. Before we introduce our LDS search procedure, we will first describe the discrepancy propagator which is critical toward achieving an efficient search space.

Discrepancy Propagator

The purpose of the discrepancy propagator is to allow the influence of the local changes to propagate to other parts of the solution, thus reducing the necessary number of discrepancies needed. Our discrepancy propagator is essentially a new scoring function that evaluates each candidate for a given mention not only based on the local compatibility but also the global coherence among the predictions.

Given that not all mentions in a document need to be related to one another, for each mention xi we construct an entity context Ei considering a window of 30 mentions centered at query mention xi. Given the current entity assignment e1,e2,...en to the mentions in the document, the entity context for mention xi will be Ei=ei−15...,ei−1,ei+1,...ei+15.
For a given mention xi and its entity context Ei, the new score of a candidate yij is defined as:

g(xi,yij)=ψ(xi,yij)+1|Ei|∑e∈Eiϕ(e,yij)

(5)

In equation 5, the score of a candidate now considers both its local score ψ as well as its average coherence with the other predictions in its entity context. The coherence score ϕ between two entities e and yij is defined as:

ϕ(e,yij)=eTCyij+wTr(e,yij)

(6)

where e and yij are entity embeddings and r(e,yij) is a vector of three pairwise features between the two entities: log transformed counts of co-ocurrences, shared in-links and shared out-links; C∈Rd×d and w∈R3 are learned weights.
The score given by equation 5 is fed through a softmax activation to assign probabilities to each candidate of xi. We train W and c in a mention-wise procedure using cross entropy loss. Specifically for each mention xi we use ground truth entities for the other mentions in Ei and update weights of C and w such that the true candidate of xi gets higher score than its false candidates.
Note that we do not intend to use this scoring function to do joint inference on all mentions. Instead, it is used with LDS to propagate the discrepancies to the output of all related mentions as follows. Given the original solution e1,e2,...en and a set of discrepancies {xk:yk}, we first update ek to yk. Then we re-evaluate equation 5 for all mentions that have yk in their entity context to produce a complete solution based on new entity context containing yk.

Heuristic Function

The heuristic function takes the initial solution and sort all the mentions based on their prediction confidence. In this work we consider two heuristic functions for computing the prediction confidence of the local model. For a given query xi, our first heuristic function h1 computes its confidence simply by taking the max of the local scores normalized by a softmax funtion:

h1(xi)=max(softmax([ψ(xi,yi1),...ψ(xi,yik)]))

(7)

where ψ(xi,yij) is the local score for candidate yij.

For our second heuristic function h2, we learn a binary classifier (a two layer MLP) to produce local confidence for each mention xi. The binary classifier utilizes the features listed in Table 1(b). Each feature value is transformed into a 10-d vector using RBF binning (Sil et al., 2018). The training samples for the binary classifier are generated from the prediction of the local model on the training set. One training example is generated for each local prediction — if the local model is correct about its decision then the label for the sample is 1. In order to balance the positive and negative training samples, we randomly sub-sample positive local predictions. For each local prediction, the learned classifier outs a probability of its being correct, which is used as the confidence score.

Pruning Function

At each iteration of the search, each beam is expanded to k new states. Following the expansion, the DP is applied to the b×k expanded nodes and re-evaluates equation 5 for all mentions with updated entity contexts due to the discrepancies. It hence propagates the discrepancies and produces a complete solution. Subsequently, the pruning function evaluates each of the b×k complete solutions and reduces the expansion list to the beam size b.

The pruning function is similar to equation 5 but scores all the mentions collectively. Given all the mentions in the document x1,...,xn and their predicted entities denoted as o=o1,...,on, our pruning function scores the solution o as follows:

s(o)=∑iψ(xi,oi)+∑(oi∈Ej or oj∈Ei)ϕg(oi,oj)

(8)

Where constraint (oi∈Ej or oj∈Ei) indicates that we consider relation among pair of entities (oi,oj) if they are in the entity context of one another. Here ϕg takes the same form as equation 6 but uses a different set of weights Cg and wg.

Despite the similarity in forms, the pruning function serves a different purpose from that of equation 6 and requires different training. We train the pruning function by reducing it to a rank learning problem. Specifically, in training, we collect all the complete solutions that are considered in each pruning step, and compute their hamming losses from the ground truth. Given a set of solutions in a pruning step, we will create ranking pairs to require the solution with the least hamming loss to score higher than all the others. Given a ranking pair, ot and of, assuming ot has lower hamming loss than of, we use the following ranking loss for training:

max(0,Δ(ot,of)−s(ot)+s(of))

(9)

where Δ(o(t),o(f)) is the absolute difference of the hamming-loss between o(t) and o(f). This loss function penalizes the scoring function if it fails to score ot higher than of by a margin specified by Δ(o(t),o(f)).

Terminating the search. We consider two different strategies for terminating the search depending on the heuristic function in use. When using h1 as the heuristic function, search terminates when we reach a depth limit τ (a maximum τ discrepancies). The strategy with h2 uses a flexible depth. For this strategy, we terminate the search when discrepancies have been introduced for all queries that are predicted to be incorrect by h2.

3.1 Data Sets

We use two datasets CoNLL 2003 (Hoffart et al., 2011) and TAC 2010 (Ji et al., 2010) for evaluation. The CoNLL dataset is partitioned into train, test-a and test-b with 946, 216 and 231 documents respectively. Following our baselines we only use 27816 mentions with valid links to the KB. The TAC 2010 dataset is yet another popular NED dataset released by the Text Analysis Conference (TAC). The dataset contains training and test set with 1043 and 1013 documents respectively. Similar to CoNLL we only consider linkable mentions in TAC and report our performance on 1020 query mentions in the test set.
To learn and tune the parameters of the local models for CoNLL and TAC we use their own training and development splits. However, to learn and tune the parameters of the global model (the discrepancy propagator, the heuristic and pruning functions) we only use the CoNLL training and dev-sets.
The number of queries per document in the test sets of the CoNLL and TAC are approximately 20 and 1 respectively. In order to have a global setup for TAC we apply CoreNLP (Manning and McClosky, 2014) mention extractor to the test documents of TAC and perform joint disambiguation of the extracted mentions together with the query mentions, increasing the number of mentions to approximately 4 per document. We only report the performance on the query mentions with the given standard ground truth by TAC.

3.2 Hyper-parameters and Dimensions

Our model settings for the hyper-parameters and dimensionality of the embeddings and weights are as follows: We use entity/word embeddings of 300-d. We learned the embeddings using the mechanism proposed by (Ganea and Hofmann, 2017). We use 2-layers MLPs for the local model and the heuristic h2 with hidden layer sizes 200×50 and 100×20 for each respectively. The RBF binning always transfers a scalar to 10-d. Although we analyze and compare different configurations for beam-size and depth limit in section 3.4, we use beam size 5 and flexible depth limit τ with heuristic h2 in our reported results.

3.3 Results

To evaluate the model performance we use the standard micro-average accuracies of the top-ranked candidate entities. We use different alias mappings for TAC and CoNLL. Specifically, for TAC we only use anchor-title alias mappings constructed from hyper-links in the Wikipedia. For CoNLL, in order to follow the experimental setup reported by (Sil et al., 2018), in addition to the alias mappings of Wikipedia anchor-titles, we use the mappings produced by (Pershina et al., 2015) and (Hoffart et al., 2011).
Tables 2(a) and (b) show our performance on the CoNLL and TAC datasets for our local and global models along with other competitive systems respectively. The prior state of the art performances on these two datasets are achieved by (Sil et al., 2018). The results show that our global model outperforms all the competitors for both CoNLL 2003 and TAC 2010. It is interesting to note that our local model is solid, but is noticeably inferior to the stat-of-the-art local model. With an even stronger local model like that of (Sil et al., 2018), one can potentially expect our LDS-based global model to push the state-of-the-art even further.

3.4 Ablation Study and Performance Analysis

For ablation we analyze the impact of the features and search. We also analyze the behavior of our model based on the rarity of the entities.

Feature Analysis

We study the impact of features on the local and global models. For the local model, starting with only considering the contextual evidence as described in Section 2.2.1, we see that the performance steadily increases as we add the prior and lexical features. As shown in tables 3(a) and (b) the prior and lexical features have very strong impact on TAC. The binning technique that projects the prior and lexical features to 10 dimensions gives an average of 0.94% and 0.61% percentage points for TAC and CoNLL respectively.
For the global model, entity pair compatibility is computed using both entity embeddings and log transformed counts of co-occurences, shared in-links and shared out-links. Using these features leads to a gain of 1.1% and 0.43% in accuracy for CoNLL and TAC respectively compared to the global model using only the embeddings in equation 6.

models

In-KB acc%

local

Context only

85.37

Context + Prior + Lexical

90.28

Context + Prior + Lexical + Bin

90.89

global

Our global

94.44

- log count features

93.34

- LDS + 1-step global prop.

93.14

- LDS + conv. global prop.

93.63

(a) CoNLL 2003

models

In-KB acc%

local

Context

70.86

Context + Prior + Lexical

84.79

Context + Prior + Lexical + Bin

85.73

global

Our global

87.9

- log count features

87.47

- LDS + 1 step global prop.

86.21

- LDS + conv. global prop.

86.29

(b) TAC-2010

Table 3: The performance of variants of our Local/Global models on CoNLL 2003 and TAC-2010

Search Analysis

In the last two rows of Table 3, we list variants of our model where LDS search is removed. Specifically, given the initial local solution, our discrepancy propagator (equation 5) can be applied without any discrepancy to re-evaluate the candidates for all mentions and obtain a more globally compatible solution. Naturally, one can repeat this for multiple iterations till convergence, which gives us an alternative global model that does not rely on search. We consider this model and show its single iteration performance (-LDS + 1-step global prop.) as well as its converged performance (-LDS + conv. global prop.) in Table 3. From the results we can see that a single iteration of the global propagation was able to significantly improve the initial local solution, but later iterations lead to very mild gain. This is because after the first propagation the convergence is almost immediate and there is a limited improvement after the first iteration. In contrast, our search based global model is able to achieve significant further performance gain 1.3% for CoNLL and 1.68% for TAC. The reason is that with local discrepancies introduced by LDS, we force the solution to escape the current local optimum and search for better ones.

Additionally, several parameters/choices in the search can potentially impact the performance: the beam size b, the depth of search, and the different heuristics for prioritizing the discrepancy locations. The following set of experiments explore these choices of parameters:

Beam size. We compare model with beam size b=5 and a simple greedy model with b=1. For CoNLL 2003 and TAC-2010, the model with b=5 gives a gain of 0.24% and 0.13% compared to the greedy. Increasing the size of the beam further beyond 5 did not show significant gain.

Depth of search. For heuristic h1, we use a fixed depth strategy. In particular, given a document with n mentions, we consider two different depth limits: τ=25%n and τ=50%n, which lead to an average depth of 5 and 10 per document respectively. For heuristic h2, the depth is flexible and determined by the number of mentions that are predicted to be incorrect by h2. This strategy leads to an average depth of 4. In Tables 4 (a) and (b), we report the confusion matrices given by h1 applied to the local prediction on test-b of CoNLL 2003 with τ=25%n and τ=50%n respectively. In these tables correct/incorrect mean whether the local prediction is correct or not (comparing to the ground truth). Therefore the first cell with value 1134 indicates that there are 1134 mentions in test-b that are correctly predicted by the local model, but deemed as among the top 25% least confident mentions (aka the hard queries) by h1. The confusion matrix for a good heuristic will have small diagonal values and large anti-diagonal values.

In Table 4 (c), we apply heuristic h2 to the same data with flexible depth. These results show that heuristic h2 gives the best precision as well as recall of the real mistakes made by the local model.

In Table 5, we report the performance of our model on CoNLL 2003 with different choices for the heuristic, search depth, and beam size. The results show that using a beam of size 5 improves upon single greedy search, and using heuristic h2 with flexible depth gives the best performance in terms of both prediction accuracy and efficiency (due to smaller search trees). When h1 is used, doubling the depth of the search tree brings about only a marginal improvement in accuracy at the cost of doubling the search depth and thus the prediction time.

Performance Analysis Based on Entity Rarity

We also analyze the behavior of the coherent global models based on the rarity of the entities for the given mention. Specifically, we measure the rarity of e for given query mention m by p(e|m), as defined in 2.2.3, scaled to (0, 100). To this end we quantize the value of rarity measure into different bins. For each bin we compute the difference of accuracy between a coherent global model and the local model. Two coherent models are considered here; a global model without LDS (-LDS + conv.global prop in Table 3) and global model with LDS using h2. Figure 2 shows the difference of accuracy between the global and local models per bin. As shown in this figure both global models achieve gains mostly on bins with small p(e|m), which are related to the mentions with rare true entity. Additionally, the impact of the global model when LDS is used is more significant, especially for the mentions whose true entities are most rare according to p(e|m).

Figure 2: Global accuracy minus local accuracy per bin of rarity measure p(e|m) which is scaled to (0, 100) for two coherent global models; model without LDS and model with LDS.

Deep learning has been leveraged in recent local and global models. In local models Sun et al. (2015) use neural tensor networks to model mention, context and entity embeddings. Ganea and Hofmann (2017) develop an attention based model to weigh the importance of the words in the query document. The model by Sil et al. (2018) utilizes neural tensor network, multi-perspective cosine similarity and lexical composition and decomposition. In global models, the early models (Milne and Witten, 2008; Ferragina and Scaiella, 2010) decomposes the problem over mentions. Hoffart et al. (2011) use an iterative heuristic to prune edges among mention and entity. Cheng and Roth (2013) use an integer linear program solver and Lev-Arie Ratinov and Anderson (2011) apply SVM to use relation scores as ranking features.

In recent global models, Personalized PageRank (PPR) (Jeh and Widom, 2003) is adopted by several studies (Han and Sun, 2011; He et al., 2013; Alhelbawy and Gaizauskas, 2014; Pershina et al., 2015). Yamada et al. (2016) extend the skip-gram model (Mikolov et al., 2013) to learn the relatedness of entities using the linking structure of the KB. Ganea and Hofmann (2017) use a Conditional Random Field (CRF) based model to capture the interrelationship among entities in the same document whereas Globerson et al. (2016) introduce a soft k-max attention model to weight the importance of other entities in the document in making prediction for any given query.

In our proposed global model we address the intractable global optimization in a search framework. Specifically we use Limited Discrepancy Search (Doppa et al., 2014). Initialized with a local solution, LDS only explores a small number corrections to the hard queries. The propagation of the corrections enables the correction of other mistakes (even those with high local confidence) and allows us to reach high quality solutions with shallow searches. Moreover, the heuristic to prioritize the discrepancies can noticeably reduce the time without hurting the performance.

In this paper we study the problem of entity disambiguation. We are inspired by the observation that local models in this task tend to produce reasonable solutions, such that with only a small number of corrections and a proper propagation of these corrections throughout the document, one can quickly find superior solutions that are globally more coherent.

Based on this observation, we propose a search based approach that starts from an initial solution from a local model, and uses Limited Discrepancy Search (LDS) to search through the space of possible corrections with the goal of improving the linking performance. The experimental results show that our global model improves the state of the art on both the CoNLL 2003 and TAC 2010 benchmarks. For future research, we are interested in further understanding of the strengths and weaknesses of the LDS-based approach for different types of queries and entities. We are also interested in applying different local models to initialize the LDS search.