Saturday, February 23, 2013

This is the second part of my text processing series. In this blog, we'll look into how text documents can be stored in a form that can be easily retrieved by a query. I'll used the popular open source Apache Lucene index for illustration.

There are two main processing flow in the system ...

Document indexing: Given a document, add it into the index

Document retrieval: Given a query, retrieve the most relevant documents from the index.

The following diagram illustrate how this is done in Lucene.

Index Structure

Both documents and query is represented as a bag of words. In Apache Lucene, "Document" is the basic unit for storage and retrieval. A "Document" contains multiple "Fields" (also call zones). Each "Field" contains multiple "Terms" (equivalent to words).

To control how the document will be indexed across its containing fields, a Field can be declared in multiple ways to specified whether it should be analyzed (a pre-processing step during index), indexed (participate in the index) or stored (in case it needs to be returned in query result).

Keyword (Not analyzed, Indexed, Stored)

Unindexed (Not analyzed, Not indexed, Stored)

Unstored (Analyzed, Indexed, Not stored)

Text (Analyzed, Indexed, Stored)

The inverted index is a core data structure of the storage. It is organized as an inverted manner from terms to the list of documents (which contain the term). The list (known as posting list) is ordered by a global ordering (typically by document id). To enable faster retrieval, the list is not just a single list but a hierarchy of skip lists. For simplicity, we ignore the skip list in subsequent discussion.

This data structure is illustration below based on Lucene's implementation. It is stored on disk as segment files which will be brought to memory during the processing.

The above diagram only shows the inverted index. The whole index contain an additional forward index as follows.

Document indexing

Document in its raw form is extracted from a data adaptor. (this can be making an Web API to retrieve some text output, or crawl a web page, or receiving an HTTP document upload). This can be done in a batch or online manner.

When the index processing start, it parses each raw document and analyze its text content. The typical steps includes ...

Tokenize the document (breakdown into words)

Lowercase each word (to make it non-case-sensitive, but need to be careful with names or abbreviations)

Remove stop words (take out high frequency words like "the", "a", but need to careful with phrases)

Stemming (normalize different form of the same word, e.g. reduce "run", "running", "ran" into "run")

Synonym handling. This can be done in two ways. Either expand the term to include its synonyms (ie: if the term is "huge", add "gigantic" and "big"), or reduce the term to a normalized synonym (ie: if the term is "gigantic" or "huge", change it to "big")

At this point, the document is composed with multiple terms. doc = [term1, term2 ...]. Optionally, terms can be further combined into n-grams. After that we count the term frequency of this document. For example, in a bi-gram expansion, the document will become ...
doc1 -> {term1: 5, term2: 8, term3: 4, term1_2: 3, term2_3:1}

We may also compute a "static score" based on some measure of quality of the document. After that, we insert the document into the posting list (if it exist, otherwise create a new posting list) for each terms (all n-grams), this will create the inverted list structure as shown in previous diagram.

There is a boost factor that can be set to the document or field. The boosting factor effectively multiply the term frequency which effectively affecting the importance of the document or field.

Document can be added to the index in one of the following ways; inserted, modified and deleted.
Typically the document will first added to the memory buffer, which is organized as an inverted index in RAM.

When this is a document insertion, it goes through the normal indexing process (as I described above) to analyze the document and build an inverted list in RAM.

When this is a document deletion (the client request only contains the doc id), it fetches the forward index to extract the document content, then goes through the normal indexing process to analyze the document and build the inverted list. But in this case the doc object in the inverted list is labeled as "deleted".

When this is a document update (the client request contains the modified document), it is handled as a deletion followed by an insertion, which means the system first fetch the old document from the forward index to build an inverted list with nodes marked "deleted", and then build a new inverted list from the modified document. (e.g. If doc1 = "A B" is update to "A C", then the posting list will be {A:doc1(deleted) -> doc1, B:doc1(deleted), C:doc1}. After collapsing A, the posting list will be {A:doc1, B:doc1(deleted), C:doc1}

As more and more document are inserted into the memory buffer, it will become full and will be flushed to a segment file on disk. In the background, when M segments files have been accumulated, Lucene merges them into bigger segment files. Notice that the size of segment files at each level is exponentially increased (M, M^2, M^3). This maintains the number of segment files that need to be search per query to be at the O(logN) complexity where N is the number of documents in the index. Lucene also provide an explicit "optimize" call that merges all the segment files into one.

Here lets detail a bit on the merging process, since the posting list is already vertically ordered by terms and horizontally ordered by doc id, merging two segment files S1, S2 is basically as follows

Walk the posting list from both S1 and S2 together in sorted term order. For those non-common terms (term that appears in one of S1 or S2 but not both), write out the posting list to a new segment S3.

Until we find a common term T, we merge the corresponding posting list from these 2 segments. Since both list are sorted by doc id, we just walk down both posting list to write out the doc object to a new posting list. When both posting lists have the same doc (which is the case when the document is updated or deleted), we pick the latest doc based on time order.

Finally, the doc frequency of each posting list (of the corresponding term) will be computed.

Document retrieval

Consider a document is a vector (each term as the separated dimension and the corresponding value is the tf-idf value) and the query is also a vector. The document retrieval problem can be defined as finding the top-k most similar document that match a query, where similarity is defined as the dot-product or cosine distance between the document vector and the query vector.

tf-idf is a normalized frequency. TF (term frequency) represents how many time the term appears in the document (usually a compression function such as square root or logarithm is applied). IDF is the inverse of document frequency which is used to discount the significance if that term appears in many other documents. There are many variants of TF-IDF but generally it reflects the strength of association of the document (or query) with each term.

Given a query Q containing terms [t1, t2], here is how we fetch the corresponding documents. A common approach is the "document at a time approach" where we traverse the posting list of t1, t2 concurrently (as opposed to the "term at a time" approach where we traverse the whole posting list of t1 before we start the posting list of t2). The traversal process is described as follows ...

For each term t1, t2 in query, we identify all the corresponding posting lists.

We walk each posting list concurrently to return a sequence of documents (ordered by doc id). Notice that each return document contains at least one term but can also also contain multiple terms.

We compute the dynamic score which is dot product of the query to document vector. Notice that we typically don't concern the TF/IDF of the query (which is short and we don't care the frequency of each term). Therefore we can just compute the sum up all the TF score of the posting list that has a match term after dividing the IDF score (at the head of each posting list). Lucene also support query level boosting where a boost factor can be attached to the query terms. The boost factor will multiply the term frequency correspondingly.

We also look up the static score which is purely based on the document (but not the query). The total score is a linear combination of static and dynamic score.

Although the score we used in above calculation is based on computing the cosine distance between the query and document, we are not restricted to that. We can plug in any similarity function that make sense to the domain. (e.g. we can use machine learning to train a model to score the similarity between a query and a document).

After we compute a total score, we insert the document into a heap data structure where the topK scored document is maintained.

Here the whole posting list will be traversed. In case of the posting list is very long, the response time latency will be long. Is there a way that we don't have to traverse the whole list and still be able to find the approximate top K documents ? There are a couple strategies we can consider.

Static Score Posting Order: Notice that the posting list is sorted based on a global order, this global ordering provide a monotonic increasing document id during the traversal that is important to support the "document at a time" traversal because it is impossible to visit the same document again. This global ordering, however, can be quite arbitrary and doesn't have to be the document id. So we can pick the order to be based on the static score (e.g. quality indicator of the document) which is global. The idea is that we traverse the posting list in decreasing magnitude of static score, so we are more likely to visit the document with the higher total score (static + dynamic score).

Cut frequent terms: We do not traverse the posting list whose term has a low IDF value (ie: the term appears in many documents and therefore the posting list tends to be long). This way we avoid to traverse the long posting list.

TopR list: For each posting list, we create an extra posting list which contains the top R documents who has the highest TF (term frequency) in the original list. When we perform the search, we perform our search in this topR list instead of the original posting list.

Since we have multiple inverted index (in memory buffer as well as the segment files at different levels), we need to combine the result them. If termX appears in both segmentA and segmentB, then the fresher version will be picked. The fresher version is determine as follows; the segment with a lower level (smaller size) will be considered more fresh. If the two segment files are at the same level, then the one with a higher number is more fresh. On the other hand, the IDF value will be the sum of the corresponding IDF of each posting list in the segment file (the value will be slightly off if the same document has been updated, but such discrepancy is negligible). However, the processing of consolidating multiple segment files incur processing overhead in document retrieval. Lucene provide an explicit "optimize" call to merge all segment files into one single file so there is no need to look at multiple segment files during document retrieval.

Distributed Index

For large corpus (like the web documents), the index is typically distributed across multiple machines. There are two models of distribution: Term partitioning and Document partitioning.

In document partitioning, documents are randomly spread across different partitions where the index is built. In term partitioning, the terms are spread across different partitions. We'll discuss document partitioning as it is more commonly used. Distributed index is provider by other technologies that is built on Lucene, such as ElasticSearch. A typical setting is as follows ...

In this setting, machines are organized as columns and rows. Each column represent a partition of documents while each row represent a replica of the whole corpus.

During the document indexing, first a row of the machines is randomly selected and will be allocated for building the index. When a new document crawled, a column machine from the selected row is randomly picked to host the document. The document will be sent to this machine where the index is build. The updated index will be later propagated to the other rows of replicas.

During the document retrieval, first a row of replica machines is selected. The client query will then be broadcast to every column machine of the selected row. Each machine will perform the search in its local index and return the TopM elements to the query processor which will consolidate the results before sending back to client. Notice that K/P < M < K, where K is the TopK documents the client expects and P is the number of columns of machines. Notice that M is a parameter that need to be tuned.

One caveat of this distributed index is that as the posting list is split horizontally across partitions, we lost the global view of the IDF value without which the machine is unable to calculate the TF-IDF score. There are two ways to mitigate that ...

Do nothing: here we assume the document are evenly spread across different partitions so the local IDF represents a good ratio of the actual IDF.

Extra round trip: In the first round, query is broadcasted to every column which returns its local IDF. The query processor will collected all IDF response and compute the sum of the IDF. In the second round, it broadcast the query along with the IDF sum to each column of machines, which will compute the local score based on the IDF sum.

Friday, February 15, 2013

Entity recognition is commonly used to parse unstructured text document and extract useful entity information (like location, person, brand) to construct a more useful structured representation. It is one of the most common text processing to understand a text document.

I am planning to write a blog series on text processing. In this first blog of a series of basic text processing algorithm, I
will introduce some basic algorithm for entity recognition.

Given the sentence: Ricky is living in CA USA.
Output: Ricky/SP is/NA living/NA in/NA CA/SL USA/CL

Basically we want to tag each word with the entity, whose definition is domain specific. In this example, we define the following tags

NA - Not Applicable

SP - Start of a person name

CP - Continue of a person name

SL - Start of a location name

CL - Continue of a location name

Hidden Markov Model

Lets say there is a sequence of state, lets look at multiple probabilistic graph.

However, in our tagging example, we don't directly observe the tag. Instead, we only observe the words. In this case, we can use a hidden markov model (ie: HMM).

Now the tagging problem can be structured as follows.

Given a sequence of words, we want to predict the most likely tag sequence.

Since the sequence w1, w2 ... wn is observed and constant among all tag sequence. This is equivalent to maximize P(t1, t2, ... tn, w1, w2, ... wn) which is equal to P(t1|S)*P(t2|t1)…P(E|tn)*P(w1|t1)*P(w2|t2)…

Viterbi Algorithm

Now the problem is find a tag sequence t1, ... tn to maximize
P(t1|S)*P(t2|t1)…P(E|tn)*P(w1|t1)*P(w2|t2)…

A naive method is to find all possible combination of tag sequence and then evaluate the above probability. The order of complexity will be O(|T|^n) where T is the number of possible tag values. Notice that this is exponential complexity with respect to the length of the sentence.

However, there is a more effective Viterbi algorithm that leverage the Markov chain properties as well as dynamic programming technique.

The key element is M(k, L) which indicates the max probability of any length k sequence that ends at tk = L. On the other hand, M(k, L) is computed by looking back different choices of S of the length k-1 sequence, and pick the one that gives the maximum M(k-1, S)*P(tk=L | tk-1=S)*P(wk|tk=L). The complexity of this algorithm is O(n*|T|^2).

To find the actual tag sequence, we also maintain a back pointer from every cell to S which leads to the cell. Then we can trace back the path from the max cell M(n, STOP) where STOP is the end of sentence.

Notice that for some rare words that is not observed from the training data, P(wk|tk=L) will be zero and cause the whole term to be zero. Such words can be numbers, dates. One way to solve this problem is to group these rare words into patterns (e.g. 3 digits, year2012 ... etc) and then compute P(group(wk) | tk=L). However, such grouping is domain specific and has to be hand-tuned.

Tuesday, February 12, 2013

You can think of planning as a graph search problem where each node
in the graph represents a possible "state" of the reality. A directed
edge from nodeA to nodeB representing an "action" is available to
transition stateA to stateB.

Basic Concepts

A set of relation types e.g. [ObjectTypeX RelationTypeA ObjectTypeY] or [ObjectTypeX RelationTypeA ValueTypeY]

A "state" is composed of a set of relation instances, It can either be a "reality state" or a "required state".

A
reality state contains tuples of +ve atoms. e.g. [(personX in
locationA), (personX is male)]. Notice that -ve atoms will not exist in
reality state. e.g. If personX is NOT in locationB, such tuple will
just not show up in the state.

A required state contains both +ve
and -ve atoms. e.g. [(personX in locationA), NOT(personX is male)]
The required state is used to check against the reality state. The
required state is reached if all of the following is true.

All +ve atoms in the required state is contained in the +ve atoms of the reality state

None of the -ve atoms in the required state is contained in the +ve atoms of the reality state

Notice that there can be huge (or even infinite)
number of nodes and edges in the graph if we are to expand the whole
graph (with all possible states and possible actions). Normally we will
expressed only a subset of nodes and edges in an analytical way.
Instead of enumerating all possible states, we describe the state as a
set of relations that we care, in particular we describe the initial
state of the environment with all the things we observed and the goal
state as what we want to reach. Similarity, we are not enumerate every
possible edges, instead we describe actions with variables such that it
describe rules that can transition multiple situations of states.

An
"action" causes transition from one state to the other. It is defined
as action(variable1, variable2 ...) and contains the following
components.

Pre-conditions: a required state containing a set
of tuples (expressed by variables). The action is feasible if the
current reality state contains all the +ve atoms but not any -ve atoms
specified in the pre-conditions.

Effects: A set of +ve atoms and
-ve atoms (also expressed by variables). After the action is taken, it
removes all the -ve atoms from the current reality state and then
insert all the +ve atoms into the current reality state.

Cost of executing this actio.

Notice
that since actions contains variables but the reality state does not,
therefore before an action can be execution, we need to bind the
variables in the pre-conditions to a specific value such that it will
match the current reality state. This binding will propagate to the
variable in the effects of the actions and new atoms will be insert /
removed from the reality state.

Planning Algorithm

This
can be think of a Search problem. Given an initial state and a goal
state, our objective is to search for a sequence of actions such that
the goal state is reached.

We
can perform the search from the initial state, expand all the possible
states that can be reached by taking some actions, and check during this
process whether the goal state has been reached. If so, terminate the
process and return the path.

Forward planning build the plan from the initial state. It works as follows ...

Put the initial state into the exploration queue, with an empty path.

Pick
a state (together with its path from initial state) from the
exploration queue as the current state according to some heuristics.

If
this current state is the goal state, then return its path that
contains the sequence of action and we are done. Else move on.

For
this current state, explore which action is possible by seeing whether
the current state meet the pre-conditions (ie: contains all +ve and no
-ve state specified in the action pre-conditions).

If the action
is feasible, compute the next reachable state, and the path (by adding
this action to the original path), insert the next state into the
exploration queue.

Repeat 5 for all feasible actions of current state.

Alternatively,
we can perform the search from the goal state. We looked at what need
to be accomplished and identify what possible actions can accomplish
that (ie: the effect of the action meets the goal state). Then we
looked at whether those actions are feasible (ie: the initial state
meets the action's pre-conditions). If so we can execute the action,
otherwise we take the action's pre-conditions as our sub-goal and expand
our over goal state.

Backward planning build the plan from the goal state. It works as follows ...

Put the goal state into the exploration queue, with an empty path.

Pick
a regression state (a state that can reach the goal state, can be
considered as a sub-goal) from the exploration queue according to some
heuristics.

If the regression state is contained in initial state, then we are done and return the path as the plan. Else move on.

From
this regression state, identify all "relevant actions"; those actions
who has some +ve effect is contained in the regression state; and all of
its +ve effect is not overlap with the -ve regression state; and all of
its -ve effect is not overlap with the +ve regression state.

If
the action is relevant, compute the next regression state by removing
the action effect from the current regression state and adding the
action pre-conditions into the current regression state, insert the next
regression state into the exploration queue.

Repeat 5 from all relevant actions of current regression state.

Heuristic Function

In above algorithms, to pick the next candidate from the exploration queue. We can employ many strategies.

If we pick the oldest element in the queue, this is a breathe-first search

If we pick the youngest element in the queue, this is a depth-first search

We can pick the best element in the queue based on some value function.

Notice
that what is "best" is very subjective and is also domain specific. A
very popular approach is using the A* search whose value function =
g(thisState) + h(thisState).

Notice that g(thisState) is the
accumulative cost to move from initial state to "thisState", while
h(thisState) is a domain-specific function that estimate the cost from
"thisState" to the goal state. It can be proved that in order for A*
search to return an optimal solution (ie: the least cost path), the
chosen h(state) function must not over-estimate (ie: need to
underestimate) the actual cost to move from "thisState" to the goal
state.