Friday, September 30, 2011

> Importance computation can be made very query specific, by first> retrieving the top k results (e.g. using a vector space ranking> method), and then retrieving links that the results forward points to> (authorities) and also the links that are backwardly linking to it> (hubs). In its entirety, this graph of links is called the "base> set". Additional value can be extracted by also considering the other> links (new authorities) that the hubs point to, and the other links> (new hubs) that point to the authorities.>

Importance computation can be made very query specific, by firstretrieving the top k results (e.g. using a vector space rankingmethod), and then retrieving links that the results forward points to(authorities) and also the links that are backwardly linking to it(hubs). In its entirety, this graph of links is called the "baseset". Additional value can be extracted by also considering the otherlinks (new authorities) that the hubs point to, and the other links(new hubs) that point to the authorities.

Importance computation can be done at three levels: Global, Query specific and Topic level. At global level, each page has one rank for every query. At query specific level, each page has one rank specific to a query and at topic level, each page has one rank to specific topic.

Thursday, September 29, 2011

To incorporate importance as well as query similarity in a ranking a document, the best approach is to cluster documents in the corpus based on topic/class they belong to, find importance of the document wrt the cluster. This can be pre-computed and stored for later use.

When the query arrives, assign a topic to this query, find documents based on similarity wrt the query and calculate pageRank as:

Query specific importance computation might be more accurate compared to global importance computationbut its major drawback is that the importance is calculated after the calculation of similarity.This drastically the query time and which is not acceptable.

Importance calculation can be done Globally or Query-specific.In global method, A/H or Page Rank is done once on the entire corpus.Whereas in Query-specific method, Page Rank or A/H can be computed on the query results.

While both global and local specific approaches to link analysis havetheir pros and cons, we can use a combination of the two, a topicspecific approach, to take the some of the query into account whilenot sacrificing any performance. We do this by assigning the query totopic(s) and using the weights in the calculation.

- Stephen Booher (Sorry for the repost; I forgot to sign my last "tweet.")

While both global and local specific approaches to link analysis havetheir pros and cons, we can use a combination of the two, a topicspecific approach, to take the some of the query into account whilenot sacrificing any performance. We do this by assigning the query totopic(s) and using the weights in the calculation.

Two times to combine similarity and importance is before and after a query. Doing before query is like combining apples and oranges, but doing after query one can compute similarity and then importance based on query. Tradeoffs correspond to speed and accuracy.

Pre-computing importance for each page is quick and simple toimplement, using HITS or Pagerank. Unfortunately, while thepresident's homepage may be very important with the query "worldleaders", you wouldn't want him to show up when you search for"computer science". While computing the importance based on the queryitself may be helpful in terms of accuracy, it will drastically slowdown the results.-Kalin Jonas

The HITS algorithm for ranking pages uses Hub and Authority scores torate each page. Hub scores are given to pages that point to otherpages with high authority scores. Authority scores are given to pagesthat are pointed to by reliable hubs. Through iterations, these scoresincrease for each page, then after several iterations, those pageswith high authority or high hub scores are presented. Presenting bothauthorities and hubs increases the likelihood that one of the firstfew results will be useful.-Kalin Jonas

Information Retrieval on the web. There are additional requirementssuch as the need to crawl and maintain an index. And there areadditional structural advantages, such as links, tags, metadata, andsocial structure. For example, certain HTML tags provide more emphasisto their contents than do other. The use of hyperlinking adds a newlevel of associations between documents. The words around anchor tagscan be used to describe the document that is linked, and the links ona page can in turn describe itself. Unique phrases can be used tocreate a strong relationship to a page more easily than a commonphrase. Hyperlinks can be used to establish a community of trust.-Thomas Hayden

Wednesday, September 28, 2011

Today we studied the concept of hubs and authorities, as well as theconcept of a steady state. Hubs and authorities can often be seen atwork in the real world. For example in social situations an authorityfigure will often have many followers (hubs) which reinforces eachrespective members authority and hub status. The general concept ofsteady state is any essentially any state which tends to remainconsistent over a period of time. In web IR, this tends to be appliedto graph or matrix structures which can be used to calculate thisstate. -Thomas Hayden

Hubs point to Authority. Hubs value is calculated by the sum of all the Authorities points they're pointed at, and Authority's value is calculated by the sum of the hubs' points that are pointing at them. Since their values keep increasing in every iteration, calculate their unit vector.

Stable state (hub value zero) and probability of spending more time in particular web page will be the factors to estimate page rank. Though, link analysis can estimate which pages are expected to be most visited, it is not easy to know how much time a user spends at any particular page.

One simple method I can think of is by analyzing how often (considering time gap) a user selects alternative links proposed by search engines after selecting one.

Sufficient conditions for making a graph "safe":1)remove sinkholes by setting any column of all 0s to all 1/n, where n is the number of nodes2)remove disconnected components by making weak links to all other nodes.

The necessary conditions for existence of unique steady state distribution: Aperiodically (it is not a big cycle) and Irreducibility (each node can be reached from every other node with non-zero probability).

Facts about LSI1. It captures synonymy, poly-semi, correlations2. It can't find correlation for words that are not present queries (" find all docs NOT containing term X")3. It Captures linear correlations

Web pages are mostly HTML documents, allow the author of a web page to control the display of page contents on the Web, and express their emphases on different parts of the page. We can make use of the tag information to improve the effectiveness of a search engine.

If the anchor text is unique enough, then even a few pages linking with that keyword will make sure the page comes up high, while for more common-place keywords, you need a lot more links.

Singular value decomposition (SVD) is a method for identifying and ordering the dimensions along which data points exhibit the most variation, and once we have identified where the most variation is, it's possible to find the best approximation of the original data points using fewer dimensions. Hence, SVD can be seen as a method for data reduction.

What makes SVD practical for NLP applications is that you can simply ignore variation below a particular threshhold to massively reduce your data but be assured that the main relationships of interest have been preserved.

when map each document and query vector into a lower dimensional space, we call it Latent semantic indexing(LSI).

Anchor text has higher importance than the plain text of the page it's pointing to. And if the anchor text is very common, then it requires a lot more work & links with that name so the page has a higher rank.

The web consist of documents/pages which advertise itself falsely saying it contains the terms so that it can be retrieved. For example any one can create a lookalike home page of XYZ claiming to be the original webpage. So its the challenge for the retrieval engine now to not only get the relevant documents but also the trustworthy information.

A result of link analysis is that now links to a page can be considered content of the page linked to. If you direct many links to a page with the same anchor text, search engines using link analysis may pick up on this and correlate the page with the anchor text.

Pages that are picked should not just be relevant but also trustworthy. Trustworthiness should be a part of page ranking. At the same time relevance of a page is different from importance of that page. Relevance and importance should be considered in deciding to show the resultant pages to the end user. The reason is users can decide the relevance of a page but the users would not know the correctness of that page. Also importance is understood to be a fixed point computation.

There are at least two ways to utilize the structure on the web inranking pages for a query. One way is to use the semantics of the pagemarkup to create weighted vectors based the importance of the terms inthe page; another way is to use the anchor text of links pointing tothe page which adds more terms to the page.

Anchor text can be considered more important than the actual contentsof the page pointed to. The benefit is that a user can find a page bydescribing it, but doesn't have to know the actual contents of thepage. The downside is that defamatory links can make a page show upwhere it may not deserve it.Kalin Jonas

An analog to LSI is the fourier transform where the input signal is split up into the frequency domain allowing dimension reduction to filter out those unwanted frequencies. This is directly correlated with LSI where we can do dimensionality reduction that allows us to see the correlation between terms and documents.

We need to assume that the given data is corrupted, and the data we get after dimensionality reduction is the real data.The least important dimensions, i.e the dimensions in which variance is very less, are actually noise and hence can be discarded.

The documents and terms are vectors in space of factors.We use LSI to capture variance.

d*t = (d*f) * (f*f) * (f*t). The Rank of the matrix of (f*f) determines the dimensions.The dimensions can be further reduced which gives us a better insight of data variance and also understand the similarities . LSI captures the clusters over data.It can capture the intuitions which the cos(theta) similarity would never capture i.e..like synonymy and polysemy.

Latent Symantic Indexing allows us to capture correlations betweenwords that we previous did not know about. LTI is accomplished throughuse of SVD. A number of smallest eigen values are chosen to beremoved, and then the decomposition is re-multiplied. This produces alower dimension approximation that captures the most variance possiblein those dimensions. This lower dimension approximation causescorrelated documents to become "clustered" together. A simple way offinding query similarity in this space is to add the query as a sudodocument to the original document domain before decomposition, andthen using the distance to other real documents after reduction.Alternatively you can transform the query into the LSI space using theform DF*Q*FF-Q*TF. - Thomas Hayden - Thomas Hayden

Eigen vector decomposition can be thought of as putting an ellipsoid on the data. Eigen vectors convey us about the new dimensions(independent orthonormal axes) and the eigen values convey the importance of these dimensions in accounting for the variance of the distribution.

We want to gather maximum characteristic of a document, including the ones that are not obvious in" vector in terms space" representation in a way that we can classify that document by looking at it from the fewest angles.

Tuesday, September 20, 2011

Dimensional reduction can be helpful for making sense of data withmany dimensions. Even when removing several dimensions, a largeportion of the original variance can be maintained. Some dimensionsmust be retained for the cosine-theta distance measurement to beuseful.Kalin Jonas

LSI can uncover hidden similarities that the normal cosine thetasimilarity would miss. Given our familiar 10 document, 6 term d-tmatrix, suppose we have two documents D1 and D2, where D1 has the term"regression" a huge amount of times, and none of the other terms, andD2 has the term "likelihood" a huge amount of times, and none of theother terms. Our intuitions say that we ought to expect similarity(also validated by the cosine clusters in a previous lecture), butwhen we actually compute the cosine similarity, their dot product is0. LSI, on the other hand, will place the two documents in thedimension with the "regression" and "likelihood" terms, which showsthat they are indeed similar.

Lower rank matrices are probably better because it will help reducenoise. Someone has proven that the resulting d-f x f-f matrix willalways be the "best" (as in the most similar to the original) matrixfor its given rank.

A Term-Term Matrix generally should be a diagonal matrix if the terms are truly independent if that is not the case then non diagonal entries can give us some incite how closely Corelated is one term to another particular term.

Dimension reduction is related to a concept called feature selection, where we are defining in terms of feature and determining which of the feature is important. Feature selection is easier than dimension reduction, because in dimension reduction we do not know which of terms are important since all the terms are derived terms.

Document is a vector in the space of term, term is a vector in the space of document. If you consider term-term metric, it should be dialog metric, if in fact the term are turly independent.

Alternatively, we can also do it with query log,(people typing this word my also typing this word, maybe it is for you too)Two terms are related if they have high occurance in the documents.

Email messages are bag of address, so we can compute address correlation.

Amazon users are bags of purchases, so they can do corelation between purchases. people often buy these things also buy those things, would you like to buy it as well?The benefit of term-term correlation is if these terms are not independent, we can explore their correlation.

Sunday, September 18, 2011

When we perform dimensionality reduction, Variance in data = eigen value of the top k dimensions / Total eigen value of all dimension vectorsFraction of variance lost = 1 - Variance in data (Calculated above)

From the term-document matrix you compute the term-term/correlation matrix which you normalize to obtain a new matrix (association clusters) that tells you how much are terms correlated with 1.0 as maximum value. But from this normalized correlation matrix, we can compute one last matrix called scalar clusters that shows the transitive correlation. In class it was shown that database, SQL, and index terms correlation numbers increased with the scalar clusters. When Gmail recommends people to add to an email to send, it's using some sort of correlation algorithm.

Scalar Clusters are used to construct Thesaurus. There can be a Global Thesaurus (GT) or a Local Thesaurus (GT). GT is constructed using all terms in all documents in the corpus whereas LT construction is query specific and is constructed by only using terms in the query.

LT is better than GT in scenarios when a term has a significant meaning wrt the query. For e.g. if we are looking for operation wrt computer science documents and construct a GT, we might get many uses of term 'operating' which will dilute the significance of the term.

Association clustering (AC) help us in determining correlation between neighbouring terms. AC lacks distributive correlation property, i.e. if a term t1 is associated to term t2, and term t2 is correlated to term t3, then t1 must be related to t3.

To overcome the drawback of AS, Scalar clustering (SC) is used. Scalar cluster is obtained by taking dot product of 2 term vectors in Association Cluster. SC determines neighbourhood correlation.

Friday, September 16, 2011

Representing the document in new dimention s.t. it is function of the original dimentions is essentially looking at the document from a different angle so we can gather the document with less dimentions. From the fish example we can see that width increases as length increases (or other way) and we see certain behaviour mapped. A new dimention called size was than created to capture the maximum variation.

While considering an example of Fish length and width to discuss dimensionality reduction, we saw ellipsoid like variation of graph of size. If this ellipsoid becomes absolute circle then the euclidean gap will be zero, hence there is no way to identify primary attribute that has least percentage of variation loss.

Singular value decomposition takes a rectangular matrix of gene expression data (defined as A, where A is an x p matrix) in which the n rows represents the genes, and the p columns represents the experimental conditions. The SVD theorem states:

Anxp= UnxnSnxpVTpxp

Where

UTU = Inxn

VTV = Ipxp(i.e. U and V are orthogonal)

Calculating the SVD consists of finding the eigenvalues and eigenvectors of AAT and ATA.

Singular value decomposition takes a rectangular matrix of gene expression data (defined as A, where A is an x p matrix) in which the n rows represents the genes, and the p columns represents the experimental conditions. The SVD theorem states:

Exponential value of Eigen gap = |Eval1 - EVal2|, or e^|Eval1 - EVal2| determines how fast or slow it would take vectors to become parallel to the principal Eigen vector while doing matrix multiplication of matrix with itself (While already calculating matrix multiplication between matrix and random vector) . Exponential value being more it takes less matrix multiplication and if being small, it would take much longer but eventually it becomes parallel to the principal eigen vector.

Thursday, September 15, 2011

Discovered the use of k-gram bag intersection with both the lexicon base and/or query log to either provide auto-correction or enhance user search patterns. One striking thing is that if the IDF of a word in lexicon base is less than a threshold, it could be a typo error and hence we should probably not consider when intersecting! Also an interesting aspect on the use of edit distances, where transposition and alignment could reduce distances considerably. And it was refreshing to look at Bayes rule from a different perspective in correcting query errors which are not syntactical. -- Aneeth

Adding 2 documents together to form a third document in the t-d matrix does not change the dimensionality of the matrix because the new document is just a linear combination of the other two.-James Cotter

In PCA (principle component analysis), we are trying to reduce as muchas possible the number of dimensions while still retaining as muchinformation as possible. In other words, we are finding the axes thatshow as much spread of the data as possible. The details have not yetbeen discussed, but it turns out that the eigenvectors determine thedimensions, and the corresponding eigenvalues determine the importanceof the dimension.

Under Scalar clusters if term k1 is co-related to k2 and if k2 is co-related to k3 ,then k1 is co-related to k3. This is computed by considering terms as vectors.By taking the dot product of terms in the association cluster we get scalar cluster.

Keywords/terms in the document can be considered independent (each keyword/term has a unique meaning, non-redundant language). To use it Dimensionality reduction techniques are required. PCA (Principal Components analysis) is a technique to do such dimensionality reduction. PCA applied to documents is called latent Semantic indexing.

Using the terms within documents to analyze them may cause problemswith synonymous words or situational words. Computing a Singular ValueDecomposition can describe documents in terms of brand new conceptswhile capturing synonymous and polysemic information.Kalin Jonas

Autocorrection is implemented in some ways through k-grams which aremapped to words to which the user most likely wants. And the other wayis based on the edit distance which is computed through dynamicprogramming.

We can multiply TD(Term Document) matrix with its transpose ie DT (Document Term) matrix and that would result in a square TT(Term Term) matrix.If the resulting matrix is a diagonal matrix, we know that is because each term matches itself. But if it is not a diagonal matrix then we can add the related terms to the query and enhance our search results.We can further normalize the resultant square matrix to intuitively see the co-relations between the terms.

Wednesday, September 14, 2011

A power law distribution when applied to IR can be used to show thatthose without query logs can have a harder time gathering them, due tothe inferior ability of a search engine without previous logs toimprove searching. Edit Distance - The distance between two wordsdetermined by the number of insertions, deletions, replacements, ortransposition of letters necessary so that the two words are the same. This process like me other discussed in class thus far, can beweighted to give differing values to each of the type of changes, andeven different weights to different variations of each type of change(such as some letter replacements being more or less heavily weightedthan some others). When using this process, proper alignment ofcharacters is necessary to prevent the edit distance from becoming ahigh complexity problem. Correlation and Co-occurrence Analysis -Terms that are related may be added to the results using a thesauruslike method.-Thomas Hayden

Answer to question 1: They are linearly independent, because there is no way that I can multiply one of them by a scalar to obtain the other. They are not a basis, because no linear combination of those two vectors can produce the vector <1, 1, 1>, which is in the three dimensional Euclidean space.

Power law is applied when we want to analyse document corpus vs query logs. The few fav. search engine have ability to further improve by analyzing query logs (which is not readily available to others).

Tuesday, September 13, 2011

When we find edit /weighted edit/ Levenshtein distance between sequences, alignment of the two sequences with respect to each other plays an important part. Dynamic programming is used to find the optimal alignment. Sequence alignment problem is also seen in Gene sequence encoding and speech recognition.

--> LSI(Latent Semantic Indexing) co-occurrence analysis is helpful in avoiding the problems of synonymy and polysemy. --> We use k -gram index to retrieve vocabulary terms that have many k-grams in common with the query.

--> Levenshtein Distance:Minimum number of basic operations to convert S1 to S2.

--> LSI(Latent Semantic Indexing) co-occurrence analysis is helpful in avoiding the problems of synonymy and polysemy.--> We use k -gram index to retrieve vocabulary terms that have many k-grams in common with the query.

--> Levenshtein Distance:Minimum number of basic operations to convert S1 to S2.

Computing Jackard similarity: For the intersection operation, for each different word, take the minimum number of times it appears in any of the documents. For the union operation, take the maximum number of times that the word appears in any of the documents.

The forloop presented in the pseudo code used to compute the cosine similarity between two documents (using tf-idf) is over words, which is the faster way. If the outer level of the forloop were over documents, it would be slower.

Most Critical Time of a search engine is the time taken " when the user hits enter with their query and before the page gets shown". This is what makes inverted indexing important because we might spend lot of time creating it but eventually it cuts down the time of retrieval of a page when the user hits the query.

Because of the sparse of vector, inverted indexing become more useful.Naive retrieval will touch every documents, so it is inefficient. Documents whose similarity would be zero would not be touched any more. That is inverted index.The importance is to use inverted index to compute similarity metrix. In traditional IR, index is made of key word, and lexicon is made of key word. Modern search engines index the full text.If you construct index your self, you should pay attention to stemming and stop words, which can not only reduce the index size, but also improve answer relevance.

- There is an assumption that the words that are being considered as different dimensions as vectors are independent of each other. However there are techniques like Latent Semantic Indexing that can be used to see the correlation between terms.

- Since all the weights in the vectors are either positive or zero the we will not have more than 90 degrees between vectors.

Jaccard similarity is used to find words in the lexicon that most resemble the query words whereas vector similarity is used to find the documents that should appear in the search results by comparing the document vectors with the query vector.

Saturday, September 10, 2011

As huge vectors are sparse in nature we explore this sparsity by using Inverted Index.Any document whose similarity is equal to zero is not touched by Inverted Index. Inverted index does not change the ranking of the pages. Rank the documents in terms of reducing vector similarity. So using Inverted Index has an advantage that words of the query fall in a very small fraction of the set of documents.

Documents are divided into barrels(in order of first having low percentage of documents and last having higher) and results are shown to the user from the first initially and if they still stick then later ones are shown.

Inverted Index: An index into a set of texts of the words in the texts. The index is accessed by some search method. Each index entry gives the word and a list of texts, possibly with locations within the text, where the word occurs.

Inverted Index: An index into a set of texts of the words in the texts. The index is accessed by some search method. Each index entry gives the word and a list of texts, possibly with locations within the text, where the word occurs.

Thursday, September 8, 2011

In Inverted indexing technique, we don't deal with dictionaries but the Lexicon generated for the words in document corpus. Having such a Lexicon saves time during query processing. If document corpus is not changing, previously created lexicon can be used.

Since all words are not present in every document, this matrix is mostly sparse.

The cosine similarity of a document and the given query is calculated. Disadvantage of this approach lies in the fact that many documents are not at all similar to the query, that is they don't contain the relevant term. In such cases cos theta value is zero for most documents, but still it is calculated.

This issue can be resolved by using Inverted Indexing which selects documents based on query.

Stemming is the process of reducing the inflected words into base or root form. The porter stemming algorithm developed by Dr. Martin F. Porter is widely used and is a de-facto standard algorithm for English stemming.

In bioinformatics, inverted indexes are very important in the sequence assembly of short fragments of sequenced DNA. One way to find out where fragments came from is to search for it against a reference DNA sequence.

I was randomly browsing about ranking algorithms and came across this page which explains the difference between popularity and the trustworthiness of a webpage. Google or Bing may follow different page ranking algorithm to rank the web pages. Personalization would also play an important role in determining the ranking of the webpage.

The popularity of the web page is calculated in terms of mozRank

MozRank is calculated at a scale of between 1 and 10 with respect to the popularity of the website. The underlying principle is very easy to understand. The more links a page receives, the better it will be popular and obviously it will increase the ranking of the site.

In the similar line, trustworthiness of a webpage is calculated in terms of mozTrust

mozTrust is another important factor which would influence the ranking of a page in the search engine results. This feature enumerates the confidence of a page relative to the other page found on the web. Again, the logic is quite similar. If a page gets a link from another page which was considered "trusted" by the engine, it is rated on a higher scale. The trusted sites can be any of the educational institutions like ASU, government websites, company sites etc.

For getting more information and technical aspect of these features, please refer this site.

Three techniques for generating keywords:1. Stop Word elemination- Eliminate common words in the lexicon(eg do not index them). In English, some examples would be "the", "an"...2. Noun phrase detection- combine multiple words that occur together eg."data structure" 3. Stemming- remove endings of words so that the query can be matched more easily to those words indexed from documents. eg. "walked"->"walk"