According to the Oxford English Dictionary, the 500 words used most in the English language each have an average of 23 different meanings. The word “round,” for instance, has 70 distinctly different meanings. The variance of word meanings in natural language has always posed problems for those who attempt to construct an unambiguous and consistent statement. It is often the case that a written statement could be interpreted in several ways by different individuals, thus rendering the statement subjective rather than objective. The first detailed examination of this problem with respect to the specifications of computer systems is contained in [Hill, 72]. Hill provides a plethora of examples to illustrate this common problem. Peter G. Neumann illustrated this point by constructing a sentence which contained the restrictive qualifier “only.” He then showed that by placing the word “only” in 15 different places in the sentence resulted in over 20 different interpretations [Neumann, 84]. Moreover, other words like “never,” “should,” “nothing,” and “usually” are sometimes applied in a manner in which a double meaning can be ascribed. In particular, the word “nothing” was a favorite word often used by Lewis Carroll.

Under these circumstances, why should we assume that the discriminating power of terms in a collection, particularly of polysemes and ambiguous terms, is the same (unique) regardless of their meanings or neighboring query terms? *

This is where IDF as a term specificity measure breaksdown. This problem is intimate linked to The Original Sin of IR models: The Term Independence Assumption.

* I have modified a bit this assertive question to make the point more clear.

Suppose at time 0, a group of people separate themselves from their culture. A group of American Indians leaves the tribe and forms its own tribe, or a group sails to a deserted island and starts its own culture. We then have two cultures, A and B. At time 0, they have the same language, so that for a given list of L words, A(0) = B(0) = 1 is the per cent of the words they both know (and have in common).

If we contact each of these cultures k years later, culture A will know A(k) = (0.805)^(0.001 k)*(0.805)^(0.001 k) = (0.805)^(0.002k) per cent of the original list. Thus the per cent of the words that both cultures know is, by the multipliation principle,

Q = (0.805)^(0.001 k)*(0.805)^(0.001 k) = (0.805)^(0.002k)

What the glottochronologist does now is to construct a list of words. From that list of words, the two cultures are studied and it is determined what per cent Q of this list of words is known by both cultures. Thus in the equation for Q above, Q is known, but k, the number of years since the two cultures separated, is unknown. Solving for k gives

k = 500lnQ/ln 0.805

To understand the significance of this ratio we need to look at some examples.

Examples

Sandefur provides several examples.

Suppose that the natives of two islands have similar language. From a list of 300 words, 180 words are understood by both groups, that is Q = 180/300 = 0.6. Then

k = 500ln 0.6/ln 0.805 = 1177.5

We then conclude that the natives of these two islands came from a common ancestry, approximately 1200 years ago.

Suppose a collection of tribes with a similar language is considered. First, group the tribes into geographical regions. Then date the time separation n for pairs of tribes in each geographical region. It can be argued that the region with the pair of tribes with the largest time separation is the homeland of the tribes. The reason for this conclusion is as follows. Suppose one tribe separates into three tribes. One tribe might move away while the other two remain in the same general region. The tribe that moved away may split again in its geographical location, but the largest time separation will always be the two that remained in the original area.

Drawbacks and Pitfalls of Glottochronology

The model presumes independence assumptions (see discussion on multiplication principle); that is, event cooccurance by chance. But we know that

If p(A1A2) = p(A1)p(A2) event cooccurance is by chance.
If p(A1A2) > p(A1)p(A2) event cooccurance is more than by chance.
If p(A1A2) < p(A1)p(A2) event cooccurance is less than by chance.

One way terms deviate from independence is through their semantics (meaning). If the meaning of words change in time, how do we know if all words from a word list change by the same amount?

As noted by Sandefur

…how do you determine if a word is the same for two culture? If the spelling of a word or the pronunciation of a word changes ‘slightly’, we will still count it as being on the list. If the meaning of a word changes ‘sigificantly’, we will delete it from the list. Thus, there is some subjectivity in determining Q which could drastically change the results. Also some words are more likely to change than the others. But in the multiplication principle, we tacitly assumed that all words were equally likely to change. This can throw the results off.

The moral of this is that you need to be careful not to make more claims about your model than are justified.

Language dating forecasts are based on independence assumptions. Let A1 and A2 be two different events. If the events are assumed to be independent, the probability of both co-occurring is

p(A1A2) = p(A1)p(A2)

Some authors like Sandefur call this the multiplication principle.

As Sandefur noted and quote:

Suppose two people each ‘know’ a certain per cent of a list of words. For example, suppose Frank knows 70 per cent of list L and Sue knows 80 per cent of list L, where L contains 100 words. Given any random sublist of words from list L, we would expect Frank to know 70 per cent of them and Sue to know 80 per cent of them.

Frank knows 70 of the original 100 words. We would expect Sue to know 80 percent of Frank’s 70 words, that is, 56 of Frank’s words. Thus, Sue and Frank know 56 words in common, that is, the per cent of the 100 words that Frank and Sue both know is (0.80)(0.70) = 0.56 or 56 per cent.

Multiplication principle: suppose person A knows P per cent of a list of L words and person B knows Q per cent of the same list of L words (where P and Q are given as decimals). Given no additional information, we woud expect A and B to both know PQ per cent of the words.

In Part III we will provide some examples of this principle to the evolution of cultures.

Later in this series we will explain how the independence assumption affects some of the reasonings and claims behind language dating models.

In the meantime: How relevant this model is to IR? Well, assume that A and B are not Franks and Sues, but passages, topics, documents, etc. Or suppose that instead of dealing with language dating we are trying to address the problem of duplicated content. The scenarios might be different, but the drawbacks and gross pitfalls introduced by independence assumptions are quite similar.

L. Venkata Subramaniam, PhD, Manager – Information Processing and Analytics, IBM India Research Lab http://lvs004.googlepages.com sent us email with some great news. They are having the Third Workshop on Analytics for Noisy Unstructured Text Data (AND) on July 23-24, 2009, at Barcelona, Spain and asked us to disseminate the news.

Copy of the email follows.

——————–

Dear Edel,

We are organizing the Third Workshop on Analytics for Noisy Unstructured Text Data on July 23-24, 2009, at Barcelona, Spain.

We know you work in related areas and would be happy to have you submit your research work to this workshop.

Although not back-to-back during this week I will be posting on glottochronology.

Glottochronology is a combination of greek terms which essentially means language dating.

Looking at some of my “old” collection of books on applied Chaos and Fractals from the ’90s (a topic close to my heart/doctoral thesis), I recalled that James T. Sandefur dedicated few pages to the topic in his great book Discrete Dynamical Systems (Chapter 2, pages 81-83; Oxford, 1990). Yep. There is nothing new under the Sun, Web IRs.

Sandefur wrote:

We all know that, over time, certain words disappear from usage and new words appear. Suppose that, at a certain point in time, we look at a list of L words (say L=250). At a later point in time, we study that same list of words and determine what per cent of the original list of words are still in use.

Let one unit of time be 1 year. Thus, time n will be n years. Let A(n) represent the per cent of the original list of words still in use n years later, given as a decimal. The basic assumption is that the percent A(n+1) of the original list of words in use at time n+1 is proportional to the per cent of the original list of words in use at time n, that is,

A(n+1) =rA(n),

where r is a positive constant less than one. At time 0, all of the original list of words is in use, so A(0)=1. Therefore, at time k, A(k)=r^k(1) = r^k is the percent of the original list of words still in use, as a decimal.

Since languages change slowly, r should be close to 1 and would probably be hard to estimate on a year by year comparison. By comparing a written language today with the same language a millenioum ago, glottochronologists can estimate r^1000. This number r also depends on the particular language. But glottochronologists have found that the number r^1000 is usually close to 0.805. So for languages with no written history, that is, for languages in which we cannot estimate r, we will assume that

r^1000 = 0.805

Thus, the per cent of the original list of words that are still in use k years later is

r^k = (0.805)^(0.001 k).

Glottochronology is one of those fields that were popular, but that many now cast doubts about it, due to questionable measurements and assumptions. One of those assumptions is term independence.

It seems that term independence is The Original Sin in Linguistic Studies as well as in IR models for noisy text collections, particularly in models that assume term independence with IDF scores. I’m working on a paper presentation on the subject.

That’s what local newspapers in Puerto Rico are doing: insisting in old spam marketing tactics. Wake up local web masters. It is not 1995. Redirections, use of splash page ads, and keyword spamming in meta keyword tags, not only does not work with search engines, but annoys users to no end.

This is exactly what some newspapers like the local El Nuevo Dia newspaper are doing.

Today I typed http://endi.com and was redirected to a full-size splash page ad. Then I have to opt-out to be redirected to a content page. Thank you for slapping in my face those annoying ads. Why chase away readers?

Adding insult to injury, when I looked at their horrible content page source code (http://www.elnuevodia.com/noticias), it is clear those that designed the page are insisting in keyword spamming through meta keyword tags. How many keywords can you count in the following sample?

How dumb is that? It is clear these folks have no clue about search engine marketing or how search engines work. Otherwise, why insist in old 1995 practices proven a waste today? I question whether they have a clue about optimizing news stories and press releases for search engines.

Unfortunately, these type of search engines are not popular, in part because are not practical at the scale of the Web and because require retraining of both the search engine and users –not to mention that these type of search machines are not precisely user-friendly.

Think about this: In general, average users are lazy searchers. They are also too busy to do neither query expansion or query reduction as we do in IR, nor they are prone to consult lookup lists, thesaurus, query logs, etc to refine their searches while surfing across databases. At any given point in time of the year the mentality of non-IR searchers is: “Don’t make me think”.

Thus, building a search engine that does ondemand query reductions for the Web (and that users will use without being forced to think) is not that easy.

We would like to hear of others working on similar research as we believe we have found a promising solution, at least partially. Ours is different from the approaches given in the above two references.

The title of this post might be a bit confusing, but we couldn’t find a better choice of words. The point to be made is that definitions and associations of terms can be affected as events evolve in time.

Consider the key term [man].

Providing a meaning or perception for [man] during a good or a bad Economy is a good example.

[man] means something different, depending if you ask to an employed or unemployed [man].

Although it might be a quite depressing article (especially for those with a pink slip in their foreheads), note the key words/phrases that define its topic and overall semantics. Incidentally, the article starts with

“Thomas Schuler is a man.”

and almost at the ends says:

“A man is what he does.”

The key words/phrases of the article can be taken as a semantics state in time attached to the [man] key term

The current issue of IR Watch – The Newsletter will be available during the day. It consists of the following sections.

Featuring Article: Data Mining Credit Cards

In this issue of the newsletter we cover Luhn’s Algorithm, also known as the Modulus 10 or Mod-10 Test. This algorithm is used for data mining and validation of credit cards. Credit cards fraud is a topic that never goes away.

QA: Types of Links

What is the difference between in-links, out-links, co-citation, and co-reference?