Slashdot videos: Now with more Slashdot!

View

Discuss

Share

We've improved Slashdot's video section; now you can view our video interviews, product close-ups and site visits with all the usual Slashdot options to comment, share, etc. No more walled garden! It's a work in progress -- we hope you'll check it out (Learn more about the recent updates).

prostoalex writes "New Scientist talks about Paul Vitanyi and Rudi Cilibrasi of the National Institute for Mathematics and Computer Science in Amsterdam and their work to extract meaning of words from Google's index. The pair demonstrates an unsupervised clustering algorithm, which 'distinguish between colours, numbers, different religions and Dutch painters based on the number of hits they return', according to New Scientist."

These kinds of articles never seem to get a very basic problem--natural languages. English is full of words that trip even humans. "Right" the direction versus "right" the judgement is a good example. In wartime something as simple as that may have lead to death. It's the elephant in the living room. Huge, important problem that nobody wants to talk about. There are alternatives, such as lojban which can be parsed like any computer program.

The article mentions English-Spanish translation. When one language is ambiguous (from a bit of Spanish I had in HS I'm guessing English is far more ambiguous), there is no hope of easy translation. And it's worse because the bigger application may be translating the many English pages (ambiguous) to Spanish.

Well, when I was in the army, it was very strict that whatever was said over a network DIDN'T have an ambigous meaning. That's why the army language sounds kinda weird at times, because you are not supposed to misunderstand anything.

Er. You didn't get the joke, so I will explain. The lore is that "repeat" is a command to the artillery to fire again on their last target, so you never ever say "repeat" on the radio, instead you say "say again".

The lore also contains an interesting anectode about the '92 riots in LA. Apparently a group of Marines were dispatched to assist the police. Two officers were approaching a house when someone opened up with a shotgun at them. One officer shouted "cover me" -- so the Marines proceeded to lay down

That's so true. For (a somewhat related) example, yesterday I've listened to an mp3 recording of radio conversations of Moscow metro traffic controller [vokruginfo.ru] when one of the machinists was incapacitated by a large dose of vodka.:) The machinist stopped the train at the station and left the train, only to be stopped by the police. Eventually when another machinist stopped on the opposite track at the opposite track at the same station, the traffic controller asked him to takeover the first train and park it in on

Well obviously the technology is not perfect yet. However, none of the problems you bring up are particularly insurmountable (as long as you aren't excepting the AI to be BETTER at parsing languages than people). Yes, words are ambiguous, and yes humans can fail at parsing them, ergo computers probably will too. That's just a fact, we're not going to achieve perfection. Still, this could be a pretty major step forward (well, not that this is the first time something like this has been tried - but the base premise seems sound) by using google the elephant of a problem you mention can be partialy mitigated. Google gives enough context around a word that ideally, when the word to be translated is also surrounded by context its meaning amoung alternate meanings can be discovered without giving an overly ambigous translation.

English is full of words that trip even humans. "Right" the direction versus "right" the judgement is a good example.

"Right" isn't really a good example of a word that might "trip even humans." A human (translator) will parse not just by word but will attempt to extract a word's meaning from the surrounding phrases, sentences or even paragraphs. The syntax of the language may also come into play. In spoken language, additional "clues" can be derived from the situation in which the word is spoken, and often

A goto in any context should always cause a compile error. In Java it's a reserved word, so you can't use it as a variable name, but it has absolutely no use in the language. Maybe it's a feature to be adding in Java 6 (1.6)

"So I go left?"
"Right."
Ambiguity; I've had this happen while driving moderately frequently. Body language isn't a help, since I'm, well, looking at the/road/.
It's easy enough to disambiguate afterwards, but if I were driving a military vehicle in a combat situation, that could easily get one of us killed, yes.

but theres a difference between a programming language word and a natural language word. printf or goto is more like a letter (or possibly syllable) to a parser than a whole word. Thats not really true either, but context doesn't work the same way. A parser expects a finite class of objects to come after printf, and while natural languages are technically finite, its not quite the same thing because communication doesn't have to be correct all the time to be understood.

The article mentions English-Spanish translation. When one language is ambiguous (from a bit of Spanish I had in HS I'm guessing English is far more ambiguous), there is no hope of easy translation.

Every language has "ambiguity", but ambiguity can come in different flavors (phonological, morphological, syntactic, semantic, pragmatic). Some of the chief instigators of language change can be thought of as ambiguity on these levels. So firstly, it's hard to imagine the existence of a function mapping languages to "ambiguity levels".

The motivation for your comment about English versus Spanish probably comes from the fact that you know of more English homophones than Spanish ones. Indeed, most literate people think of their language in terms of written words, so your take on the matter is common.

(As a slight digression, your example of right the direction versus right as in 'correct, just' is pretty interesting. We can understand the semantic similarity between the two when we notice that most humans are right-handed. Thus it is extraordinarily common, cross-linguistically and cross-culturally, for the word meaning the direction 'right' to have similar meanings as dextrous, just, well-guided and so on, whereas the word meaning the direction 'left' also has meanings such as worthless, stupid. (In fact, the word dextrous was borrowed through French from the Latin word dexter meaning 'right, dexterous' or dextra meaning 'right hand'.) So the given
example is one where, historically, a word had
no ambiguity, but gained ambiguity because speakers started using it differently.)

Getting back to the main topic, more problematic about Section 7 of TFA is the implicit assertion that, at some point in the future, their techniques can be applied to create a function mapping words in a particular language to words in another language. Anybody who has studied more than one language has seen cases where this is difficult to do on the word-level. For instance, the French equivalent of English river is often given as riviere or fleuve. But riviere is only used by French speakers to mean 'river or stream that runs into another river or stream' whereas fleuve means 'river or stream that runs into the sea'. English breaks up river-like things by size: rivers are bigger than streams. So, in the strictest sense, there is no English word for fleuve, just as there's no French word for stream (unless there has been a recent borrowing I don't know about). This certainly
does not imply that French people can't tell the difference between big rivers and small rivers; their lexicon just breaks things up differently.

These little problems can be remedied lexically, as I've just done. So fleuve is denotationally equivalent to river or stream that runs into the sea, although the latter is obviously much bulkier than its French equivalent. The real problem is that there are words in some languages whose meanings are not encoded at all in other languages. English, for example, has a lexical past-progressive tense marker, was, used in the first person singular (e.g. I was running to the store). Some languages have no notion of tense. What, then, does was mean in the context of such a language?

It's pretty well-known that Slashdotters' general policy is to tear apart every article we read, and half of those we don't. This is certainly not my intent here. Languages are complicated beasties, and everyone seems to understand that, including the writers of the article. So, we should interpret their result in Section 7 as them saying, "Well, maybe this has gotten us a baby-step closer to creating the hypothetical Perfect Natural Language Translator, but someone's gonna have to do a lot more work to see where this thing goes".

English, for example, has a lexical past-progressive tense marker, was, used in the first person singular (e.g. I was running to the store). Some languages have no notion of tense. What, then, does was mean in the context of such a language?

Plainly it means nothing significant. In most cases the distinction it makes in English contains no useful information. We think such words are useful, because a sentence sounds "wrong" without them, or because we know that a certain type of information is being lost wit

Well, I don't think there is a way to translate a text into another language without

understanding the text

understanding both languages

understanding the socio-cultural context of both languages

But we must consider the fact that most humans can't produce a decent translation either, even if they think they understand both languages. I've been professionally translating movies (EN->RU) and I know to what extent the scripts are riddled with linguistic traps. An average professional human translator would

my feeling is that what's often missed in the various AI language research programs is the problem of speech and reference. Any linguist working in the field will tell you that speech recognition is an incredible pain in the arse and then to layer semantic recognition on top of that is doubly painful. Though my real concern is about things like irony and sarcasm. I'm glad somebody stepped up to point out that different languages break concepts and the world up in different ways. But how exactly can you

I'm sorry, but as a linguist/cognitive scientist, I have to call bullshit on any current approaches to AI when it comes to language. "Words" do not have "meanings." Current research regards "chunking" and "lexical phrases." Basically, humans do not typically process words' individual meanings, but rather spit out pre-formed, formulaic expressions. This is why children who cannot form a sentence on their own yet will say things like "Let's go." They don't see that as a fully grammatical utterance; they

First, you cannot assume that all pages returned by Google have an equal opportunity to be served. This is due to the facts that:a) For any query, Google will only show the top 1,000 most relevant pagesb) Most relevant is determined by a numer of factors, one of them being a probability distributon that assigns more weight (PR) to some pages than others

In english what they are trying to do is define a word by context. For example go input a single word in google and you will get all the various contexts in which it is used.

Then by using some algo. ( serious academic handwaving here ) you place the meaning of word via context as determined for google. Thus effectively you create the potential for a program that could distinguish between there and their and do it across languages. It could also translate sayins like, the spir

Let's say you are just learning to ride a horse, and you want to know of positions to sit on the horse. You'd search for something like:'good riding positions'A current search return for this statement would deliver you everything from: Xaviers House Of Pleasures, to Yokels Horse Taming Ranch.

What this system does is refine your query for you, based off cached google pages, and using: page popularity and keyword algorithms.

W3C's OWL [w3c.org] standard is a "language" to mark-up information to make it more meaningful to machines. Machines can draw conclusions to what a word means by context. So even if two words are the same, they may not mean the same and the computer can draw that conclusion based on the context. It's all a part of W3C's Semantic Web initiative. There is research dedicated to query languages for these kinds of files.

Meaning could, in principle, mean 'affective meaning' as in the emotional weight something carries. Maybe Google are also working on emotional search engines and the article poster doesn't want us getting confused with that.

Words can be identified as a specific part of speech, or a specific part of a sentence, or even just identified as words, for instance - all of which would I would consider non-semantic meanings.

In language theory/compilers, semantic meaning is information that cannot be obtained by a lexer (a.k.a information that cannot be gained through regular expressions a.k.a. non-regular language components).

The part that can be recognized by a lexer is still part of the meaning, which is the reason for the name.

Is this in any way related to the way that Google was able to decide all on it's own that Scientology was crap, and thus bring Operation Clambake up to the top of the search results? (Until they Scientology people got pissed, anyway.)

Google is already starting to show signs of intelligence higher than some people.:)

I'm guessing that it's due to people posting (in their.sigs and whatnot) "scientology is a cult" and linking the word scientology or the phrase to the clambake site. I don't know of a formal googlebomb, though.

While I think ideally you would endow computers with the same algorithmic usage of speech that is employed by human beings, as these researchers have shown, it is also possible to work with programs that do not 'parse' language but rather categorize it based on massive databases of language that has already been parsed by humans.

This obviously has its failings, but theoretically, you could use a sufficiently large database of common human language coupled with simple algorithms to perform operations like grammar checking.

An internet search would not be quite so useful for that, but I would really be interested in what would be possible with full digital access to the library of congress. I would imagine you could do things like automatically generate books based on existing material.

There needs to be an annual prize for the highest compression ratio using random pages from the web as the corpus. This would probably do more for real advancement of artificial intelligence than the Turing competitions.

followed by the explanation:

Intelligence can be seen as the ability to take a sample of some space and generalize it to predict things about the space from which the sample was drawn. The smaller the sample and the more accurate the prediction, the greater the intelligence. This is also a short description of

By stricter do you mean narrower and incomplete? Do you think that taking something overly terse and compressed and explaining it simply with examples and analogies etc is a greater intellectual acheivement?

Intelligence can be seen as the ability to take a sample of some space and generalize it to predict things

It would be myopic to see it as such. The ability to communicate an idea is a closer description of what it is to be intelligent as captured

The smaller the sample, the larger the domain covered, and the quicker and more accurate the prediction, the greater the intelligence.

Good point. However it is difficult to value time in a single competitive metric whereas compression ratio (where the initial and compressed sizes include the size of the algorithm/knowledge of the AI) is a single number.

Perhaps the way around this is to have different prizes for different time classes, varying by an exponential. You'd have, say, 3 competitions with tim

After consulting with the elephant in my living room, I have only one thing to say.
semantic Pronunciation Key (s-mntk) also semantical (-t-kl)
adj.
1. Of or relating to meaning, especially meaning in language.
2. Of, relating to, or according to the science of semantics.

This is a pretty nice approach. Quoted from the news article "The technique has managed to distinguish between colours, numbers, different religions and Dutch painters based on the number of hits they return, the researchers report in an online preprint.", it shows that common terminology can be drawn. In the end though, this is a refined search routine for Google IMHO. This would be good for scholar searches perhaps, or even a dynamic thesaurus. But when using terms such as: does windows use linux, the der

First off, I am not an "AI" expert nor do I claim to be, however, this is how I see it.

Since it seems that so few really understand the term "intelligence", it is really not surprising that even fewer grasp the meaning of the term "artificial intelligence", is it?

One: intelligence is not awareness.

Although we cannot prove the existence of or even seem to really define self-awareness, it seems self-evident, at least to me, that intelligence is clearly defined and can be measured.

Therefore, I believe that we will have "artificial intelligence" soon, in fact, I'd bet Google may well be the first AI or "self intelligent' engine.

However, I suspect it will be quite awhile before we are mature enough to build a self-aware engine.

Lastly, in regards to some of the other comments, it seems to me that this paper is about using the "intelligence" included in the language we use, that Google crawls. This repository is the single largest collection of semantic weighting, therefore, algorithms could be developed that reflect this "intelligence", therefore appear themselves intelligent, even though they themselves are simply deterministic.

Intelligence does, however, imply the ability to perform self-directed learning. Without that, all you have is preprogrammed behavior, which is not intelligence. Given the ability to learn, an intelligent entity is likely to draw conclusions about its own existence ("I think therefore I am"), and will thus essentially be self-aware.

Of course, the builders of an artificially intelligent machine might restrict its ability to gather facts about itself - it wouldn't

"the ability to "see" its "body"" - Individual ant's do not learn, they are very much like small robots. An ant's nest on the other hand can display a modicum of intelligence in the way that it forages and protects itself.

The other comment hinted at the distinction, but one of the hallmarks of intelligence is desire. In philosophic language, "intentionaliy". Without goals and self-directed moves towards those goals, you do not have intelligence. Note that intentionality is needed, but not sufficient. In any case, Google, the machine, does not have self-directed goals.

You're quite correct that cowboy-loose definitions of terms make this a very difficult discussion to have. For example, when you say "self awareness," it's unlikely that you actually mean "self awareness" in the literal sense; after all, if a computer is capable of detecting when its processor is overheating (and perhaps turn on a fan in response), it is basically "self aware," though we wouldn't confuse that with itelligence.

Rather, I think by "self awareness" here you mean, possessing narrativity; that i

I'm so sorry, but I find I can't agree with your statement that a mechanical system is "capable of detecting when its processor is overheating". A system is not "aware" in the sense of being "conscious" since obviously machines are not capable, at this point, of being conscious. Indeed, you seem to be making exactly the mistake I was attempting to describe. That is, you've mistaken the intelligence of the created with the intelligence of the creator. The programmer was aware that at a certain temperature le

A slug is not conscious. Nothing without langauge is. Recommended reading: Dr. Daniel C. Dennett, Consciousness Explained and Darwin's Dangerous Idea. Richard Dawkins, The Extended Phenotype. Julian Jaynes, The Origin of Consciousness in the Breakdown of the Bicameral Mind.

Those are all more commercial works, well within the grasp of even people who've done no work in the field. For more sholarly and technical references, check their bibliographies, especially in Dennett.

"Sheesh" is a word which normally means, "I'm not very good at actually saying what I mean, so I'll just make strange noises and roll my eyes at someone who won't figure it out for me." (It's also the nick of one of my favorite Internet trolls; what ever happened to the good old days when trolls actually tried to be entertaining, instead of merely annoying?)

Anyway, okay, it's loose definitions of words that are once again getting us into trouble here. That a slug is aware of its environment, as in, capable

Even though I disagree with Google's hiring practices (i.e. preferring H-1Bs when many American engineers are unemployed), I must admit that Google's search algorithm is the best one -- even better than Yahoo! Search, which I use regularly for socio-political reasons.

I will give you an example. If you search news (i.e., either Google News [google.com] or Yahoo! News [yahoo.com]) for stories about the recent federal action (by Washington) involving Chinese companies and Iranians weapons improved by Chinese technology, you will discover that one of the popular news articles about this topic comes from the "New York Times". Several other newspapers redistributed the Times article, written by David Sanger (spelling?).

I read that article, but I also read articles from less popular Web news sites: e.g. "Taipei Times". The "Taipei Times" article does mention that a Taiwanese company was also implicated in the sale of weapons technology to Taiwan. Yet, "New York Times" article made no mention of this fact.

Is the "Taipei Times" telling the truth? It claims that Ecoma Enterprise Company, a Taiwanese company, was one of the culprits.

Guess how long I took on Google to find this information? 5 minutes. I kid you not. Even though I hate Google's employment practices, I am quite impressed with their technology.

Using Yahoo! Search, I was not able to locate the desired information.

Apparently, Google has an algorithm that, although it is unsupervised (i.e. without the kind of human interaction that corrupts Yahoo! Search), it captures the notion of what the typical person wants to find. The Google algorithm, dare I say "it", is on the verge of acquiring human sentience. THAT is, indeed, impressive.

Pray to Buddha that the middle name of the CEO is not "666" or Beelzebub. Just kidding.

Although very clever, NGD (Normalized Google Distance) misses alll higher-order relationships and does not even distinguish between different categories of pairwise relationships. For example, NGD might assume that "Bush" & "Iraq" had the same relationship as "Slashdot" & "Geek" because the two word pairs co-occur with similar frequencies.

More interesting are analyses on n-Tuples (co-occurences and orderings of n-words at a time). Anyone who does ER (Entity-Relationship) diagrams for relational databases will appreciate that many relationships involve multiple entities that are decomposable into pairwise relationships.

Another limit is that Google is atrocious on its estimates of the number of hits. The actual number of hits is only fraction (about 60%?) of the estimated from my experience. This suggests that Google has a pairwise estimator built in that may be only partially empirical. If Google simply reports an estimated number of hits based on products of probabilities, then their is no information about the pair in the NGD. Obviously, these scientists have gotten useful results, but NGD may not be as good an estimate of the co-occurence of the words as the scientists assume.

You are right that Google may be performing estimation and this could effect results and I don't really know what sort of rounding they do at this time. Perhaps more will become apparent. But your other assertion about no higher order statistics is incorrect. see the earlier Clustering by Compression paper for more info.
Quickly, the reason is as follows:

I use NGD to convert arbitrarily-large lists of search-terms into feature-vectors of arbitrary dimension. The only limit to this is the max query len

Thank you for the reply. I'm glad your work generalizes to longer search-term lists. Like so many other/. readers, I did not take the time to read your preprint before posting.

I've often wondered if one can use simple pair-wise distance estimates to reconstruct a polytope or distorted simplex for the set of items within a multidimensional space. In theory, an N-object system, with non-zero pairwise distances, requires (N-1) dimensions. But in practice, many real systems don't fill the space -- being

I wrote a program that gathered, analyzed and used word pair frequency data (various situational pairings). It needs more raw data, but shows a lot of promise. I opted to not use literature, as that often has archaic and purposefully awful word usage. Some of the issues involved include case, like Fall vs fall, I chose to ignore case, grammatical structure, needs to integrate with a grammar checker. Coupling this with a thesaurus is my eventual goal, this leads to some obvious difficulties, though it has potential rewards. I had considered google, and have run a few tests using it, but that solution was too simple, and not quite as powerful in the long run. Just had to share, sorry to waste your time.

I accept your apology for relating relevant information about the subject matter of the article.

For future reference, to avoid this, it helps not to read article. If you must read it, you can always pick out a short phrase and take it out of context. If you are absolutely at a loss on how to comment on a story with presenting useful/interesting information, generally you can get away with "FRIsT POST!!!" or one of the popular Slashdot memes.

Why not use Wikipedia? The database is downloadable in its entirety, quite large, and contains plenty of great information about topics from advanced mathematics to pop culture; all in quite down-to-earth normal language written and refined by normal people. I think Wikipedia ought to be a tremendously great resource for computer learning research.

Actually that is the intent, I just need to parse it. I've got the curr database on my computer and it is slightly parsed. I've figured out most of the relevant structure (like what are discussions, vs articles, what separates articles, wikicode), just need to sit down and parse all the wikicode and html. Preprocessing aside I've estimated a couple CPU weeks for wikipedia ~1.7GB.
One thing that was interesting about wikipedia was that I counted all words in it, and looked at the ones that only occur

My company develops a data mining program for OS X (theConcept [mesadynamics.com]) that uses Google (or other search engines) to provide links to data for mining.

For example, searching on Google for "tom cruise" brings up pages upon pages of links, but -- from a cursory glance at the results -- it is impossible to learn anything about Tom Cruise unless one visits those results.

Our software visits each of those results (for example, the first 100) and looks for the most significant keywords and phrases used over all the data. As you might expect, these typically end up being the names of people (e.g. Nicole Kidman, Penelope Cruz) or movies (e.g. Top Gun, Color of Money) that are associated with Tom Cruise. As far as our software goes, this is ample for doing keyphrase analysis.

But the problem with deriving any additional meaning from the Internet web space is this: the biases that exist due to the very reasons for mentioning Tom Cruise (namely those things he is famous for) simply outweigh -- by a wide margin -- any other quite relevant interesting data about Tom Cruise. So, in fact, the web, in general, is an awful corpus of valid semantic data.

If you want a rough model of popular ideas then perhaps Google and the web en masse is useful (it is for our software). But if you want any real meaning at all you come to the same conclusion that has given rise to sites like Wiki: the web, to be blunt, has a whole lot of shit in it. Coming up with a perfect (and rational) filter is quite a task.

I'm glad to see you are interested in our work. I applaud parallel and different efforts like your own system, however I think you are making at least one misleading and factually false assumption that I would like to correct. By coincidence, I have already done an unpublished experiment that involved Tom Cruise. Contrary to your assertion that it's impossible to get useful data, in fact I have
already gotten the data that

They are developing an open source tool http://complearn.sourceforge.net/ [sourceforge.net] that will hopefully integrate the algorithm they describe. Right now it's only supporting one of their previous algorithms. More about this in the above link and section 5 of the paper.

Here is a more extensive exposition on current work on relationships, esp. as they can be supported in the Semantic Web context:
http://www.cs.uic.edu/~ifc/SWDB/keynote-as.html
http://lsdis.cs.uga.edu/lib/download/SAK02-TM.pdf

I checked the supplied links. The point of the
Cilibrasi-Vitanyi method is not that it uses Google
page counts (like the supplied links many different
approaches have done so). The point is that a
particular distance (formula) based on an extensive
mathematical theory and a sequence of research papers spanning over a decade has been developed.
Experimental testing shows that it always works,
in diffrent settings, in a relatively precise way.
For example, in a massive randomized experiment
the mean agreement