Posted
by
samzenpus
on Thursday September 01, 2005 @12:01AM
from the universal-translator dept.

An anonymous reader writes "U.S. and Israeli researchers have developed a method for enabling a computer program to scan text in any of a number of languages, including English and Chinese, and autonomously and without previous information infer the underlying rules of grammar. The rules can then be used to generate new and meaningful sentences. The method also works for such data as sheet music or protein sequences."

This algorithm works with sample data. Where is the sample data going to come from? If you have to download it, then that negates the whole point of using it. If you use what you see online, well that's just rediculous, for obvious reasons:).

Our experiments show that it can acquire intricate structures from raw data, including transcripts of parents' speech directed at 2- or 3-year-olds. This may eventually help researchers understand how children, who learn language in a similar item-by-item fashion and with very little supervision, eventually master the full complexities of their native tongue."

In addition to child-directed language, the algorithm has been tested on the full text of the Bible in several languages

I hardly would consider transcripts of parents' speech directed at 2- or 3-year-olds "intricate". And while the algorithm may have "been tested on the full text of the Bible", it doesn't say with what percentage of accuracy or what translation. King James version or the Teen Magazine Bible?

And the "rules" of a language are NOT what children "learn". First of all, children acquire a language, they do not "learn" it. That is a large attribute to the child's ability to speak it--not whether or not they understand gerunds and the pluperfect.

Second, in a language such as English whose words for the most part lack any necessity to the order in which they're placed to understand they're meaning and, even worse, lack declension forms to distinguish subject from object of the preposition, with what success can a language recognition program have "learning" such a language when prepositions themselves mainly can be omitted? To teach a computer Latin is easy.

Third, what's the hope of the computer ever understanding something like Shakespeare, Joyce, or Dante, whose uses of language rely extensively on erudition for word placement as opposed to typical usage? While a computer might be able to learn Latin because of its rigourous rules, I doubt it could faithfully render a text from Ovid.

Perhaps a linguist could weigh in on this, but it seems to me that this kind of research is quite contrary to the Chomskian view of linguistics.

Instead of a language module with specialized abilities tuned to learn rule-based grammar, we have an an unsupervised learning system has surmised the grammar of the language merely from the patterns inherent in the data it is given. That a system can do this is evidence against the notion that an innate grammar module in the brain is necessary for language.

This algorithm works with sample data. Where is the sample data going to come from? If you have to download it, then that negates the whole point of using it. If you use what you see online, well that's just rediculous, for obvious reasons:).

It's going to come from large bodies of text that exist in mmultiple langueages. Things like the Bible, the constitution, etcetera. The whole point of this technology is that by drawing conclusions from those texts, the program infers the underlying rules of the language and can therefore translate other things. Google was doing something similar.
An online dictionary is completely different. First, it has to be compiled by someone. Second, it only helps for translating words verbatim. This technology would self-teach itself to translate languages, even if none of the researchers working on the project could even speak those languages themselves. That's the beauty of it.

Linguistics has nothing to do with prescriptive grammar, except perhaps studying what influence it has on language. Something like "don't split infinitives" is not a rule in linguistics. Something like "size descriptors come before color descriptors in English" is a rule, because it's how people actually speak. Incidentally, most people are not even aware of these rules in their native language, despite obviously having mastery over them.

If there were no rules, I could write a post using random letters for random sounds in a random order, or just using a bunch of non-letters. That wouldn't convey anything. Saying "I'm writing on slashdot" is more effective than writing "(*&$@(&^$)(#*$&"

I took a linguistics class this previous year with a professor who absolutely disagreed with the Chomskyan view of linguistics (though she did acknowledge that he had contributed a great deal to the field). Some of the arguments against Chomsky include objections to the Chomskyan view of "universal grammar"--that essentially a series of nerual "switches" determine what language a person knows and that these in turn are purely grammatical in nature (the lexicon of different languages qualifying as "superficial"--in and of itself a somewhat tenable argument). While this holds reasonably well for English and closely related languages (English grammar in particular depends a tremendous amount upon word order and syntax, and thus lends itself well to this sort of computational model), in many languages the lines between nominally "superficial" categories--e.g. phonology, lexicon and syntax--become blurred, especially in, for instance, case languages. Whereas you can break down the grammatical elements of an English sentence fairly easily into "verb phrases" "noun phrases" and so on, this is largely because of English syntactical conventions. When a system of prefixes and suffixes can turn a base morpheme from a noun phrase to a verb phrase or any of various parts of speech, the kind of categories to which English morphemes and phrases lend themselves become much harder to apply. Add to this the fact that there exist languages (e.g. Chinese) in which grammatically superficial categories (in English) like phonology become syntactically and grammatically significant, and the sheer variety of lingiustic grammars either seriously undermines the theory in general or forces upon one the Socratic assumption that everyone knows every language and every possible grammar from birth and simply need to be exposed to the rules of whatever their native language is and to pickup superficialities like lexicon to become a fluent speaker. It's not all complete nonsense, but if it were truly correct then presumably computerized translation software (with the aid of large dictionary files for lexicons) would have been perfected some time ago).

Sorry about the rant, but like I said, my prof did *not* like the Chomskyan view of linguistics.

Oh, and as far as the notion of the "language module" goes, it might be premature to call it a module, but there *is* neurophysiological evidence to suggest that humans are physically predisposed towards learning language from birth, so that much at the very least is tenable.

Yes, pattern recognition is a major part of the process. However, there are other fundamental parts that are also extremely important, and lacking them you get nonsense. In particular, context matters. "aitakatta" in the middle of a business letter probably does mean "wanted to meet". By itself, said by one member of a couple to the other over drinks at a bar, it does not.

In order for a program to translating to translate accurately, it needs to know who is speaking/writing, who is the audience, what their relationship is, and their location. Some of this may be given to the computer explicitly, or easily found in the text/speech (for a human at least) but some of it may not. This is not going to be an easy problem to solve.

Writing is never free from its context. I know before I even start whether I am reading a fiction novel, a satire, a scientific journal, an email from my boss, or a text message from my date this Saturday. The meaning of the words can change a lot in those cases.

Even Google translator, which was trained on multi-lingual UN reports, could not produce comprehensible English from simple Japanese business emails.

Called Pragmatics. It can be somewhat oversimplified as saying it's the study of how context affects meaning or as figuring out what we really mean, as opposed to what we say.

For example, a classical Pragmatics scenario:

John is interested in a co worker Anna, but is shy and doesn't want to ask her out if she's taken. He asks his friend Dave if he knows if Anna is available to which Dave replies "Anna has two kids."

Now, taken literally, Dave did not answer John's question. What he literally said is that Anna has at least two children, and presumably exactly two children. That says nothing of her avalibility for dating. However, there's nobody who reads that scenario who doesn't get what Dave actually meant to communicate: That Anna is married, with children.

So that's a major problem computers hit when trying to really understand natural language. You can write a set of rules that comletely describes all the syntax and grammar. However that doesn't do it, that doesn't get you to meaning, because meaning occurs at a higher level than that. Even when we are speaking literally and directly, there's still a whole lot of context that comes in to play. Since we are quite often at least speaking partially indirectly, it gets to be a real mess.

Your example is a great one of just how bad it gets between languages. The literal meaning in Japanese was not the same as the intended meaning. So first you need to decode that, however even if you know that, a literal translation of the intended meaning may not come out right in another language. To really translate well you need to be able to decode the intended meaning of a literal phrase, translate that into an approprate meaning in the other language, and then encode that in a phrase that conveys that intended meaning accurately, and in the appropriate way.

Yeah and this didn't learn the language in any meaningful sense. It just found a statistical pattern, and then generates possible sentences from that pattern. That's a whole lot different to you and I understanding the language and generating intentional, meaningful sentences.

I'd say this is the first step to it though. Lets forget about natural language for a second and look at computer algebra systems, proof generators, etc. How is the inference that you talk about any different than a computerized proof system proving something based on bits of information it has stored away? I think it's pretty similar really, except for the part about knowing what thing you want to prove/confirm.

So how does that sort of thing work? Well, in mathematics you can have something like y=f(x) and substitute f(x) whenever you see y or vice versa. You also know various other rules that are of the same form, e.g. a(b+c) = ab+ac. Then, you can brute-force trying different combinations (or be smart about it and modularize some set of translations to create a new compound rule which is true, e.g. a lemma).

It may not be so easy in languages, but there are transformations you can apply to sentences. For instance, you can do some rearrangements like:

A is under B Under B, there is A.

And there are ways that these relations (spatial relations especially) distribute:

A is in B, B is under C -> A is under C.

So to understand 'Anna has two kids' you have to know: 1. That you want to evaluate the truth/falseness of 'is Anna available to go out' and 2. Various pieces of social information about 'going out', people who are married, people who have kids, etc.

If you have 2 you should be able to use a method in the same vein as a computer algebra system to determine how what was just said applies to your question.

If you take just a single string [of length n] and rotate it against itself in a search for matches, then you've got to do n^2 byte comparisons just to find all singleton matches,...

No you don't:-)

If you want to find all singleton matches, it's enough to sort the string into ascending order (order n.log(n)), and then scan through for adjacent matches (order n). For example, sorting "the cat sat on the mat" gives "cat mat on sat the the"—where the two "the"s are now adjacent and so easily discovered.

For finding longer matches the sorting method still works, except that you sort fragments of the sentence rather than individual words. Clearly there is more work involved, but (depending on exactly what you're counting) there are still order n.log(n) comparisons to be performed.

This means that searching for substring matches can be performed relatively efficiently. I don't know about how the language-learning algorithm works, but you may be interested to know that the compression algorithm used by "bzip2" works in exactly this way (google for "Burrows-Wheeler transform" for more details!)

However, there's nobody who reads that scenario who doesn't get what Dave actually meant to communicate: That Anna is married, with children.

Funny, I read that answer as a "yes, she's available," but added additional information: don't ask her out unless you are willing to accept the entire package.

In a different language, I could still see a literal translation of the question and answer as communicating the same information. The "higher level meaning" is not embedded in the words or language. The exchange, "available?" "kids." does not mean "not available," but is more of a trinary response.