Archive

Coming up at the Hacker Dojo we have quite the interesting Meetup. There is going to be a debate between Jeff Pollock, Monica Anderson, and Dean Allemang about the Semantic Web. Knowing Monica personally and working on the Syntience project, I can say that this will most definitely be a heretical experience, to say the least.

Standing room only! Here’s an expert from the site:

“Jeff Pollock – Mr. Pollock is the author of Semantic Web for Dummies and is a Senior Director with Oracle’s Fusion Middleware group, responsible for management of Oracle’s data integration product portfolio. Mr. Pollock was formerly an independent systems architect for the Defense Department, Vice President of Technology at Cerebra and Chief Technology Officer of Modulant, developing semantic middleware platforms and inference-driven SOA platforms from 2001 to 2006.

Monica Anderson – Ms. Anderson is an artificial intelligence researcher who has been considering the problem of implementing computer based cognition since college. In 2001 she moved from using AI techniques as a programmer to trying to advance the field of “Strong AI” as a researcher. She is the founder of Syntience Inc., which was established to manage funding for her exploration of this field. Syntience is currently exploring a novel algorithm for language independent document comparison and classification. She organizes the Bay Area AI Meetup group.

At the 2007 Foresight Vision Weekend Unconference, Monica Anderson presented on the prospect of developing artificial intuition in computer hardware. Further talks are currently planned for delving into the technical details of the project and also exploring the Philosophy and Epistemology to support the theory. For more information on her see: http://artificial-int…
and http://videos.syntien… or http://artificial-int…
Dean Allemang – Dr. Allemang has a formal background, with an MSc in Mathematics from the University of Cambridge, England, and a PhD in Computer Science from The Ohio State University, USA. He was a Marshall Scholar at Trinity College, Cambridge. Dr. Allemang has taught classes in Semantic Web technologies since 2004, and has trained many users of RDF, and the Web Ontology Language OWL. He is a lecturer in the Computer Science Department of Boston University.

Dr. Allemang was also the Vice-President of Customer Applications at Synquiry Technologies, where he helped Synquiry’s customers understand how the use of semantic technologies could provide measurable benefit in their business processes. He has filed two patents on the application of graph matching algorithms to the problems of semantic information interchange. In the Technology Transfer group at Swisscom (formerly Swiss Telecom) he co-invented patented technology for high-level analysis of network switching failures. He is a co-author of the Organization Domain Modeling (ODM) method, which addresses cultural and social obstacles to semantic modeling, as well as technological ones. He currently works for Top Quadrant, recently published Semantic Web for the Working Ontologist and has the blog S is for Semantics“

While reeling from the scoop, depressed and doing some preliminary market research, I happened upon a gem of a blog post by none other than our favorite search company, Google. Before proceeding on in my post, I do recommend that you do read the blog post by Steve Baker, Software Engineer @ Google. I think he does an excellent job describing the problems Google is currently having and why they need such a powerful search quality team.

Here’s what I got from the Blog post: Google, though they really want to have them, cannot have fully automated quality algorithms. They need human intervention…And A LOT OF IT. The question is, why? Why does a company with all of the resources and power and money that Google has still need to hire humans to watch over search quality? Why have they not, in all of their intelligent genius, not created a program that can do this?

Because Google might be using methods which sterilize away meaning out of the gate.

Strangely enough, it may be that Google’s core engineer’s mind is holding them back…

We can write a computer program to beat the very best human chess players, but we can’t write a program to identify objects in a photo or understand a sentence with anywhere near the precision of even a child.

This is an engineer speaking, for sure. But I ask you: What child do we really program? Are children precise? My son falls over every time he turns around too quickly…

The goal of a search engine is to return the best results for your search, and understanding language is crucial to returning the best results. A key part of this is our system for understanding synonyms.

We use many techniques to extract synonyms, that we’ve blogged about before. Our systems analyze petabytes of web documents and historical search data to build an intricate understanding of what words can mean in different contexts.

Google does this using massive dictionary-like databases. They can only achieve this because of the sheer size and processing power of their server farms of computing devices. Not to take away from Google’s great achievements, but Syntience’s experimental systems have been running “synthetic synonyms” since our earliest versions. We have no dictionaries and no distributed supercomputers.

As a nomenclatural [sic] note, even obvious term variants like “pictures” (plural) and “picture” (singular) would be treated as different search terms by a dumb computer, so we also include these types of relationships within our umbrella of synonyms.

Here’s the way this works, super-simplified: There are separate “storage containers” for “picture”, “pictures”, “pic”, “pix”, “twitpix”, etc, all in their own neat little boxes. This separation removes the very thing Google is seeking…Meaning in their data. That’s why their approach doesn’t seem to make much sense to me for this particular application.

The activities of an engineer would be to write code that, in a sense, tells the computer to create a new little box and put the new word in a list of associated words. Shouldn’t the computer be able to have some sort of continuous, flowing process which allows it to break out of the little boxes and allow for some sort of free association? Well, the answer is “Not using Google’s methods.”.

You see, Google models the data to make it easily controllable…actually for that and for many, MANY other reasons. But by doing so, they have put themselves in an intellectually mired position. Monica Anderson does a great analysis of this in a talk on the Syntience Site called “Models vs. Patterns”.

So, simply and if you please, rhetorically:

How can computer scientists ever expect a computer to do anything novel with data when there is someone (or some rule/code) telling them precisely what to do all the time?

From our new Use Case Document (v1.0) on our speculated use of Artificial Intuition (AN) technology applied to finally and truly solving Semantic Search:

We Understand.

True “Semantic Search” is the holy grail of Web Search. When indexing web pages, the pages will be fed through an Artificial Intuition based device that produces a set of “semantic tokens”. These tokens might look like large integers; they are opaque to humans. But they specify, as a group, to any compatible AN device what the web page is ABOUT. It is a trivial matter to add those tokens to the search index side by side with the words in the document, which is what is currently stored in the index.

At query time, the same algorithm is run on the userʼs query. Longer queries will now become more precise queries since they allow more context to be activated. A set of semantic tokens can now be extracted from the userʼs query and matched in the index lookup process just the way words are looked up today. Even short queries can generate many relevant semantic tokens in a cascading process we could call “regeneration” – when a sufficiently specific query sentence is entered, all tokens identifying the context will be regenerated from the query. [Note: This is an expected but not yet experienced effect.]

The result will be a high precision search that returns documents that perfectly match the userʼs query. There will be no false positives caused by ambiguous word meanings, and some documents returned may not even contain the words in the userʼs query but they will still be spot-on ABOUT what the user wanted the results to be about. All efforts that have been called “Semantic Search” to date are still syntax based. Some, like PowerSetʼs technology, use grammars. But grammars are not semantics, they are describing syntax. This use of the term “Semantic Search” is a marketing parable.

Final version should be available for wide distribution soon. Email me if you would like a copy at mgusek at syntience dot com.