Microsoft’s Scott Prevost Interviewed by Eric Enge

For over a decade, Dr. Scott Prevost has worked to bring natural language processing technology to the marketplace. As a graduate student at University of Pennsylvania, he developed theoretical models of prosody for synthetic speech, as well as technology to generate dialogue for autonomous agents. In post-doctoral research at the MIT Media Lab and FX Palo Alto Lab, he integrated gestures, facial expressions, and other interactional cues into his research, creating lifelike 3D characters with speech recognition, dialogue processing, and vision capabilities.

Dr. Prevost co-founded and served as CEO of Headpedal, Inc., a software company that specialized in creating virtual character interfaces for customer-facing applications on the web. Dr. Prevost also previously served as CEO of Animated Speech Corporation, which produces interactive, animated tutors for speech and language development. Dr. Prevost was General Manager and Director of Product at Powerset, where he was focused on developing the user experience for natural language search. Powerset was acquired by the Microsoft Live Search division in August 2008, where Dr. Prevost currently holds the position of Principal Development Manager.

Interview Transcript

Eric Enge: Can you provide a quick overview about yourself and Powerset?

Scott Prevost: I have been working on natural language systems with the goal of helping information retrieval in particular for quite a while now. Powerset was founded with the notion that we can improve search results by having a much better understanding of the meaning of the documents and of what people intend with their search queries.

The way that we do this is to apply very deep natural language processing technologies to the documents as we are creating an index. And we also apply that to the queries at runtime so we can do a better job of actually matching meaning to meaning as opposed to just finding the keywords.

Powerset was founded in 2006 and we launched our product in May of 2008, which was initially a Wikipedia search engine. Then we were acquired by Microsoft in the summer and closed the deal on August 1, 2008.

Eric Enge: Can you talk a little bit more about the goal of better understanding a searcher’s intent and the mechanics that you use after doing that?

Scott Prevost: One of the key points that I want to make is that Powerset is not just about understanding intent in queries. That’s part of the equation for getting better search results, but once you have that, you also have to have a much better understanding of what’s in the documents as well. So, it’s not enough to know that a user is looking for a certain kind of search result, you also need to be able to match that to what’s actually in the document.

So, what we propose to do is very different from what most other search engine startups do. Most search engine startups are trying to take the existing keyword search model and add some bells and whistles to it or put a new front-end on it. What we did is completely reinvent how the index is built by applying technology that we licensed from PARC, which allows us to do very deep linguistic processing.

We essentially look at a document, break it into sentences and then we analyze each sentence using a very robust linguistic parser. We extract semantic representations out of that, and it actually has semantic representations that we store in our index.

We do a similar processing on queries at runtime, and then we look to match these semantic properties, the keyword properties and other document properties. What this means is that we can find sentences that may have the right meaning, but use slightly different words. If you type in “When did earthquakes hit Tokyo” in powerset.com, you will see answers that use words like strike instead of hit. Then you will see that we are actually able to highlight dates in the captions for those answers because we’ve done the linguistic analysis on the sentences, not merely matching keywords.

Eric Enge: So how is this different from Latent Semantic Analysis or Latent Semantic Indexing?

Scott Prevost: We are actually doing the semantic processing upfront, and we are doing all the hard work on the backend, so that’s one big difference from all the other approaches that we’ve seen out there.

Eric Enge: So you are doing some preprocessing?

Scott Prevost: Yes. We are processing the documents as we index them. We are also trying to do some analysis at query time, because natural language technology is still quite expensive in terms of the compute power that’s needed. So, the degree to which we can compile all that out in the index means we can produce a runtime that’s on power with a keyword search runtime in terms of latency properties.

Eric Enge: So, when we talk about the problems with traditional search engines, one of the things that I saw you focus on was the fact that they required users to speak their language?

Scott Prevost: That’s right, yes. Generally, we’ve all gone through the process of trying to find that document where we try to figure out what the right collection of words that will pull this document up is. That means that you have to start thinking like the author of the document, imagining how the thing that you are looking for might have been expressed.

We generally try our query a few times before we find what we are looking for. By adding the semantic analysis, we are allowing people to be a little more natural in the way they express themselves. You don’t necessarily have to worry about the specific keyword, because we are likely to find a synonym.

You also don’t have to worry about excluding stop words or which words are going to be matched with which words in the matching algorithm. We just want people to be able to write a natural phrase or even a question, and then let the search engine do the hard part; figuring out what the appropriate matches are.

Eric Enge: Right. In existing search engines it can be a disadvantage to have extra words that aren’t actually necessary to the query. This is a result of using a more basic method for matching up the words in query with words on page.

Scott Prevost: That’s right, yes. And of course it creates some interesting issues for us, because now we are trying to change user’s behavior a little bit. They have grown very accustomed to thinking of a search engine as words and documents that include these words. So now that we are messing with that interaction model, our hope is that people’s behavior will gradually change as they start to realize the power of the system that we are introducing.

One thing that we have been very careful with at Powerset is trying to maintain the old model as much as possible. So, if you just type keywords into Powerset, you will still get results that are just as good as those from Google, Live Search or Yahoo.

Eric Enge: So you have talked a little bit about stop words, can you expand upon that a little bit? Define what they are, how they are treated by regular search engines and why making use of them in Powerset is important?

Scott Prevost: Stop words are words that the search engine just disregards; prepositions or words like “what” and “where.” It’s a very salient limitation to implementation. Basically the idea is that if you try to match documents on those words, they tend to be less important in the query because they would match so many documents. But in reality they are the linguistic glue in the query and in language. They start to tell you how the other important words in the query link together, and that allows us to look for those links in the document when we are matching a query by processing them linguistically. Let’s go back to the earthquake example. I am not specifically searching for the word “did,” but that word is still part of the verb complex in that query. So the parser knows that “did” and “hit” go together. Basically, we are not matching for that specific word, but we are matching verbs together that semantically match. So instead of “did hit,” we can use the word “strike.”

Eric Enge: Right. So for example, you could accidentally get something like “did not hit?”

Scott Prevost: Yes. We are not currently processing negation in parser on a real detailed level because it is such a tricky problem. It would actually match queries that get the negation incorrect, but that is generally useful information for the user anyway because it is relevant to their query even if it isn’t an exact answer.

Eric Enge: Right. So, that’s an example of something that you would be working on in the future?

Scott Prevost: Oh, absolutely. That and things like sentiment analysis are all things that we will be working on in the future. For sentiment analysis, say you want to know what positive things a particular politician said about a particular topic. You would get a different set of results then if you just asked what they said about the particular topic.

Right now we are basically working on sentence level linguistic matching along with other broader document properties like keywords, anchor text and using all of these things to rank our results. But as the technology improves, we’ll start to look at many more of these kinds of discourse level properties so we can really understand what the most important sentences in the document are and how they relate to each other. And as we can learn from these kinds of approaches, I think we’ll see the relevance of search results improving with time.

Eric Enge: Right. For example, if someone types in “The Office,” they probably don’t just want to search the phrase “Office.” They probably mean the TV show.

Scott Prevost: Yes. And in fact if you type that into Powerset, you will get a result that’s tabbed at the very top, for The Office television show. There is also a tab for the UK television series by that name, one for the band and one for Microsoft Office. So that’s a pretty ambiguous query, but chances are you probably meant the television show by phrasing it that way. That’s the one that comes up first.

Eric Enge: Right. So let’s get back to Latent Semantic Analysis. One of the things that you do is look at the entire set of documents, and determine relationships between words by proximity and frequency. This way you might discover that doctor and physician probably mean the same thing, or at least almost the same thing. What I am getting at here is the analysis of the corpus of documents to extract relationships.

Scott Prevost: We are not using what you are thinking about as Latent Semantic Analysis. We are actually using more of a symbolic approach to the linguistic processing. That’s the first phase of what we are doing. We look at a document and break it into sentences, and then we actually parse the sentences using technology that we’ve licensed from PARC. What this does is it allows us to create fairly complex semantic representations of the meaning of those sentences.

And it also allows us to represent ambiguity in those interpretations as well. This way we can index he most likely reading of that sentence, and the other possible readings as well. What happens then is that these things become semantic features that get thrown into the mix with keyword and other document property features that are used by our retrieval system and ranking system.

We are not retrieving results just based on meaning matches and partial meaning matches. It throws that into the mix, and that retrieval and ranking system is a machine-learning based algorithm. In that sense we are starting to use statistical approaches, but we start with a very symbolic representation of the meaning in the document. Then that is used by a machine learning algorithm to retrieve and rank the documents.

We are not pulling the relationships based on things like frequency, we are actually uncovering the linguistic and semantic relationships through symbolic approaches.

We actually do have other projects going on within the company that are looking at more statistical approaches to these problems. But I would currently characterize that system as a hybrid.

Eric Enge: What exactly does it mean to say that it’s a symbolic approach?

Scott Prevost: It means that it’s rule-based semantic processing as opposed to just uncovering things from machine developed approaches. For example, if we have a rule in our system that says if you kill something it dies.

Eric Enge: What are some examples of search queries that highlight the power of this approach?

Scott Prevost: Let’s start with something like Siddhartha. The first thing you will see is the summary of Wikipedia pages that are relevant and that you can tab through. You probably were looking for Siddhartha, the founder of Buddhism, when you typed it in, but there is also a film, a novel and an American rock band by that name as well. You can just click on the tabs to see those different snippets.

In the section below that, you will see something called facts from Wikipedia, and these are some of the semantic relations that we have automatically extracted using these linguistic techniques. In the second line you will see “Siddhartha renounced the world,” and if you click on world, you will see sentences from which we extracted that fact. We extracted that from three different sentences on three different Wikipedia pages, and you will see that it’s not the case that we are using proximity in the second one.

Siddhartha is actually pretty far away from the word renounced, but linguistically they are tightly tied together. It’s just that there is another phrase intervening. So this starts to show you how we are taking data that’s in Wikipedia and starting to structure it. If you click the More link at the bottom of that section, you’ll see that there are a bunch of other relationships that we’ve pulled from.

Eric Enge: They are just a little less tightly matched.

Scott Prevost: Exactly. Now you can also get to this structured information pretty directly. So, if you type in “What did Siddhartha attain,” you will see Enlightenment and Nirvana. So, in a sense, these subject-relation-object semantic triples are great for answering questions.

So, try something like “What was banned by the FDA.” Now, if you are at the right part of the screen, you will see More. If you click that you will see up the longer list. And if you say click on something like “cyclamate” you will see the sentences from which we extracted that fact.

We are basically allowing a whole new type of interaction. I type a simple subject-relation-object question, and now I get a list of answers that are supported by the text that we’ve uncovered through this linguistic analysis. And you’ll also note that we can start to make distinctions between a query, like “who defeated Hulk Hogan,”and “who did Hulk Hogan defeat?”

If you search “who defeated Hulk Hogan,” and you click on More you will see the whole list. And if you do the other query, “who did Hulk Hogan defeat,” you will see that the lists are different because we are actually looking for these things in the correct relationship to each other in the text. We are not just looking for the keywords “Hulk,” “Hogan,” and “defeat.”

That’s an example of a pair of queries that would be very hard for a typical search engine to distinguish between, because the key phrases are the same and the word order is what defines the difference. So let’s pick a query for the regular search results. Let’s type in “how many nuclear reactors does Japan have?” Now, here is a query with a lot of stop words, right? But it’s a query where I think it is pretty easy to tell what the user is looking for. In the very first caption we can see that Japan has 55 reactors.

We are basically interpreting the fact that you typed in “how many” as the fact that you are looking for the particular number of nuclear reactors. This is just something that you don’t get when you use Google, Yahoo or Live Search, or any of the keyword search engines.

Let’s try “Who mocked Sarah Palin?” Now obviously, the other search engines do a pretty good job of finding relevant results for this. But what I want to show you are some of the captions in the blue link results. So we get things about impersonating Palin and parodies of Palin. It’s not that we are necessarily just looking for the specific words Mock Sarah Palin, but we find synonyms that are semantically related to and can highlight those right in the answers.

The hope here is that we can help users better understand when one of these blue link results is actually truly relevant to them, and we can save the clickthroughs when they are not. Another thing that we can talk about is pulling data, or pulling search results from structured data. So, if you type “GM board of directors,” we actually connect with Freebase in order to produce this result at the top.

Eric Enge: Along with the pictures of each of the members.

Scott Prevost: Right. If you type in “what movies did Heath Ledger star in,” you will get the same results as if you typed in “films with Heath Ledger,” because we are actually doing semantic analysis and you are essentially looking for the same thing whether you type in the first phrase or the second.

Eric Enge: The list of movies shown didn’t change at all. There were just some subtle changes to the results below that. Those are interesting examples. Currently you are operating this on Wikipedia?

Scott Prevost: That’s right.

Eric Enge: What was the reason why you chose Wikipedia in particular?

Scott Prevost: Well, there are few reasons. First of all, as we were developing the technology, Wikipedia was a great test bed because it covers just about every topic that there is to cover. We wanted to make it very clear that our technology was about linguistic processing, and that we didn’t have to be within a specific, very narrow semantic domain for the technology to work. Some other natural language approaches have taken that very narrow approach, and that’s not what we’ve done. So the fact that Wikipedia is so broad was very appealing to us.

The second reason is that Wikipedia is well written, so it parses pretty nicely. Although, our technology is designed so that when we can’t parse something, we still index it as keywords. It has to be graceful degradation into the keyword world.

The final reason is that Wikipedia is prevalent in so many search results these days. It’s almost hard to find a search query that doesn’t have a Wikipedia result in the top ten. So we know it has a very valuable set of documents to index. When it came time to define a product to launch, we had some resource constraints. It takes a lot of hardware to spin an index that has as much information as the Powerset index.

So we had to find a smaller set of documents, and then it becomes a challenge to find a small set of documents that hangs together for the user in a meaningful way. So we decided initially to restrict ourselves to Wikipedia alone, rather than having Wikipedia and a few other smaller document sets that might not fit in.

But now we are currently expanding the index. We’ve been continually playing around with other kinds of documents. The technology is not particularly wedded to anything that’s specific to Wikipedia, but it’s such a valuable set of documents on the web that so many people use.

Eric Enge: So, if we think about this as runtime, if someone enters a query is there reason to believe that Powerset is more or less compute-intensive than regular search?

Scott Prevost: It’s marginally more compute-intensive at runtime, but the reason that it is only marginally more compute-intensive at runtime is because we do the real compute-intensive things at index time.

Eric Enge: I assume that at that time it’s probably significantly more compute-intensive.

Scott Prevost: Actually the only thing that’s more intensive at runtime is the fact that we are parsing the query. Once we’ve parsed the query, then the actual retrieval it is very similar to keyword retrieval, except we are retrieving on semantic features as well as keyword features. But it’s very similar apparatus.

Eric Enge: Right. But you probably have a higher level of investment to build the index, because, you are doing all that preprocessing?

Scott Prevost: That’s right. We are doing very deep processing on the documents as opposed to just pulling out the words.

Eric Enge: Is there any insight you can give us at to how much more difficult it is.

Scott Prevost: It depends on the degree to which we do it. It’s a very granular system and we can adjust a lot of knobs. It can be a anywhere from ten to one hundred times more expensive. I am sure we could make it a thousand times more expensive if we thought we would get the benefit from it. Our goal initially has been to improve relevance while disregarding cost in some sense. But obviously, we are pragmatic when push comes to shove. The goal was to find out which of these features are most important for improving relevance. Then as we learn more, we can simplify and skip some of the computation that’s not giving us as much bang for the buck.

Eric Enge: Are there any components of Powerset that are integrated in the Live Search at this point?

Scott Prevost: We’ve integrated a few things. We’ve integrated some of our direct answers using Freebase, some improved captions and snippets under the blue links for Wikipedia. And we’ve also done some things with related searches. And of course we are working on a much more robust integration plan, although I don’t have any plans to announce anything today. But some exciting stuff will be coming down the pipe for sure.

Eric Enge: Any closing comments?

Scott Prevost: We are excited at Powerset to be having the opportunity to take this technology to scale and to integrate it in a product like Live Search. We are really thrilled because it allows us to see our dream actually come to fruition. And I think that we have just a lot of exciting stuff coming down the road.