Tomi Poutanen talks about social media and the problems with algorithmic search

Tomi J Poutanen is the Senior Director of Product Management, Yahoo! Social Search. Tomi Poutanen directs the product management team overseeing Yahoo’s social search products Yahoo! Answers, del.icio.us and Yahoo! Bookmarks. He has worked at Yahoo! since the company acquired Inktomi, a Web Search technology provider, in March 2003.

While at Inktomi, Poutanen was the founder and general manager of the paid inclusion group. He also oversaw Inktomi’s Web Search partnerships with AOL, MSN, Lycos and others.

Poutanen started his commercial career when he founded Data Compression Technologies Inc while a computer engineering student at University of Waterloo in Ontario. The company specialised in high-end data compression utilities for software distribution. In 1996, Microsoft acquired exclusive rights to the company’s patented technology, which today forms the basis of the CABARC compression utility used in all Microsoft software distribution.

Poutanen holds an MBA and MASc in Computer Engineering from the University of Toronto.

Interview Transcript

Eric Enge: Can you talk a little bit about what the problems are with algorithmic search?

Tomi Poutanen: I would say that there are three challenges with the algorithmic search. Number one is the very size of the web that they are searching. Currently, most search engines have over ten billion documents in them and typical web searches may return millions of web pages. But ultimately, for the person who is reading those web pages, they only look at the top ten results. And, whether you are getting ten million results back or one million results back, you are really no wiser between those two. So, I think the first challenge for web search is just the human ability to absorb all those millions of results that are returned for a typical query.

The second challenge for algorithmic search, is what we call subjective queries. Algorithmic search is very good with navigational queries, if they are trying to find a website they typically nail it, but when you are asking a web search to provide you results that maybe subjective in nature, for example, hotel recommendations in a city or cool lamps, web search engines, computers and algorithms are not really able to create a notion of subjective results that a human would define interesting and appealing.

I think the third challenge with algorithmic search is the fact that it operates in a very lucrative industry, where referrals from search engines are highly valued, and the commercial benefits from those referrals drives a lot of abuse of their systems, so there is a lot of spam, and that reduces the quality of the results.

Tomi Poutanen: With an algorithm, you can always write your spam to tailor it to any particular algorithm.

Eric Enge: Right. So, in the case of the spam, why would social search be better?

Tomi Poutanen: Well, spam is a very significant issue that impacts both algorithmic search as well as social search to an increasing extent. If you define spam, specifically as in set of activities intended to drive traffic to sites, where those site then monetize site traffic, then there certainly is spam that shows up on sites like del.icio.us, a social bookmark site, and on sites like Yahoo! Answers, where participants are looking to drive users onto their sites. Now, there are different ways that social search properties can mitigate spamming of our sites.

I think fundamentally, it comes down to the return on investment in spamming an algorithmic engine versus spamming a social search property. So for an algorithmic engine, it is pretty easy to go buy thousands of domain names for a few dollars a pop and create millions of web pages, right. And so, you can create millions of items as spam very, very cheaply in an algorithmic environment. However, in a social property like del.icio.us or within Yahoo! Answers, there is a more significant cost in creating that spam. At Yahoo! Answers specifically, we limit the number of daily submissions from any user to what we would be humanly possible. And, we also have a large community of users, who report abuse. They report abuse much more frequently than you might see abusers spam being reported on an algorithmic search engine. Fundamentally, most spammers find out quickly that it is more costly to produce spam in social media properties, and the community is more likely to single out spam, which then gets pretty promptly removed from our properties. So, I think the economics are very different, which makes it easier to manage long term.

Eric Enge: Basically, one of the big differences is the economics.

Tomi Poutanen: Exactly.

Eric Enge: Alright. Now, how does social search help with the other issues that algorithmic engines face, such as the large index size and the subjective query problems?

Tomi Poutanen: A typical web search today is only 2.3 words and that has actually remained somewhat constant for the last few years. Search has actually trained users. When you see a search box, people now start to actually think about the keywords they have to type to find the result they are looking for, and that largely drives their behavior, and they expect the search engine to behave in a certain way. But, when you are only typing 2.3 words an average, that doesn’t provide a lot of detail, and as a result you get a lot of results back. On Yahoo! Answers, on the other hand, the typical question is thirteen words long, and there is an extra element of detail that one can provide. So that extra level of specificity, in addition to the user being able to include considerable added details that provides context to their question, is what really drives to the very question you are looking to have answered, and the community will respond to that, much more in such a way that it is much more specific to you, given the additional detail that you provided.

The other way, that social search helps navigate the data is that there is an implicit networking, in Flickr, on del.icio.us, and in Answers, and you can specify users of those services who you trust and whose content you value, you will be more exposed to those users submissions, and those users will also directly be able to see your questions, your feedback questions and answers, which will help you create a more personal experience, and make that application more specific to you and perform better for you. Because you are providing more specificity in how you pose questions, or how you interact with the service and also the fact that you are directly connecting to your network, and have a personal connection to those individuals, you should get better quality, whereas searching the web it is really faceless and could be impersonal.

Eric Enge: There is an interesting information processing problem here, which is that with the huge amount of information our there, you have to figure out what information you trust.

Tomi Poutanen: Correct.

Eric Enge: And, as you pointed out once you have identified a person you trust as a resource, or a source of information, then the chances are that you are better off than in a more random environment.

Tomi Poutanen: Right. You are really hitting this at the very core of the issue, which is enabling individuals to create an identity on these sites and a reputation on these sites. Then services like Yahoo! Answers surface that person’s reputation. In the exchange of Q&A, it is very important that you know the source of the person providing you an answer and that you can trust that source. So we have a number of ways, where we try to communicate the level of trust or the reputation of the source, and that is a very meaningful way of being able to consume the data, and make sense of all the responses you get back to your questions. A typical question today, gets around seven answers, and it is great to have those different perspectives being presented by users, but that also means that as a user you have to decide which advice or a piece of knowledge you will follow.

Eric Enge: Right.

Tomi Poutanen: So, you can build your personal network, and we also provide information on the user, how long they have been part of answers, what level they are, and basically their collective contribution to the site overtime. We have badges for the people, who are top answerers in a particular category, so that really signifies an area of knowledge. And, we also provide more esoteric signals, such as what is an individual’s best answer percentage, which basically measures how often the person who asks a question or the Answers community has picked their answer as the best answer.

Eric Enge: Right.

Tomi Poutanen: And typically, if you have a high rate of being chosen as the best answerer, then that also signals certain quality about you. So, those are the different ways that we surface a person’s reputation, and certainly that is an ongoing area of research.

Eric Enge: Right. Helping people build their own trust networks seems like a key activity.

Tomi Poutanen: Correct.

Eric Enge: So are there situations, where a social search is better then algorithmic search, and others where algorithmic search is better?

Tomi Poutanen: Yes, the two are different and each has a purpose. At times they complement each other, and sometimes they replace each other. So I would say, that in the area of navigation, in the area of deep research, etc web search is definitely the fastest way to find what you’re looking for. And I think in the area of subjective searches, or very specific questions that you want answered that rely on personal advice or opinion, that is where social search properties are better. And, I will give you a couple of examples. So for example, if you are looking for a hotel in New York City, you can type in a query in del.icio.us, and you will get all the hotels tags, and NYC hotel and you are able to see which ones have been tagged the most frequently, and that certainly is a very interesting view into actual hotels, and how the community perceives them, basically providing a popularity for those hotels. Contrast that to web search, which hence would be highly commercial in nature, largely linking to hotel reservation sites, not the actual hotel. So, that is one area that is very interesting. Take another area, in the area of advice, where answers really play strongly as when you had a very, very specific question, where there might be people who could been in the similar situation, a few have had that experiential knowledge, again that is where Answers is the best form for getting your question answered. For example, I was searching for a nanny contract and I wanted to do a web search on that, and got a bunch of nanny agencies, who want to sell their services to you. I asked that question on Answers and somebody said, “hey, you are looking for a contract, and this is the contract I used”, and they provided it as an answer to my question, and that direct connection to someone who had the knowledge was what helped me solve my task.

Eric Enge: Right. Human interpretation provides great benefits to resolving a search query, and you mentioned subjective queries, but there are other kinds of queries that are not really subjective queries, but still get handled better by humans.

Tomi Poutanen: Right. It is an advice query. You are looking for somebody’s experience and knowledge.

Eric Enge: I think it is really intriguing that tagging is back. Why is it going to succeed this time, when it failed last time?

Tomi Poutanen: It succeeds on Flickr and del.icio.us for two reasons. On Flickr, there is not any text to go with the images, so, people associate tags with images as a way of organizing those images for their personal recall. And on del.ici.ous, the act of tagging is almost entirely motivated for personal recall whereby you want to type in a certain number of tags that will help you find that record in the future, might associate a tag like to read, or a tag like Amazon and say if there is an Amazon Tag, that really signifies that I want to buy this book on Amazon. So it is really very specifically to help your personal recall process. I think that is fundamentally why tagging will work in this sense, where because the primary motivation is personal, yet the collective activity of millions of people tagging content is what will build a faster production of tags in such a way that it is more generally consumable.

Eric Enge: Right. Then, it really gets back to the business of having trust networks that help you are more comfortable with the kind of input you are getting to, right?

Tomi Poutanen: That is correct.

Eric Enge: Are there specific measures you take to combat spam?

Tomi Poutanen: Yes, there are a lot of resources that we put towards spam. For example, Yahoo! Answers is a communication vehicle enabling people to share knowledge in a more constructive way than other vehicles. One reason that forums have failed to do more is that they have been unable to organize all that knowledge and that communication, in a way that is constructive over time, and, because they have been overwhelmed by spam and abuse. That is an issue we are addressing everyday, and putting a lot of resources towards. The long term answer, in combating spam is to empower our community to be able to manage the abuse, and then for our systems to be able to drive attention to high quality content and reduce attention from poor quality content.

On Flickr we have “interestingness”, and it really drives people’s attention to the most high quality images, and the way they determine what is high quality is a very complicated set of functions, but they are at the heart of Flickers’ trust in a user and how well an image is been received by the community. It is very different signals that, we look for in determining the quality of user generated content versus the signals web search engine that will look for in determining what’s a high quality web page.

Eric Enge: Right. So how big a problem is spam for social media sites?

Tomi Poutanen: I would say it is a significant issue for social media sites in general. We are putting a lot of resources towards and building systems that will enable our community to police itself, but it is manageable. In the long term, we need to have tools in place that absolutely make it an unappealing place for spammers to spend their time, and to better enable our community to self police.

Eric Enge: It seems like to me that one of the harder issues to deal with are long tail type terms that do not have so many eye balls on them. Of course, there is less return for a spammer in winning on those.

Tomi Poutanen: Well, right. So that is again what it comes back to is, the spammer spamming a web search engine can create a million pieces of Long Tail content very cheaply. But, they cannot do the same in these social search properties. So, the way you are looking to combat spam with regard to the Long Tail problem is to make it uneconomical for a spammer to target those and to get the community that cares about that, deep content to self moderate it so without Yahoo! ever needing to look at it.

Eric Enge: What about the idea of creating some interesting marriage of these things, algorithmic and social search, is there a future for something like that?

Tomi Poutanen: Absolutely. We already do that today. For subjective queries like “best car,” Yahoo search will return algorithmic results, and below the top ten web results, we will show answers results labeled “Shared by Yahoo!s.” We provide the questions and best answers that match that criteria. That is a first step, but there are a lot of directions in which we can take this, and there are a number of implementations around the world, where we have solutions that target the specific markets they are in. So in Morea, for example, Answers like services have been in operation there since 2002, their search experience actually federates content from various sources. Your search experience will include web results, blog results, answers results, image results, etc. So, it is a very different user experience, and it is well suited to that market, where the typical broadband rates are 10MB/sec. In the US we have a lighter implementation of that today.

Eric Enge: It will be interesting to see how this all evolves.

Tomi Poutanen: Certainly a lot of innovation for us.

Eric Enge: Yes, I have maintained for a long time that having human review over algorithmic search is the best way to do, but there are of course implementation challenges, because of the need to make money in the whole process.

Tomi Poutanen: There is a lot of science that goes into it, being able to target high quality content that is produced by people and branding that with algorithmic results, it is a very tough scientific endeavor. Yahoo! is investing a lot of resources to foster these properties and to make that vision happen.

Eric Enge: Right. What about problems with ambiguous language. It seems to be that is also something that social search would be more able to handle, words like jaguar and orange, or surfing, words that have double meanings.

Tomi Poutanen: That is where tagging helps. Tags are ultimately created by human beings and that provides human interpretation of the different words. Flickr does a really good job in clustering images based on different content types, so for a query like jaguar, you will have everything from the animal to the car to the operating system, etc. and I think even a guitar.

You are right that social search provides a different way to interpret words. We are in a better place to be able to deal with ambiguous queries like that. There are also questions that might be hard to articulate for a search engine. It is always easy to ask on a search like Yahoo Answers. Everybody learns to ask questions as a child, nothing comes more naturally to human being, and being able to pose that question on Yahoo Answers typically will yield the results and answers to your questions very quickly.

Eric Enge: The Yahoo! Answers market share is really quite impressive. Why do you think it got to such a substantial share so fast?

Tomi Poutanen: Our latest data, just in case you do not have it, is comScore March data, which says we have 19.2 million unique American users, and on a global basis, we have over 90 million users. We are growing very, very fast, growing actually even faster internationally than in the U.S. But it is quite remarkable that we have reached that scale for a product that we only launched in the U.S. in December of 2005.

Eric Enge: Right.

Tomi Poutanen: There is a real demand for an area, where individuals can share knowledge, share their wisdom, and share in a very simple Q&A format. By structuring that communication in Q&A format we are helping put more structure in that dialogue, and making that conversation searchable over time. I think very fundamentally, it seems successful because it is meeting a market need, and that it is well regarded by our community. I think the second reason it successful is that, in the area of user generated content, it is probably the simplest online service. It is so simple to ask a question or answer a question that we’ve found many people who wouldn’t otherwise publish information on the internet have become members. I can’t think of one other user generated content site on the Internet that is simpler to use than Yahoo! Answers. So, I think those are the two fundamental reasons for why it has grown so fast. Also Yahoo! is a big believer in this and is being very supportive in promoting the service on our network and having the backing of Yahoo!, which touches half a billion users on a monthly basis, is certainly a good accelerator for any product.

Eric Enge: Right. There are a lot of other products out there, I think the last statistics I saw had Microsoft’s Live QnA service in the number two slot.

Tomi Poutanen: I think the latest stats I saw in terms of market share said that Yahoo! was at ninety-six percent market share.

Eric Enge: That was the Hitwise study done in December of 2006. I think it is just a phenomenal number.

Tomi Poutanen: Yes. Yahoo! enjoys quite a market lead in this area. It is also by virtue of being in the market first with a free site and having a large current established base of users. For that reason, it is also the place where you are most likely to find an answer to your question very quickly just by virtue of having so many people participating on our service. So, that is certainly helping us along as well, because from an effectiveness point of view, Yahoo! Answers definitely delivers.

Eric Enge: Just a little less than two months ago, you announced new social networking features that you built into Yahoo! Answers. What kind of impact has that had?

Tomi Poutanen: It has been very significant. We have had hundreds of thousands of people build up their personal answers network, and what that enables those individuals to do is see the questions and answers that their community have asked or provided, and also for their friends to see their questions. By virtue of building up a network one is able to tap into the knowledge of your friends and those whose knowledge you value, and be able to help out your direct network of the individuals that you know and respect. So the service definitely is having a large impact on Yahoo! Answers, and it is helping people enjoy answers more and also making it more personal and more effective. We have modeled the Yahoo! Answers network after the Flickr network, where social contacts largely define the user experience. On Flickr you are really more geared towards seeing images from your community, so Flickr is more social than Answers, but Answers certainly has very strong social element and certainly that is a very strong motivation in answering questions from your contact.

Eric Enge: It helps you prioritize.

Tomi Poutanen: Exactly.

Eric Enge: There is a very strong international focus across the site, and I am interested in what has caused so much success across the whole Yahoo! network internationally.

Tomi Poutanen: Yahoo!’s social media and social search properties, have done half or more of their activity overseas. And I think that the number one driver behind that is that the overall size of the online population globally, and that the US is less than fifty percent of it.

But also as an enabler, the services are all offered as platforms, so and these platforms in many cases are independent outside of Yahoo!. So a blogger in France can tap into Flickr images through an API call. And they can get their del.icio.us links through an API call. They can build up the site in French in this case, for their readership, but the content they are pulling from our services are through a platform that’s language insensitive. So, we are really providing the tools that are enabling webmasters around the world, to be able to add this social media content into their sites, through APIs, and in a form is language agnostic.

Eric Enge: Another observation with regard to del.ici.ous is that the income level of the audiences is really quite affluent. According to Hitwise, thirty-six percent of your users make between 100,000 and 150,000 a year, and are also roughly speaking 59% male., It seems to me that must have some impact on del.icio.us as a property.

Tomi Poutanen: I would expect in the long term that the demographics will become more mainstream. And certainly, it is a reflection that the current users are information workers, and that’s what del.icio.us helps people with, it helps you organize your online world. Anybody who has such a need is typically an informational worker, and that tends to be of the higher income.

So that is a reflection of the current audience, early adopters etc, but in the long term we would expect that to become more mainstream.