Adventures in Technology and Data Science

February 17, 2008

Dave McClure moderated an event on Search & The Social Graph
at the Yahoo! campus this week, organized by the Search SIG of the
Software Development Forum. With the meteoric rise of Facebook and the
heightened interest in leveraging the social graph - both Google and
Yahoo! have launched new APIs and OpenSocial is gaining momentum - this
discussion was timely and attendance was strong.

The panelists represented some of the most interesting players in this space:

Kevin Marks from Google

Aditya Agarwal from Facebook

Kent Brewster of Yahoo!

Eve Phillips, CEO of Chirp

It turned out to be an interesting event, with lots of good discussion
about the implications of portability, privacy, utility and
monetization of social data. No stranger to the social data space,
moderator McClure did an outstanding job of keeping things focused and
the discussion lively; he was clearly knowledgeable and well-prepared,
launching into a series of leading questions that moved the
conversation forward.

Key Observations

By grouping together related comments, I've distilled the discussion at this event into the following topics:

1. Relevance of Search Results

- With the explosion of self-publishing and user-generated content on
the web, the type of data getting created on the web is changing, and
the classic search algorithms are becoming less effective.
- Users are increasingly interested in what their friends and peers are doing online.
- By using a social graph to filter out results during a specific search, you can boost the relevance of search results.

2. Monetization

- It is no longer uncommon for a person to become a media source, using
tools such as twitter, blogs and RSS feeds; but this is hard to
monetize. A referral model works better in this case than advertising.
- Brand advertising is still big, even for social search, but it works differently than for targeted search
- Online brand advertising will move into more interactive experiences in the future
- The key question is: Does membership in a social group signal an intention that can be targeted by advertisers? The panelists felt that, on balance, it did Not
- For a more concrete example: Google's directed search is very
monetizable; Facebook has a lot of social data, but user behavior is
not very monetizable

3. Privacy

- There is a clear difference between a publicly-proclaimed graph, such
as the friends on Facebook, and a private list, such as Email contacts;
application developers will ignore this distinction at their peril
- Yahoo!'s Brewster said it best: "There should never be a privacy surprise for the user!"
- Applications should make it clear to users if they are making data
public or private; e.g. Flickr is three-valued in this regard

4. Interaction Levels

- From a monetization perspective, all "friends" are not created equal; some connections in the social graph are stronger than others
- The smallest inner set of friends is the most valuable; the first 25 people have 80% of the value
- The viral rate of promotion in Facebook is incredible
- If users can annotate connections, they can more fully express their network graph
- You can infer relationships from user behavior, such as sites visited and click-throughs
- The most important part of social data is the connections, followed by the profile; eventually, it gives you the ability to answer the question: "Who should you go to, to answer this question?"

5. OpenSocial

- OpenSocial allows application developers to write one application,
and then take it to where the users are on diverse other social networks
- The vision: take some of the good parts of Facebook and bring those to a lot of people
- This allows any application to spread through the social graph

6. Social Email

- Email networks have a lot of connection data, which has social data buried in it
- These connections can either be one-way or two-way; the difference signals intent on the part of the user
- Google's Marks
made an interesting point: a person's email address and personal URL
are opposites - with the former, you can communicate with that that
person; with the latter, the person communicates with you

Facebook

Facebook's Agarwal
did a great job of articulating the company's approach to some of these
issues. His contributions to the discussion were somewhat
Facebook-centric; but given the strong community interest in Facebook
lately, this only added to the value of the panel.

In discussing the value of social data for search, Agarwal compared the
issues of selecting for relevance among a large number of results for a
targeted search, with those of producing Facebook's news feed, which
must also present a large amount of data to the user in a format that's
easy to consume.

In terms of privacy, Facebook wants to allow users to annotate the
social graph, so that they can fully express their network. This will
allow users to separate their strong connections from casual friends.
The size of a user's graph is another dimension to be considered.

For data portability, Facebook currently doesn't have any plans to
implement enabling features focusing on it. Agarwal clarified that
although philosophically they support data portability initiatives,
they have not determined it to be the best use of resources at this
time.

Finally, although Agarwal did not acknowledge this directly, the
panelists agreed that the Facebook-type social network data and
searches are far less monetizable than directly targeted activities
that display clear intent, such as a Google search.

Chirp

This was the first time I saw a demo of Chirp . Eve Phillips, Chirp's CEO, gave a demo of chirpscreen,
an interactive screen saver that displays content from your social
network, such as pictures from Flickr and status messages from
Facebook. On the whole, the audience loved it - a series of photos of
her friends kept popping up on the screen - but there were some
concerns about being able to control what gets shown. According to
Phillips, Chirp is planning to introduce new features soon that will
allow users to set preferences of what content is displayed, from which
sources, and so on.

Open Questions

McClure asked some incisive questions to the panelists, which deserve
to be listed in their own right; I hope these lead to a wider
discussion about social data and related topics:

Is Social Search - revolutionary, or evolutionary?

Which benefits more from social data: targeted search or discovery?

How well does social search monetize?

How should we use the social data that's automatically present in Email?

If Facebook and other networks encourage lightweight friendships, does it obscure the real social graph?

January 07, 2008

Jeremiah Owyang wrote an interesting post yesterday: The Five Members of the Techmeme Family - in which he lists the different types of bloggers that end up on Techmeme. I think he's right on the money; as an avid follower of the site, I've seen the same dynamics at play.

In his post, Owyang also looks at how posts are rated on Techmeme.
What's interesting about it is that the person who breaks the story
does not necessarily get the lead; a more mainstream news source or
blogger often becomes the "top node", even if all he or she is doing is
to repeat the story without any additional content or unique insight.
This is a reasonable approach from an automated content discovery perspective, but it sometimes gives funny results.

As Owyang says:

...

The Breaker: This can be mainstream news source or
a mainstream blogger that discovers the story from the Original News
Source and blogs it, as a result, they often become the top node, even
if they aren’t the original source. It seems as if some websites are
naturally geared to be an “H1″ even if they are resonators.

The Resonator: Also referred to as those who echo
or copy, they repeat what was already said, adding little or no
additional content, news or opinion.

At that time, the big news of the moment was about an executive
defection, er, employment change - Steve Souders, Chief Performance
Yahoo, left his post at Yahoo! to join Google.

What is interesting to note is the ordering of the various stories on the Techmeme web site.

The lead story on this topic is the Silicon Alley Insider post by Henry
Blodget - an A-list blogger. Now, Mr. Blodget is a fine writer and SAI
is a great blog, but this particular story
that leads is written mostly as a breaking-news flash, with minimal
opinion and no particular startling insights. (Where is the story behind the story ?)

However, the story had already been broken by techno.blog on the previous day
(according to the respective blog post time stamps), so it wasn't
really breaking news by the time it appeared on Silicon Alley Insider.
And others - for example, Donna Bogatin and Ashkan Karbasfrooshan - provide a lot more additional content and, arguably, much more insight. So how did the big-T pick Blodget's post as the lead?

My belief is that the Techmeme algorithms choose their lead based on
the prominence of the source and on the links to a given post (which
two factors are generally highly correlated, in any case).

This is fine and generally works well. Are there other options, other
algorithms that can be used to choose the lead for a developing story,
that could highlight the more meaty posts? A few possibilities come to
mind:

Reader Votes: Within the set
of posts for a developing story, allow readers to vote for the ones
they like best, so that the most popular ones rise to the top.

Link Count: Examine the
cross-linking between posts to leverage the implicit knowledge therein,
similar to Google's PageRank algorithm. I believe Techmeme already
incorporates this to some extent.

Bookmark Count: Examine the incidence of social bookmarks for different posts, for popular bookmarking services like del.icio.us .

Human Editors: Use human editors to select the top leads. Of course, this may prove too expensive and/or cumbersome.

Author Markup: Enable
authors to include metadata in some standard format for their posts. By
using markup or tags such as "news", "opinion", "analysis",
"multi-idea" and so on, authors could indicate the type of their post
to the selection engine. Admittedly, this approach is susceptible to
gaming, although it could be combined with voting to improve quality.

Over time, the significance of "prominence" as a measure of content
quality is eroding - especially for blog posts in particular. As the
web evolves, Techmeme and other sites are sure to experiment with these
and other alternative approaches; it will be interesting to see which
ones emerge as the winners.

August 26, 2007

Thank you to everyone who participated in the last Software Abstractions survey! We asked: which features do you see as the most important ones for Web Search in the future? The results were interesting.

Out of a total of 33 votes, the top votes were closely split between a variety of answers.

Personalization [6 votes]

Social Input [5 votes]

Semantic Query [5 votes]

Semantic Index [6 votes]

Trusted Sources [6 votes]

For search engines with advanced linguistic parsing
capabilities, it's reasonable to assume that semantic processing will
be applied to both the query and to the indexed content as a whole. If
you combine those two answers, then Semantic Processing is the clear winner with 33% of the votes!

The high number of votes for the "Trusted Sources" answer was a surprise - it's clear that a stronger focus on quality of the results in the future (and their being spam-free) weighs heavily on users.

January 31, 2007

I've been a regular reader of Mark Seremet's Repliqa blog for a long time. It's a great blog - he regularly manages to come up with some fascinating posts, such as this post on Entrepreneurship outside Silicon Valley and this one on Donald Trump.
Mark is a serial entrepreneur, and is working on a discovery engine
called Repliqa (currently in stealth mode) and a startup called
Wallhogs. In his latest post, he gives a nod to the Software Abstractions blog. Thanks, Mark - I appreciate the compliment! Keep those great posts coming ...