On Teasing Patterns from Data, with Applications to Search, Social Media, and Advertising

Note:This post is about a new product we're testing at my company Kosmix.

Search
engines are great at finding the needle in a haystack. And that's perfect
when you are looking for a needle. Often though, the main objective is
not so much to find a specific needle as to explore the entire haystack.

When
we're looking for a single fact, a single definitive web page, or the
answer to a specific question, then the needle-in-haystack search
engine model works really well. Where it breaks down is when the
objective is to learn about, explore, or understand a broad topic. For
example:

Hiking the Continental Divide Trail.

A loved one recently diagnosed with arthritis.

You read the Da Vinci code and have an irresistible urge to learn more about the Priory of Sion.

Saddened by George Carlin's death, you want to reminisce over his career.

The
web contains a trove of information on all these topics. Moreover, the
information of interest is not just facts (e.g., Wikipedia), but also
opinion, community, multimedia, and products. What's missing is a
service that organizes all the information on a topic so that you can
explore it easily. The Kosmix team has been working for the past year
on building just such a service, and we put out an alpha yesterday. You
enter a topic, and our algorithms assemble a "topic page" for that
topic. Check out the pages for Continental Divide Trail, arthritis,
Priory of Sion, and George Carlin.

The problem we're solving is fundamentally different from search, and
we've taken a fundamentally different approach. As I've written before,
the web has evolved from a collection of documents that neatly fit in a
search engine index, to a collection of rich interactive applications. Applications such as Facebook, MySpace, YouTube, and Yelp. Instead of serving results from an index, Kosmix builds topic pages by
querying these applications and assembling the results on-the-fly into
a 2-dimensional grid. We have partnered with many of the services that
appear in the results pages, and use publicly available APIs in other
cases.

Here are some of the challenging problems that we had to tackle in building this product:

Figuring out which which applications are relevant to a topic. For
example, Boorah, Yelp, and Google maps are relevant to the topic
"restaurants 94041". WebMD, Mayo Clinic, and RightHealth are relevant
to "arthritis". If we called each application for every query, the page
would look very confusing, and our partners would get unhappy very
quickly! I'll write more on how we do this in a separate post by
itself, but it's very, very cool indeed.

Figuring out the placement and space allocation to each element in
the 2-dimensional grid. Going from one dimension (linear list) to two
dimensions (grid) turns out to be quite a challenge, both from an
algorithmic and from a UI design point of view.

In this alpha, we've taken a first stab at tackling these challenges. We
are still several months from having a product that we feel is ready to
launch, but we decided to put this public alpha out there to gather
user feedback and tune our service. Many aspects of the product will
evolve between now and then: Do we have the right user interaction
model for topic exploration? Do we put too much information on the
topic page? Should we present it very differently? How do we combine
human experts with our algorithms?

Most importantly, the Kosmix approach does not work for every query! Our goal is to
organize information around topics, not answer arbitrary search
queries. How do we make the distinction clear in the product itself? Can we carve out a separate niche from search engines?

We hope to gain insight into all
these and more questions from this alpha. Please use it and provide
your feedback!

Microblogging is a nice-to-have in developed economies,
like the US. It's a must-have in developing economies like India,
China, and Egypt.

In essence,
microblogging is semi-synchronous publish-subscribe messaging. It’s
publish-subscribe because it decouples senders and their reader(s), who
can choose which senders to follow at any point in time. It is
semi-synchronous because readers can choose either to follow it
synchronously (via various desktop tools, or their mobiles), or read it
later. In the Western world, the penetration of PCs is almost
universal, so we have other PC-dependent messaging options such as
blogging (asynchronous publish-subscribe); email (asynchronous
point-to-point); instant messaging (synchronous point-to-point). Yes,
none of them offers quite what Twitter does, but the majority of people
in the majority of situations can make do with the conventional
options.

Contrast this with the situation in third-world nations: PC penetration is incredibly low, while mobile penetration is incredibly high.
For example, India has about 40 million PCs but 10 times as many cell
phones. This makes short text messages sent via SMS the main written
communication mechanism. Blogging, email, and IM are just not options,
so microblogging becomes the main form of publishing, communication,
and self-expression.

I recently started using Twitter and have become a big fan of the
service. I've been appalled by the downtime the service has endured,
but sympathetic because I assumed the growth in usage is so fast that
much might be excused. Then I read this TechCrunch post on the Twitter
usage numbers and sympathy turned to bafflement - because I'm
intimately familiar with SMS Gupshup, a startup in India that boasts
usage numbers much, much higher than Twitter's, but has
scaled without a glitch.

I'll let the numbers speak for themselves:

Users: Twitter (1+ million), SMS GupShup (7 million)

Messages per day: Twitter (3 million); SMS GupShup (10+ million)

Actually, these numbers don't even tell the whole story. India is a
land of few PCs and many mobile phones. Thus, almost all GupShup
messages are posted via mobile phones using SMS. And almost every
GupShup message is posted simultaneously to the website and to the
mobile phones of followers via SMS. That's why they have the SMS in the
name of the service. Contrast with Twitter, where the majority of the
posting and reading is done through the web. Twitter has said in the
past that sending messages via the SMS gateway is one of their most
expensive operations, so the fact that only a small fraction of their
users use the SMS option makes their task a lot easier than GupShup's.

So I sat down with Beerud Sheth, co-founder of Webaroo, the company
behind GupShup (the other founder Rakesh Mathur is my co-founder from a
prior company, Junglee). I wanted to understand why GupShup scaled
without a hitch while Twitter is having fits. Beerud tells me that
GupShup runs on commodity Linux hardware and uses MySQL, the same as
Twitter. But the big difference is in the architecture: right from day
1, they started with a three-tier architecture, with JBoss app servers
sitting between the webservers and the database.

GupShup also uses an object architecture (called the "objectpool")
which allows each task to be componentized and run separately - this
helps immensely with reliability (can automatically handle machine
failure) and scalability (can scale dynamically to handle increased
load). The objectpool model allows each module to be run as multiple
parallel instances - each of them doing a part of the work. They can be
run on different machines, can be started/stopped independently,
without affecting each other. So the "receiver", the "sender", and the
"ad server" all run as multiple instances. As traffic scales, they can
just add more hardware -- no re-architecting. If one machine fails, the
instance is restarted on a different machine.

In read/write applications, the database is often the bottleneck. To
avoid this problem, the GupShup database is sharded.
So, the tables are broken into parts. For e.g., users A-F in one
instance, G-K in another etc. The shards are periodically rebalanced as
the database grows. The JBoss middle-tier contains the logic that hides
this detail from the webserver tier.

I'm not familiar with the details of Twitter's architecture, beyond
knowing they use Ruby on Rails with MySQL. It appears that the biggest
difference between Twitter and GupShup is 3-tier versus 2-tier. RoR is
fantastic for turning out applications quickly, but the way Rails
works, the out-of-the-box approach leads to a two-tier architecture
(webserver talking directly to database). We all learned back in the
90's that this is an unscalable model, yet it is the model for most
Rails applications. No amount of caching can help a 2-tier read/write
application scale. The middle-tier enables the database to be sharded,
and that's what gets you the scalability. I believe Twitter has
recently started using message queues as a middle-tier to accomplish
the same thing, but they haven't partitioned the database yet -- which
is the key step here.

I don't intend this as a knock on RoR, rather on the way it is used by
default. At my company Kosmix we use an RoR frontend for a website that
serves millions of page views every day; we use a 3-tier model where
the bulk of the application logic resides in a middle-tier coded in
C++. Three-tier is the way to go to build scalable web applications,
regardless of the programming language(s) you use.

Update: VentureBeat has a follow-up guest post by me, with some more details on SMS GupShup. Also my theory on why SMS GupShup is growing faster than Twitter: Microblogging is a nice-to-have in places with high PC penetration, like the US, but a must-have in places with very low PC penetration, like India.

Disclosure: My fund Cambrian Ventures is an investor in Webaroo, the
company behind SMS GupShup. But these are my opinions as a database
geek, not as an investor.

Often, we have a really cool algorithm (say, support-vector machines or singular value methods) that works only on main-memory datasets. In such cases, the only possibility is to reduce the data set to a manageable size through sampling. Mayank's post illustrates the dangers of such sampling: in businesses such as advertising, sampling can make the pattern you're trying to extract so weak that even the more powerful algorithm cannot pick it up.

For example, say 0.1% of users exhibit a certain kind of behavior. If you start with 100 million users, and then take a 1% sample, you might think you are OK because you still have 1 million users in your sample. But now just 1000 users in the sample exhibit the desired behavior, which may be too small for any algorithm to pick apart from the noise. In fact, it's likely to be below the support thresholds of most algorithms. The problem is, these 0.1% of users might represent a big unknown revenue opportunity.

Moral of the story: use the entire data set, even it if it is many terabytes. If your algorithm cannot handle a dataset that large, then change the algorithm, not the dataset.

This post continues my prior post Are Machine-Learned Models Prone to Catastrophic Errors. You can think of these as a two-post series based on my conversation with Peter Norvig. As that post describes, Google has not cut over to the machine-learned model for ranking search results, preferring a hand-tuned formula. Many of you wrote insightful comments on this topic; here I'll give my take, based on some other insights I gleaned during our conversation.

The heart of the matter is this: how do you measure the quality of search results? One of the essential requirements to train any machine learning model is a a set of observations (in this case, queries and results) that are tagged with "scores" that measure the goodness of the results. (Technically this requirement applies only to so-called "supervised learning" approaches, but those are the ones we are discussing here.) Where to get this data?

Given Google's massive usage, the simplest way to get this data is from real users. Try different ranking models on small percentages of searches, and collect data on how users interacted with the results. For example, how does a new ranking model affect the fraction of users who click on the first result? The second? How many users click to page 2 of results? Once a user clicks out to result page, how long before they click the back button to come back to the search results page?

Peter confirmed that Google does collect such data, and has scads of it stashed away on their clusters. However -- and here's the shocker -- these metrics are not very sensitive to new ranking models! When Google tries new ranking models, these metrics sometimes move, sometimes not, and never by much. In fact Google does not use such real usage data to tune their search ranking algorithm. What they really use is a blast from the past. They employ armies of "raters" who rate search results for randomly selected "panels" of queries using different ranking algorithms. These manual ratings form the gold-standard against which ranking algorithms are measured -- and eventually released into service.

It came as a great surprise to me that Google relies on a small panel of raters rather than harness their massive usage data. But in retrospect, perhaps it is not so surprising. Two forces appear to be at work. The first is that we have all been trained to trust Google and click on the first result no matter what. So ranking models that make slight changes in ranking may not produce significant swings in the measured usage data. The second, more interesting, factor is that users don't know what they're missing.

Let me try to explain the latter point. There are two broad classes of queries search engines deal with:

Navigational queries, where the user is looking for a specific uber-authoritative website. e.g., "stanford university". In such cases, the user can very quickly tell the best result from the others -- and it's usually the first result on major search engines.

Informational queries, where the user has a broader topic. e.g., "diabetes pregnancy". In this case, there is no single right answer. Suppose there's a really fantastic result on page 4, that provides better information any of the results on the first three pages. Most users will not even know this result exists! Therefore, their usage behavior does not actually provide the best feedback on the rankings.

Such queries are one reason why Google has to employ in-house raters, who have been instructed to look at a wider window than the first 10 results. But even such raters can only look at a restricted window of results. And using such raters also makes the training set much, much smaller than could be gathered from real usage data. This fact might explain Google's reluctance to fully trust a machine-learned model. Even tens of thousands of professionally rated queries might not be sufficient training data to capture the full range of queries that are thrown at a search engine in real usage. So there are probably outliers (i.e., black swans) that might throw a machine-learned model way off.

I'll close with an interesting vignette. A couple of years ago, Yahoo was making great strides in search relevance, while Google apparently was not improving as fast. Recall then that Yahoo trumpeted data showing their results were better than Google's. Well, the Google team was quite amazed, because their data showed just the opposite: their results were better than Yahoo's. They couldn't both be right -- or could they? It turns out that Yahoo's benchmark contained queries drawn from Yahoo search logs, and Google's benchmark likewise contained queries drawn from Google search logs. The Yahoo ranking algorithm performed better on the Yahoo benchmark and the Google algorithm performed better on the Google benchmark.

Two learnings from this story: one, the results depend quite strongly on the test set, which again speaks against machine-learned models. And two, Yahoo and Google users differ quite significantly in the kinds of searches they do. Of course, this was a couple of years ago, and both companies have evolved their ranking algorithms since then.

Note: I wrote this piece a couple of weeks back, inspired by Greg Linden's blog post (see below). Inc then picked up the piece and asked me not to publish it until it appeared on the Inc website. The article appears on the Inc website today with some minor edits.

Greg Linden was one of the key developers behind Amazon's famous
recommendations system -- the system that recommends books, movies, and
other products to Amazon customers based on their purchase history. He
subsequently went to Stanford and picked up an MBA. In January 2004, he
launched a startup named Findory to provide everyone with a
personalized online newspaper. You cannot imagine anyone who could be
more qualified to make a startup like this a success. Yet Findory shut
down in November 2007. In a brilliant post-mortem,
Greg says his big mistake was to bootstrap his company while trying to
raise funding from venture capital firms; he just couldn't convince
them to invest. He should have raised his funding from angel investors
instead.

This is an important decision
every startup founder has to make -- where to raise their funding. The
three viable sources at the very early stages of a company are:

Friends and family. Yourself, if you can afford it.

Angel
investors. Usually wealthy individuals, but includes outfits such as Y
Combinator. (My firm Cambrian Ventures is also in this category, although we
are currently not actively seeking investments; we're too busy running
our own company Kosmix.)

Venture Capital (VC).

To
understand which option is best for your startup, you need to
understand how investors evaluate companies. While investors evaluate
companies across a range of criteria, three that stay consistent are:
Team, Technology, and Market. Angels and VCs evaluate them in different
ways. Here's how.

How Venture Capitalists Evaluate Startups

Market. Venture
Capitalists want to invest in companies that produce meaningful returns
in the context of their fund size, which typically is in the hundreds
of millions of dollars. To interest a VC firm, a company needs to be
attacking a large market opportunity. If you cannot make a credible
case that your startup idea will lead to a company with at least $100
million in revenue within 4-5 years, then a VC is not the right fit for
you. It's often OK to use consumer traction as a substitute for market
opportunity -- many VCs will accept a large and rapidly growing user
base as sufficient proof that there is a potentially large market
opportunity.

Team. Venture
Capitalists use simple pattern matching to classify teams into two
buckets. A founding team is deemed "backable" if it includes one or
more seasoned executives from successful or fashionable companies (such
as Google) or entrepreneurs whose track record includes a least one
past hit. Otherwise the team is considered "non-backable."

Technology.
Venture Capitalists are not always great at evaluating technology. To
them, technology is either a risk (the team claims their technology can
do X; is that really true?) or an entry barrier (is the technology hard
enough to develop to prevent too many competitors from entering the
market?) If your startup is developing a nontrivial technology, it
helps to have someone on the team who is a recognized expert in the
technology area -- either as a founder or as an outside advisor.

Here's the rule of thumb: to qualify for VC financing, you need to pass the Market Opportunity test and at least one of the other two tests. Either you have a backable team, or you have nontrivial technology that can act as an entry barrier.

How Angels Evaluate Startups

There
are many kinds of angels, but I recommend picking only one kind:
someone who has been a successful entrepreneur and has a deep interest
in the market you are attacking or the technology you are developing.
Other kinds of angels are usually not very high value. Here's how
angels evaluate the three investment criteria:

Market. It's
all right if the market is unproven, but both the team and the angel
have to believe that within a few months, the company can reach a point
where it can either credibly show a large market opportunity (and thus
attract VC funding), or develop technology valuable enough to be
acquired by an established company.

Team. The team needs to include someone the angel knows and respects from a prior life.

Technology. The technology is something the angel has prior expertise in and is comfortable evaluating without all the dots connected.

Here's the angel rule of thumb: you need to pass any 2 out of the 3 tests (team/technology, technology/market, or team/market).
I have funded all 3 of these combinations, resulting in either
subsequent VC financing (e.g., Aster Data,
Efficient Frontier, TheFind), or quick acquisitions (Transformic, Kaltix
-- both acquired by Google).

I've
written about the stories behind the Aster Data investment and the Transformic investment previously on my blog. In both cases, notice how my personal
relationship with the founders, as well as my passionate belief in the
technology, played big roles in the investment decisions.

Friends and Family or Bootstrap

This
is the only option if you cannot satisfy the criteria for either VC or
angel. But beware of remaining too long in this "bootstrap mode." An
outside investor provides a valuable sounding board and prevents the
company from becoming an echo chamber for the founder's ideas. An angel
or VC can look at things with the perspective that comes from distance.
Sometimes an outside investor can force something that's actually good
for the founder's career: shut the company down and go do something
else. That decision is very hard to make without an outside investor.
My advice is to bootstrap until you can clear either the angel or the
VC bar, but no longer.

Back now to Greg
Linden and Findory. By my reckoning, Findory passes the team and
technology tests from an angel's point of view -- if you pick an angel
investor who has some passion for personalization technology. The
company doesn't pass any of the VC tests. Given this, Greg should
definitely have raised angel funding. My guess is that this route would
likely have led to a sale of the company to one of many potential
suitors: Google, Yahoo, or Microsoft, among many others. Of course,
hindsight is always 20/20! I have deep respect for Greg's intellect and
passion and wish him better luck in his future endeavors.