Category Archives: weotta

As we prepare to launch Weotta, we’ve struggled with how to describe what we’ve built. Is our technology big data? Yes. Do we use machine learning and natural language processing? Yes. Could you call us a search engine? Absolutely. But we think the sum is more than those parts.

We finally decided that the term that best describes what we do is deep search — a concise description of a complex search system that goes far beyond basic text search. Needless to say, we aren’t the only ones in this area by any means; Google, and plenty of other companies do various aspects of deep search. But no one has created a deep search system quite like ours — a search technology built to handle the kinds of everyday queries that don’t make sense to a normal text search engine.

Basic Search

Text search engines such as sphinx or lucene/solr, use faceted filtering: collections of documents which each have a set of fields, often specified in an XML format and each indexed so that they can be efficiently retrieved given a search query and optional facet parameters. (I recommend reading Introduction to Information Retrieval to learn specific implementation details).

In text indexing you can usually specify different fields, each with its own weight. For instance, if you choose a heavily weighted title field and a lower weighted text field, documents with your search term in their title will get a higher score than documents that just have it in the text field.

To retrieve indexed documents you use search query strings that are analogous to SQL queries, in that there’s usually special syntax to control how the search engine decides which documents to retrieve. For example, you might be able to specify which words are required and which are optional, or how far apart the words can be.

Now, all of this is fine for programmers and technical users, but it’s hardly ideal for typical consumers, who don’t even want to know that special syntax exists, let alone how to use it. Thankfully, a deep search query engine’s superior query parsing and understanding of natural language makes that special syntax unnecessary.

Facets provide ways of filtering search results. A faceted search could look for a specific value in the field, or something more complex like a date/time range or distance from a given point. Facets don’t usually affect a document’s score, but simply reduce the set of documents that get returned. Faceted search can also be called Faceted Navigation, because facets often enable a combination search/browse interface. A basic search system, if it offers facets at all, will generally do so via checkboxes, dropdowns or similar controls. ebay is a perfect example of this, offering many facets to drill down & filter your search. A deep search system, by contrast, moves facets to the background.

These text search engines are powerful, but in the context of deep search, they’re really just another kind of database. Both text search and deep search use indexes to optimize retrieval, but instead of using SQL to retrieve data, text search uses specially formatted query strings and facet specifications. A SQL database is optimized for row-based or column-based data, while a text search engine is optimized for plain text data, using inverted indexes. Either way, from the developer standpoint, both are low-level data stores, best suited for different use cases.

Advanced Search

An advanced search system uses a text search engine at its lowest levels, but integrates additional ranking signals. An obvious example of this is Google’s Page Rank, which combines text search with keyword relevance, website authority, and many other signals in order to sort results. Where basic text search only knows about individual documents, and statistics about collections of documents, an advanced search system also considers external signals like trustworthiness, popularity and link strength. Amazon, for instance, lets users sort results by average rating or popularity. But this still isn’t deep search, because there’s no deeper understanding of the data or the query, just more powerful controls for sorting results.

Multi-category search. If you’re only searching for one thing, your search system can be relatively simple. But as soon as a search contains multiple variables with no explicit facets given by a user, you need NLU to know precisely what’s being searched for, and how to search for it. You also need to effectively and automatically integrate multiple data sets into one system.

Feature engineering for deep data understanding. Contrary to popular belief, big data isn’t enough. Simply having access to tons of data doesn’t automatically mean you know how to get meaningful insights out of it. A good metaphor is that of an iceberg: users can only see the tip, while most of the berg lies hidden below the water. In this metaphor, data is the ice, and feature engineering is how you shape the ice below the water, in order to surface the best results where users will see them.

Contextual understanding. The more you know about the user, the more knowledge you have with which to tailor search results. This could mean knowing the user’s location, their past search history, and/or explicit preferences. Context is king!

Many of today’s search systems don’t meet any of these requirements. Some implement one or two, but very few meet them all. Siri has device context and does NLU to understand queries, but instead of actually doing the search, it routes it to another application or search engine. Google and Weotta meet all the requirements, but have very different implementation, approaches, and use cases.

How does one build a deep search system? As with simple text search, there are two major stages: indexing and querying. Here’s an overview of both, from a deep search perspective.

Deep Indexing

Deep search requires a deep understanding of your data: what it is, what it looks like, what it’s good for, and how to transform it into a format that machines can understand. Here are a few examples:

places have addresses and geographic points

products have a weight and size

movies have actors and directors

Once you’ve got your low-level data structure, you transform it into a document structure suitable for text and facet indexing. But deep search also requires higher level knowledge and understanding, which is where feature engineering comes into play. You have to think deeply about what kinds of searches your customers may do, and what level of quality they expect in the results. Then you have to figure out how to translate that into indexable document features.

Here are two examples of this thinking.

A restaurant serves chicken wings. Okay, but are they any good. How much do people like or dislike them? Are they the best in the city? Questions like this could be answered through a twist on menu-based sentiment analysis.

A specific concert may be a one-time event, but the bands have probably played other shows before. How did people like those previous gigs? What are their fan’s general demographics? What’s the venue like? Answering these questions may require combining multiple datasets in order to cross-correlate performers with concerts and venues.

Deep indexing is all about answering these kinds of questions, and converting the answers into values that are usable for ranking and/or filtering search results. This may involve applied data science, linear regression or sentiment analysis. There’s no specific methodology, because the questions and answers depend on the nature of your data and what kind of results you need. But with the proper methods, you can achieve insights that weren’t possible before. For example, with latent semantic analysis you can discover features that aren’t explicit in the data, which allows queries that would be impossible with basic text indexing. Unsurprisingly, you can expect to spend most of your time deep in the data trenches. To quote Pedro Domingos, from his paper A Few Useful Things to Know about Machine Learning:

“First-timers are often surprised by how little time in a machine learning project is spent actually doing machine learning. But it makes sense if you consider how time-consuming it is to gather data, integrate it, clean it and pre-process it, and how much trial and error can go into feature design.”

“70% of the project’s time goes into feature engineering, 20% goes towards figuring out what comprises a proper and comprehensive evaluation of the algorithm, and only 10% goes into algorithm selection and tuning.”

A major part of feature engineering is getting more data and better data. You need large, diverse datasets to get the necessary context. In Weotta’s case, that includes geographic info, demographic profiles, POI and location databases, and the social graph. But you also need a deep understanding of how to integrate and correlate this data, which machine learning algorithms to apply, and most important, which questions to ask of it and which can be answered. All of this goes into engineering an integrated system that can do so automatically. “We don’t have better algorithms,” says Google Research Director Peter Norvig. “We just have more data.”

At Weotta, we believe that high-quality data is paramount, so we spend a surprising amount of effort filtering out noisy data to extract meaningful signals. A huge part of any significant feature engineering, in fact, is data cleansing. After all, garbage in, garbage out.

You also need an automated process for continuous learning. As data comes in and is integrated, your system should automatically improve. “Machine learning isn’t a one-shot process of building a data set and running a learner,” says Pedro Domingos, “but rather an iterative process of running the learner, analyzing the results, modifying the data and/or the learner, and repeating.”

And people are an essential part of this process. You must be able to incorporate human knowledge and expertise into your data pipeline at almost every level; it is the right balance and combination of humans and machines that will determine a deep search system’s true capabilities and ability to adapt to change.

Deep Querying

Once you’ve got a deep index powered by deep data, you need to use it effectively. Simple text queries won’t suffice; you need to understand exactly what you’re searching for in order to get the right results. That means query parsing and natural language understanding.

Once you’ve parsed the query and know what to search for, the next step is retrieving results. Since we’re doing multi-category search, that means querying multiple indexes. At this point the NLU query parsing becomes essential, because you need to know what kinds of query parameters each index supports, so the system can slice and dice the query intelligently.

But if you’re retrieving different kinds of information, how do you compose them into one set of results? How do you rank and order different kinds of things? These are fundamentally interface design and user experience questions. Google uses different parts of their results page for different kinds of results, such as maps and knowledge graph.

At Weotta, we’ve decided the card analogy makes a lot of sense. On mobile we have one stack, and on web up to five cards in a row. Why? This presentation visually focuses the user on a few results at a time while letting us show multi- category results. That’s how you can do a search like dinner drinks and a movie and get three different kinds of results, all mixed together.

Remember facets from earlier? With deep search, facets are hidden to the user, but they’re still essential to the query engine. Instead of relying on explicit checkboxes, the query parser uses natural language understanding to decide which facets to use based on the query. This decision can also be driven by the nature of the data and the product. At Weotta, when we know a query is about food, we use a facet to restrict the results to restaurants. Google does things differently; while they may know that a query has food words, because their data are so much larger and more diverse, they are unable or unwilling to make a clear decision about what kinds of results to show, so you often end up with a mix. For example, I just tried a search for sushi and along with a list of web pages, I got a ribbon of local restaurants, a map, and a knowledge graph box. Since Weotta is focused on local search and “what to do,” we know you’re looking for sushi restaurants, and that’s what we’ll produce for you. Better yet, with Weotta Deep Search, a user can be even more specific and get relevant results for restaurants that have hamachi sushi.

Another key to our deep query understanding is context: Who is doing the search? Where are they? What time is it? What’s the weather there right now? What searches have they done in past? Who are their friends or contacts? What are their stated preferences? What are their implicit preferences?

The answers to these questions could have a significant effect on results. If you know someone is in New York, you may not want to show places or events happening elsewhere. If it’s raining outside, you may want to feature indoor events or nearby places. If you know someone dislikes fast food, you don’t want to show them McDonald’s.

People tend to like what their friends like. It may not be a strong signal, but social proof does matter to almost everyone. Plus, people often do things with their friends and family, so if you take all their preferences into account, you may be able to find more relevant results. In fact, if you use Facebook to signup for Weotta, you’ll be able to search for places and events your friends like.

Summary

A deep search system goes beyond basic text search and advanced search with the following requirements:

No explicit facets

Multi-category search

Deep feature engineering

Context

To implement these, you’ll need to make use of natural language understanding, machine learning, and big data. It’s even more work to implement than you’d think, but the benefits are quite clear: you can do natural language queries with a simpler interface and get more relevant, personalized results.

As we build ever more machines to adapt to human needs, I believe deep search technology will become an integral part of our daily lives in countless ways. For now, you can get a taste of its capabilities with Weotta.

Share this:

At the end of February and the beginning of March, I’ll be giving 3 talks in the SF Bay Area and one in St Louis, MO. In chronological order…

How Weotta uses MongoDB

Grant and I will be helping 10gen celebrate the opening of their new San Francisco office on Tuesday, February 21, by talking aboutHow Weotta uses MongoDB. We’ll cover some of our favorite features of MongoDB and how we use it for local place & events search. Then we’ll finish with a preview of Weotta’s upcoming MongoDB powered local search APIs.

Corpus Bootstrapping with NLTK at Strata 2012

As part of the Strata 2012Deep Data program, I’ll talk about Corpus Bootstrapping with NLTK on Tuesday, February 28. The premise of this talk is that while there’s plenty of great algorithms and methods for natural language processing, most of them require a training corpus, and chances are the training corpus you really need doesn’t exist. So how can you quickly create a quality corpus at minimal cost? I’ll cover specific real-world examples to answer this question.

NLTK Tutorial at PyCon 2012

Introduction to NLTK will be a 3 hour tutorial at PyCon on Thursday, March 8th. You’ll get to know NLTK in depth, learn about corpus organization, and train your own models manually & with nltk-trainer. My goal is that you’ll walk out with at least one new NLP superpower that you can put to use immediately.

On the first day, we gave our demo, and I nearly swore on stage when I saw the big red X’s that you get when the Google Static Maps API rate limits your IP address. I had checked before the session started, and everything seemed okay, but I guess you can’t escape Murphy’s Law (especially when hundreds of people are sharing the same IP address). Afterwards, we immediately scrambled to get the site ready to allow people in. So many people were sharing our beta invite link on Facebook that the Weotta Facebook App was temporarily disabled due to unusual behavior. Luckily, our excellent advisor Mike Hart connected us with some great people at the Facebook API team, and they quickly got us back online.

The next day we discovered, and quickly fixed, an inaccurate geocode that was causing certain plans not to generate. Then I found out that anonymized facebook emails are much longer than Django’s 75 character default EmailFieldmax_length. Not wanting to do a database migration while so many people were using the site, I waited until getting back home to fix this issue. But despite these small problems, hundreds of people were able to get in to Weotta, make plans, and discover fun things to do with their friends.

Weotta has been running smoothly ever since, and now that the conference craziness is over, we can start focusing on our #1 feedback: when will Weotta be in my city? We got requests for everywhere from Chicago and Denver to Sydney and Singapore. We hear you, and will be expanding outside of SF and NY as fast we can. While our methods are very algorithmic and we don’t depend on UGC, it still takes human effort to give you focused, localized, highly relevant content so you can easily discover and plan amazing occasions. And if you’d like to help us expand and improve Weotta, get in touch. On the technical side, we’re looking for at least 2 developers: a crawler/content person familiar with Scrapy, and a Django/jQuery web developer. If you’re interested, contact me on github, LinkedIn, or directly at jacob@weotta.com.

We hope that everyone who signed up for the beta has received their invite; if you haven’t (or want one), then you can signup for Weotta here (only a limited number will get in). And if you want to learn more about Weotta, then check out the Weotta press coverage. Weotta currently covers San Francisco and New York, so if you’re interested in a “personal concierge” like service that can provide recommended plans/itineraries of things to do in a city, then signup for weotta here.