Advisor, Board Member, Professor, Former Tech Exec, …

Monthly Archives: March 2012

When you think compression, you probably think saving space. I think speed. In this post, I explain why compression is a performance tool, and how it saves tens of millions of dollars and helps speed-up the search process in most modern search engines.

Some background on how search works

Before I can explain how compression delivers speed, I’ll need to explain some of the basics of search data structures. All search engines use an inverted index to support querying by users. An inverted index is a pretty simple idea: it’s kind of like the index at the back of a book. There’s a set of terms you can search for, and a list of places those terms occur.

Let’s suppose you want to learn about the merge sorting algorithm and you pick up your copy of the third volume of Knuth’s Art of Computer Programming book to begin your research. First, you flip to the index at the back, and you do a kind of binary search thing: oops, you went to “q”, that’s too far; flip; oh, “h”, that’s too far the other way; flip; “m”, that’s it; scan, scan, scan; ah ha, “merge sorting”, found it! Now you look and see that the topic is on pages 98, and 158 through 168. You turn to page 98 to get started.

A Simple Inverted Index

An inverted index used in a search engine is similar. Take a look at the picture on the right. What you can see to the left side of it is a structure that contains the searchable terms in the index (the lexicon); in this example I’ve just shown the term cat, and I’ve shown the search structure as a chained hash table. On the right, you can see a list of term occurrences, in this case I’ve just shown the list for the term cat and it shows that cat occurs three times in our collection of documents: in documents numbers 1, 2, and 7.

If a user wants to query our search engine and learn something about a cat, here’s how it works. We look in the search structure to see if we have a matching term. In this case, yes, we find the term cat in the search structure. We then retrieve the list for the term cat, and use the information in the list to compute a ranking of documents that match the query. In this case, we’ve got information about documents 1, 2, and 7. Let’s imagine our ranking function thinks document 7 is the best, document 1 is the next best, and document 2 is the least best. We’d then show information about documents 7, 1, and 2 to the user, and get ready to process our next query. (I’ve simplified things here quite a bit, but that’s not too important in the story I’m going to tell.)

What you need to take away from this section is that inverted indexes are the backbone of search engines. And inverted indexes consist of terms and lists, and the lists are made up of numbers or, more specifically, integers.

Start with the seminal Managing Gigabytes if you’re interested in learning more about inverted indexes.

Compressing Integers

Inverted indexes have two parts: the terms in our searchable lexicon, and the term occurrences in our lists of integers. It turns out that compression isn’t very interesting for the terms, and I’m going to ignore that topic here. Where compression gets interesting is for our lists of integers.

You might be surprised to learn that integer compression is a big deal. There are scholars best known for their work on compressing integers: Solomon Golomb, Robert F. Rice, and Peter Elias immediately come to mind. I spent a few years between 1997 and 2003 largely focusing on integer compression. Here’s a summary I wrote with Justin Zobel on the topic (you can download a PDF here.)

I could spend a whole blog post talking about the pros, cons, and details of the popular integer compression techniques. But, for now, I’ll just say a few words about variable-byte compression. It’s a very simple idea: use 7 bits in every 8 bit byte to store the integer, and use the remain 1 bit to indicate whether this is the last byte that stores the integer or whether another byte follows.

Suppose you want to store the decimal number 1,234. In binary, it’d be represented as 10011010010. With the variable byte scheme, we take the first (lowest) seven bits (1010010), and we append a 0 to indicate this isn’t the last byte of the integer, and we get 10100100. We then take the remaining four bits of the binary value (1001), pad it out to be seven bits (wasting a little space), and append a 1 to indicate this is the last byte of the integer (00010011). So, all up, we’ve got 00010011 10100100. When we’re reading this compressed version back, we can recreate the original value by throwing away the “indicator” bits (the ones shown in pink), and concatenating what’s left together: 10011010010. Voila: compression and decompression in a paragraph!

If you store decimal 1,234 using a typical computer architecture, without compression, it’d be typically stored in 4 bytes. Using variable-byte compression, we can store it in two. It isn’t perfect (we wasted 3 bits, and paid no attention to the frequency of different integers in the index), but we saved ourselves two bytes. Overall, variable-byte coding works pretty well.

Why This All Matters

Index size: compressed versus uncompressed

I’ve gone down the path of explaining inverted indexes and integer compression. Why does this all matter?

Take a look at the graph on the right. It shows the size of the inverted index as a percentage of the collection that’s being searched. The blue bar is when variable byte compression is applied, the black bar is when there’s no compression. All up, the index is about half the size when you use a simple compression scheme. That’s pretty cool, but it’s about to get more interesting.

Take a look at the next graph below. This time, I’m showing you the speed of the search engine, that is, how long it takes to process a query. Without compression, it’s taking about .012 seconds to process a query. With compression, it’s taking about 0.008 seconds to process a query. And what’s really amazing here is that this is when the inverted index is entirely in memory — there’s no disk, or SSD, or network involved. Yes, indeed, compression is making the search engine about 25% faster.

That’s the punchline of this post: compression is a major contributor to making a search engine fast.

Search engine querying speed: compressed versus uncompressed

How’s this possible? It’s actually pretty simple. When you don’t use compression, the time to process the inverted lists really consists of two basic components: moving the list from memory into the CPU cache, and processing the list. The problem is there’s lots of data to move, and that takes time. When you add in compression, you have three costs: moving the data from memory into the CPU cache, decompressing the list, and processing the list. But there’s much less data to move when its compressed, and so it takes a lot less time. And since the decompression is very, very simple, that takes almost no time. So, overall, you win — compression makes the search engine hum.

It gets more exciting when disk gets involved. You will typically see that compression makes the system about twice as fast. Yes, twice as fast at retrieving and processing inverted indexes. How neat is that? All you’ve got to do is add a few lines of code to read and write simple variable bytes, and you’ll save an amazing amount in processing time (or cost in hardware, or both).

Rest assured that these are realistic experiments with realistic data, queries, and a reasonable search engine. All of the details are here. The only critiques I’d offer are these:

The ranking function is very, very simple, and so the bottleneck in the system is retrieving data and not the processing of it. In my experience, that’s a reasonably accurate reflection of how search engines work — I/O is the bottleneck, not ranking

The experiments are old. In my experience, nothing has changed — if anything, you’ll get even better results now than you did back then

Here is some C code for reading and writing variable-byte integers (scroll to the bottom to where it says vbyte.c). The “writing” code is just 27 fairly simple, not-too-compact lines. The reading code is 17. It’s pretty simple stuff. Feel free to use it.

There’s countless information on search ranking – creating ranking functions, and their factors such as PageRank and text. Query rewriting is less conspicuous but equally important. Our experience at eBay is that query rewriting has the potential to deliver as much improvement to search as core ranking, and that’s what I’ve seen and heard at other companies.

What is query rewriting?

Let’s start with an example. Suppose a user queries forGucci handbagsat eBay. If we take this literally, the results will be those that have the words Gucci and handbags somewhere in the matching documents. Unfortunately, many great answers aren’t returned. Why?

Consider a document that contains Gucci and handbag, but never uses the plural handbags. It won’t match the query, and won’t be returned. Same story if the document contains Gucci and purse (rather than handbag). And again for a document that contains Gucci but doesn’t contain handbags or a synonym – instead it’s tagged in the “handbags” category on eBay; the user implicitly assumed it’d be returned when a buyer types Guccihandbags as their query.

To solve this problem, we need to do one of two things: add words to the documents so that they match other queries, or add words to the queries so that they match other documents. Query rewriting is the latter approach, and that’s the topic of this post. What I will say about expanding documents is there are tradeoffs: it’s always smart to compute something once in search and store it, rather than compute it for every query, and so there’s a certain attraction to modifying documents once. On the other hand, there are vastly more words in documents than there are words in queries, and doing too much to documents gets expensive and leads to imprecise matching (or returning too many irrelevant documents). I’ve also observed over the years that what works for queries doesn’t always work for documents.

Let’s go back to the example. If the user typed Gucci handbags as the query, we could build a query rewriting system that altered the query silently and ran a query such as Gucci and (handbag or handbags or purse or purses orcategory:handbags) behind the scenes. The user will see more matches, and hopefully the quality or relevance of the matches will remain as high as without the alteration.

Query rewriting isn’t just about synonyms and plurals. I’ve shared another example where it’s about adding a category constraint in place of a query word. There are many other types of rewrites. Acronyms are another example: expanding NWT to New With Tags, or contracting Stadium Giveaway to SGA. So is dealing with global language variations: organization and organisation are the same. And then there’s conflation: t-shirt, tshirt, and t shirt are equivalent. There are plenty of domain specific examples: US shoe sizes 9.5 and 9 ½ are the same, and if we’re smart we’d know the equivalent UK sizing is 9 and European sizing is 42.5 (but 9 and 9.5 aren’t the same thing when we’re talking about something besides shoes). We can also get clever with brands and products, and common abbreviations: iPad and Apple iPad are the same, and Mac and MacIntosh are the same (when it’s in the context of a computing device or a rain jacket).

Dealing with other languages is a different challenge, but many of the techniques that I’ll discuss later also work in non-English retrieval. Brian Johnson of eBay recently wrote an interesting post about the challenges of German language query rewriting.

Recall versus precision

Query rewriting sounds like a miraculous idea (as long as we have a way of creating the dictionary, which we’ll get to later). However, like everything else in search there is a recall versus precision tradeoff. Recall is the fraction of all relevant documents that are returned by the search engine. Precision is the fraction of the returned results that are relevant.

If our query rewriting system determines that handbag and handbags are the same, it seems likely we’ll get improved recall (more relevant answers) and we won’t hurt precision (the results we see will still be relevant). That’s good. But what about if our system decides handbags and bags are the same: not good news, we’ll certainly get more recall but precision will be poor (since there’ll be many other types of bags). In general, increasing recall trades with decreasing precision – the trick is to figure out the sweet spot where the user is most satisfied. (that’s a whole different topic, one I’ve written about here.)

Query rewriting is like everything else in search: it isn’t perfect, it’s an applied science. But my experience is that query rewriting is a silver bullet: it delivers as much improvement to search relevance as continuous innovation in core ranking.

How do you build query rewriting?

You mine vast quantities of data, and look for patterns.

Suppose we have a way of logging user queries and clicks in our system, and assume we’ve got these to work with in an Hadoop cluster. We could look for patterns that we think are interesting and might indicate equivalence. Here’s one simple idea: let’s look for a pattern where a user types a query, types another query, and then clicks on a result. What we’re looking for here is cases where a user tried something, wasn’t satisfied (they didn’t click on anything), tried something else, and then showed some satisfaction (they clicked on something). Here’s an example. Suppose a user tried apple iapd 2 [sic], immediately corrected their query to apple ipad 2, and then clicked on the first result.

If we do this at a large scale, we’ll get potentially millions of “wrong query” and “right query” pairs. We could then sort the output, and count the frequency of the pairs. We might find that “apple iapd 2” and “apple ipad 2” has a moderate frequency (say it occurs tens of times) in our data. Our top frequency pairs are likely to be things like “iphon” and “iphone”, and “ipdo” and “ipod”. (By the way, it turns out that users very rarely mess up the first letter of a word. Errors are much more common at the end of the words.)

Once we’ve got this data, we can use it naively to improve querying. Suppose a future user types apple iapd 2. We would look this query up on our servers in some fast lookup structure that manages the data we produced with our mining. If there’s a match, we might decide to:

Silently switch the user’s query from apple iapd 2 to apple ipad 2

Make a spelling suggestion to the user: “Did you mean apple ipad 2?”

Rewrite the query to be, say, apple AND (iapd OR ipad) AND 2

Let’s suppose in this case that we’re confident the user made a mistake. We’d probably do the former, saving the user a click and a little embarrassment, and deliver great results instead of none (or very few). Pretty neat – our past users have helped a future user be successful. Crowdsourcing at a grand scale!

Getting Clever in Mining

Mining for “query, query, click” patterns is a good start but fairly naïve. Probably a step better than saving the fact apple iapd 2 and apple ipad 2 are same is to save the fact that iapd and ipad are the same. When we see that “query, query, click” pattern, we could look for differences between the adjacent queries and save those – rather than saving the query overall. This is a nice improvement: it’ll help us find many more cases where iapd and ipad are corrected in different contexts, and give us a much higher frequency of the pair, and more confidence that our correction is right.

We could also look for other patterns. For example, we could mine for a “query1, query2, query3, click” pattern. We could use this to find what’s different between query1 and query2, and what’s different between query2 and query3. We could even look for what’s different between query1 and query3, and perhaps give that a little less weight or confidence as a correction than for the adjacent queries.

Once you get started down this path, it’s pretty easy to get excited about other possibilities (what about “query, click, query”?). Rather than mine for every conceivable pattern, a nice way to think about this is as a graph: create a graph that connects queries (represented as nodes) to other queries, and store information about how they’re connected. For example, “apple iapd 2” and “apple ipad 2” might be connected and the edge that connects them annotated with the number of times we observed that correction in the data for all users. Maybe it’s also annotated with the average amount of time it took for users to make the correction. And anything else we think is worth storing that might help us look for interesting query rewrites.

Showing a few results from the original query, and then the rest from the query alteration

Showing a few results from the query alteration, and then the rest from the original query

Running a modified query, where extra terms are added to the user’s query (or perhaps some hints are given to the ranking function about the words in the query and the rewriting system’s confidence in them)

Everything in search comes with its problems. There are recall and precision tradeoffs. And there are just plain mistakes that you need to avoid.

Suppose a user attended Central Michigan University. To 99% of users, CMU is Carnegie Mellon University, but not this user and a good number like him. This user comes to our search engine, runs the query CMU, doesn’t get what he wants (not interested in Carnegie Mellon), types Central Michigan University instead, and clicks on the first result. Our mining will discover that Central Michigan University and CMU are equivalent, and maybe add a query rewrite to our system. This isn’t going to end happy – most users don’t want anything from the less-known CMU. So, we need to be clever when a query has more than one equivalence.

There’s also problems with context. What does HP mean to you? Horsepower? Hewlett-Packard? Homepage? The query rewriting system needs context – we can’t just expand HP to all of them. If the user type HP Printer, it’s Hewlett-Packard. 350 HP engine is horsepower. So, we will need to be a little more sophisticated in how we mine – in many cases, we’ll need to preserve context.

The good news is we can test and learn. That’s the amazing thing about search when you have lots of traffic – you can try things out, mine how users react, and make adjustments. So, the good news is that the query rewrite system is an ecosystem – our first dictionary is our first attempt, and we can watch how users react to it. If we correct HP printer to Horsepower printer, you can bet users won’t click, or they’ll try a different query. We’ll very quickly learn we’ve made a mistake, and we can correct it. If we’ve got the budget, we can even use crowdsourcing to check the accuracy of our corrections before we even try them out on our customers.

Well, there you have it, a whirlwind introduction to query rewriting in search. Looking forward to your comments.

Like this:

I’ve spent much of the past seven years helping recruit great candidates. I’ve probably interviewed over 1,000 people (wow!). In this post, I thought I’d share some of the experiences and beliefs I’ve built up along the way. Before I start, I should say that Ken Moss and Jim Walsh have shared with me their interviewing philosophies and experiences over the years, and much of what I say below was influenced by them.

Sourcing Candidates

The most successful source of candidates is personal referral. Why? The interview process only estimates whether an engineer is great – we all want the error bars to be small, but there’s only a certain amount of information you can gather in a day of interviews. Having prior knowledge of a candidate is invaluable – if you’ve worked or studied with them, you’re able to decrease the error bars significantly. Moreover, there’s the importance of the human side – if you know someone, and you’ve enjoyed working with them, and the feeling is mutual, you’re already ahead of your competitors and you’ve already given the candidate a reason to come work with you. (Sourcing through a recruiting team is also important, but in my experience the error bars are higher, the success rate is lower, and the engineering team’s knowledge isn’t directly applied to the sourcing problem.)

All up, one of the most important things you can do to help build a great team is make personal referrals. Take an old colleague out to lunch!

Great Interviews

I’m passionate about great interviews. I believe in two things: the candidate must have a great experience, and you must interview for core competencies (and not skills and training). Having a great experience is important: even if a candidate isn’t successful in the interview, they’ll talk to their friends, and spread the word about the interview experience – you want that message to be positive. Hiring for competencies is also important: if you focus only on skills, you may not hire people who can grow and change as your business and technology grows and changes.

What are core competencies? They’re inate traits and abilities, such as integrity, communication skills, and courage and conviction. There’s a company called Lominger that has worked hard on developing a list of the competencies, and ways to describe them and help you understand how to assess your profiency at each one. The only list in the public domain that I can find that’s similar is here.

I believe that all competencies are important, but that there are four that are critical to being a successful engineer at the companies I’ve worked at:

Intellectual horsepower

Problem solving skills

Drive for Results

Action-oriented

Interviews should focus on uncovering the capabilities of the candidate at those four compentencies. (Again, I emphasize that other competencies are important – none of us want to work with folks with low integrity, no sense of humor, or no interest in valuing diversity. But I’m much happier approximating the candidate’s skills at those after the interview, rather than making them the focus.)

Intellectual horsepower is basically being smart, and being able to learn and grow when presented with new knowledge. In interviews, I typically measure this by how fast the candidate understands the questions, the types of questions they ask, and how “fast paced” the conversation is. If I learn something from talking to the candidate, where I’m provoked to have a new thought, I’m usually satisfied that the candidate has intellectual horsepower.

We want problem solvers who can solve complex challenges in code. We don’t need software engineers who can only solve an organizational challenge, figure out a clever physics problem, or solve puzzles from NPR’s Car Talk. It’s therefore essential to present computer science problems, and ask the candidate to solve them with real code. I’m not a fan of pseudo code, and I tend to ignore the output of interviews that don’t have real, hard, problem solving problems with coding solutions. Some of my favorites questions for recent college graduates are: reverse a linked list, and write a program to shuffle a deck of cards. (The former either gets a recursive solution, or uses a stack; if they get it fast, ask them to try the other solution. The latter is best solved with a single pass through an array, where each element is swapped with another random element.)

Drive for results means getting things done for a reason, with zeal, and a strong desire to reach a conclusion. People with this competency are “finishers” and will deliver results for the customers and business, and they realize that getting it done is more important than making it perfect. These kinds of people are scrappy, and make the right tradeoffs, and they’re the ones who work the smartest. How do you figure this out? If I’m interviewing college graduates, I ask about their favorite project while they were at college – do they talk about the customer? The impact it had or could have? Was it actually a summer job that shows their passion for results in the real world? Or was it just a technology for technology’s sake? Is it esoteric or applied? Try asking the question, you’ll get an instant feel for what I mean. If it’s someone more experienced, I generally ask about a project or team they’ve enjoyed being on.

Action-oriented means getting started, and being decisive about starting (and often figuring out only what is needed before beginning). These folks are more action and less talk. They’re the ones you will perceive as hard working. This is a hard competency to explicitly understand in an interview – but you can often observe it in their approach to problem solving questions. Do they jump up to the whiteboard, grab a pen, and start solving? Do they make and state assumptions, just so they can get on with it? Or do they endlessly push back and ask questions? Do they criticize you and your question? Do they try and divert the interview somewhere else? Do you have to ask them to get up and use the whiteboard?

When you’re done with an interview, I believe it’s critical to write down the questions you asked and what you learnt. I typically write down what I learnt about the four competencies, and I always begin my writeup with a definitive statement of whether or not I’d hire the candidate. Great decisions are only made after considered thought – and writing it down makes you think, and makes you stand behind a decision. It’s also very useful – others will learn what you ask and how you interpret it, and it’s also a great record for when a candidate applies again at a later date (this happens more frequently than you’d expect). A good interview writeup is several paragraphs in length in my experience.

Great Experiences

We want to win the hiring race. Great candidates will typically interview with multiple companies, and often have multiple offers. When it comes to decision time, they’ll reflect on more than the position you’re offering and the compensation. They’ll think hard about the people they met, the questions they were asked, and how they were treated during the interview process.

Here are some basic tips:

Don’t ask the same question as someone else – understand what’s been asked so far, and show that you know who they’ve already talked to

Read their resume, show interest in them and their experiences. Often, I look for the unique thing and ask about it – “How long have you been playing guitar?” or “How did you enjoy living in London?”

Leave 5 or 10 minutes to answer questions, sell your experience at the company, and give the candidate a chance to use the restroom or get a drink

Show that you’re smart, great at problem solving, you’re action oriented, and driven for results. Be engaged, animated, excited, and passionate about what you’re doing – great folks want to work with great people

If you’re the manager, make sure nothing slips between the cracks. Stay close to recruiting and the candidate, and make sure everything happens in a timely fashion – even when we don’t want to offer a role, make sure the “regret” experience is timely, in person (not an email!), and professional

The bottom line is it’s like running a retail business. Part of success is having a great customer experience – if you upset someone, trust me that you’ll upset five more through word of mouth. On the flip side, do it well, and you’ll have an enhanced reputation as a great technology company that’s well worth considering as a destination.

Share this:

Like this:

The video of my recent keynote “A Tour of eBay’s Extreme Data and Platforms” at the 2012 PHP UK Conference is now online. Here it is:

The keynote is around 60 minutes. The talk has three parts. First, I explain eBay’s scale, and offer some insights into eBay’s usage and the extreme data we generate and store. Second, I give three examples of how we use our data to build customer products and features, and two examples of platforms at eBay: Cassini, our new search engine, and ql.io, a new and innovative gateway for consuming http APIs. Last, I offer an opinion on online commerce and speculate about how it’ll play out over the next two or three years. There’s a short Q&A at the end. Enjoy!

Like this:

I was recently told that I am an ideas guy. Probably the best compliment I’ve received. It got me thinking, and I thought I’d share a story.

With two remarkable people, Nick Craswell and Julie Farago, I invented infinite scroll in MSN Search’s image search in 2005. Use Bing’s image search, you’ll see that there’s only one page of results – you can scroll and more images are loaded, unlike web search where you have to click on a pagination control to get to page two. Google released infinite scroll a couple of years ago in their image search, and Facebook, Twitter, and others use a similar infinite approach. Image search at Bing was the first to do this.

MSN Search's original image search with infinite scroll

How’d this idea come about? Most good ideas are small, obvious increments based on studying data, not lightning-bolt moments that are abstracted from the current reality. In this case, we began by studying data from image search engines, back when all image search engines had a pagination control and roughly twenty images per page.

Over a pizza lunch, Nick, Julie, and I spent time digging in user sessions from users who’d used web and image search. We learnt a couple of things in a few hours. At least, I recall a couple of things – somewhere along the way we invented a thumbnail slider, a cool hover-over feature, and a few other things. But I don’t think that was over pizza.

Back to the story. The first thing we learnt was that users paginate in image search. A lot. In web search, you’ll typically see that for around 75% of queries, users stay on page one of the results; they don’t like pagination. In image search, it’s the opposite: 43% of queries stay on page 1 and it takes until page 8 to hit the 75% threshold.

Second, we learnt that users inspect large numbers of images before they click on a result. Nick remembers finding a session where a user went to page twenty-something before clicking on an image of a chocolate cake. That was a pretty wow moment – you don’t see that patience in web search (though, as it turns out, we do see it at eBay).

If you were there, having pizza with us, perhaps you would have invented infinite scroll. It’s an obvious step forward when you know that users are suffering through clicking on pagination for the bulk of their queries, and that they want to consume many images before they click. Well, perhaps the simplest invention would have been more than 20 images per page (say, 100 images per page) – but it’s a logical small leap from there to “infinity”. (While we called it “infinite scroll”, the limit was 1,000 images before you hit the bottom.) It was later publicly known as “smart scroll”.

To get from the inspiration to the implementation, we went through many incarnations of scroll bars, and ways to help users understand where they were in the results set (that’s the problem with infinite scroll – infinity is hard to navigate). In the end, the scroll bar was rather unremarkable looking – but watch what it does as you aggressively scroll down. It’s intuitive but it isn’t obvious that’s how the scroll bar should have worked based on the original idea.

This idea is an incremental one. Like many others, it was created through understanding the customer through data, figuring out what problem the customers are trying to solve, and having a simple idea that helps. It’s also about being able to let go of ideas that don’t work or clutter the experience – my advice is don’t hang onto a boat anchor for too long. (We had an idea of a kind of scratchpad where users could save their images. It was later dropped from the product.)