Local Database Accuracy: How Good, How Bad?

There’s an ongoing debate about the issue of local database accuracy. Accuracy, freshness and comprehensiveness are the attributes you want in a local database. There are a lot of claims and complaints about the databases but I haven’t seen anything empirical one way or another.

Today I met today with a startup in the local space that said they licensed the data of one of the big three database vendors and went “door to door” in one San Francisco Bay Area town to verify the accuracy of that information.

What did they find? Shockingly, they discovered there was something wrong (minor to major issues) with 47% of the listings! That’s a remarkably large number. I was truly surprised.

If this is reflective of general accuracy of the databases in local then this is a much larger problem than I previously thought.

We will only get more accurate data when local businesses have enough incentive to proactively manage their own information – either directly or through a (paid) partner.

I have long felt there is an opportunity for someone to create a single, central repository where businesses can publish their info once and once only, and all publishers interested in using the data can subscribe to it via a simple API.

Someone with clout, eg Google, could provide real value to local businesses by providing such a service, and as a side effect would place themselves in a strong position to up-sell local advertising solutions (website, adwords, etc.).

I envision basic information (name, address, phone) being free to publishers, with enhanced info being available for a fee. Sharing that fee with local businesses might be an additional incentive for them to participate (above and beyond the primary benefit of centralized maintenance).

“We will only get more accurate data when local businesses have enough incentive to proactively manage their own information – either directly or through a (paid) partner.”

This is the mistake that everyone makes– attributing rational behavior to a large pool of small businesses in a local market.

I believe (and am admittedly drinking my own Kool-Aid) that a local newsgathering organization is best positioned to do this. As Ahmed says above, the collection is not as big a hurdle as upkeep. And no matter the incentives, small businesses will not update their own data in a sufficient mass to keep the database clean. It takes a hybrid of user submissions of errors and an on the ground staff that both checks the data by phone and keeps track of venues closing/opening/etc.

The newspapers seem to think this work is beneath their journalists. There is an opportunity there….

Everyone wants more accurate data, to be sure, but we think of things in a slightly different way.

There’s a very important difference between the accuracy of data and the fitness of a model built on top of that data. What we’re doing when we use this data is building a model of the world. And the fitness of that model is determined by a) the interpretive labor set upon the underlying data, and b) the framework in which that model is made useful to the consumer for a particular need.

While we–like everyone else–place a premium on getting better data, we also know that the really hard work is to take what’s available, shore it up with as many other sources of data as we can, and produce useful models–and products–that are still highly “fit” to the needs of the consumer.

The algorithmic/crawling method has the asme problem as the search engines face right now – who do you trust, and can you *really* trust them?

We have had incorrect results in our database. One person recently contacted me about a number we had incorrect (the correct number was 531-xxxx and his phone # was 521-xxxx). The unfortunate person had been inundated with phone calls. Turned out it wasn’t just us – Google listed the (incorrect phone #), as did other sources.

So as I see it, an algorithmic solution is close to impossible. Every single database will have flaws. Even taking Mike’s comments – a twice a year update may sound like a lot, but is it really? What is restaurant X was ‘verified’ on January 1, and then went out of business 5 days later. For the next ~180 days a restaurant is listed in the database when it does not exist anymore.

An idea I kicked around was an automated phone system – once a month (so that the owner does not get peeved) the system dials xxx-xxx-xxxx, says something like “Hello. I am from XXX and this is an automated check that your place is still in business. If this is YYYYY please press 1. Otherwise, press 0”.

Viola! An ‘algorithmic’ solution to keep local addresses upto date. Of course, I myself usually hang up on such phone calls, but I am paranoid about privacyðŸ™‚

“What is restaurant X was â€˜verifiedâ€™ on January 1, and then went out of business 5 days later. For the next ~180 days a restaurant is listed in the database when it does not exist anymore.”

That’s why you need people on the ground, preferably with other things to do (like news gathering). The errors I pointed out on the other listing site weren’t caught by research. They were caught because I live here and am part of an organization that makes it its business to follow such comings and goings.

In other words, the best you can hope for on local is “semi-scalable.”

I like your phone idea, but it doesn’t account for our biggest problem on these calls– non-English speakers. That’s a hard one to beat. (Note that we are also collecting hours of operation.)

I will admit to being somewhat surprised by the 47% error rate Greg referred to. However, as we all know, database accuracy has always been a problem – think back to the errors you have experienced over the years with Directory Assistance and print White Pages. However, while in the past, one could appreciate just how prohibitively expensive it would be to create accurate databases, I believe that technology has dramatically reduced that cost in recent years, and at the same time, the business opportunity has significantly increased.

I believe we are several years past the point where a major search engine or IYP should have seen the business opportunity and began making the investment. Instead, the IYPs has been content to use what is essentially the same database, purchased at a very low cost, and instead “compete” on bells and whistles instead of focusing on what really matters – the content itself. I always felt that major publishers weren’t willing to make the multi-million dollar investments needed to begin making a real difference in data quality because it would likely take years to make a real impact on their business and there would be no way to predict what that impact would be. Personally, I feel that IYP sites have a very high churn rate amongst their user bases due to poor data quality, but because of tracking difficulties, the publishers aren’t able to figure out what this number is and the considerable benefits to be gained from even a small reduction in this user churn.

Imagine if a publisher were to use automated tools like AhmedF describes, in conjunction with web-crawling and an overseas call center. The combination would create relatively low-cost verification and information gathering capabilities (example: Use the automated call to solicit businesses “Press ‘1’ if you would want a representative from Google to call you to collect information about your business that will be made available in our online directory”).

The publishers, through their database of queries, already know what businesses are most requested and thus can concentrate limited resources on those businesses.

With all the things that Google is doing, you’d think they’d get around to investing $10m/year which would give them dramatically better data than any competing publisher. Arguably, Yahoo would be the best candidate to do this since that company is more savvy about combining technology and people.

It’s not for nothing that all those travel guide books pay college kids to speed through 10 countries in two months to try and fact check, making sure that cool, cheap pension is still at 10 Rue Garcon. Quality local data comes from going out and getting it. Yahoo and Google both make it pretty easy for local mom and pops to update/correct their info and, especially outside the big cities, it seems that almost no one takes advantage of this.

[…] There’s an interesting post on the reliability (or lack thereof) of local data over on Greg Sterling’s blog that is very relevant to ConsensusBest and any other site dealing in any kind of “where” data. Helping people find the best products is just half the value we hope to provide. The other half is helping people find those products in their local stores, since most people still prefer to make most purchases live and in person. […]

I’ve optimized my biz site to do well for all sorts of long tail local phrases. If the aggregators can’t do a decent job of providing good information, local businesses will do best with optimizing their own sites and advertising agencies/local seo experts will do well for basic search

It baffles me that this hasn’t been solved yet, and I stress yet.. It’s very funny all the solutions were technology based. That’s the main problem right there..

The solution is so simple if you think about it. The problem is it takes time, a value and a reason. I’m tired of these”local” sites that upload that generic listing and have a “Is this your listing?” button. What good is this? So basically they upload information that everyone has and builds a pretty interface that is basically useless.

In my opinion Local Search is a fraud as it’s implemented today and so are the main companies in this area.. Just look at Citysearch, which has to be the most useless website on the planet. But they’re “CitySearch” and the brand lives on.. look at their compete.com profile and it’ll tell the whole story. people will come, and they’ll leave much faster. Although suprisingly, they are doing it “right” and not taking advantage of it, instead they are devaluing it. And their search is a waste of time. It is implemented for them to make money and not for you to find what you are looking. Find a restaurant in the East Village for brunch that has outdoor seating. A simple search, that is made excruciatingly painful.. Sorry to get off topic, I just hate CitySearch and what they represent in the Local Industry. I find it insulting

Anyhow, sorry to get longwinded and I know no one will probably see this post due to the fact that it’s 2 years old, but that’s why I’m writing it..

The solution is this, it needs to be built from SCRATCH, without data collection, the right way, on one system.

Also, much love to Greg, without you I wouldn’t be doing what I’m dong for the most part. Your blog has been enormously helpful..

Oh hey guys.. I got to say I’m a little embarrased by my comment above. I was in a bad mood yesterday and talking about these “local” search companies brings out the worst in me. I’ve been researching this market for 2 years now and it seems to be getting worse, which is fine by me though. I wish I could comment more on Gregs blogs but I need to keep things to myself, which is driving me nuts and also quite lonely. I should say this though, the only reason I know anything is from Greg and people like you Ahmed who post replies.

One last thing and I’ll go back underground where I belong.. Ahmed, you said it seems like I figured things out so easily.. This is true… However, anyone can talk a good game, the hard part will be implementing it. We all know that a change is definately needed in this market and change is never easy.

This was fun, thanks, and hopefully we can talk about this one day soon. I’d really love that

It amazes me to know that you have such a vast knowledge about this subject. Personally I think that this blog would be an eye opener to most of its readers………. This is really good work and I m very impressed………