GIS Data

Family

Links

January 07, 2012

I've just noticed a new feature on Google Maps. To help with discoverability and understanding of the satellite imagery (as opposed to the basic map) they have introduced an overlay window in the top right hand corner which exposes the satellite image for that part of the map:

As you pan and scroll the map, the image changes tracking these movements.

September 25, 2011

After I noticed a problem with the accounting for circles in Google Plus, Natalie suggested I send feedback. So here it is:

1. Your profile tells you how many people are in your circles and how many have added you to their circles.

2. Your circles management page (click the circles icon) tells you this information as well. But the numbers aren't the same:

3. There are 5 more people counted in the circles management page added to my circles than there are presented on my profile (31-26). My guess is that these 5 are people that I'm sharing via email only with. Ok - if that is the case, this needs to be clarified in the interface.

4. My profile indicates that there are 3 more people that have put me in my circle than the number of people accounted for in my circles management page (177-174). I can find people who have added me to their circle who are not present in the 177 people who are present on my profile. For example, if I go to Robert Scoble's profile page I see this:

but, when I bring up the set of people from my profile via the "view all" button I see no Robert Scoble (see below for how this was verified). Robert is in the view of people that I bring up from my circles management page, however.

It is tedious to verify who is present and who isn't due to the way the interface is implemented (scrolling doesn't lock on the rows of cards for people in either view and the underlying HTML / JSON is too dynamic to allow for the easy capture of all entries). The circles management page allows you to sort by alpha on name, so I can verify that people are there, but I can't do this on the profile page view as their is no sorting option.

The way I verified that Scoble wasn't in the profile view was as follows. I opened the Chrome browser in debug mode, brought up the 'view all' layer and copied the HTML. This only shows the HTML for the visible part of the set though. So I then scrolled down one step and copied again, repeating until I had scrolled through the entire list (thankfully I have very few followers).

By looking at the HTML, I see no sign of Robert (but I did spot check others whom I see in the view).

What is additionally weird is that Robert is present in the circles management view which claims to have fewer people than the profile page view where he is missing.

In summary there is either a problem in who this data is being presented to the user (and this user can't understand it) or there is a simple bug that results in different people being visible in the profile page view compared to the circle management page view.

Update: since posting I noticed that the number reported in my circles management page went down by 1 but the same number is displayed on my profile page as before.

August 13, 2011

When I'm signed in to Google and issue a regular web search, this is what I visually see:

What's with the black background, exclamation sign, unreadable text, etc? Are things going pear-shaped as Google focuses on more products? The actual search results are great, but the experience is horrible.

April 28, 2011

Historically, the local search space was defined by aggregates of business listings purchased from companies whose original focus was various types of advertising and direct marketing. As the space has evolved, search engines have looked at ways to gather better data and, importantly, to merge different sources of data into a single unified view of the restaurant or business being indexed and served as the result of a query.

Consequently, when you see a details page - either on Google, Bing or some other search engine with a local search product - you are seeing information synthesized from multiple sources. Of course, these sources may differ in terms of their quality and, as a result, the values they provide for certain attributes.

When combining data from different sources, decisions have to be made as to firstly when to match (that is to say, assert that the data is about the same real world entity) and secondly how to merge (for example: should you take the phone number found in one source or another?).

This process - the conflation of data - is where you either succeed or fail.

The entry is for a company called the West Seattle Blog - certainly one of the best hyperlocal blogs I've come across. This is clearly the identity of the corporation that publishes the popular hyperlocal blog covering news and events in the West Seattle neighborhood.

As the places page indicates it is an owner verified listing, which means that the business owner has confirmed and contributed some of the data.

However, because Google gets data from many sources, it has conflated these additional data sets with this authoritative listing. One of the local entities that it has conflated with the West Seattle Blog is the West Seattle Bowl - a bowling alley in the same neighborhood but at a different address. While it is easy to see how this has happened (the names are identical apart from a few difference between the words blog and bowl which could easily be accounted for by a typo on the part of either data provider) it is unusual in that the addresses of the entities and many of the other features conflict.

By adding the attributes of the two businesses together, we see that the hybrid entity has the categories

Again, due to the mashing together of the data, the images - for a blogging company - are actually those for the bowling alley. In addition, a search on Google for 'west seattle bowl' brings up this listing and no listing for the bowling alley per se.

Because Google's local search, like most local search companies, provides reviews for businesses, this search result contains snippets of text reviewing the company. But compounding the problems of mis-associating these two entities, because the blog itself posts reviews about other local businesses, these too have been associated with the Blog / Bowl informing the user of the great chinese food, etc.:

Another feature of this result is that there are links (listed under reviews) to articles in the Seattle PI (a local newspaper). I imagine these have come about due to the journalistic nature and linking of the West Seattle Blog. However, the article one immediately reaches via the Seattle PI link on the page is entitled Shooter in West Seattle slayings reportedly had mental problems - not exactly ideal copy for potential customers of a bowling alley in that neighborhood.

I certainly believe that the mining of web content and the inference performed on multiple signals of data are two crucial technologies in the local search space, and I have plenty of respect for what Google has achieved in these areas. However, this example (which I came across via some original local search I don't recall) just seemed like an amazing example of a mutated associated graph with interesting consequences worthy of sharing.

December 19, 2010

Since my earlier post on the new trending tool provided by Google Books, I've been thinking more about the service. While I've found plenty of interesting trends (more of which later), I've also been considering the underlying data and interface.Many of these considerations are common to any trending or other data probing interface (such as BlogPulse).

While there have been lots of reasonably visible copy written about the opportunities presented by the data set - the potential to understand trends in our culture and linguistics - this enthusiastic data geekery is somewhat lacking in data diligence. The original article in Science, for example, doesn't describe the data in the most basic terms.

At the very least, the data needs to be described in terms of design, accuracy and bias.

By data design I mean the intentions of the data. These intentions are somewhat exposed in the interface (where one can choose from things like 'American English', 'British English', etc.). I'd love to understand the rationale behind some of the corpora - e.g. English 1 Million - and the reason for missing corpora (we have English Fiction but not English Non-Fiction).

The accuracy of the data, with respect to the design, can at least be considered in terms of the current specifications. How accurate are the years associated with the articles? How accurate is the origin of publication? In addition, as Google points out, the accuracy of the OCR is also of great interest, especially for older texts (Danny Sullivan has an interesting post on this topic).

Finally, given any data designed along a set of dimensions, one can always take another set of dimensions and see how they are distributed and correlated - if at all. For example, what is the mixture of fiction and non-fiction in the English corpus? What is the distribution of topics? Are these representative with respect to historically accurate accounts of linguistic and cultural shifts (e.g. the introduction of the novel, the impact of the enlightenment on the mixture of fiction and non-fiction). What is the sampling from different publishing houses and is that representative of the number of books, or the number of copies sold? This last point is intriguing - does a book with 1 million copies in circulation have more 'culturonomic' impact than a book with only a single copy out there.

While the data sets are clearly labeled as 'American English' and 'British English' the books in those collections are not actually classified as such. Rather they are defined by their country of publication. With this in mind, how do we interpret the color v colour graph from my earlier post? As Natalie pointed out in an email, the trend in 'British English' of the difference between these terms could be described either by an underlying cultural shift towards the American spelling, or by a change in the ratio of American books published in the UK without editorial 'translations'.

Searching for foreign terms in certain languages brings up hits for the foreign language (e.g. 'because' in the Spanish corpus, 'pourquoi' in the English corpora).

Regarding the English Fiction corpus, I was surprised to see mentions of figures and tables in works of fiction.

Drilling down on these in the interface surfaces what are clearly non-fiction publications (but it is not clear if this search is filtered by the various corpora visualized in the ngram interface). It is also important to bear in mind when looking at these anomolies the volume of hits. Here we are seeing very small fractions of the overall corpus containing what look like terms indicating false positives.

Another subtle, but easily missed (I missed it!) aspect of the interface is that it is case sensitive. This allows us to do interesting queries like 'however' versus 'However.'

How do we interpret this? The most obvious interpretation might be that 'however' at the beginning of a sentence is becoming more frequent. We could also conclude that 'however' in general is becoming more frequent (imagine if we could combine the lines). Alternatively, it could mean that sentence length in the corpus is shifting. Given that we don't know the exact cultural mix of the 'British English' corpus, it could be somehow related to the mixture of American and British content. Finally, it could be due to the mix of fiction and non-fiction. Interestingly, the 'American English' corpus has quite a different signal.

When investigating temporal data, it is always interesting to try to discover things that don't change over time. What words would we expect to be relatively stable? From a simple initial probing, it seems like numbers and days of the week are reasonably stable. In looking at this, I did find that certain colours come and go in a very correlated pattern.

Overall, I find this to be a hugely exciting project. I'm disappointed in the general lack of analysis given to the data set before jumping to conclusions, but perhaps this is more a reflection of the blogosphere and the quality of writing. I'd love to see a more in depth analysis of the corpora provided by the team that wrote the Science article.

December 17, 2010

I love this new feature in Google's book search product which allows you to look at the time series trends for terms according to the publication dates of books. The example below shows the trend for the tokens 'colour' and 'color'.

This type of statistical analysis brings up lots of questions, simultaneously about the occurrence of terms in books in general, and the distribution of books in Google's collection. Does it show a decline in the ratio of British to American publications? or a decline in the British spelling of colour? or a bias in the corpus towards recent American publications and earlier British publications? Hard to say, and interesting to ask if probing only via this tool one could find out.

Update: I'm actually quite serious in being keen on understanding both the distribution of terms in our language and the nature of the collection. While this article on ReadWriteWeb rightly celebrates the insights that this data set can bring, it lacks in any questioning on the representative nature of the underlying data set.

January 24, 2010

As I tweeted today, I noticed something wrong with the Google landing page I was getting: the advanced search link mistakenly links back to www.google.com, not the advanced search page. Thinking it through, I thought this might be due to flighting a new UI.

January 08, 2010

The blogosphere is abuzz with reports of Google's Near Me Now feature which provides mobile searchers with a very quick route to local search results organized by category (coffee shop, ATM, what have you). Going for the bus this morning, I thought I would give it a try, so I enabled the various pieces on my iPhone so that Google could locate me, then asked for near-me-now ATMs and Banks. The results I got back on the SERP looked pretty reasonable. I was impressed to see that they got the nesting of some establishments correct (there is a bank inside the Whole Foods for example). Sweet.

The UI is very nice - it shows up on the main mobile entry page.

But - then I hit the 'map all results' button. What I found was a set of results that weren't only spread over the US (hits in multiple states), but also in Europe (Denmark and France) and Australia. Being a forgiving person by nature, I tried again, this time from the bus and for coffee shops, not banks. Again, I found results on the map which were over the US and outside.

Now, readers of this blog will be familiar with my style of opinion-disguised-as-data-driven-analyis, but on this point I'm confounded. My first 2 interactions with this new feature showed serious problems. In addition, and this is what really gets me, all the link lovethatotherblogs are giving to this feature (at least those exposed by TechMeme) don't seem to have cottoned on to this issue.

To be somewhat objectve, I just ran the ATM query again from my office and found these results on the map:

December 22, 2009

I was walking down Denny Way in Seattle and turned onto Broadway. Whoosh - thanks to Google's teleport feature I was zapped over to Broadway in Renton! How cool is that!

Actually, what I'm talking about is the difficulty of interpreting geography properly. If you enter Google's street view mode in their mapping product at the crossroads of Broadway and Denny in Seattle you will find that while the imagery is correct (if a little dated), the addresses that are associated with these locations change from Seattle to Renton as you turn the corner.

December 10, 2009

A new-to-me feature in Google is a link which appears when one searches for a local entity (e.g. a business) and asks the searcher to confirm or correct the entity data.

A search for Elliott Bay Books produces this:

Clicking on the ‘Is this accurate?’ link brings up the ability to confirm or ‘cancel’. Cancel here means to drop out of the interaction, so essentially all you can do is confirm the data.

This only appears to happen when the result is in the top of the SERP (i.e. not when the Local business results appear interleaved with the organic blue link results. This interaction appears to surface when the business information is presented in isolation with a map (i.e. not when there are multiple locations positioned on the map.

This post on Search Engine Roundtable suggests that this feature surfaced late October.

Reading the SER post forced be to go back and re-read the interaction which I now see asks you to confirm ‘This address, phone number, map or business info is not accurate.’ Confusing. You are asked to confirm that the data is incorrect.