September 26, 2014

Sometimes a title for a blog posts suggests itself to me which seems so self contained that it takes real effort to actual write the post ('Machine Intelligence, not Machine Learning is the Next Big Thing' is another in this line). The idea behind the (or a) Longform Manifesto is as follows. I have become aware of late of the sense of deterioration that is associated with the mobile 'revolution' and the info snacking, casual gaming and interupt driven lifestyle that it has entailed. The behaviours are perfectly illustrated in this scene from Portlandia:

With a daughter who has now come of technological age (she has a cell phone) it has become important to me to remind myself what content consumption was like before this mobile mess appeared.

We read books, we watched movies, we listened to music. But, of course, we haven't stopped doing that. Rather, we have started all this other stuff, and the problem is that this is influencing how we approach longform content. I find myself watching bits of movies, or listening to bits of music or reading parts of essays.

The Longform Manifesto, through the definition of longform content and the discipline and commitment needed to consume it as it was meant to be consumed, helps to dillute and remove the behaviour degrading influence of mobile technology. Someone should write it.

July 12, 2014

While Google has been doing a great job of their front page animations (today's is very nice, illustrating how Brazil and The Netherlands are on their way to Russia for 2018), Bing appears to be far more attentive to actually answering questions about the competition. For example:

Compared to Bing's

Google's answer brings up some interesting news articles, but Bing brings up stats on the teams and even a prediction of who will win (Cortana - which is driving these predictions - has been doing a perfect job of predicting game outcomes).

May 04, 2014

The rationale behind mining business data directly from the business's own website is that the business has a clear economic motivation to ensure that the data is up to date. If you own a restaurant that changes location, and your website still publishes the former address, those potential customers who visit your site will not be enjoying your delicious offerings.

For the web mining proposition to work, it is important to firstly know that you have in your hand a genuine business website and secondly, to have excellent extraction and inference technology to pull the required data from the HTML.

The first requirement can get pretty murky. There are sites that could easily be mistaken for a business site but which are, in fact, other types of legitimate sites (such as a blog with the contact information of the blogger). Unfortunately, there are also sites which are essentially fake store fronts for the actual business in question. The most obvious are those sites which are simply parked domains with some spam links on them. A domain parker might snap up the domain yourrestaurantseattle.com hoping that web surfers looking for yourrestaurant will land there and give them some clicks. An emerging new trend is a far more sophisticated site which (through some amount of templating but also specific editorial attention) aims to look like the actual site of the business. The motivation for these sites is the burgeoning third party restaurant delivery service industry - for which GrubHub might be the poster child.

I can't find the email address ghwebsites at grubhub.com on the GrubHub website, but a site search on Google over the Web Analyzer domain (site:wa-com.com "ghwebsite at grubhub.com") produces 13, 200 results. Dipping in to these brings up further examples of fake sites made with the same template as that for the 1947 Tavern phasmid.

[In the above, I substitie ' at ' for the '@' to avoid typepad's automated obsfucation of email addresses in posts.]

I don't believe these sites are particularly malicious - most likely, they bring additional customers to the business even if it is through deception. They do, however, pose a problem for web mining systems. There is less pressure on GrubHub to keep the exact details of the business up to date. In addition, when GrubHub goes belly up, these sites will linger.

January 19, 2014

Briefly - Hopper is something new in the travel / local space. In their own words:

What if you could plan an amazing trip based on a vague idea — like “spring surfing in California” or “Mediterranean cruise”? What if logistical information popped up right when you needed it, so you wouldn't have to spend hours on research? This is our vision: to make planning a trip an effortless extension of discovering and exploring new places.

We spent several years experimenting with different tools, technology and algorithms to collect, organize and manage massive amounts of travel data. The result is a new kind of trip planning engine, powered by the world's largest structured database of travel information.

I've not remotely explored the site, but I see it as part of a trend which involves rich exploration experiences including plenty of imagery, the social aspect of local and specifically travel combined with smarts involving itinerary planning and travel booking. There are some similarities with the recently acquired RouteSet demo from PerceptLabs and also with the geo-microblog site Findery.

Visually, the exploration of a place on Hopper looks like this:

Which is to say - visually very rich with images provided (I assume) by the community. This wave of modern location products makes one ask the question - how important is the map for (engagement) in local search?

Right now, the site has some issues. As a signed in user I'm told to browse others' experiences and 'save' what I find interesting. Howerver, there is no mention of a 'save' action on any of the posts on the site. Consequently, it is a little hard right now to give a write up of how the site works. I do note posts have a reference to a source. Does hopper crawl these sources? or do the users cross post?

Update: regarding saving - a search on google for 'site:hopper.com save -near' surfaces pages which contain the word save, like this one: http://www.hopper.com/list/cities/-378 . However, the page itself according to Chrome has no instance of the string 'save' on it. Looking at the source for the page shows that there is actually a save button and other mechanisms in place. Not sure what is amiss here. Testing on IE also fails to surface any visible save functionality.

Update: I figured out the save mechanism. There is a star on each entity. Hitting the star *saves* the entity. This is a pretty poor design. Stars are generally used in interfaces associated with the term 'favourite'. Telling users they need to 'save' entities, then using a different metaphor for this action will, if I am any sort of average user, result in a lot of lost opportunity for engaging users.

December 29, 2013

Briefly - Wakako gave me (actually us) a FitBit for Christmas. This is a great product if you are (like me) motivated by data to take action. While I appreciate the device design (small but functional), I really like the thought that has gone in to the data presentation in the dashboard. The displays of the key variables are clean and yet subtle enough to reward interaction by revealing additional dimensions.

December 28, 2013

Information is Beautiful is a thought provoking labour of love by one of the first true data journalists, David McCandless. It is a simply structured collection of graphical interpretations of a variety of interesting statistics, factoids and opinions. It is compelling in its ability to provoke exclamations of surprise at the relationships between facts (e.g. the financial crisis costing us almost four times more than the expected total cost of the west's adventurism in Iraq and Afghanistan) as well as generating respect for the creativity and design that has gone in to presenting the information.

That being said, the book also illustrates the very tricky position of a data journalist (or whatever we eventually call those individuals who render 'information' visually). Visualization of data in the form of graphics and the expression of facts, opinions, processes, etc. in the form of information visualizations is, essentially, a new language. As consumers of this new language, we have to place a large amount of trust in the translator.

As is appropriate for a book aimed more at the coffee table than the academic library, Information is Beautiful comes with no explanation of the graphical idioms used. Nor does it come with any summary of conclusions or discussion of the implications drawn from the data or the visualization. It is more like the glossy book of fabulous beaches from around the world which contains little or no indication of where these places are or what is just out of sight, or lurking behind the scene. This is, in my opinion, a grave oversight.

For example, the first piece presents a number of types of spending (e.g. defence budgets, foreign aid payments, etc.) and compares them - via the cleverly engineered positioning of a page turn - with the cost of 'the financial crisis' (which I assume is the most recent such event). Here the intended implication is clear - the financial crisis cost a lot more than all that other stuff that you think is costing us a lot. But what is the scope of all the other stuff? The defence budgets for the US, China, the UK, Saudi Arabia and India are presented - are these the largest budgets? If so, what percentage of all defense budgets do they represent? Are there other events which provide more context (e.g. other financial meltdowns, the 'cost' of a world war, the cost of other wars). It is clear that the auther has selected the variables being compared with the recession, but without knowledge of the selection criteria it is hard to know either the intended spin, or how meaningful the conclusion that the reader is lead to might be.

The second graphic - an exploration of the values and opinions of left and right leaning political positions - suffers in a similar way from a lack of context. The graphic, for example, appears to make the statement 'right leaning governments don't interfere with [the] social lives [of their citizens]'. What are we to make of this? Is this the opinion of the author? Or is it somehow a statement derived from one of the sources quoted at the bottom of the image (wikipedia, britannica.com, etc.)? As there are a number of sources, is this a consensus or is it an amalgam of the information in these sources?

McCandless presents some statistics on the structure of rape reports, prosecutions and convictions in England and Wales. I've approximated the visualization below:

Overlapping circles evoke the common concept of the venn diagram. However, here the semantics would appear to indicate that Prosecutions include rapes that are not reported, and that Convictions are exclusively obtained for non-reported rapes. I can't make sense of that.

McCandless often uses Google's Insights tool to make observations about the relative importance of various concepts. This type of analysis requires some amount of preparation for the reader. These graphs have no vertical axis label or units. Google labels the y-axis as 'interest over time' and provides a reasonable amount of explanation about the graphs and how to interpret them, including:

A downward trending line means that a search term's popularity is decreasing. It doesn't mean that the absolute, or total, number of searches for that term is decreasing.

In other words, a peak doesn't necessarily mean that there is more absolute interest - it could just as easily indicate that there is a reduced amount of interest in some other topic which therefore takes away mass from the denominator. Quite possibly the conclusions that can be drawn from these comparative time series are reasonable where there are differences between the trends (this may not be the case for compared series that show correlated peaks).

In terms of colour palette, McCandless is clearly from the Wired circa 2000 school - a school which embraces challenging colour schemes (such as white characters on a yellow background - see 'Lack of Conviction') and where the semantics of colours often trumps the contrast (for example, using a range of similar colours in a legend to a graphic leaving the reader to guess which blob goes with which meaning - see 'Most Successful Rock Bands').

Overall, this is a fascinating book. It has received popular coverage in the media and I'm sure it continues to sell well all over the globe. As a community of readers, we have to become data literate so that we can consume this type of content with the same critical eye as if we were reading the statements in dry text. It is not the information that is beautiful per se, it is the presentation of the information. I feel that this book would be far more useful if it contained a preface of some sort helping the reader to understand this new language, to educate them in the skills required to draw insights, but also to question the translation.

It is fascinating to see how many of these (almost all) are still in business and how rich their online experiences and product suites have become.

Now, there is another site to add to the list of data engines: Quandl.Quandl offers search over 8 Million data sets. A search brings up a results page with a list of data sets, related topics and relevant sources. For example, a search for 'french unemployment' brings up the following:

From here, the user can drill down to a specific data set and get the usual interactions with time series graphs, downloading of data sets, etc. The graphing tool allows a number of modifications (e.g. raw data, % change, etc.).

There isn't much on the site about the history of the company, but the wayback machine tells me that the root URL was first archived in April 2012. Whois tells me the domain was registered in 2012.