Tuesday, December 30, 2008

Danny Sullivan is an insightful writer, long-time watcher of the search industry, and founder of Search Engine Watch, Search Engine Land, and the popular Search Engine Strategies (SES) conference. His thoughts are well worth reading.

Monday, December 29, 2008

Amazon CTO Werner Vogels posted an copy of his recent ACM Queue article, "Eventually Consistent - Revisited". It is a nice overview of the trade-offs in large scale distributed databases and focuses on availability and consistency.

An extended excerpt:

Database systems of the late '70s ... [tried] to achieve distribution transparency -- that is, to the user of the system it appears as if there is only one system instead of a number of collaborating systems. Many systems during this time took the approach that it was better to fail the complete system than to break this transparency.

In the mid-'90s, with the rise of larger Internet systems ... people began to consider the idea that availability was perhaps the most important property ... but they were struggling with what it should be traded off against. Eric Brewer ... presented the CAP theorem, which states that of three properties of shared-data systems -- data consistency, system availability, and tolerance to network partition -- only two can be achieved at any given time .... Relaxing consistency will allow the system to remain highly available under the partitionable conditions, whereas making consistency a priority means that under certain conditions the system will not be available.

If the system emphasizes consistency, the developer has to deal with the fact that the system may not be available to take, for example, a write ... If the system emphasizes availability, it may always accept the write, but under certain conditions a read will not reflect the result of a recently completed write ... There is a range of applications that can handle slightly stale data, and they are served well under this model.

[In] weak consistency ... The system does not guarantee that subsequent accesses will return the updated value. Eventual consistency ... is a specific form of weak consistency [where] the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value ... The most popular system that implements eventual consistency is DNS (Domain Name System).

[In] read-your-writes [eventual] consistency ... [a] process ... after it has updated a data item, always accesses the updated value ... Session [eventual] consistency ... is a practical version of [read-your-writes consistency] ... where ... as long as [a] session exists, the system guarantees read-your-writes consistency. If the session terminates because of a certain failure scenario, a new session needs to be created and the guarantees do not overlap the sessions.

As Werner points out, session consistency is good enough for many web applications. When I make a change to the database, I should see it on subsequent reads, but anyone else who looks often does not need to see the latest value right away. And most apps are happy if this promise is violated in rare cases as long as we acknowledge it explicitly by terminating the session; that way, the app can establish a new session and either decide to wait for eventual consistency of any past written data or take the risk of a consistency violation.

Session consistency also has the advantage of being easy to implement. As long as a client reads and writes from the same replica in the cluster for the duration of the session, you have session consistency. In the event that node goes down, you terminate the session and force the client to start a new session on a replica that is up.

Werner did not talk about it, but some implementations of session consistency can cause headaches if a lot of clients doing updates to the same data where they care what the previous values were. The simplest example is a counter where two clients with sessions on different replicas both try to increment a value i and end up with i+1 in the database rather than i+2. However, there are ways to deal with this kind of data. For example, just for the data that needs it, we can use multiversioning while sending writes to all replicas or forcing all read-write sessions to the same replica. Moreover, a surprising vast amount of application data does not have this issue because there is only one writer, there are only inserts and deletes not updates, or the updates do not depend on previous values.

Please see also Werner's older post, "Amazon's Dynamo", which, in the full version of their SOSP 2007 paper at the bottom of his post, describes the data storage system that apparently is behind Amazon S3 and Amazon's shopping cart.

[An] attacker may cleverly decorate portions of such a third-party UI to make it appear as if they belong to his site instead, and then trick his visitors into interacting with this mashup. If successful, clicks would be directed to the attacked domain, rather than attacker's page -- and may result in undesirable and unintentional actions being taken in the context of victim's account.

[For example,] the attacker may also opt for showing the entire UI of the targeted application in a large <IFRAME>, but then cover portions of this container with opaque <DIV> or <IFRAME> elements placed on top ... [Or] the attacker may simply opt for hiding the target UI underneath his own, and reveal it only miliseconds before the anticipated user click, not giving the victim enough time to notice the switch, or react in any way.

Tuesday, December 09, 2008

Yahoo ... is planning to .... launch in beta relatively soon with half a dozen small applications running in a sidebar inside the Yahoo mail client (Evite is one of the services that is said to be building a nano-app for this new Yahoo Mail-as-a-platform). Users' address books would act as a social graph, essentially turning Yahoo Mail into the basis of a whole new social networking experience.

The only way for Yahoo or Google to challenge the social networking incumbents like Facebook [is] to leverage their email infrastructure ... With relationship buckets pre-defined by the address book, which contains everything from web-based addresses to geo-local data (physical address) to mobile numbers, email clients are already rich with the very data set that Facebook [has].

I liked this idea back when Om talked about it last year and still like it now.

The address book is essentially a social network. Not only does it have friend relationships, but also we can determine the importance of the relationships, the weights of the social connections. Oddly, surprisingly little has been done with that information in e-mail clients.

Perhaps it is fear of breaking something that so many people use and depend on, but e-mail clients have largely stood still over the last several years while social networking applications nibbled away at their market and mind share. What experimentation has occurred seems stuck in startups and research (e.g. Xobni or Inner Circle).

Meanwhile, there seems to be a trend where social networks are creeping toward e-mail clients. For example, Facebook has added limited e-mail functionality within their site as well as Twitter-like microblogging. These features seem intended to make communication dwell within Facebook.com rather than inside e-mail and IM clients.

I admit I am a bit outside of the demographic for heavy social network users, but, from what I can tell, the primary use of social networks is for communication, perhaps with the twist of being focused primarily on dating and entertainment. It makes me wonder if social networks really are a different app from communication apps like e-mail clients or just a different facet of the same idea.

If they are nearly the same, I would expect Yahoo will do much better from implementing social networking features in Yahoo Mail than from attempting to create a new social network such as Yahoo 360. Something similar probably could be said for Orkut and GMail.

Monday, December 01, 2008

Deepak Agarwal along with many others from Yahoo Research have a paper at the upcoming NIPS 2008 conference, "Online Models for Content Optimization", with a fun peek inside of a system at Yahoo that automatically tests and optimizes which content to show to their readers.

It is not made entirely clear which pages at Yahoo use the system, but the paper says that it is "deployed on a major Internet portal and selects articles to serve to hundreds of millions of user visits per day."

The system picks which article to show in a slot from 16 potential candidates where the pool of candidates are picked by editors and change rapidly. The system seeks to optimize the clickthrough rate in the slot. The problem is made more difficult by the way the clickthrough rate on a given article changes rapidly as the article ages and as the audience coming to Yahoo changes over the course of a day, which means the system needs to adapt rapidly to new information.

The paper describes a few variations of algorithms that do explore/exploit by showing the article that performed best recently while constantly testing the other articles to see if they might perform better.

The result was that the CTR increased by 30-60% over editors manually selecting the content that was shown. Curiously, their attempt to show different content to different user segments (a coarse-grained version of personalization) did not generate additional gains, but they say this is almost certainly due to the very small pool of candidate articles (only sixteen articles) from which the algorithm was allowed to pick.

One amusing tidbit in the paper how they describe the culture clash that occurred between maintaining the control the editors were used to and giving the algorithms freedom to pick the content users really seem to want.

I remember similar issues at Amazon way back when we first started use algorithms to pick content rather than editors. It is hard for editors to give up control even to the collective voice of people voting on what they want. While I always have been sympathetic to the need for editorial voice, if it is forcing content on users that they do not actually want, it is important to understand its full cost.