At its best, the on-line version of a newspaper should learn from the information I'm giving it -- what I've read, who I am and what I like -- to automatically send me stories and photos that will interest me.

Then, Eric described how newspapers could make money using personalized advertising:

Imagine a magazine online that knew everything about you, knew what you had read, allowed you to go deep into a subject and also showed you things... that are serendipit[ous] ... popular ... highly targetable ... [and] highly advertisable. Ultimately, money will be made.

Finally, Eric claimed Google has a moral duty to help newspapers succeed:

Google sees itself as trying to make the world a better place. And our values are that more information is positive -- transparency. And the historic role of the press was to provide transparency, from Watergate on and so forth. So we really do have a moral responsibility to help solve this problem.

Well-funded, targeted professionally managed investigative journalism is a necessary precondition in my view to a functioning democracy ... That's what we worry about ... There [must be] enough revenue that ... the newspaper [can] fulfill its mission.

Eric's words come at a time when, as the New York Times reports, newspapers are cratering, with "revenue down 16.6 percent last year and about 28 percent so far this year."

Update: Five weeks later, Eric Schmidt, in the WSJ, imagines a newspaper that "knows who I am, what I like, and what I have already read" and that makes sure that "like the news I am reading, the ads are tailored just for me" instead of being "static pitches for products I'd never use." He also criticizes newspapers for treating readers "as a stranger ... every time [they] return."

Wednesday, October 21, 2009

Google Fellow Jeff Dean gave a keynote talk at LADIS 2009 on "Designs, Lessons and Advice from Building Large Distributed Systems". Slides (PDF) are available.

Some of this talk is similar to Jeff's past talks but with updated numbers. Let me highlight a few things that stood out:

A standard Google server appears to have about 16G RAM and 2T of disk. If we assume Google has 500k servers (which seems like a low-end estimate given they used 25.5k machine years of computation in Sept 2009 just on MapReduce jobs), that means they can hold roughly 8 petabytes of data in memory and, after x3 replication, roughly 333 petabytes on disk. For comparison, a large web crawl with history, the Internet Archive, is about 2 petabytes and "the entire [written] works of humankind, from the beginning of recorded history, in all languages" has been estimated at 50 petabytes, so it looks like Google easily can hold an entire copy of the web in memory, all the world's written information on disk, and still have plenty of room for logs and other data sets. Certainly no shortage of storage at Google.

Jeff says, "Things will crash. Deal with it!" He then notes that Google's datacenter experience is that, in just one year, 1-5% of disks fail, 2-4% of servers fail, and each machine can be expected to crash at least twice. Worse, as Jeff notes briefly in this talk and expanded on in other talks, some of the servers can have slowdowns and other soft failure modes, so you need to track not just up/down states but whether the performance of the server is up to the norm. As he has said before, Jeff suggests adding plenty of monitoring, debugging, and status hooks into your systems so that, "if your system is slow or misbehaving" you can quickly figure out why and recover. From the application side, Jeff suggests apps should always "do something reasonable even if it is not all right" on a failure because it is "better to give users limited functionality than an error page."

Jeff emphasizes the importance of back of the envelope calculations on performance, "the ability to estimate the performance of a system design without actually having to build it." To help with this, on slide 24, Jeff provides "numbers everyone should know" with estimates of times to access data locally from cache, memory, or disk and remotely across the network. On the next slide, he walks through an example of estimating the time to render a page with 30 thumbnail images under several design options. Jeff stresses the importance of having an at least high-level understanding of the operation of the performance of every major system you touch, saying, "If you don't know what's going on, you can't do decent back-of-the-envelope calculations!" and later adding, "Think about how much data you're shuffling around."

Jeff makes an insightful point that, when designing for scale, you should design for expected load, ensure it still works at x10, but don't worry about scaling to x100. The problem here is that x100 scale usually calls for a different and usually more complicated solution than what you would implement for x1; a x100 solution can be unnecessary, wasteful, slower to implement, and have worse performance at a x1 load. I would add that you learn a lot about where the bottlenecks will be at x100 scale when you are running at x10 scale, so it often is better to start simpler, learn, then redesign rather than jumping into a more complicated solution that might be a poor match for the actual load patterns.

The talk covers BigTable, which was discussed in previous talks but now has some statistics updated, and then goes on to talk about a new storage and computation system called Spanner. Spanner apparently automatically moves and replicates data based on usage patterns, optimizes the resources of the entire cluster, uses a hierarchical directory structure, allows fine-grained control of access restrictions and replication on the data, and supports distributed transactions for applications that need it (and can tolerate the performance hit). I have to say, the automatic replication of data based on usage sounds particularly cool; it has long bothered me that most of these data storage systems create three copies for all data rather than automatically creating more than three copies of frequently accessed head data (such as the last week's worth of query logs) and then disposing of the extra replicas when they are no longer in demand. Jeff says they want Spanner to scale to 10M machines and an exabyte (1k petabytes) of data, so it doesn't look like Google plans on cutting their data center growth or hardware spend any time soon.

Data center guru James Hamilton was at the LADIS 2009 talk and posted detailed notes. Both James' notes and Jeff's slides (PDF) are worth reviewing.

Monday, October 19, 2009

I don't know much about analyzing music streams to find similar music, which is part of why I much enjoyed reading "Content-Based Music Information Retrieval" (PDF). It is a great survey of the techniques used, helpfully points to a few available tools, and gives several examples of interesting research projects and commercial applications.

Some extended excerpts:

At present, the most common method of accessing music is through textual metadata .... [such as] artist, album ... track title ... mood ... genre ... [and] style .... but are not able to easily provide their users with search capabilities for finding music they do not already know about, or do not know how to search for.

For example ... Shazam ... can identify a particular recording from a sample taken on a mobile phone in a dance club or crowded bar ... Nayio ... allows one to sing a query and attempts to identify the work .... [In] Musicream ... icons representing pieces flow one after another ... [and] by dragging a disc in the flow, the user can easily pick out other similar pieces .... MusicRainbow ... [determines] similarity between artists ... computed from the audio-based similarity between music pieces ... [and] the artists are then summarized with word labels extracted from web pages related to the artists .... SoundBite ... uses a structural segmentation [of music tracks] to generate representative thumbnails for [recommendations] and search.

An intuitive starting point for content-based music information retrieval is to use musical concepts such as melody or harmony to describe the content of music .... Surprisingly, it is not only difficult to extract melody from audio but also from symbolic representations such as MIDI files. The same is true of many other high-level music concepts such as rhythm, timbre, and harmony .... [Instead] low-level audio features and their aggregate representations [often] are used as the first stage ... to obtain a high-level representation of music.

Estimation of the temporal structure of music, such as musical beat, tempo, rhythm, and meter ... [lets us] find musical pieces having similar tempo without using any metadata .... The basic approach ... is to detect onset times and use them as cues ... [and] maintain multiple hypotheses ... [in] ambiguous situations.

Melody forms the core of Western music and is a strong indicator for the identity of a musical piece ... Estimated melody ... [allows] retrieval based on similar singing voice timbres ... classification based on melodic similarities ... and query by humming .... Melody and bass lines are represented as a continuous temporal-trajectory representation of fundamental frequency (F0, perceived as pitch) or a series of musical notes .... [for] the most predominant harmonic structure ... within an intentionally limited frequency range.

Audio fingerprinting systems ... seek to identify specific recordings in new contexts ... to [for example] normalize large music content databases so that a plethora of versions of the same recording are not included in a user search and to relate user recommendation data to all versions of a source recording including radio edits, instrumental, remixes, and extended mix versions ... [Another example] is apocrypha ... [where] works are falsely attributed to an artist ... [possibly by an adversary after] some degree of signal transformation and distortion ... Audio shingling ... [of] features ... [for] sequences of 1 to 30 seconds duration ... [using] LSH [is often] employed in real-world systems.

The paper goes into much detail on these topics as well as covering other areas such as chord and key recognition, chorus detection, aligning melody and lyrics (for Karaoke), approximate string matching techniques for symbolic music data (such as matching noisy melody scores), and difficulties such as polyphonic music or scaling to massive music databases. There also is a nice pointer to publicly available tools for playing with these techniques if you are so inclined.

By the way, for a look at an alternative to these kinds of automated analyses of music content, don't miss this last Sunday's New York Times Magazine section article, "The Song Decoders", describing Pandora's effort to manually add fine-grained mood, genre, and style categories to songs and articles and then use it for finding similar music.

Friday, October 16, 2009

When Google started penalising a site's search engine rankings for having ... link farms ... ­[then] people engaged in sabotage: they built link farms and left blog comment spam to their competitors' sites.

The same sort of thing is happening on Yahoo Answers. Initially, companies would leave answers pushing their products, but Yahoo started policing this. So people have written bots to report abuse on all their competitors. There are Facebook bots doing the same sort of thing.

Last month, Google introduced Sidewiki, a browser feature that lets you read and post comments on virtually any webpage ... I'm sure Google has sophisticated systems ready to detect commercial interests that try to take advantage of the system, but are they ready to deal with commercial interests that try to frame their competitors?

This is the arms race. Build a detection system, and the bad guys try to frame someone else. Build a detection system to detect framing, and the bad guys try to frame someone else framing someone else. Build a detection system to detect framing of framing, and well, there's no end, really.

Commercial speech is on the internet to stay; we can only hope that they don't pollute the social systems we use so badly that they're no longer useful.

An example that Bruce did not mention is shill reviews on Amazon and elsewhere, something that appears to have become quite a problem nowadays. The most egregious example of this is paying people using Amazon MTurk to write reviews, as CMU professor Luis voh Ahn detailed a few months ago.

Some of the spam can be detected using algorithms, looking for atypical behaviors in text or actions, and using community feedback, but even community feedback can be manipulated. It is common, for example, to see negative reviews get a lot of "not helpful" votes on Amazon.com, which, at least in some cases, appears to be the work of people who might gain from suppressing those reviews. An arms race indeed.

An alternative to detection is to go after the incentive to spam, trying to reduce the reward from spamming. The winner-takes-all effect of search engine optimization -- where being the top result for a query has enormous value because everyone sees it -- could be countered, for example, by showing different results to different people. For more on that, please see my old July 2006 post, "Combating web spam with personalization".

Monday, October 12, 2009

Nick Carr writes of our communication hell, starting with "the ring of your phone would butt into whatever you happened to be doing at that moment" and left you "no choice but to respond immediately", going through false saviors such as asynchronous voice mail and e-mail, and leading to "the approaching [Google] Wave [which] promises us the best of both worlds: the realtime immediacy of the phone call with the easy broadcasting capacity of email. Which is also, as we'll no doubt come to discover, the worst of both worlds."

In all of these communications, the problem is not so much the difference of synchronous or asynchronous but the lack of priority. Phone calls, voice mails, e-mails, and text messages, all of these appear to us sorted by date. Reverse chronological order works well as a sort order when either the list is short or when we only care about the tip of the stream. Otherwise, it rapidly becomes overwhelming.

When your communication becomes overwhelming, when there is just too much to look at it all, you need a way to prioritize. You need a relevance rank for communication.

The closest I have seen to this still is an ancient project out of Microsoft Research called Priorities. This project and work that followed ([1][2]) tried to automate the process we all currently do manually of prioritizing incoming communication. The idea was to look at who we talk to, what we talk to them about, and add in information such as overall social capital to rank order the chatter by usefulness and importance.

Going one step further, not only do we need a relevance rank for communication, but also for all the information streaming at us in our daily lives. We need personalized rankers for news, communications, events, and shopping. Information streams need to be transformed from a flood of noise to a trickle of relevance. Information overload must be tamed.