I've chosen a power-of-2 closed table - collisions cause re-probing into the table ... I do not need any memory fencing or any locking [even on a resize]. I do need CAS [(Compare-and-swap)] when changing the table.

Early in the talk, I was wondering how Cliff handles deletes. As it turns out, he does not. Old deleted keys are only cleaned up on a resize, which works and nicely dodges delete nastiness.

I seem to be bumping into a lot of interesting work on large scale multiprocessor systems lately. For example, "Evaluating MapReduce for Multi-core and Multiprocessor Systems" (PDF) shows some promising results for using MapReduce to simplify some parallel programming on a single box with 8+ CPUs (rather than on a cluster of boxes), at least for tasks that easily can be translated to a MapReduce problem.

There is also a Google engEdu talk on that MapReduce paper, but the video is hard to hear in parts. However, there were some interesting tidbits in the Q&A, including people questioning the small data sizes used for the tests (almost all under 1G) and whether the extraordinary speedup (~x50) reported on the ReverseIndex task was an issue with the coding of the serial version of the algorithm rather than a true speedup.

Finally, that Java memory model talk serves as good motivation for another talk, "Software Transactional Memory", which argues for considering the software transactional memory model -- allowing a series of reads/writes to be marked as one logical action and then using abort and retries to handle most contention -- over error-prone traditional locking. If you have never heard about software transactional memory before (I had not), skimming the Wikipedia page or watching the video might be worthwhile.

An interesting story on TechDirt about how Amazon-owned Alexa took many of the good UI ideas developed by a site called Alexaholic (that was layered on top of and depended on Alexa data), then worked to shut Alexaholic down.

Amazon's Alexa unit, which tries (poorly, some might argue) to track web traffic, has been embroiled in a spat with the site Statsaholic, which until recently was called Alexaholic.

Statsaholic's strategy was to take Alexa's data and present it in a matter that's far more usable than the way Alexa presents it.

Amazon seemed to tolerate, or even encourage, Alexaholic, until it built all of Alexaholic's functionality into its own site, at which point it went on the attack. First it went after the company's domain name, Alexaholic.com, which was arguably infringing on Alexa's trademark. Then Amazon blocked off access to its graphs and data, effectively disabling the renamed Statsaholic.

Companies offer web services to get free ideas, exploit free R&D, and discover promising talent. That's why the APIs are crippled with restrictions like no more than N hits a day, no commercial use, and no uptime or quality guarantees.

They offer the APIs so people can build clever toys, the best of which the company will grab -- thank you very much -- and develop further on their own.

Monday, March 26, 2007

Sixteen months ago, there was an interesting rumor that Google is building data centers that fit inside of shipping containers.

Google hired a pair of very bright industrial designers to figure out how to cram the greatest number of CPUs, the most storage, memory and power support into a 20- or 40-foot [shipping container]. We're talking about 5000 Opteron processors and 3.5 petabytes of disk storage that can be dropped-off overnight by a tractor-trailer rig.

Now, it appears that Microsoft may be getting into the act. Windows Live Core architect James Hamilton wrote a paper, "Architecture for Modular Data Centers" (.doc), that shows considerable thought into how you might squeeze a data center into a shipping container.

Extended excerpts from the paper:

[We propose] to no longer build and ship single systems or even racks of systems. Instead, we ship macro-modules consisting of a thousand or more systems.

Each module is built in a 20-foot standard shipping container, configured, and burned in, and is delivered as a fully operational module with full power and networking in a ready to run no-service-required package. All that needs to be done upon delivery is provide power, networking, and chilled water.

Components are never serviced and the entire module just slowly degrades over time as more and more systems suffer non-recoverable hardware errors ... Software applications implement enough redundancy so that individual node failures don't negatively impact overall service availability ... At the end of its service life, the container is returned to the supplier for recycling.

This model brings several advantages: 1) on delivery systems don't need to be unpacked and racked, 2) during operation systems aren't serviced, and 3) at end-of-service-life, the entire unit is shipped back to the manufacturer for rebuild and recycling without requiring unracking & repackaging to ship.

A shipping container is a weatherproof housing for both computation and storage. A "data center," therefore, no longer needs to have the large rack rooms with raised floors that have been their defining feature for years ... The only requirement is a secured, fenced, paved area to place the containers around the central facilities building.

On-site hardware service can be expensive ... [we avoid] these costs ... Even more important ... are the errors avoided by not having service personal in the data center ... human administrative error causes 20% to 50% of system outages.

The macro-module containers employ direct liquid cooling ... No space is required for human service or for high volume airflow. As a result, the system density can be much higher than is possible with conventional air-cooled racks .... High efficiency rack-level AC to DC rectifier/transformers [also may yield] significant power savings.

This architecture transforms data centers from static and costly behemoths into inexpensive and portable lightweights.

There also is a PowerPoint presentation (.ppt) that covers much of the same material as the paper.

Update: One year later, Microsoft announces their new Chicago data center has an "entire first floor is full of containers ; each container houses 1,000 to 2,000 systems per container ; 150 - 220 containers on the first floor." A total of 200k - 400k servers. Wow, quite a build-out. [Found via James Hamilton and [Nick Carr]

Grow a spine people! You have a giant growing market with just one dominant competitor, not even any real #2 ... Get a stick and try to knock G's crown off.

Here are my tips to get started:

A conventional attack against Google's search product will fail ... A copy of their product with your brand has no pull.

Forget interface innovation ... Interface features only get in the way.

Forget about asking users to do anything besides typing two words into a box.

Users do not click on clusters, or tags, or categories, or directory tabs, or pulldowns. Ever. Extra work from users is going the wrong way. You want to figure out how the user can do even less work.

Go read the whole thing. It's a good read.

Personalized search, by the way, requires no extra work from the user, works from just a couple words in a box, adds no interface goo, and could provide a substantially different experience than using Google.

Saturday, March 24, 2007

Thinking more about my last post, "Google and the deep web", Google appears to me to be rejecting federated search, instead preferring a local copy of all the world's data on the Google cluster.

Federated search (or metasearch) is when a search query is sent out to many other search engines, then the results merged and reranked.

In more complicated forms, the federated search engine may build a virtual schema that merges all the underlying databases, map the original query to the different query languages of the individual source databases, only query the databases that have a high likelihood of returning good answers, resolve inconsistencies between the databases, and combine multiple results from multiple sources to produce the final answers.

The Googlers on the "Structured Data Meets the Web: A Few Observations" (PS) Dec 2006 IEEE paper make several arguments against this approach succeeding at large scale:

The typical solution promoted by work on web­data integration is based on creating a virtual schema for a particular domain and mappings from the fields of the forms in that domain to the attributes of the virtual schema. At query time, a user fills out a form in the domain of interest and the query is reformulated as queries over all (or a subset of) the forms in that domain.

For general web search, however, the approach has several limitations that render it inapplicable in our context. The first limitation is that the number of domains on the web is large, and even precisely defining the boundaries of a domain is often tricky ... Hence, it is infeasible to design virtual schemata to provide broad web search on such content.

The second limitation is the amount of information carried in the source descriptions. Although creating the mappings from web­form fields to the virtual schema attributes can be done at scale, source descriptions need to be much more detailed in order to be of use here. Especially, with the numbers of queries on a major search engine, it is absolutely critical that we send only relevant queries to the deep web sites; otherwise, the high volume of traffic can potentially crash the sites. For example, for a car site, it is important to know the geographical locations of the cars it is advertising, and the distribution of car makes in its database. Even with this additional knowledge, the engine may impose excessive loads on certain web sites.

The third limitation is our reliance on structured queries. Since queries on the web are typically sets of keywords, the first step in the reformulation will be to identify the relevant domain(s) of a query and then mapping the keywords in the query to the fields of the virtual schema for that domain. This is a hard problem that we refer to as query routing.

Finally, the virtual approach makes the search engine reliant on the performance of the deep web sources, which typically do not satisfy the latency requirements of a web­search engine.

Google instead prefers a "surfacing" approach which, put simply, is making a local copy of the deep web on Google's cluster.

Not only does this provide Google the performance and scalability necessary to use the data in their web search, but also it allows them to easily compare the data with other data sources and transform the data (e.g. to eliminate inconsistencie and duplicates, determine the reliability of a data source, simplify the schema or remap the data to an alternative schema, reindex the data to support faster queries for their application, etc.).

Google's move away from federated search is particularly intriguing given that Udi Manber, former CEO of A9, is now at Google and leading Google's search team. A9, started and built by Udi with substantial funding from Amazon.com, was a federated web search engine. It supported queries out to multiple search engines using the OpenSearch API format they invented and promoted. A9 had not yet solved the hard problems with federated search -- they made no effort to route queries to the most relevant data sources or do any sophisticated merging of results -- but A9 was a real attempt to do large scale federated web search.

If Google is abandoning federated search, it may also have implications for APIs and mashups in general. After all, many of the reasons given by the Google authors for preferring copying the data over accessing it in real-time apply to all APIs, not just OpenSearch APIs and search forms. The lack of uptime and performance guarantees, in particular, are serious problems for any large scale effort to build a real application on top of APIs.

Lastly, as law professor Eric Goldman commented, the surfacing approach to the deep web may be the better technical solution, but it does have the potential of running into legal issues. Copying entire databases may be pushing the envelope on what is allowed under current copyright law. While Google is known for pushing the envelope, yet another legal challenge may not be what they need right now.

Friday, March 23, 2007

A few new papers out of Google cover some of their work on indexing the deep web.

A Dec 2006 article, "Structured Data Meets the Web: A Few Observations" (PS), appears to provide the best overview.

The paper starts by saying that "Google is conducting multiple efforts ... to leverage structured data on the web for better search." It goes on to talk about the scope of the problem as "providing access to data about anything" since "data on the Web is about everything."

The authors discuss three types of structured data, the deep web, accessible structured data such as Google Base, and annotation schemes such as tags.

Of most interest to me was the deep web, described in the paper as follows:

These are pages that are dynamically created in response to HTML ­form submissions, using structured data that lies in backend databases. This content is considered invisible because search ­engine crawlers rely on hyperlinks to discover new content. There are very few links that point to deep web pages and crawlers do not have the ability to fill out arbitrary HTML forms.

The deep web represents a major gap in the coverage of search engines: the content on the deep web is believed to be [vast] ... [and] of very high quality.

While the deep Web often has well structured data in the underlying databases, the Google authors argue that their application -- web search -- makes it undesirable to expose the deep Web structure. From the paper:

The reality of web search characteristics dictates ... that ... querying structured data and presenting answers based on structured data must be seamlessly integrated into traditional web search.

This principle translates to the following constraints:

Queries will be posed ... as keywords. Users will not pose complex queries of any form. At best, users will pick refinements ... that might be presented along with the answers to a keyword query.

Answers from structured data sources ... should not be distinguished from the other results. While the research community might care about the distinction between structured and un­structured data, the vast majority of search users do not.

The authors discuss two major approaches to exposing the deep Web, virtual schemas and surfacing.

Virtual schemas reformulate queries "as queries over all (or a subset of) the forms" of the actual schemas. The virtual schema must be manually created and maintained. As the paper discusses, this makes it impractical for anything but narrow vertical search engines, since there are a massive number of potential domains on the Web ("data on the Web encompasses much of human knowledge"), the domains are poorly delineated, a large scale virtual schema would be "brittle and hard to maintain", and the performance and scalability of the underlying data sources is insufficient to support the flood of real-time queries.

Therefore, authors favor a "surfacing" approach. In this approach:

Deep web content is surfaced by simulating form submissions, retrieving answer pages, and putting them into the web index.

The main advantage of the surfacing approach is the ability to re­use existing indexing technology; no additional indexing structures are necessary. Further, a search is not dependent on the run­time characteristics of the underlying sources because the form submissions can be simulated off­line and fetched by a crawler over time. A deep web source is accessed only when a user selects a web page that can be crawled from that source.

Surfacing has its disadvantages, the most significant one is that we lose the semantics associated with the pages we are surfacing by ultimately putting HTML pages into the web index. [In addition], not all deep web sources can be surfaced.

Given that Google is approaching the deep web as adding data to their current web crawl, it is not surprising that they are tending toward the more straightforward approach of just surfacing the deep web data to their crawl.

The paper also discusses structured data from Google Base and tags from annotation schemes. Of particular note there is that they discuss how the structured data in Google Base can be useful for query refinement.

At the end, the authors talk about the long-term goal of "a database of everything" where more of the structure of the structured deep web might be preserved. Such a database could "lead to better ranking and refinement of search results". It would have "to handle uncertainty at its core" because of inconsistency of data, noise introduced when mapping queries to the data sources, and imperfect information about the data source schemas. To manage the multitude of schemas and uncertainty in the information about them, they propose only using a loose coupling of the data sources based on analysis of the similarity of their schemas and data.

This specific paper is only one of several on this topic out of Google recently. In particular, "Web-scale Data Integration: You can only afford to Pay As You Go" (PDF) goes further into this idea of a loose coupling between disparate structured database sources. Of most interest to me was the idea implied by the title, that PayGo attempts to "incrementally evolve its understanding of the data it emcompasses as it runs ... [understanding] the underlying data's structure, semantics, and relationships between sources", including learning from implicit and explicit feedback from users of the system.

Another two papers, "From Databases to Dataspaces" (PDF) and "Principles of Dataspace Systems" (PDF) also further discuss the idea of "a data co-existence approach" where the system is not in full control of its data, returns "best-effort" answers, but does try to "provide base functionality over all data sources, regardless of how integrated they are."

Apollo is the code name for a cross-operating system runtime being developed by Adobe that allows developers to leverage their existing web development skills (Flash, Flex, HTML, JavaScript, Ajax) to build and deploy Rich Internet Applications (RIAs) to the desktop.

Apollo is a cross-operating system runtime that runs outside of the browser .... We ... are confident that we will be able to quickly get significant distribution of the Apollo runtime.

Apollo is targeted at developers who are currently leveraging web technologies, such as Flash, Flex, HTML, JavaScript and Ajax techniques to build and deploy Rich Internet Applications.

Apollo will provide a set of APIs to make it easy to develop connected applications that work while offline.

Tuesday, March 20, 2007

An interesting excerpt on the competitive advantage Google has from its skill in building out its massive cluster:

In an internet market where you deliver your services by computers with spinning disks, we have a competitive advantage... because we have the cheapest and most scalable such architecture.

We hope that in the course of innovation we will be able to build products which are almost impossible for our competitors to replicate, because we simply learn how to implement them at scale.

The internet is a scaled business, and running large internet scale businesses is very, very difficult. All the challenges – staying up, dealing with spammers, dealing with other access problems – and we believe we do it best in the world.

See also my post, "Making the impossible possible", that quotes Google Earth CTO Michael Jones as saying, "Your perception of a thing that is a viable problem to think about is shaped by the tool you can use."

See also some of my posts ([1][2]) about Microsoft's belated attempts to build their own cluster.

Monday, March 19, 2007

Deepak Thomas and Nisan Gabbay post an excellent and detailed "YouTube Case Study" that analyzes the reasons for the success of the startup.

To summarize, YouTube eliminated the hassles with sharing and watching videos by using the just-released Adobe Flash support for video, then drove rapid adoption through offering embedded widgets and distribution of copyrighted content.

Deepak and Nisan also suggest at the end that there might be an opportunity for video startups that offer higher resolution video than YouTube.

I suspect a bigger opportunity lies with helping people find and discover interesting videos on demand, effectively helping people create their own TV stations customized to their own interests. Lack of item authority and poor recommendation quality make this very hard to do on YouTube right now.

Microsoft: stop the talk. Ship a better search, a better advertising system than Google, a better hosting service than Amazon, a better cross-platform Web development ecosystem than Adobe, and get some services out there that are innovative (where's the video RSS reader? Blog search? Something like Yahoo's Pipes? A real blog service? A way to look up people?) That's how you win.

Saturday, March 17, 2007

How do engines [disambiguate] intent without us giving it any more information to work with at the point of query?

This is where personalization comes in. In the current online reality, there are really only a few places that the search engine can look to help define intent without depending on further information from the user:

They can look at your past history and learn more about you by what you have already done

They can look at the context of the task you're currently engaged in, hoping that it will give some clues to what you're looking for

And finally, if they know something about you and your social, geographic and demographic cohort, the engine can hope that there is a similarity of thinking within that cohort, at least when it comes to common interests and intent

Each of these factors is being explored as a potential avenue to help with disambiguating intent.

Right now, Google is put their eggs in the past online history basket, feeling that where you have been will provide the best signal to predict where you might want to go.

When people only enter a few keywords, any additional information might help. Looking back into what they have done might allow us to better determine intent and interest and lead to more useful search results.

Different people have different interpretations of what is relevant. At some point, the only way to further improve the quality of search results will be to show different people different search results. Changing search results using long-term search history, as Google Personalized Search does, is one way of satisfying these differing views of relevance.

Personalization also seems likely to be useful when a searcher is iterating and not finding what they want. That searcher clearly is having difficulty translating intent into results. Current search engines ignore these iterations -- each search is treated as independent -- but there is valuable information in those struggles.

Monday, March 12, 2007

Todd Bishop at the Seattle PI has a good article today, "What next for Microsoft in Web search?", that summarizes some opinions on "what the company might try next in its struggle against Google" to gain "traction in an area where Microsoft has lost market share the last two years."

I, Danny Sullivan, John Battelle, and several others are briefly quoted.

Update: Todd Bishop also posted on his weblog and included longer versions of some of the opinions there.

Saturday, March 10, 2007

Miguel Helft at the NYT writes about the Google commuter buses. Some excerpts:

[Google] now ferries about 1,200 employees to and from Google daily — nearly one-fourth of its local work force — aboard 32 shuttle buses equipped with comfortable leather seats and wireless Internet access.

Its aim is to make commuting painless for its pampered workers — and keep attracting new recruits in a notoriously competitive market for top engineering talent.

And Google can get a couple of extra hours of work out of employees who would otherwise be behind the wheel of a car.

Google will not discuss the cost of the program.

The cost of the program, I suspect, is trivial compared to the benefits.

Let's do a quick back-of-the-envelope on this one.

Let's assume 32 buses require less than 100 employees to operate (bus drivers running in two shifts + maintenance + coordination + admin). Assume a Google employee using the bus is able to work for at least extra one hour that would otherwise be wasted in the commute.

That is already a 3:2 ratio in time saved, but it gets even better. The average bus employee almost certainly makes less than 1/5th the average Googler. After adjusting for the salary differential, the difference becomes at least 15:2.

So, even if only the extra work time provided for Googlers is considerd, the program almost certainly easily pays for itself, by nearly an order of magnitude.

Beyond this, as the Helft article states, there appear to be substantial benefits for recruiting and retention. That also has clear value.

Given this, I have long wondered why other companies do not imitate Google's strategy on perks. In particular, I am amazed that Microsoft does not run dedicated commuter buses from Seattle to Redmond given the length of that commute (> 1 hour) and the number of people who leave Microsoft or refuse to work there because of that commute.

See also the back-of-the-envelope calculations in my August 2005 post, "Free food at Google". In that post, I also cited business research on perks and said, "Perks can be seen as a gift exchange, having an impact on morale and motivation disproportionate to their cost."

See also my July 2004 post, "Microsoft cuts benefits", especially the update at the end where, fourteen months later, BusinessWeek blamed low morale and loss of key people at Microsoft on the benefit cuts.

Thursday, March 08, 2007

Microsoft says it still believes that it will eventually turn the tables [on Google] by improving the quality of its search results and by changing the way computer users search.

Search in the future will look nothing like today's simple search engine interfaces, [Susan Dumais] said, adding, "If in 10 years we are still using a rectangular box and a list of results, I should be fired."

John briefly mentioned efforts by Susan Dumais and others on personalized search in his article.

Mary Jo Foley also covered the event. An excerpt on personalized search:

I also enjoyed reading Don Dodge's thoughts on TechFest. An excerpt on personalization:

By building an index of documents, emails, and previous searches it is possible to create a personal profile that will help filter and rank search results for better relevance.

This is an artificial intelligence system that learns your interests and preferences, and constantly updates its algorithm based on your choices.

In this way it is not necessary for the user to change their behavior or search style in order to improve results.

It was not entirely clear to me from the coverage, but the personalized search based on desktop files sounds a lot like the 2005 work by Jaime Teevan, Susan Dumais, and Eric Horvitz (discussed in this old August 2005 post). I wonder what was different about the work that was demoed this year.

See also my April 2006 post, "Using the desktop to improve search", which discusses how several projects at Microsoft Research may be able to be "combined, refined, finished, and moved into the Windows desktop" to create "an experience impossible to reproduce in a web browser, a jump beyond the 1994, one-box search interface we still live with today."

Richard MacManus reports that a major revision of My Yahoo is about to start a private beta test.

Richard has already seen the beta and says that the features include a "pre-built personalized page for each user, based on data Yahoo has already gleaned from their usage of Yahoo properties."

In other words, My Yahoo will learn from your usage of Yahoo and builds you your own personalized page.

New implicit personalization features may not be limited to My Yahoo. Verne Kopytoff at SFGate reports that "Jerry Yang, the Sunnyvale company's co-founder, spoke [recently] about better tailoring Yahoo's iconic Web portal to individual users, with the help of technology that predicts what they want."

It appears that both My Yahoo and the Yahoo home page may soon adapt to your behavior and try to learn what you might want.

See also my May 2006 post, "Yahoo home page cries out for personalization", where I said, "To help me get where I want to go, the [Yahoo home] page should feature things I use at Yahoo. To help me discover new stuff, the site should recommend things based on what I already use."

Wednesday, March 07, 2007

Todd Bishop reports in the Seattle PI that Chris Payne, "the Microsoft Corp. executive who led the company's challenge to Google is leaving the company."

Four years ago, back in Feb 2003, Chris Payne pitched Bill Gates on building a Google-killer. That became Project Underdog and lead to a long, expensive, and largely unsuccessful effort to compete with Google.

See also Todd's weblog post which adds additional commentary and pointers to other coverage.

Update: Danny Sullivan posts an excellent, detailed article on the history of Microsoft and search, writing that Microsoft now is entering its "third era" of search "because there are no longer any excuses to justify further losses."

I would recommend jumping down to Danny's subheading "2003: The 'Build Our Own' Decision & Christopher Payne" and reading down in the article from there.

Update: Danny Sullivan posts more details about how Microsoft created the opportunity for the rise of Google back in 1998-2002 using comments directly from "former Microsoft search chief" Bill Bliss. An obviously frustrated Bill writes, "I was always told 'Search is not core to our business ... AOL is the competitor to beat.'"

Update: Three weeks later, BusinessWeek writes "Where is Microsoft Search?", saying that "In February, 2005, Microsoft's MSN Search accounted for nearly 14% of all Web searches ... Just two years later, Microsoft's rebranded Windows Live Search has a 9.6% share" and that "Microsoft has already squandered much of the time it spent developing the search business .... [while] Google has performed near flawlessly."

Update: Eight months later, Microsoft Search GM Ken Moss -- Chris' right hand man on the search effort for the last five years -- is out too.

"Apparently, Google is planning to build distribution relationships with multiple carriers by allowing them to minimize subscription and marketing costs," Simeonov writes. "In other words, Google will market the phone online and carriers will fulfill. How fast can you say dumb pipe?"

What if I had a phone that works over WiFi? ... What if there was city-wide WiFi coverage? Or WiFi coverage equivalent to cell phone networks (covering cities and major highways)? [Then] my WiFi phone would work everywhere.

Google just launched Google Talk, a VoIP application. Google is rumored to be thinking about a nationwide free WiFi network. Combined these two, add a WiFi phone to the mix, and am I about to get free mobile calling nationwide?

Personalization is one of those things where if you look down the road a few years, having a search engine that is willing to give you better results because it can know a little bit more about what your interests are, that's a clear win for users.

Already, you don't do a search for football and get the same results in the U.K. as you do in the U.S .... You can get different results, instead of just the standard American results ... [That has] a huge benefit.

If you're in the United Kingdom and you type the query newspapers, you don't want to get, necessarily, the L.A. Times or a local paper in Seattle, the Post-Intelligencer ... [Localization] started down that trend, and, over time, personalization will help a lot of people realize that it's not just a generic set of results, or a vanilla set of results.

The idea of a monolithic set of search results for a generic term will probably start to fade away ... You already see people expect that if I do a search and somebody else does the search, they can get slightly different answers ... Over time people will expect that more and more.

See also my previous post, "Marissa Mayer interview on personalized search", that has selected excerpts from Gord's interview with Google VP Marissa Mayer and references to other information on Google's efforts on personalized search.

For example, I named a new tab "geek" and got a page with articles from Slashdot, Wired, PCWorld, PC Magazine, Engadget, MajorGeeks, and HowStuffWorks. A tab named "search" had searches for YouTube, eBay, WhitePages, YellowPages, Wikipedia, and Dictionary.com.

Compound queries do not seem to work as well. "Computer science" yields a page with matches for "computer" like PCWorld and PC Magazine and matches for "science" like ScienceDaily, Scientific American, and New Scientist. None of that was the hard core computer science stuff I was expecting.

Obscure stuff does not work at all. For example, requests for autogenerated pages for "cryptography", "mysql", or "findory" returned empty pages.

Nathan Weinberg at Inside Google, who talked with Google personalization lead Sep Kamvar before writing his excellent review, reports, "The system relies on many factors, primarily other people who have named their tabs the same way you've named yours, sees what Gadgets are popular in that set, and gives you a page of them."

The next step, I would assume, would be to try to generate a tab or, more boldly, the initial version of the personalized home page using the information in your search history. That would be very cool, like My Yahoo or Netvibes but with no effort to set up and configure.

Annalee Newitz at Wired writes "I Bought Votes on Digg", an article about how she bought votes to get "a pointless blog full of poorly written, incoherent commentary ... to the front page on Digg."

Matt Marshall has a nice summary of the Wired article, ending with the comment, "This is particular harmful for Digg, because its management has said gaming can't happen."

Security guru Ed Felten posts some good thoughts on the Wired article and, more generally, on manipulation of community reputation systems. Some excerpts:

There's a myth floating around that such [reputation] systems distill an uncannily accurate folk judgment from the votes submitted by millions of ordinary citizens. The wisdom of crowds, and all that.

In fact, reputation systems are fraught with problems, and the most important systems survive because companies expend great effort to supplement the algorithms by investigating abuse and trying to compensate for it.

The incentive problem is especially challenging for recommendation services like Digg. Digg assumes that users will cast votes for the sites they like. If I vote for sites that I really do like, this will mostly benefit strangers.

But if I sell my votes or cast them for sites run by my friends and me, I will benefit more directly. In short, my incentive is to cheat. These sorts of problems seem likely to get worse as a service grows, because the stakes will grow and the sense of community may weaken.

See also my earlier post, "Spam is ruining Digg", and its many references to other posts about spam, Digg, and reputation systems.

Serious Yahoo engineers -- quit before further damage is done to your resumes.

I will hire you. Google will hire you. Someone will hire you. You will be happier in an org that values tech and scale and algorithms.

Get out of there!

Heh, heh.

It is a little funny that the singing news hoser is what pushes some over the edge on Yahoo. I thought it was much worse when Yahoo CFO Susan Decker publicly gave up on search by saying, "It's not our goal to be No. 1 in Internet search. We would be very happy to maintain our market share."