Archive for the ‘Web/Tech’ Category

While the dustup over Bing’s possible appropriation of Google’s long-tail search results is presently occupying the attention of the world of search, I thought I’d take a step back and offer a longer-term historical perspective about an aspect of search that fascinates me: namely the evolution of search algorithms to adopt ever greater amounts of human-generated input into their calculation of relevancy.

Last September, Facebook began including heavily “Liked” items in their search results, and Bing followed suit in December. While this news itself is now a few months old, it got me thinking about how the methods used to determine relevance have changed since the era of web search began. The inclusion of “likes” as a measure of relevancy represents another chapter in the evolution of the various techniques that have been employed to determine relevance ranking in search results.

The arc of relevancy’s story can be traced along one dimension by observing the amount of human input that is incorporated into the algorithm that determines the relevance ranking of search results.

Early search engines relied primarily on the words in each page (and some fancy math) to determine a page’s relevance to a query. In this case, there is one human (the author of that particular web page) “involved” in determining the page’s relevance to a search.

When we launched the Excite.com web search engine in October 1995, we had an index that contained a whopping 1.5 million web pages, a number that seemed staggering at the time, though the number of pages Google now indexes is at least five orders of magnitude larger.

Excite’s method for determining which search results were relevant was based entirely upon the words in each web page. We used some fairly sophisticated mathematics to determine how to stack rank each document’s relevancy to a particular search. This method worked fairly well for a time, when searching just a few tens of millions of pages, but as the size of our index grew, the quality of our search results began to suffer. The other first-gen search engines like Lycos, Infoseek and AltaVista suffered similar problems. Too much chaff, not enough wheat.

Enter Google. Google’s key insight was that the words in a web page weren’t sufficient for determining the relevance of search results. Google’s PageRank algorithm tracked the links in each web page and recognized that each of those links in a web page represented votes for other web pages, and that measuring these votes could help determine the relevance of search results, and do so dramatically better than cranking out complex math calculations based only on the words in the document alone.

Simply put, Google allowed the author of any web page to “like” any other web page simply by linking to it. So instead of a single page’s author being the sole human involved in determining relevancy, all of a sudden everyone authoring web pages got to vote. Overlaying a human filter on top of the basic inverted-index search algorithm created a sea-change in delivering relevant information to the users seeking it. And this insight (coupled with the adoption of pay-per-click advertising) turned Google into the juggernaut it became.

While Google’s algorithm expanded the universe of humans contributing to the relevancy calculation dramatically from a single author of a single web page to all the authors of web pages, it hadn’t fully democratized the web. Only content publishers (who had the technical resources and know-how) had the means to vote. The 90%+ of users online who were not creating content still had no say in relevancy.

Fast forward several years to the meteoric rise of Facebook. Arguably, Facebook’s rise is largely attributable to the launch of the newsfeed feature as well as the Facebook API, which opened the floodgates for third-party developers and brought a rich ecosystem of applications and new functionality to Facebook. After reaching well over half a billion users, Facebook unleashed a new powerful feature that may ultimately challenge Google in its ability to deliver relevant data to users: the “Like” button.

With over two million sites having installed the Like button as of September 2010, billions of items on and off Facebook have been Liked. In the early Google era, only those people with the ability to author a web page (a relatively small club in the late ‘90s) had the ability to “like” other pages (by linking to them).

Facebook’s Like button today enfranchises over half a billion people to vote for pages simply by clicking. This reduces the voting/liking barrier rather dramatically and brings the wisdom of the crowd to bear on an unprecedented scale. And beyond simple volume, it enables the “right” people to vote. Having your friends’ votes count juices relevancy to a whole new level.

A related behavior to clicking the Like button is content sharing, which is prevalent on both Facebook and Twitter. Social network “content referral” traffic in the form of URLS in shares and tweets in users’ newsfeeds is now exceeding Google search as a traffic source for many major sites. Newsfeeds are now on equal footing with SERPs in terms of their importance as a traffic source.

Not only are destination sites seeing link shares become a first-class source of traffic, but clearly users themselves are spending much more time in their newsfeeds on Facebook and Twitter than they do in the search box and search-results pages. Social networks’ sharing and liking gestures have resulted in an unexpected emergent property — users’ newsfeeds have become highly personalized content filters that are in some sense crowdsourced, but are perhaps more accurately described “clansourced” or “cohortsourced” since the crowd doing the sourcing for each user is hand-picked.

Beyond liking and sharing in the spectrum of human involvement is perhaps a move to a more labor intensive gesture: curation. Human-curated search results (aided, of course, by algorithms) are the premise behind Blekko, a new search engine focused on enhancing search results through curation. Making a dent in Google’s search hegemony is a tall order indeed, but my guess is that if anyone succeeds, it will be through a fundamentally new approach to search, and likely one that involves a more people-centric approach. And Google certainly faces a challenge as content farms and the like fill up the index with spam that is hard to root out algorithmically. For a cogent description of this problem, just ask Paul Kedrosky about dishwashers and the ouroboros.

One thing seems clear: the web’s ability to deliver relevant content to users relies on ever-sophisticated algorithms that not only leverage raw computational power but also increasingly weave sophisticated forms of feedback from a growing sample-size of the humans participating in the creation and consumption of digital media online.

I’m happy to announce that Sling Media has shipped their first product, the Slingbox Personal Broadcaster, and that Walt Mossberg’s review of the Slingbox appears in Thursday’s Wall Street Journal. I’ve posted about Sling briefly in the past, here and here, and have been actively involved with the company as a board member since Mobius VC invested in them last October.

Since that time, the team managed to garner several awards at January’s CES and build some buzz, all while keeping their heads down in product development mode to meet an aggressive goal of shipping the product in the first half of 2005, which they did with a day or two to spare. You can buy a Slingbox online at CompUSA right now, or you can walk over to your local CompUSA store and pick one up off the shelves on Friday, with more retailers to be announced shortly. Congrats to everyone over at SlingMedia for a job very well done!

I may be biased, but I’ve been a beta tester for the past couple months and wouldn’t want to part with my Slingbox, which I’ve got hooked up to my DirecTiVo. Just yesterday, while at the office (hey, I’m a multi-tasker, what can I say), I managed to watch the last 15 minutes of Six Feet Under and also reactivate my Sopranos season pass using my laptop. And a month or so ago while in Pittsburgh, I watched the most recent episode of Entourage (which was sitting on my Tivo) from the comfort of my broadband-enabled hotel room. Pretty damn cool.

Dave Sifry of Technorati has done a great three-part (1, 2, 3) series of posts on the size and shape of the blogosphere where he provides some fun charts and graphs on the development of the world of weblogs over the past several months since he last posted on this topic after his presentation at last October’s Web2.0 conference. Since October, the number of blogs indexed and links tracked by Technorati has doubled to nearly eight million blogs and nearly one billion links. I’m happy to say that I am one of the new bloggers contributing the the growth, though sometimes that just makes me feel like a statistic.

My kudos go to the team at Technorati for scaling up to meet the torrid growth of the blogosphere thus far, all the while adding many new features along the way. And I wish them the best of luck to them as they rise to meet the challenge of another doubling of their universe over the next five months and contemplate a quadrupling of it by year’s end.

Yahoo is celebrating its tenth anniversary, which led me to reflect on the fact that a decade is a very long time in Silicon Valley, particularly when viewed through the lens of Moore’s Law. As Ray Kurzweil and others have observed, when humans contemplate exponential progress, we tend to overestimate what can be accomplished in the short term (where the curve is relatively flat), but we tend to underestimate progress in the long term (when the curve gets very steep, goesup and to the right, does a hockey stick, etc.). People tend to think more easily in powers of ten (as opposed to the powers of two prevalent in the technology industry), so a decade is a good duration to look back and consider what Mr. Moore has done for us lately, after we’ve had six or seven doublings of memory density, computing speed and bandwidth. The datacenter is a great place to look to see these trends converge.

Given my background, it is perhaps not surprising that I’m a big fan of consumer-oriented web services such as Google and Yahoo, as well as recent Mobius VC investments Technorati and NewsGator. I’m equally enamored with enterprise-focused software-as-a-service businesses such as RightNow, Salesforce.com and Mobius VC portfolio companies Postini, Quova and Rally Software Development. Another cool software-as-a-service startup offers a hosted application wiki and is called JotSpot, which was founded by Joe Kraus and Graham Spencer, two of the guys with whom I co-founded Excite back in 1993. Though these enterprise and consumer oriented companies have different revenue models, their delivery model and back-end architectures for serving their customers are fundamentally similar.

Each of the companies I have mentioned above have benefited greatly from the drastic increase in the amount of storage, computing power, bandwidth and datacenter rack space that a dollar buys in 2005 versus what a dollar bought for the same thing in 1995, back when Yahoo and Excite launched their sites. I spent some time poking around the web trying to find 1995 prices for CPUs, RAM, storage, bandwidth and colo space, but it turns out that this kind of pricing archaeology is difficult to practice online, as all searches for these things turned up advertising and commerce sites focused on selling me these commodities today, not ten years ago.

I related my problem to my colleague Jocelyn Ding, who is the SVP Business and Technical Operations at Postini, and she was able to track down some old price lists from a variety of vendors from 1995, 1998 and 2000 for bandwidth, cage rental, one and four CPU servers and storage systems. Thanks go to Jocelyn for helping me put some real numbers behind my somewhat obvious assertion that a dollar goes much further in today’s datacenter than it did a decade ago. I also dug up a good whitepaper detailing costs of enterprise storage since 1992 which includes projections to 2010, which can be found here. For some items, the prices didn’t go back to 1995, so in those cases, I have extrapolated the price trend to estimate 1995 costs based on 1998 or 2000 costs relative to today’s costs:

In addition to the price reductions, we also have to look at the compute performance of a web server class machine in 1995 vs. today. Given five or six performance doublings since 1995 courtesy of improvements in clock speed, bus speed, architecture changes from 32 to 64 bit, additional cache memory and faster RAM, a conservative estimate would be that today’s single CPU 1-U “pizza box” web server is roughly fifty timesfaster than last decade’s model. Couple that with the 25x price difference for this pizza box, and your 2005 dollar buys you more than one-thousand times as much compute power as it did in 1995. Bandwidth is at least ten times cheaper than it was in 1995, floor space in the data center is seven times cheaper and enterprise-class storage is at least four hundred times cheaper than it was only a decade ago. With some smart software and network engineering, the cost per gigabyte of storage can be brought down an order of magnitude further still using a distributed filesystem based on low-end IDE drives. Finally, with the rise of Linux, Apache, MySQL and open source in general, software license costs can also vanish from the equation when running a large-scale web service.

What does this mean for a web-services company? The cost to deliver an application to an end-user has dropped dramatically for these companies and the cost to operate their data centers therefore has much less of an impact on their costs of operations and capex budget than it used to, which means their gross margins for delivering their product have improved significantly since 1995. For companies like Yahoo, Google and more recently, Technorati, this means the cost to deliver a page view or search results page has gone down dramatically, while the average size of a search-results page is perhaps only marginally larger since 1995. Even considering the size of a search index (Google’s 8B pages today vs. Excite’s 10M in 1995) has grown nearly one thousand-fold, the costs of computing power and storage have accommodated this expansion while bandwidth costs and rack space have fallen nearly tenfold.

For enterprise-focused companies like Salesforce.com, Postini, Quova and Rally, the story is similar. Add in a subscription-based recurring revenue stream and you have a business model that has all the benefits of a dependable revenue stream and profit margins that can approach those of a traditional software company. Thanks to the low cost and high performance of today’s hardware coupled with an elegant service architecture, Postini is able to process several hundred million email messages per day for its customers with an extraordinarily light hardware footprint and does so quite profitably as a result.

The fun thing about doing a retrospective like this is to realize that when I write about this again in 2015, the increases in CPU speed, memory density and bandwidth will make today’s costs and capabilities look as quaint as 1995’s do today. Thus the environment will continue to become more hospitable to the software-as-a-service model, more entrepreneurs will create meaningful businesses based on this model, VCs will continue to invest in these ideas (myself included), and we’ll all be able to enjoy some mind-blowing applications a decade from now that are simply not possible today.

Today Technorati launched a very cool new feature, Technorati Tags. Inspired by the communal categorization features of flikr and del.icio.us, Technorati now indexes tags embedded in blog posts (you get this for free when you choose to categorize your blog post on most popular publishing tools), and now they are first-class entities within Technorati that you can search and explore, which should help drive more folksonomy in the blogosphere. Technorati’s tag-search results include photos from the fine folks at Flikr and links from del.icio.us, which makes tag exploration on Technorati a compelling multimedia experience. Some of my favorites: Mac, Iraq, iPod. Kudos to the team for another great release!

This was my first CES, and it was quite an experience. It also seemed like it was Las Vegas’ first CES given the mass confusion that existed at the airport, the monorail, and all around the conference center. I thought Vegas was supposed to know how to handle conventions! Ah well, I’ll give them the benefit of the doubt for now and assume that CES is just so big that all infrastructure is stretched so far beyond capacity that the whole event can’t help but be a clown show.

Given the consumer-electronics industry’s affinity for acronyms, I’ll create a few of my own to describe what I saw on the hectares of floor space at the convention center: the floor was replete with HEFTs, YADMPs, HORPTs, YACPs and TFBVCs (sorry, but I couldn’t find a vowel for that one). Now I’ll decode the acronyms: Huge Enormous Flatscreen Televisions, Yet Another Digital Music Player, Hundreds Of Rear Projection Televisions, Yet Another Cellular Phone and Tiny Flash-Based Video Cameras.

Samsung’s102 inch plasma TV was quite impressive, though it was actually four 50 inch panels fused together — I could not see the seams between each panel. The picture was beautiful and this TV was the talk of the show. However, given that it must weigh at least 400 pounds, an eight and a half foot piece of glass begs the question of whether one should opt for a projector and a screen instead. I’ve got a 50 inch plasma at home and that thing is a space heater when it is running, so this thing could transform you living room into that sauna you’ve always wanted.

Overall, the scale of the show is overwhelming and one becomes numb after seeing hundreds of MP3 players, cellphones, video cameras, USB flash drives, televisions projectors, and so on. I didn’t see much that I would describe as disruptive technology. For the most part it was incremental, more storage capacity, bigger screens, smaller form-factors, etc. Once a new device category is created and becomes established, one can witness the Cambrian explosion in action at CES with established companies (and dozens you’ve never heard of) creating hundreds of variations on the same basic concept. It has been going on for years with MP3 players, and if I had to choose a recent example of a newish category, I suppose it would be personal media players, though there’s not enough data to say these are a success yet.

I ended my Friday with the keynote from Texas Instrument’s CEO Rich Templeton, which featured Howie Long, Jeffrey Katzenberg, several movie trailers (including the new Star Wars Episode III) and a live demo of Sling Media’sSlingBox Personal Broadcaster from Sling’s CEO/co-founder Blake Krikorian. Blake introduced the audience to the concept of place-shifting: using a SlingBox attached to his cable TV at his home in San Mateo, he was able to watch his own television over the internet on his laptop and on a new EVDO cellphone. Pretty cool. Of course, being a demo, it was not without glitches, and no matter how hard he tried, he wasn’t able to show the SlingBox connected to his TiVo. Blame the demo demons for that one. Nonetheless, it was a great show for Sling, and the company had heavy foot traffic at their booth and got some nice mentions in the press, including a nice mention in Thursday’s WSJ.

Finally, the last of the nine photos I posted in this entry made me laugh and illustrates nicely why CES is no place for babies (to say nothing of the AVN Awards show that goes on at the same time as CES).

I’m a couple days late in posting this, but I’d be remiss if I didn’t congratulate the team over at Technorati for two recent milestones. First, Technorati is going international, in partnership with Digital Garage, who will be launching a fully localized version in Japan. For more info on this, check out David Sifry’s and Joi Ito’s blog entries.

Second, last weekend Technorati moved to a new data center, a process that strikes fear into the hearts of the hardiest engineers and network operations people. They managed to do it in well under their 48-hour goal. My hat is off to the entire team for pulling this off. I’m happy to say the new data center will give us plenty of room to grow, and we’ll all sleep better at night knowing the chance of fire-related downtime has been drastically reduced.

This is a few days old and many folks have already made reference to it, but Google Suggest is very cool feature that bears mentioning.

But, even cooler, is Google’s announcement of their ambitious plan to digitize the collections of Stanford, Harvard, Oxford, the University of Michigan and the New York Public Library. Among these libraries are as many as 50 million books, though it isn’t clear how much duplication there is among them. In any case, assuming an average count of 200 pages per book, which is probably low, you could wind up with 10 billion pages in the index once the task is complete (though it is not actually possible to complete the task given the number of new books published every year). Compared to the 8 billion web pages they have indexed today, that’s pretty impressive, though by the time they complete the digitization over many years, the web index will no doubt have grown well beyond its current girth. In any case, one can see it is possible that in time this library index will rival the size of the web index.

One interesting feature of an indexed corpus of print media is that it lacks the hyperlinks among the pages that enable Google to deliver their highly relevant search results. The quality of Google’s search technology stems from the basic insight of the PageRank algorithm: that the link structure of the web itself is a useful determinant of the quality and relevancy of pages that match particular keywords. Of course, counting the “votes” from links among pages is not the only technique they employ, but it is a big part of their secret sauce.

It will be fascinating to see how the results from these library queries change over time, since presumably once these books are made available digitally, web pages will increasingly link into the books that are hosted on Google’s servers, which is why Google’s decision to do this is doubly brilliant; not only will they see increased search traffic due to these collections becoming freely available, but they will also see huge amounts of page views that result from all the links that accumulate on the web that will point into the library collections.

This is a great example of enlightened capitalism that follows naturally from Google’s simple yet audacious mission statement, which is “to organize the world’s information and make it universally accessible and useful.” Kudos to them.

Hi, I'm Ryan McIntyre, a VC with Foundry Group,
living in Boulder, CO. Over the past 20 years,
I've been a software engineer, entrepreneur,
angel investor and venture capitalist. I'm also
a husband, father, avid musician, foodie, gadget
nerd and occasional San Franciscan.