David Hardtke's Blog

A few weeks ago I was at the Twitter Developer Conference. On the hack day there were many impressive presentations about the tools that Twitter has developed to manage all of the data going in and out of Twitter. Twitter moving their back-end data store over to Cassandra. They threw out some impressive numbers -- 50 million tweets per day. 600 million searches per day. After the hack day, I had dinner with a friend from Twitter (@jeanpaul) and we were discussing the raw data volumes that they have to deal with.

My benchmark for "big data" is the STAR Experiment at RHIC. I worked on STAR from 1997-2003, and at that point I believe it was the largest volume data producer in existence. The raw data rates were enormous (Gigabyte or so per second) but it was fairly easy to compress that to 100 MB/s using electronics. At the end of the day, we had to put everything on tape, and the limit at the time was about 20 Mb/s to tape. Using the technologies available at the time, 20 Mb/s was that maximum you could record.

Today, of course, nobody uses tape for these sorts of problems. Tape is the same price as it was 10 years ago but disk is about 1000 times cheaper. One would assume then that people are recording data at much higher rates than the physicists were 10 years ago. Turns out, that for human generated data, the data rates are not as high as one might think. I compiled the following numbers from various places. This is data that needs to be archived -- when Ashton Kutcher sends a 4 kB tweet it causes 20 GB of bandwidth to be used, but only the 4 kB tweet needs to be saved.

Source

Rate

Data to Storage

Twitter

700/s

2 MB/s

Facebook Status Updates

600/s

2 MB/s

Facebook Photos

400/s

40 MB/s

Google Search Queries

34,000/s

30 MB/s

All of this content is humans typing at a keyboard (except for the Facebook photos). We see something interesting -- human generated unique content, integrated over all humanity, is not a very difficult data problem. Everything we generate is of order 100 MB/s, or perhaps 1 GB/s if we include emails and SMS.

This week Microsoft approved two applications that integrate Stinky Teddy's Gossip Powered search directly into your browser and posted them in the Internet Explorer Add-ons Gallery. These tools were built to take advantage of some great features that Microsoft added to Internet Explorer 8. Browsers are becoming like smart phones where the actual phone is not as important as the apps that are available (in the browser world, "apps" are known as "add-ons"). Mozilla's Firefox is the king of the add-on business. Firefox was built as a lightweight shell that could be customized by the user. There are more than 10,000 add-ons in the Mozilla add-on gallery. Google's Chrome has recently enabled third-party add-ons and many Mozilla developers have ported their applications.

Add-ons and toolbars have long existed for Internet Explorer, but there has been a fundamental barrier to their widespread adoption -- the tools used to build add-ons and toolbars for Internet Explorer are also used by hackers to steal your information and infect your computer. Installing add-ons required that you install system software on your computer, and once you hit that button you were at the mercy of the software developer. Often they enticed you to hit the button by offering something useful like smiley face emoticons or access to games. Mozilla's Firefox built a sandboxing mechanism that keeps the add-ons separate from the operating system. Mozilla also has a good system of community policing that keeps the Mozilla community safe from malicious hackers.

Internet Explorer is the default browser for most users, so there has always been a desire to bring add-on features to Internet Explorer without requiring the user to install potentially malicious software on their computer. Enter Internet Explorer 8, with the concept of the Accelerator. Accelerators allow developers to interact with web pages that are rendered in your browser. The applications are completely sandboxed in the browser, and are only activated when you explicitly call for them. Hence, they are safe to install and use.

The Stinky Teddy Abracadabra Search Accelerator allows you to launch a search directly from a web page, either by highlighting terms on the page or by simply right clicking and selecting our accelerator. A little search preview box will pop up, so in many cases you can navigate directly to the page you are looking for. What I've described is pretty standard, but we've added a special ingredient. The Stinky Teddy Abracadabra Search accelerator uses the page you are currently visiting as context for your search. The concepts on the page are used as a frame of reference that guides us when we decided which search results to show you. The word "base" means different things if you are on a page about baseball or a page about furniture. Where you are helps us to know where you want to go. Although this idea is obvious, no other search engine uses this information. To be clear, we aren't tracking you -- all we use is your current screen to provide context. We don't save any information about you.

A second cool feature added to Internet Explorer 8 is Visual Search Suggestions. Firefox allows for search suggestions in a limited fashion (one line of text). After installing our Search Box Plugin we show you a preview of the search page as you type in your search query. Most search providers show query suggestions -- we show the search page. The search page preview we show has most of our usual content types (web, video, real-time, twitter, news), and the "buzzing" content is shown first. We wonder why other search engines don't show you search results as you type, and we suspect the answer is that this is a case where the business of search gets in the way of the user experience. The business of search is to show sponsored links above the search results. Search engines want you to go to their page, even when that step is unnecessary. Direct navigation from the search box makes more sense to us.

Check out our Internet Explorer 8 goodies and let us know what you think.

Visual Search Bar Plugin:

Accelerator:

I'm currently training for the upcoming Oakland Marathon. Last weekend I needed to go for a long run in spite of a steady rain. I decided to do the French Trail -- if there is any trail that will make a run good on a miserable day it is the French Trail. Here's the GPS. I was surprised to several other runners out that day, all covered in mud, and all enjoying the rainy day in the forest.

For those of you not familiar with the French Trail, it's located in Redwood Regional Park. The park has 4 major north-south trails that are excellent for running (East Ridge Trail, Stream Trail, West Ridge Trail, and French Trail). The French Trail is by far the most difficult, but also the best. I think it is the best running trail in the East Bay. The French Trail hugs the eastern side of the mountain that forms the Oakland skyline. Becuase of this geography, it is able to support the few remaining redwood groves in the area (these are second generation trees -- all of the old growth was cut down to rebuild San Francisco in 1906). The trail has several micro-climates with bits of chapparal intersperesed with redwood rain forest.

The trail is quite difficult to reach. You need to hike about a mile the trail head (park at the skyline gate, and take the West Ridge Trail). In order to reach the best part of the trail (between Tres Sendas and Chown Trails), it requires a several mile hike or run from the parking lot. The best way to access this part of the trail is to park at the Redwood Bowl parking lot, and take the West Ridge Trail to either Tres Sendas or Chown.

Like most alternative search engines, Stinky Teddy doesn't get much traffic. On an average day we get a few hundred searches on our site (Google handles about 1 billion searches per day worldwide). It doesn't help that our advertising, marketing, and public relations budget is $0. This is not strictly true - we once spent $40 on a Facebook advertising campaign, but that experience warrants a separate blog post.

We do, however, get an occassional surge of traffic. Somebody writes an article about us on their blog and we get a bunch of people checking out the site. Our first surge of traffic came in October, when Frederic Lardinois wrote a short piece on ReadWriteWeb entitled Stinky Teddy: A Cool Real-Time Search Engine with a Rather Odd Name. We didn't know the article was coming, and only noticed that it had been posted when our site crashed (we had a memory leak, since fixed). Before this ReadWriteWeb article we got no traffic whatsover as we had not yet released the product.

Being a scientist, I couldn't help but to utilize this ReadWriteWeb post as a chance to do an interesting study. The Internet Entrepreneur's dream scenario is the following:

Build Great Product in Secrecy

Using PR, generate massive news coverage on day of launch

Go viral, with peer-to-peer messaging on social media leading to massive adoption.

This scenario never works for new search engines. Nonetheless, there are a bunch of people out there willing to try a new search engine, and positive news coverage is the way to get on their radar screen.

When it comes to planning for this glorious launch, however, there is one question that the Internet Entrepreneur wants answered that nobody will tell them. How much traffic will I get, and how long will it last?. The study I performed using the ReadWriteWeb post addresses the "how long" question.

The basic premise behind the study was that this single ReadWriteWeb post was singularly responsible for all traffic on Stinky Teddy for the next month. Our traffic before was nil, and we did absolutely no marketing or PR during this period. Therefore, any visitor or user on our site during that period was directly or indirectly related to the ReadWriteWeb post. For the first time, we were able to measure the "Impulse Response Function" of the web-blog-social media ecosystem. The "Impulse Response" is how a system responds to a sharp input signal (for a detailed discussion, read the paper). In this study, we measured the hourly/daily traffic on our site. That is the data we need to determine the impulse response. Here's the traffic in the 100 hours after publication:

This shows an interesting two-peak structure to the traffic. The first peak is obviously direct traffic from the ReadWriteWeb blog. We suspect the second peak is due to social media (e.g. Twitter sharing) and news readers (Google Reader, Netvibes, etc.). The second peak corresponds to 9 AM on the East Coast of the United States, so these are people checking yesterday's news when they arrive at work the next morning. We also looked at traffic on Stinky Teddy for the next 25 days:

Here we see something very interesting. In the web-blog-social media ecosystem stories "ring" for a long time. Half the traffic attributable to the ReadWriteWeb article came more that 4 days after the article. Only 10% came during that initial 5 hour burst from the ReadWriteWeb page.

This is a one-time only experiment. We've had several other momentary spikes in traffic, but only for this period in October through Novembmer could we definitively attribute all of the traffic back to a single source. It would be interesting of others repeated this study to see if what we observe is universal. Our main findings are:

Only 10% of traffic eventually generated by the blog post came via early direct clickthroughs from the ReadWriteWeb home page.

There is a two-peak structure in the traffic during the first 24 hours, with the second peak likely associated with "first thing in the morning" readers of yesterday's news through social media sharing or readers.

Half of the traffic (from both direct and indirect sources) came four or more days after the article was posted.

To my great delight and surprise, my grant proposal to the Knight Foundation was selected for the next level of review. The proposal aims to create a performance royalty system for online journalism. Today I submitted the Full Proposal. Please have a look and post comments on their web site.

After the proposal passed the preliminary round of review, I started to seriously investigate the legal issue involved. I contacted several lawyers, both to get some insight and also to line up future collaborators. My proposal, at it's essence, involves charging search engines to index and cache web content. The performance royalty idea is just a fair way of determining the proper distribution of payments to the various news providers and journalists.

The legal issue involved is with regards to "fair use". From my non lawyer understanding, fair use means that I am allowed to reproduce a small passage of a copyrighted work under certain conditions. To determine if an action is legal under fair use, one must use a balancing test. Factors include the purpose of the use, the nature of the work, whether the use impacts the value of the copyrighted work, and the amount of the excerpt compared to the whole.

Somewhere in my academic training, I learned that fair use with regards to text can be mostly captured by the "three sentence rule". You are allowed to quote three sentences and be safe. Search Engines generally follow the three sentence rule (snippets shown on search results pages are never more than three sentences long). Based on this simple rule, search engines have argued that they should never need to pay to link. I agree for the most part. The Internet is all about linking, and charging to link and quote others would be disastrous for the Internet.

It's a bit more complicated, however, in the case of search engines. In order to generate a snippet, a search engine must cache the entire content of the document. The document might not be cached in it's original form, but the entire document is cached in a derivative form. The snippet is generated in response to a user query -- that's why the cache is necessary.

The interesting question is whether I, as a web site publisher, automatically authorize the automated caching of my copyrighted content once I stick it on the web without password protection. Does allowing people to read my web page also give a search engine crawler the right to read my web page and store it's findings? As far as I can tell, the answer to this question is unclear given the current state of the law. During my recent research, I was pointed to an excellent editorial by Bruce Brown and Bruce Sanford on this very subject of Fair Use and Search Engines.