SEO Black Hat has a great post and tool up today that extrapolated clickshare by ranking percentages from the leaked AOL search data. Clickshare is the percentage of total search volume that clicks on a particular position in the SERPs. According to his data, 41.1% of all searchers click on the first result.

I went ahead and graphed the distribution curve of these numbers, and a modified cumulative percentage curve, and it looks like our much beloved pareto curve. I said “modified” cumulative curve, because the first 10 positions only get a little over 80% of the total traffic, so the cumulative curve is based on only those searchers who click on a result in the top 10.
(the top line is the cumulative curve, and the bottom line the distribution)

The disproportionate amount of traffic the first and second positions get is a little surprising, and certainly makes it clear how important it is to not only be in the top 10, but in the top 2.

I mean yeah its Microsoft and it could be better, but there is tons of mathematical fun to be had with Excel’s formulas and graphs.

Over the past week or so I’ve been working on a table, full of lots of simple calculations designed to help me understand the economic patterns in search marketing. Here’s what I came up with.
First put yourself in the shoes of a merchant, you sell hats and on average you make a 10-dollar profit per sale through your website. You convert about 1% of your visitors to buyers so each ‘average’ visitor is essentially worth 10 cents to you. You want more traffic so you’ve decided to create some new content for your site to build natural SEO traffic. You take the time to write and code the new pages, or you outsource the job, either way lets say each new page of content costs you $100 to create and rank on average (including copy writing, HTML and linking). Each page is targeted at one keyword and your site ranks very well so you generally manage to get about a 1% “clickshare” of the total overall keyword traffic for the keyword you are optimizing (based on something like wordtracker data). You would like to make your investment back in increased sales in a month using the new traffic the pages will generate.

Using these givens there is a formula:CostPerPage/KeywordPerPage/ProfitPerVisitor/DaysToBreakEven/Clickshare = TrafficThreshold
So for our hypothetical situation that is: 100/1/0.10/30/0.01 or 3333.3.
This means that for you to break even in one month with these cost and profit numbers each keyword you optimize for needs to have at least 3334 total searches per day. This is your traffic threshold.

There are lots of ways to improve this number, optimizing each page for more than one keyword, increasing your profit per visitor (by increasing average profit per sale or conversion rate), being willing to wait a bit longer to make your money back or reducing your cost to create new pages. The numbers will change but the formula remains the same.

But lets look at how each individual factor controls your threshold.
First cost per page, starting at $1000 per page and going down to $10 your profitable traffic threshold per keyword looks like this:
As you increase the number of days you’re willing to wait to recoup your initial investment the graph looks like this:
By optimizing for more keywords per page you can bring down your threshold in this curve
and finnally, by increasing your profit per visitor you do this to the shape:
One initial observation that can be made is that the cheaper you can make your pages the less traffic you need per keyword to justify the expense, its a purely zipf curve, no point of diminishing returns. On the other hand with profit per visitor, keywords per page and days to recoup, there can clearly be seen a point of diminish returns past which improvements to your numbers are no longer low hanging fruit. For each of these three metrics there is a sweet spot where you can minimize your traffic threshold with the least amount of effort.

The point of trying to minimize your traffic threshold is to make your business model nimble enough to squeeze way down into the tail of your niche and take advantage of as much of the available keyword traffic as possible.

Seth hits the nail right on the head with this. When I’m deciding what links to post here, I’m essentially curating ideas, collecting them to “send” to you (and to myself, in a way). And unconsciously, these seven points factor into my decision on what to post here.

Kottke essentially agrees with Godin in his list of factors. We should all be thinking about these points when creating any content we’d like people to link to.

A while back I linked to a demo of a script I wrote implementing the unix command tail (Like for watching the data being appended to a file), so I could tail a log file. I finally got around to posting the source code.
You’ll need saja the secure ajax for PHP framework and my saja.functions.php file as well as the actual output page, tail.php.
As with most of my code its icky and hackish, but it works. For me at least.

For a corpa to use for an as-yet-unnamed project I’m working on, I’ve been struggling with the unwieldy wikipedia XML dump.
1.4gb of pure XML wikicontent. A huge pain to import however, since SQL dumps are no longer directly released. I had to install mediawiki’s (the software that wikipedia runs on) database structure (in the source code its in maintentance/tables.sql), then run a java program called mwdumper to create an enourmous SQL file. All of that didn’t take very long, what’s taking a while now is actually importing that SQL file.

In a previous post I talked about the short head and long tail of keyword traffic. The 80/20 rule doesn’t quite apply but some general 20-40/80 rule does. Even though multiple and more targetted keyword phrases don’t cost the searcher anything, keyword traffic still roughly follows a pareto curve.
In his book Chris Anderson asserts the 80/20 rule is enforced by powers of economics. In the music industry the risks and costs to music companies defines what becomes a “hit”. In keywords the only powers driving towards a pareto distrobution are similarities in the way people express themselves. Its only a small slice of expression, a few words used to describe something we want, but the head and tail distribution suggests we all comunicate in strikingly similar ways. Or at least a lot of us do.
Building on my new PHP POS Tagger and my PHP ngram tokenizer I plan to study wikipedia as a corpus to derive some data about distribution curves of the most popular ngrams and how they can relate to keyword selection.

Dan Zarrella

Dan Zarrella is the award-winning social media scientist and author of four books: “The Science of Marketing,” “Zarrella’s Hierarchy of Contagiousness,” “The Social Media Marketing Book” and The Facebook Marketing Book.