Public Speaker, New Media Expert, Thanet's leading independent journalist, SEO, Web Developer and business to Internet interface expert. Also a father, a geek, a blogger and friendly face. An Agent of Change.

Archives

When I write that I feel that it is easy to beat Google at the search game you might think I was crazy, stupid or genius. You would be unlikely to think that I was talking about you doing it. That is exactly what I am talking about though.

Do you remember that awhile ago I showed how some basic maths could be used to give a reasonable approximation by of the median range of the median number of back links needed for any given Google Page Rank. The approximation process was actually very crude but it was something any 16 year old should have been capable of. So I was given to wondering why some 16 year old (or "other" year old for that matter) had not come up with a similar process before. The only answer was that people are not given to thinking about these things as much as I am.

Maybe the general population has given it no thought or perhaps my understanding of Statistics is different to that of others.

Today's post is going to lay the groundwork for search engine style analysis of the web in order to stimulate better search algorithm development in the Open Source (and general geek) community. I see no reason why an open search engine algorithm need not be more efficient and relevant even with all the SEOs looking at it. All that is required is to assume that everyone is putting on their best face anyway.

No let us look at some statistical basics. Don't worry this is not going to get complicated. It is going to get a bit technical but if you have read this far then it should not bother you.

I'd like you to meet my good friends Arithmetic Mean and Standard Deviation. Now I would like to introduce the Normal Distribution also known as the bell curve. This assumes that most of the population we are examining is near to the mean (the height of the bell curve) with a steep but ongoing fall off to ever less as we get away from the mean in each direction. This bell graph may or may not be skewed if the population actually try to lump up in one end or the other.

For the purpose of this we shall assume a normal. What we are going to do is create an outline for a multi-sourced analysis of blogs to create an accurate and relevant "top one hundred" that actually gets more accurate and harder to game the more blogs and (independent) metrics (the larger the population) that are considered. This works because larger samples produce better estimations of the mean and standard deviation.

It is beyond the scope of this first article to come up with a number of fully complete metrics so let us consider a set of common metrics: Alexa Rank, Technorati Rank, Technorati Authority and Google Page Rank. Most of these are taken separately and developed as separate entities which for our reasons makes them more than independent. There is a deliberate problem with this choice and over coming it will form part of the learning process.

So let us turn that into a preparation for code.

Let us consider Alexa Rank and Google Rank. Google Rank runs from 0 to 10 and Alexa Rank from infinity (actually very very big) down to 1. We need a way to make these values fit to the same scale somehow so that they can be used in a meaningful calculation.

This is where average and deviation comes in. With a calculated mean and a standard deviation value calculated we have a way of making Page Rank and Alexa Rank and anything else form a for a value on a scale from around -4 to +4. This is the number of standard deviations from the mean with a sign to indicate if it is before or after the mean (below average or above). A directional measure of Standard-Deviations from the mean.

All that remains is to multiply by -1 or +1 to get the scales all running in the same direction. For our purposes we shall say that up is +1 and down is -1. The result is that a number times 1 is the number but a number times -1 changes the sign so that -2 time -1 is +2 but +4 times -1 is -4. This gets our numbers running in the same direction.

Now most values will be something like 0.941, 1.021, 1.325 etc. Regardless of the original scale of the metric we have reduced it down to a measure of how the individual measures up against the population sample. In a population where all blogs are Page Rank 3 a blog that is ranked as 5 will score very well but in a population where all blogs are ranked 9 it will have a very low score.

The same goes for Alexa rank - the distributional range is ignored and a better measure is created to see how far from the average an item is. From this we can compare metrics against each other on a like for like basis. This can be used to locate anomalies within a niche. These unusual differences can indicate tricking the system or a rise of a new leader or Google messing with the PR or anything else. It simply indicates something is different. That sort of thing I leave as an exercise for the advance SEO that I might pick up on later.

We will need to store our data and so I am going to assume we shall use a database. The first data table we shall address is the blog. To avoid worry about normalisation and natural and artificial keys we shall assume the blogURL is unique enough to form the key.

So we have a table called "blogs" with a field called "BlogURL". You can add as many other fields as you wish but for the purpose of our theory this is all the data we seem to need. This is a primary key (unique and indexed value) which means that the database can use this as a sort of select column very quickly and efficiently. It also means that it can be used to indicate a relationship between this table and others. A sort of anchor.

Now we are going to want to store our calculations. Normally you should not store calculated data but calculate it however this is a nod to efficiency and acts as a historic cache of changes. Let us think of each row of a table giving the top 100 as a record in our database.

Table called "rank" that stores "BlogURL", "rank", "oldrank" and "calc-flag" from this we can output the blogs ordered by rank and show the URL and any other data from the blog table (via an innerjoin) along with change and previous rank. The calc-flag field is used to filter our results to only this specific calculation of our top 100.

To set the calc-flag value we need a calc-session table. This, like the blogs table serves only to give a foundation. "calc-session" has "calc-flag" and "timestamp". This background data table serves to keep track of the code we are using to indicate the table we are building. The time stamp allows us to put the week and year of calculation on our chart archives (should we make them).

I've saved this until last because this is where things get interesting. For each metric we will want a class or code object that will process the data. Each of these has a lot in common and so we can template for these objects before we create them. In languages such as C/C++, PHP or Ruby this template forms a base class that can be extended by the specialist sections.

For each metric the calculation is carried out in three passes. The first pass takes a list of blogs from the blogs table and creates an empty Calculations record for each blog (for added security one could drop the records for the metric that already exist and lack values (orphans)). This gives us a working canvas to fill with values. As we fill the values we set "calc-flag-sum" with the current calc-flag value from the calc-session table for this session.

We can now iterate through the records where calc-flag-sum is not yet set. We do this so that we can stop getting dat and come back to it. There are many reasons for this including multiple servers doing the job, limited processing time for the script or limited server resources. But trust me when you start to chunk through heavy data you need a way to leave a bookmark and come back.

So if interrupted we continue where we left off until the raw values have been filled. Once filled we can calculate the Mean and the Standard Deviation. This value we remember and then we load a list of all records for the metric where "calc-flag-sum" is set to the current calc-flag but "clac-flag-process" is not set at all.

Now we iterate through the values in the final (third) pass. We set the "clac-flag-process" to the current calc-flag and the calc value to our calculated value. This value is always directional Standard-Deviations from the mean times the metrics direction value (-1 or +1).

We use the flag system so that we can stop if we need to or have multiple servers doing the work (or whatever). Once done the metric is finished with for this cycle and can be powered down (or ignored) until next time.

After each metric has pulled it's values into the table and then applied the -1/+1 effect to the standard deviations measure the task is done. Then the table of results can be built (including the calc-flag). In theory it can be done in one pass. If this is not the case you will want to use the flag pattern to allow your system to rest.

The simplest way forward is to return the sum of all clac values for each blog where "calc-flag-sum" and "clac-flag-process" both equal the current calc-flag. This can then be stored in the rank table. The two last values to calculate are rank and old rank. The presence of the values acts as a flag and as that is all that is needed.

The new rank ("rank") value should be obvious and the old rank value should be available on an innerjoin to the same table with a different WHERE value (the last calc-flag). In MySQL you will need to use aliases to achieve that but it can be done in a set of swift SQL commands. And then it is done your new rank system has been calculated and the power of the tried and tested statistical formula used means that with an increase in the population of blogs tested and metrics used the results are more accurate.

On a side note another process called Correlation Coefficient (Speamans Rank might work well) could be used to search for metrics that do not agree with the others. For a deeper test the Product Moment Correlation Coefficient could be used to measure rankings over a period of time. The result would be that you could identify metrics that are poor measures (or the only good measure).

I hope I have not overwhelmed you with what is a basic introduction to a methodology that could be used to make it hard to game a system even when completely open. I hope too that I have given you some food for thought and that some geeks like me might even produce working models of the system I have described. The future of search lies in the history of analysis.

Oh yes, P.S.: The deliberate problem is that the Alexa Rank value is always high for blogspot blogs. Not all metrics are created equal.