Public Statistics

With the release of Spinn3r 3.0, we have decided to share key statistics on our
crawler with the public, including language breakdown data, content management
distribution, and so on.

These stats are updated live 24/7 and recomputed on the fly. There are additional metrics
available to our paying customers via our admin console.

Language Breakdown

Breakdown of language across all content in the blogosphere. This is measured
by mathematically computing the language of each post based on content (rather
than the configured language for the weblog, which might be incorrect).

Posts with an 'unknown' language are almost certainly short posts of less than about
200 characters.

Weblog Hosting Provider Breakdown

Breakdown of posts across major weblog hosting providers. We separate this
from other hosting providers which, while they might have weblogs, they might
not have users actively participating in the core of the blogosphere.

Right now this is implemented with link pattern matching. For example,
WordPress blogs are identified if the URL contains 'wordpress.com'. Of course,
this method is prone to error and doesn't correctly identify Moveable Type,
WordPress, or any other stand alone blog on their own URL.

Further, TypePad.com is probably misrepresented as well, as most users there
use domain masking.

We expect to have a patch for this soon to include more precise values, creating
a higher rate of accuracy for these blog hosts.

ALL Hosting Provider Breakdown

All weblog hosts, including MSN Live Spaces, MySpace, and LiveJournal.
They might not qualify as traditional blogs, so we decided to break them
out into a dedicated metric.

Feed Performance

Raw number of RSS and Atom feeds being indexed by Spinn3r. This is directly
correlated to the number of posts Spinn3r sees in both the permalink API and
the feed API. This number might fluctuate based upon the raw ping rate at any given time,
as well as the weekly update cycle for pinged feeds.

Feed Content Performance

Number of feed items per hour indexed and available by the feed API. You might
see less content from raw API, as we do not include all posts in the API if they
have been registered by one of our customers, but not yet approved as
non-spam content.