Blekko’s new tool lets you find deep information from the Internet dataset

I find myself asking odd questions about the Internet all the time. Thinks like “do websites that use popover ads perform better than those which use popunder?” shouldn’t enter most people’s daily conversations, but they certainly do have a space in mine. Unfortunately, finding deep data like that is often difficult at best.

Blekko is looking to change how we think of the Internet, and what we can learn from it. With a new tool called WebGrepper that will let you find not only high-level data but also the minutiae that lie under the macro scale. What sort of information? According to a blog post from Blekko, just about anything:

One of the advantages of having your own search engine is that you have access to all sorts of data that no one else does. Really, really cool data. And when you tell your friends you have all this data, you get lots of people asking you for stuff. Interesting questions like: “Can you give me a list of every site that uses Facebook Connect? In rank order?” Or: “Can you send me a list of sites that have the Google +1 button on them?”

For those unfamiliar, a grep is a command-line search, developed for Unix systems. The advantage is that it can be contextual. While most searches could find a word in a document, the would find that word in every document. A grep could find that word in a specific document, modified at a certain time. When you’re talking about the huge dataset that is the Internet, being able to whittle down these results is incredibly important.

This falls in line with how Blekko works on the whole. The search site is set up with what it calls slashtags, which are essentially operators that allow you to trim down your search results into more usable information. The grep function of WebGrepper is able to happen because Blekko runs a twice-daily “mapping job” across its 4 billion indexed pages, for results that are popular.

The inherent problem here is that the process isn’t fully scalable to be used by everyone, to find any information that they’d want. As Blekko explains in the posting:

Got a grep you want to run? Submit it here. If enough people agree with you that this grep is interesting (by voting it up), we’ll run it. And we’ll post the results here. We make the top 500 results for every grep available for free to anyone who wants it.

It’s a bit limited in its use at this point. Not that anyone can (or should) blake Blekko for that limitation. There has to be a heavy strain put onto its search system by running queries like this, so using the voting method for interesting queries makes perfect sense.

In short, it’s a great tool, but limited in its use for now. Results that we saw for popular blogging platforms don’t seem to make sense based on the publicly-stated numbers that we know. There might very well be an explanation for this, but you’ll need to take all results as a “your milage may vary” situation for now.