Before we discuss how to abuse it, it might be a good idea to define it within its context, ok?

Playground

First of all, Google announced these meta tags on the official Google News blog for a reason. So when you plan to abuse it with your countless MFA proxies of Yahoo Answers, you most probably jumped on the wrong band wagon. Google supports the meta elements below in Google News only.

syndication-source

The first new indexer hint is syndication-source. It’s meant to tell Google the permalink of a particular news story, hence the author and all the folks spreading the word are asked to use it to point to the one –and only one– URI considered the source:

The meta element above is for instances of the story served from
http://outerspace.com/breaking/page1.html
http://outerspace.com/yyyy-mm-dd/page2.html
http://outerspace.com/news/aliens-appreciate-google-hotpot.html
http://outerspace.com/news/ubercool-geeks-launched-google-hotpot.html
http://newspaper.com/main/breaking.html
http://tabloid.tv/rehashed/from/rss/hot:alien-pot-in-your-bong.html
…

Don’t confuse it with the cross-domain rel-canonical link element. It’s not about canning duplicate content, it marks a particular story, regardless whether it’s somewhat rewritten or just reprinted with a different headline. It tells Google News to use the original URI when the story can be crawled from different URIs on the author’s server, and when syndicated stories on other servers are so similar to the initial piece that Google News prefers to use the original (the latter is my educated guess).

original-source

The second new indexer hint is original-source. It’s meant to tell Google the origin of the news itself, so the author/enterprise digging it out of the mud, as well as all the folks using it later on, are asked to declare who broke the story:

Say we’ve got two or more related news, like “Google fell from Mars” by cnn.com and “Google landed in Mountain View” by sfgate.com, it makes sense for latimes.com to publish a piece like “Google fell from Mars and landed in Mountain View”. Because latimes.com is a serious newspaper, they credit their sources not only with a mention or even embedded links, they do it machine-readable, too:

It’s a matter of course that both cnn.com and sfgate.com provide such an original-source meta element on their pages, in addition to the syndication-source meta element, both pointing to their very own coverage.

If a journalist grabbed his breaking news from a secondary source telling “CNN reported five minutes ago that Google’s mothership started from Venus, and the LA Times spotted it crashing on Jupiter”, he can’t be bothered with looking at the markup and locating those meta elements in the head section, he has a deadline for his piece “Why Web search left Planet Earth”. It’s just fine with Google News when he puts

Fine-prints

At this time, Google News will not make any changes to article ranking based on this tags.

If we detect that a site is using these metatags inaccurately (e.g., only to promote their own content), we’ll reduce the importance we assign to their metatags. And, as always, we reserve the right to remove a site from Google News if, for example, we determine it to be spammy.

As with any other publisher-supplied metadata, we will be taking steps to ensure the integrity and reliability of this information.

It’s a field test

We think it is a promising method for detecting originality among a diverse set of news articles, but we won’t know for sure until we’ve seen a lot of data. By releasing this tag, we’re asking publishers to participate in an experiment that we hope will improve Google News and, ultimately, online journalism. […] Eventually, if we believe they prove useful, these tags will be incorporated among the many other signals that go into ranking and grouping articles in Google News. For now, syndication-source will only be used to distinguish among groups of duplicate identical articles, while original-source is only being studied and will not factor into ranking. [emphasis mine]

Spam potential

Well, we do know that Google Web search has a spam problem, IOW even a few so-1999-webspam-tactics still work to some extent. So we tend to classify a vague threat like “If we find sites abusing these tags, we may […] remove [those] from Google News entirely” as FUD, and spam away. Common sense and experience tells us that a smart marketer will make money from everything spammable.

But: we’re not talking about Web search. Google News is a clearly laid out environment. There are only so many sites covered by Google News. Even if Google wouldn’t be able to develop algos analyzing all source attribution attributes out there, they do have the resources to identify abuse using manpower alone. Most probably they will do both.

They clearly told us that they will compare those meta data to other signals. And that’s not only very weak indicators like “timestamp first crawled” or “first heard of via pubsubhubbub”. It’s not that hard to isolate particular news, gather each occurrence as well as source mentions within, and arrange those on a time line with clickable links for QC folks who most certainly will identify the actual source. Even a few spot tests daily will soon reveal the sites whose source attribution meta tags are questionable, or even spammy.

If you’re still not convinced, fair enough. Go spam away. Once you’ve lost your entry on the whitelist, your free traffic from Google News, as well as from news-one-box results on conventional SERPs, is toast.

Last but not least, a fair warning

Now, if you still want to use source attribution meta elements on your non-newsworthy MFA sites to claim owership of your scraped content, feel free to do so. Most probably Matt’s team will appreciate just another “I’m spamming Google” signal.

Not that reprinting scraped content is considered shady any more: even a former president does it shamelessly. It’s just the almighty Google in all of its evilness that penalizes you for considering all on-line content public domain.

I’m sure Google’s WebSpam team will pull the plug sooner or later, but as of today Google’s real time search results are extremely vulnerable to questionable content.

The somewhat shady approach to make creative use of real time search I’m outlining below will not work forever. It can be used for really evil purposes, and Google is aware of the problem. Frankly, if I’d be the Googler in charge, I’d dump the whole real-time thingy until the spam defense lines are rock solid.

Here’s the recipe from Dr Evil’s WebSpam-Cook-Book:

Ingredients

1 popular topic that pulls lots of searches, but not so many that the results scroll down too fast.

Provided you’re a savvy spammer, your crawler detection routine will be a little more complex.

Save the file and upload it, then test the URI http://youspamaw.ay/hot-topic.php in your browser.

Serving

Login to Twitter and submit lots of nicely crafted, not too much keyword stuffed messages carrying your spammy URI. Do not use obscene language, e.g. don’t swear, and sail around phrases like ‘buy cheap viagra’ with synonyms like ‘brighten up your girl friend’s romantic moments’.

On their SERPs, Google will display the text from the trusted page’s TITLE element, linked to your URI that leads punters to a sales pitch of your choice.

Just for entertainment, closely monitor Google’s real time SERPs, and your real-time sales stats as well.

Be happy and get rich by end of the week.

Google removes links to untrusted destinations, that’s why you need to abuse authority pages. As long as you don’t launch f-bombs, Google’s profanity filters make flooding their real time SERPs with all sorts of crap a breeze.

I stole this pamphlet’s title (and more) from Google’s post Hard facts about comment spam for a reason. In fact, Google spams the Web with useless clutter, too. You doubt it? Read on. That’s the URI from the link above:

When your Google account lists both Feedburner and GoogleAnalytics as active services, Google will automatically screw your URIs when somebody clicks a link to your site in a feed reader (you can opt out, see below).

Why is it bad?

FACT: Google’s method to track traffic from feeds to URIs creates new URIs. And lots of them. Depending on the number of possible values for each query string variable (utm_sourceutm_mediumutm_campaignutm_contentutm_term) the amount of cluttered URIs pointing to the same piece of content can sum up to dozens or more.

FACT: Bloggers (publishers, authors, anybody) naturally copy those cluttered URIs to paste them into their posts. The same goes for user link drops at Twitter and elsewhere. These links get crawled and indexed. Currently Google’s search index is flooded with 28,900,000 cluttered URIs mostly originating from copy+paste links. Bing and Yahoo didn’t index GA tracking parameters yet.

That’s 29 million URIs with tracking variables that point to duplicate content as of today. With every link copied from a feed reader, this number will increase. Matt Cuttssaid “I don’t think utm will cause dupe issues” and points to John Müller’s helpful advice (methods a site owner can apply to tidy up Google’s mess).

Maybe Google can handle this growing duplicate content chaos in their very own search index. Lets forget that Google is the search engine that advocated URI canonicalization for ages, invented sitemaps, rel=canonical, and countless high sophisticated algos to merge indexed clutter under the canonical URI. It’s all water under the bridge now that Google is in the create-multiple-URIs-pointing-to-the-same-piece-of-content business itself.

So far that’s just disappointing. To understand why it’s downright evil, lets look at the implications from a technical point of view.

Spamming URIs with utm tracking variables breaks lots of things

Look at this URI: http://www.example.com/search.aspx?Query=musical+mobile?utm_source=Referral&utm_medium=Internet&utm_campaign=celebritybabies

Google added a query string to a query string. Two URI segment delimiters (“?”) can cause all sorts of troubles at the landing page.

Some scripts will process only variables from Google’s query string, because they extract GET input from the URI’s last questionmark to the fragment delimiter “#” or end of URI; some scripts expecting input variables in a particular sequence will be confused at least; some scripts might even use the same variable names … the number of possible errors caused by amateurish extended query strings is infinite. Even if there’s only one “?” delimiter in the URI.

In some cases the page the user gets faced with will lack the expected content, or will display a prominent error message like 404, or will consist of white space only because the underlying script failed so badly that the Web server couldn’t even show a 5xx error.

Regardless whether a landing page can handle query string parameters added to the original URI or not (most can), changing someone’s URI for tracking purposes is plain evil, IMHO, when implemented as opt-out instead of opt-in.

Appended UTM query strings can make trackbacks vanish, too. When a blog checks whether the trackback URI is carrying a link to the blog or not, for example with this plug-in, the comparision can fail and the trackback gets deleted on arrival, without notice. If I’d dig a little deeper, most probably I could compile a huge list of other functionalities on the Internet that are broken by Google’s UTM clutter.

Finally, GoogleAnalytics is not the one and only stats tool out there, and it doesn’t fulfil all needs. Many webmasters rely on simple server reports, for example referrer stats or tools like awstats, for various technical purposes. Broken. Specialized content management tools feeded by real-time traffic data. Broken. Countless tools for linkpop analysis group inbound links by landing page URI. Broken. URI canonicalization routines. Broken, respecively now acting counterproductive with regard to GA reporting. Google’s UTM clutter has impact on lots of tools that make sense in addition to Google Analytics. All broken.

What a glorious mess. Frankly, I’m somewhat puzzled. Google has hired tens of thousands of this planet’s brightest minds –I really mean that, literally!–, and they came out with half-assed crap like that? Un-fucking-believable.

What can I do to avoid URI spam on my site?

Boycott Google’s poor man’s approach to link feed traffic data to Web analytics. Go to Feedburner. For each of your feeds click on “Configure stats” and uncheck “Track clicks as a traffic source in Google Analytics”. Done. Wait for a suitable solution.

If you really can’t live with traffic sources gathered from a somewhat unreliable HTTP_REFERER, and you’ve deep pockets, then hire a WebDev crew to revamp all your affected code. Coward!

As a matter of fact, Google is responsible for this royal pain in the ass. Don’t fix Google’s errors on your site. Let Google do the fault recovery. They own the root of all UTM evil, so they have to fix it. There’s absolutely no reason why a gazillion of webmasters and developers should do Google’s job, again and again.

What can Google do?

Well, that’s quite simple. Instead of adding utterly useless crap to URIs found in feeds, Google can make use of a clever redirect script. When Feedburner serves feed items to anybody, the values of all GA tracking variables are available.

Instead of adding clutter to these URIs, Feedburner could replace them with a script URI that stores the timestamp, the user’s IP addy, and whatnot, then performs a 301 redirect to the canonical URI. The GA script invoked on the landing page can access and process these data quite accurately.

Perhaps this procedure would be even more accurate, because link drops can no longer mimick feed traffic.

Speak out!

So, if you don’t approve that Feedburner, GoogleReader, AdSense4Feeds, and GoogleAnalytics gang rape your well designed URIs, then link out to everything Google with a descriptive query string, like:

I mean, nicely designed canonical URIs should be the search engineer’s porn, so perhaps somebody at Google will listen. Will ya?

Update:

I’ve just added a “UTM Killer” tool, where you can enter a screwed URI and get a clean URI — all ‘utm_’ crap and multiple ‘?’ delimiters removed — in return. That’ll help when you copy URIs from your feedreader to use them in your blog posts.

Jump station

Not that the recent –and ongoing– paradigm shift shift in crawling, indexing, and ranking is bad news in general. On the contrary, for the tech savvy webmaster it comes with awesome opportunities with regard to traffic generation and search engine optimization. In this pamphlet I’ll blather about pitfalls, leaving chances unmentioned.

Background

For a moment forget everything you’ve heard about traditional crawling, indexing and ranking. Don’t buy that search engines solely rely on ancient technologies like fetching and parsing Web pages to scrape links and on-the-page signals out of your HTML. Just because someone’s link to your stuff is condomized, that doesn’t mean that Google & Co don’t grab its destination instantly.

More and more, search engines crawl, index, and rank stuff hours, days, or even weeks before they actually bother to fetch the first HTML page that carries a (nofollow’ed) link pointing to it. For example, Googlebot might follow a link you’ve tweeted right when the tweet appears on your timeline, without crawling http://twitter.com/your-user-name or the timeline page of any of your followers. Magic? Nope. The same goes for favs/retweets, stumbles, delicious bookmarks etc., by the way.

Guess why Google encourages you to make your ATOM and RSS feeds crawlable. Guess why FeedBurner publishes each and every update of your blog via PubSubHubbub so that Googlebot gets an alert of new and updated posts (and comments) as you release them. Guess why Googlebot is subscribed to FriendFeed, crawling everything (blog feed items, Tweets, social media submissions, comments …) that hits a FriendFeed user account in real-time. Guess why GoogleReader passes all your Likes and Shares to Googlebot. Guess why Bing and Google are somewhat connected to Twitter’s database, getting all updates, retweets, fav-clicks etc. within a few milliseconds.

Because all these data streams transport structured data that are easy to process. Because these data get pushed to the search engine. That’s way cheaper, and faster, than polling a gazillion of sources for updates 24/7/365.

Making use of structured data (XML, RSS, ATOM, PUSHed updates …) enables search engines to index fresh content, as well as reputation and other off-page signals, on the fly. Many ranking signals can be gathered from these data streams and their context, others are already on file. Even if a feed item consists of just a few words and a link (e.g. a tweet or stumble-thumbs-up), processing the relevant on-the-page stuff from the link’s final destination by parsing cluttered HTML to extract content and recommendations (links) doesn’t really slow down the process.

Later on, when the formerly fresh stuff starts to decompose on the SERPs, signals extracted from HTML sources kick in, for example link condoms, link placement, context and so on. Starting with the discovery of a piece of content, search engines permanently refine their scoring, until the content finally drops out of scope (spam filtering, unpaid hosting bills and other events that make Web content disappear).

Traditional discovery crawling, indexing, and ranking doesn’t exactly work for real-time purposes, nor in near real-time. Not even when a search engine assigns a bazillion of computers to this task. Also, submission based crawling is not exactly a Swiss Army knife when it comes to timely content. Although XML-sitemaps were a terrific accelerator, they must be pulled for processing, hence a delay occurs by design.

Nothing is like it used to be. Change happens.

Why does this paradigm shift puts your site at risk?

As a matter of fact, when you publish user generated content, you will get spammed. Of course, that’s bad news of yesterday. Probably you’re confident that your anti-spam defense lines will protect you. You apply link condoms to UGC link drops and all that. You remove UGC once you spot it’s spam that slipped through your filters.

Bad news is, your medieval palisade won’t protect you from a 21th century tank attack with air support. Why not? Because you’ve secured your HTML presentation layer, but not your feeds. There’s no such thing as a rel-nofollow microformat for URIs in feeds, and even condomized links transported as CDATA (in content elements) are surrounded by spammy textual content.

Feed items come with a high risk. Once they’re released, they’re immortal and multiply themselves like rabbits. That’s bad enough in case a pissed employee ‘accidently’ publishes financial statements on your company blog. It becomes worse when seasoned spammers figure out that their submissions can make it into your feeds, and be it only for a few milliseconds.

If your content management system (CMS) creates a feed item on submission, search engines –in good company with legions of Web services– will distribute it all over the InterWeb, before you can hit the delete button. It will be cached, duplicated, published and reprinted … it’s out of your control. You can’t wipe out all of its instances. Never.

Congrats. You found a surefire way to piss off both your audience (your human feed subscribers getting their feed reader flooded with PPC spam), and search engines as well (you send them weird spam signals that rise all sorts of red flags). Also, it’s not desirable to make social media services –that you rely on for marketing purposes– too suspicious (trigger happy anti-spam algos might lock away your site’s base URI in an escape-proof dungeon).

So what can you do to prevent your feeds from unwanted content?

Before I discuss advanced feed protection, let me point you to a few popular vulnerabilities you might haven’t considered yet:

No nay never use integers as IDs before you’re dead sure that a piece of submitted content is floral white as snow. Integer sequences produce guessable URIs. Instead, generate a UUID (aka GUID) as identifier. Yeah, I know that UUIDs make ugly URIs, but those aren’t predictable and therefore not that vulnerable. Once a content submission is finally approved, you can donate it a nice –maybe even meaningful– URI.

No nay never use titles, subjects or so in URIs, not even converted text from submissions (e.g. ‘My PPC spam’ ==> ‘my_ppc_spam’). Why not? See above. And you don’t really want to create URIs that contain spammy keywords, or keywords that are totally unrelated to your site. Remember that search engines do index even URIs they can’t fetch, or which they can’t refetch, at least for a while.

Before the final approval, serve submitted content with a “noindex,nofollow,noarchive,nosnippet” X-Robots-Tag in the HTTP header, and put a corresponding meta element in the HEAD section. Don’t rely on link condoms. Sometimes search engines ignore rel-nofollow as an indexer directive on link level, and/or decide that they should crawl the link’s destination anyway.

Consider serving social media bots requesting a not yet approved piece of user generated content a 503 HTTP response code. You can compile a list of their IPs and user agent names from your raw logs. These bots don’t obey REP directives, that means they fetch and process your stuff regardless whether you yell “noindex” at them or not.

For all burned (disapproved) URIs that were in use ensure that your server returns a 410-Gone HTTP status code, respectively perform a 301 redirect to a policy page or so to rescue link love that would get wasted otherwise.

Your Web forms for content submissions should be totally AJAX’ed. Use CAPTCHAs and all that. Split the submission process into multiple parts, each of them talking to the server. Reject excessively rapid walk throughs, for example by asking for something unusual when a step gets completed in a too short period of time. With AJAX calls that’s painless for the legit user. Do not accept content submissions via standard GET or POST requests.

There’s more. With the above said, I’ve just begun to scrape the surface of a savvy spammer’s technical portfolio. There’s next to nothing a clever programmed bot can’t mimick. Be creative and think outside the box. Otherwise the spammers will be ahead of you in no time, especially when you make use of a standard CMS.

Having said that, lets proceed to feed protection tactics. Actually, there’s just one principle set in stone:

Make absolutely sure that submitted content can’t make it into your feeds (and XML sitemaps) before it’s finally approved!

The interesting question is: what the heck is a “final approval”? Well, that depends on your needs. Ideally, that’s you releasing each and every piece of submitted content. Since this approach doesn’t scale, think of ways to semi-automate the process. Don’t fully automate it, there’s no such thing as an infallible algo. Also, consider the wisdom of the crowd spammable (voting bots). Some spam will slip through, guaranteed.

Each and every content submission must survive a probation period, whereas it will not be included in your site’s feeds. Regardless who contributed it. Stick with the four-eye principle. Here are a few generic procedures you could adapt, respectively ideas which could inspire you:

Stuck content submissions from new users who didn’t participate in other ways in quarantaine. Moderate this queue and only manually release into the probation queue what passes the moderator’s heuristics. Signup-submit-and-forget is a typical spammer behavior.

On submission fetch the link’s content and analyze it, don’t stick with heuristic checks of URIs, titles and descriptions. Don’t use methods like PHP’s file_get_contents that don’t return HTTP response codes. You need to know whether a requested URI is the first one of a redirect chain, for example. Double check with a second request from another IP, preferably owned by a widely used ISP, with a standard browser’s user agent string, that provides an HTTP_REFERER, for example a Google SERP with a q parameter populated with a search term compiled from the submission’s suggested anchor text. If the returned content differs too much, set a red flag.

Maintain white lists, too. That’s a great way to reduce the amount of inavoidable false positives.

If you have editorial staff or moderators, they should get a Release to Feed button. You can combine mod releases with a minimum number of user votes or so. For example you could define a rule like “release to feed if mod-release = true and num-trusted-votes > 10″.

Categorize your user’s reputation and trustworthiness. A particular number of votes from trusted users could approve a submission for feed inclusion.

Don’t automatically release submissions that have raised any flag. If that slows down the process, refine your flagging but don’t lower the limits.

With all automatted releases, for example based on votings, oops, especially based on votings, implement at least one additional sanity check. For example discard votes from new users as well as from users with a low participation history, check the sequence of votes for patterns like similar periods of time between votings, and so on.

Disclaimer: That’s just some food for thoughts. I want to make absolutely clear that I can’t provide bullet-proof anti-spam procedures. Feel free to discuss your thoughts, concerns, questions … in the comments.

Recap: Each and every 3rd party URI shortener is evil by design. Those questionable services do/will steal your traffic and your Google juice, mislead and piss off your potential visitors customers, and hurt you in countless other ways. If you consider yourself south of sanity, do not make use of shortened URIs you don’t own.

Actually, this pamphlet is not about sloppy social media users who shoot themselves in both feet, and it’s not about unscrupulous micro blogging platforms that force their users to hand over their assets to felonious traffic thieves. It’s about search engines that, in my humble opinion, handle the sURL dilemma totally wrong.

Some of my claims are based on experiments that I’m not willing to reveal (yet). For example I won’t explain sneaky URI hijacking or how I stole a portion of tinyurl.com’s search engine traffic with a shortened URI, passing searchers to a charity site, although it seems the search engine I’ve gamed has closed this particular loophole now. There’re still way too much playgrounds for deceptive tactics involving shortened URIs …

How should a search engine handle a shortened URI?

Redirect patterns: URI shorteners receive lots of external inbound links that get redirected to 3rd party sites. Linking pages, stopovers and destination pages usually reside on different domains. The method of redirection can vary. Most URI shorteners perform 301 redirects, some use 302 or 307 HTTP response codes, some frame the destination page displaying ads on the top frame, and I’ve seen even a few of them making use of meta refreshs and client sided redirects. Search engines can detect all those procedures.

Link appearance: redirecting URIs that belong to URI shorteners often appear on pages and in feeds hosted by social media services (Twitter, Facebook & Co).

Self exposure: the root index pages of URI shorteners, as well as other pages on those domains that serve a 200 response code, usually mention explicit terms like “shorten your URL” et cetera.

URI length: the length of an URI string, if less or equal 20 characters, is an indicator at most, because some URI shortening services offer keyword rich short URIs, and many sites provide natural URIs this short.

Search engine crawlers bouncing at short URIs should do a lookup, following the complete chain of redirects. (Some whacky services shorten everything that looks like an URI, even shortened URIs, or do a lookup themselves replacing the original short URI with another short URI that they can track. Yup, that’s some crazy insanity.)

Each and every stopover (shortened URI) should get indexed as an alias of the destination page, but must not appear on SERPs unless the search query contains the short URI or the destination URI (that means not on [site:tinyurl.com] SERPs, but on a [site:tinyurl.com shortURI] or a [destinationURI] search result page). 3rd party stopovers mustn’t gain reputation (PageRank™, anchor text, or whatever), regardless the method of redirection. All the link juice belongs to the destination page.

Now let’s see how major search engines handle shortened URIs, and how they could improve their SERPs.

Bing doesn’t get redirects at all

Oh what a mess. The candidate from Redmond fails totally on understanding the HTTP protocol. Their search index is flooded with a bazillion of URI-only listings that all do a 301 redirect, more than 200,000 from tinyurl.com alone. Also, you’ll find URIs that do a permanent redirect and have nothing to do with URI shortening in their index, too.

I can’t be bothered with checking what Bing does in response to other redirects, since the 301 test fails so badly. Clicking on their first results for [site:tinyurl.com], I’ve noticed that many lead to mailto://working-email-addy type of destinations. Dear Bing, please remove those search results as soon as possible, before anyone figures out how to use your SERPs/APIs to launch massive email spam campaigns. As for tips on how to improve your short-URI-SERPs, please learn more under Yahoo and Google.

Yahoo does an awesome job, with a tiny exception

Yahoo has done a better job. They index short URIs and show the destination page, at least via their site explorer. When I search for a tinyURL, the SERP link points to the URI shortener, that could get improved by linking to the destination page.

By the way, Yahoo is the only search engine that handles abusive short-URIs totally right (I will not elaborate on this issue, so please don’t ask for detailled information if you’re not a SE engineer). Yahoo bravely passed the 301 test, as well as others (including pretty evil tactics). I so hope that MSN will adopt Yahoo’s bright logic before Bing overtakes Yahoo search. By the way, that can be accomplished without sending out spammy bots (hint2bing).

Google does it by the book, but there’s room for improvements

As for tinyURLs, Google indexes only pages on the tinyurl.com domain, including previews. Unfortunately, the snippets don’t provide a link to the destination page. Although that’s the expected behavior (those URIs aren’t linked on the crawled page), that’s sad. At least Google didn’t fail on the 301 test.

As for the somewhat evil tactis I’ve applied in my tests so far, Google fell in love with some abusive short-URIs. Google –under particular circumstances– indexes shortened URIs that game Googlebot, having sent SERP traffic to sneakily shortened URIs (that face the searcher with huge ads) instead of the destination page. Since I’ve begun to deploy sneaky sURLs, Google greatly improved their spam filters, but they’re not yet perfect.

Since Google is responsible for most of this planet’s SERP traffic, I’ve put better sURL handling at the very top of my xmas wish list.

About abusive short URIs

Shortened URIs do poison the Internet. They vanish, alter their destination, mislead surfers … in other words they are abusive by definition. There’s no such thing as a persistent short URI!

Besides my somewhat shady experiments that hijacked URIs, stole SERP positions, and converted “borrowed” SERP traffic, there are so many other ways to abuse shortened URIs. Many of them are outright evil. Many of them do hurt your kids, and mine. Basically, that’s not any search engine’s problem, but search engines could help us getting rid of the root of all sURL evil by handling shortened URIs with common sense, even when the last short URI has vanished.

Fight shortened URIs!

It’s up to you. Go stop it. As long as you can’t avoid URI shortening, roll your own URI shortener and make sure it can’t getabused. For the sake of our children, do not use or support 3rd party URI shorteners. Deprive the livelihood of these utterly useless scumbags.

Unfortunately, as a father and as a webmaster, I don’t believe in common sense applied by social media services. Hence, I see a “Twitter actively bypasses safe-search filters tricking my children into viewing hardcore porn” post coming. Dear Twitter & Co. — and that addresses all services that make use of or transport shortened URIs — put and end to shortened URIs. Now!

Today I’ve removed all instances of the thunderbird icon from my computers, and from my memory as well. I’m finally done with email. I’ve forwarded1) all my email accounts to paid-links@google.com, and here’s why:

Sebastian’s Pamphlets

Dear Sebastian,

I visited your web site earlier today and it seems you are also a seo company like us. As an SEO company we are in this field since 1998 in India(CHD). We have developed and maintained high quality websites.

We understand link building better than other because of our 11 year experience in linking industry and we follows the right manual link building approach in seeking, obtaining and attracting topic specific trusted inbound links. We have different themes related sites, directories and blogs and i would like to make a request to enter a mutual understanding by EXCHANGING LINKS with your website in order to get targeted visitors, higher ranking and link popularity.

We look forward to linking our site with yours, as exchanging links would Benefit both of us.

You\’ve received this email simply because you have been found while searching for related sites in Google, MSN and Yahoo If you do not wish to receive future emails, simply reply with this email and let us know.

Waiting for your positive and quick response.

PLEASE NOTE THAT THIS IS NOT A SPAM OR AUTOMATED EMAIL, IT\’S ONLY A REQUEST FOR A LINK EXCHANGE. YOUR EMAIL ADDRESS HAS NOT BEEN ADDED TO ANY LISTS, AND YOU WILL NOT BE CONTACTED AGAIN.

Direct message from Spamdiggalot

Spamdiggalot: hi!I think you should like my article “12 addons to get the most out of safer-sex”, here: digg.com/x010101 please RT!

Reply on the web at http://twitter.com/direct_messages/create/Spamdiggalot

Send me a direct message from your phone: D SPAMDIGGALOT

our company proposal

Dear Sebastian Pamphlets,

My name is Vincentas and I am member of board in multi-location hosting company - Host1Plus (http://www.host1plus.com). Our servers are in U.S., U.K., the Netherlands, Germany, Lithuania and Singapore.

I just visited your website which I found interested and it provides excellent complementary content.
We would like to offer you free hosting for your site in Host1Plus hosting service the only thing we would ask you is to place our visitors counter to your website here is the link http://www.count1plus.com or it could be any other feature.

So let me know if you are interested for my offer and I hope that offer is interested to you. Hope to hear you soon.

Kind Regards,
Vincentas Grinius

Host1Plus.com Team
part of Digital Energy Technologies Ltd.
26 York Street
London

1. Our site is regularly crawled by google, so there are better chances googlebot visiting your website regularly.
2. We ask you to link back to only those pages where your url is present, indirectly you are increasing your own link value.
3. By linking to our articles and technology blog you can provide useful content to your visitors.

This is an advertisement and a promotional mail strictly on the guidelines of CAN-SPAM act of 2003 . We have clearly mentioned the source mail-id of this mail, also clearly mentioned the subject lines and they are in no way misleading in any form. We have found your mail address through our own efforts on the web search and not through any illegal way. If you find this mail unsolicited, please reply with \”Unsubscribe\” in the subject line and we will take care that you do not receive any further promotional mail.

Trust me, quitting email is a time-saver. And yes, I’ve an idea how to waste the additional spare time: Tomorrow I’ll have paid me a beer for a link to myself. And I can think of way more link monkey business that doesn’t involve email.

1) Actually, “forwarding” comes with a slighly shady downside:
If you continue to send me your (unsolicited) emails, you’ll find all your awkward secrets on literally tons of automatically generated Web pages –nicely plastered with very targeted ads and usually x-rated or otherwise NSFW banners–, hosted on throw-away domains.
I’m such a devil.

Then when I leave a polite note asking the thief Veronica Domb from EmeryVille to remove my stuff asap, see my comment marked as “in moderation”, but neither my content gets removed nor my comment is published within 24 hours, I stay annoyed.

When I’m annoyed, I write blog posts like this one. I’m sure it will rank high enough for [Veronica Domb] when the assclown’s banker or taxman searches for her name. I’m sure it’ll be visible on any SERP that any other (potential) business partner submits at a major search engine.

Hey, outing content thieves is way more fun than filing boring DMCA complaints, and way more effective. Plagiarists do ego searches too, and from now on Veronica Domb from EmeryVille will find the footsteps of her criminal activities on the Web with each and every ego search. Isn’t that nice?

Not. Of course Veronica Domb is a pseudonym of Slade Kitchens, Jamil Akhtar, … However, some plagiarists and scam artists aren’t smart enough to hide their identity, so watch out.

Maybe I’ve done some companies a little favor, because they certainly don’t need to sent out money sneakily “earned” with Web spam and criminal activities that violate the TOS of most affiliate programs.

AdBrite will love to cancel the account for these affiliate links:
http://ads.adbrite.com/mb/text_group.php?sid=448245&br=1 &dk=736d616c6c20627573696e6573735f355f315f776562
http://www.adbrite.com/mb/commerce/purchase_form.php?opid=448245&afsid=1

Google’s webspam team as well as other search engines will most likely delist oylinki.com that comes with 100% stolen text and links and fakedwhois info as well.

Disclaimer: I’ve outed plagiarists in the past, because it works. Whether you do that on ego-SERPs or not depends on your ethics. Some folks think that’s even worse than theft and spamming. I say that publishing plagiarisms in the first place deserves bad publicity.

It seems MSN/LiveSearch has tweaked their rogue bots and continues to spam innocent Web sites just in case they could cloak. I see a rant coming, but first the facts and news.

Since August 2007 MSN runs a bogus bot faking a human visitor coming from a search results page, that follows their crawler. This spambot downloads everything from a page, that is images and other objects, external CSS/JS files, and ad blocks rendering even contextual advertising from Google and Yahoo. It fakes MSN SERP referrers diluting the search term stats with generic and unrelated keywords. Webmasters running non-adult sites wondered why a database tutorial suddenly ranks for [oral sex] and why MSN sends visitors searching for [MILF pix] to a teenager’s diary. Webmasters assumed that MSN is after deceitful cloaking, and laughed out loud because their webspam detection method was that primitive and easy to fool.

And one has to wonder how effective methods like this really are. Those savvy enough to cloak may be able to cloak for this new cloaker detection bot as well.

They say that they no longer spam sites that don’t cloak, but reverse this statement telling Donna

we need to be able to identify the legitimate and illegitimate content

and Vanessa

sites that are cloaking may continue to see some amount of traffic from this bot. This tool crawls sites throughout the web — both those that cloak and those that don’t — but those not found to be cloaking won’t continue to see traffic.

Here is an excerpt from yesterdays referrer log of a site that does not cloak, and never did:
http://search.live.com/results.aspx?q=webmaster&mrt=en-us&FORM=LIVSOP
http://search.live.com/results.aspx?q=smart&mrt=en-us&FORM=LIVSOP
http://search.live.com/results.aspx?q=search&mrt=en-us&FORM=LIVSOP
http://search.live.com/results.aspx?q=progress&mrt=en-us&FORM=LIVSOP
http://search.live.com/results.aspx?q=google&mrt=en-us&FORM=LIVSOP
http://search.live.com/results.aspx?q=google&mrt=en-us&FORM=LIVSOP
http://search.live.com/results.aspx?q=domain&mrt=en-us&FORM=LIVSOP
http://search.live.com/results.aspx?q=database&mrt=en-us&FORM=LIVSOP
http://search.live.com/results.aspx?q=content&mrt=en-us&FORM=LIVSOP
http://search.live.com/results.aspx?q=business&mrt=en-us&FORM=LIVSOP
Why can’t the MSN dudes tell the truth, not even when they apologize?

Another lie is “we obey robots.txt”. Of course the spambot doesn’t request it to bypass bot traps, but according to MSN it uses a copy served to the LiveSearch crawler “msnbot”:

Yes, this robot does follow the robots.txt file. The reason you don’t see it download it, is that we use a fresh copy from our index. The tool does respect the robots.txt the same way that MSNBot does with a caveat; the tool behaves like a browser and some files that a crawler would ignore will be viewed just like real user would.

In reality, it doesn’t help to block CSS/JS files or images in robots.txt, because MSN’s spambot will download them anyway. The long winded statement above translates to “We promise to obey robots.txt, but if it fits our needs we’ll ignore it”.

Well, MSN is not the only search engine running stealthy bots to detect cloaking, but they aren’t clever enough to do it in a less abusive and detectable way.

Their insane spambot led all cloaking specialists out there to their not that obvious spam detection methods. They may have caught a few cloaking sites, but considering the short life cycle of Webspam on throwaway domains they shot themselves in both feet. What they really have achieved is that the cloaking scripts are MSN spam detection immune now.

Was it really necessary to annoy and defraud the whole Webmaster community and to burn huge amounts of bandwidth just to catch a few cloakers who launched new scripts on new throwaway domains hours after the first appearance of the MSN spam bot?

Can cosmetic changes with regard to their useless spam activities restore MSN’s lost reputation? I doubt it. They’ve admitted their miserable failure five months too late. Instead of dumping the spambot, they announce that they’ll spam away for the foreseeable future. How silly is that? I thought Microsoft is somewhat profit orientated, why do they burn their and our money with such amateurish projects?

Besides all this crap MSN has good news too. Microsoft Live Search told Search Engine Roundtable that they’ll spam our sites with keywords related to our content from now on, at least they’ll try it. And they have a forum and a contact form to gather complaints. Crap on, so much bureaucratic efforts to administer their ridiculous spam fighting funeral. They’d better build a search engine that actually sends human traffic.

If only this headline would be linkbait … of course it’s not sarcastic.

Rumors are out that Microsoft will launch a porn affiliate programm soon. The top secret code name for this project is “pornbucks”, but analysts say that it will be launched as “M$ SMUT CASH” next year or so.

Since Microsoft just can’t ship anything in time, and the usual delays aren’t communicated internally, their search dept. began to promote it to Webmasters this summer.

Surprisingly, Webmasters across the globe weren’t that excited to find promotinal messages from Live Search in their log files, so a somewhat confused MSN dude posted a lame excuse to a large Webmaster forum.

Meanwhile we found out that Microsoft Live Search does not only target the adult entertainment industry, they’re testing the waters with other money terms like travel or pharmaceutic products too.

Anytime soon the Live Search menu bar will be updated to something like this:

Here is the sad –but true– story of a search engine’s downfall.

A few months ago Microsoft Live Search discovered that x-rated referrer spam is a must-have technique in a sneaky smut peddlar’s marketing toolbox.

Since August 2007 a bogus Web robot follows Microsoft’s search engine crawler “MSNbot” to spam the referrer logs of all Web sites out there with URLs pointing to MSN search result pages featuring porn.

Read your referrer logs and you’ll find spam from Microsoft too, but perhaps they peeve you with viagra spam, offer you unwanted but cheap payday loans, or try to enlarge your penis. Of course they know every trick in the book on spam, so check for harmless catchwords too. Here is an example URL: http://search.live.com/results.aspx?q= spammy-keyword &mrt=en-us&FORM=LIVSOP

The traffic you are seeing is part of a quality check we run on selected pages. While we work on addressing your conerns, we would request that you do not actively block the IP addreses used by this quality check; blocking these IP addresses could prevent your site from being included in the Live Search index.

That’s not traffic, that’s bot activity: These hits come within seconds of being indexed by MSNBot. The pattern is like this: the page is requested by MSNBot (which is authenticated, so it’s genuine) and within a few seconds, the very same page is requested with a live.com search result URL as referer by the MSN spam bot faking a human visitor.

If that’s really a quality check to detect cloaking, that’s more than just lame. The IP addresses don’t change, the bogus bot uses a static user agent name, and there are other footprints which allow every cloaking script out there to serve this sneaky bot the exact same spider fodder that MSNbot got seconds before. This flawed technique might catch poor man’s cloaking every once in a while, but it can’t fool savvy search marketers.

The FUD “could prevent your site from being included in the Live Search index” is laughable, because in most niches MSN search traffic is not existent.

All major search engines, including MSN, promise that they obey the robots exclusion standard. Obeying robots.txt is the holy grail of search engine crawling. A search engine that ignores robots.txt and other normed crawler directives cannot be trusted. The crappy MSN bot not even bothers to read robots.txt, so there’s no chance to block it with standardized methods. Only IP blockingcan keep it out, but then it still seems to download ads from Google’s AdSense servers by executing the JavaScript code that the MSN crawler gathered before (not obeying Google’s AdSense robots.txt as well).

Since this method cannot detect (most) cloaking, and the so called “search quality control bot” doesn’t stop visiting sites which obviously do not cloak, it is a sneaky marketing tool. Whether or not Microsoft Live Search tries to promote cyberspace porn and on-line viagra shops plays no role. Even spamming with safe-at-work keywords is evil. Do these assclowns really believe that such unethical activities will increase the usage of their tiny and pretty unpopular search engine? Of course they do, otherwise they would have shutted down the spam bot months ago.

Dear reader, please tell me: what do you think of a search engine that steals (bandwidth and AdSense revenue), lies, spams away, and is not clever enough to stop their criminal activities when they’re caught?

OMFG, yet another post on Sphinn? Yup. I tell you why gaming Sphinn is counter productive, because I just don’t want to read another whiny rant in the lines of “why do you ignore my stuff whilst A listers [whatever this undefined term means] get their crap sphunn hot in no time”. Also, discussions assuming that success equals bad behavior like this or this one aren’t exactly funny nor useful. As for the whiners: Grow the fuck up and produce outstanding content, then network politely but not obtrusive to promote it. As for the gamers: Think before you ruin your reputation!

What motivates a wannabe Internet marketer to game Sphinn?

Traffic of course, but that’s a myth. Sphinn sends very targeted traffic but also very few visitors (see my stats below).

Free uncondomized links. Ok, that works, one can gain enough link love to get a page indexed by the search engines, but for this purpose it’s not necessary to push the submission to the home page.

Attention is up next. Yep, Sphinn is an eldorado for attention whores, but not everybody is an experienced high-class call girl. Most are amateurs giving it a (first) try, or wrecked hookers pushing too hard to attract positive attention.

The keyword is positive attention. Sphinners are smart, they know every trick in the book. Many of them make a living with gaming creative use of social media. Cheating professional gamblers is a waste of time, and will not produce positive attention. Even worse, the shit sticks at the handle of the unsuccessful cheater (and in many cases the real name). So if you want to burn your reputation, go found a voting club to feed your crap.

Fortunately, getting caught for artificial voting at Sphinn comes with devalued links too. The submitted stories are taken off the list, that means no single link at Sphinn (besides profile pages) feeds them any more, hence search engines forget them. Instead of a good link from an unpopular submission you get zilch when you try to cheat your way to the popular links pages.

Although Sphinn doesn’t send shitloads of traffic, this traffic is extremely valuable. Many spinners operate or control blogs and tend to link to outstanding articles they found at Sphinn. Many sphinners have accounts on other SM sites too, and bookmark/cross-submit good content. It’s not unusual that 10 visits from Sphinn result in hundreds or even thousands of hits from StumbleUpon & Co. — but spinners don’t bookmark/blog/cross-submit/stumble crap.

So either write great content and play by the rules, or get nowhere with your crappy submission. The first “10 reasons why 10 tricks posts about 10 great tips to write 10 numbered lists” submission was fun. The 10,000 plagiarisms following were just boring noise. Nobody except your buddies or vote bots sphinn crap like that, so don’t bother to provide the community with footprints of your lousy gaming.

If you’re playing number games, here is why ruining a reputation by gaming Sphinn is not worth it. Look at my visitor stats from July to today. I got 3.6k referrers in 4 months from Sphinn because a few of my posts went hot. When a post sticks with 1-5 votes, you won’t attract much more click throughs than from those 1-5 folks who sphunn it (that would give 100-200 hits or so with the same amount of submissions). When you cheat, the story gets buried and you get nothing but flames. Think about that. Thanks.