Google spam suite primer

Google provides a full suite of services for the entry-level blog spammer. There are plenty of legitimate uses for all of these Google services, but Google's market-leading position in search creates a spam ecosystem that inflates corporate revenues, index size, and user data. Google's blog hosting service, Blog*Spot, received a lot of attention this week as blogosphere neighbors threw up their arms in protest of the host, which is like the seedy motel at the edge of town that rents by the-hour. It's cheap and inviting to those who know no better, but those in the know don't want anything...

Google spam suite primer

Google provides a full suite of services for the entry-level blog spammer. There are plenty of legitimate uses for all of these Google services, but Google’s market-leading position in search creates a spam ecosystem that inflates corporate revenues, index size, and user data. Google’s blog hosting service, Blog*Spot, received a lot of attention this week as blogosphere neighbors threw up their arms in protest of the host, which is like the seedy motel at the edge of town that rents by the-hour. It’s cheap and inviting to those who know no better, but those in the know don’t want anything to do with it.

I will describe the Google elements that contribute to a spam farm in an attempt to create more understanding about how your content ends up where you may not want it.

The host

Blogger’s Blog*Spot hosting is a quick and easy way to create new blogs. It’s free, you can post via e-mail, and many people think a Blog*Spot blog is the quickest way into Google’s search index since the blog hosting servers might be only a few rows away from the Google crawler and of course Google knows how to find all of the content inside its own system.

The image above is a completely automated public Turing test to tell computers and humans apart, commonly referred to by its acronym: CAPTCHA. A CAPTCHA is supposed to be easy for a human to decipher, but difficult for computers using image recognition software.

Blogger requires users to solve the above CAPTCHA before creating a new blog. Yet the system is bypassed daily and thousands of new blogs are created.

A simple CAPTCHA can be broken using optical character recognition, the same technology that scans a printed page and converts the words to plain text.

A common way to bypass a CAPTCHA system is to offer humans a reward for successfully entering the scrambled word. Some sites trade free porn for a CAPTCHA solutions, others hire people in low-income areas of the world to sit in front of a computer and solve CAPTCHAs all day.

The content

Google provides a lot of free content for someone to repurpose on their newly created Blog*Spot blog. Search Google’s web, news, or blog results for the keyword of your choice and you will receive a list of content sources Google has determined is most relevant to the query. Copying from the top of these results is an easy way for spammers to obtain content already deemed relevant by Google for inclusion in its own pages.

You will often see spam blogs composed of a group of results including a title, link, and except for targeted keywords. These pages are meant to attract search referrals for advertising or create more pages linking to a site the spammer would like to promote.

Google blog search is the newest Google search service with relevant content available for scraping. Many of the cries from bloggers over the past week were most likely a result of a spammer using a script to retrieve the top search results on Google’s blog search ranked by relevance for inclusion on a newly created Blog*Spot blog.

The payout

Google AdWords places text advertisements across the web related to the textual content of a page. Every time someone clicks on a Google text ad for “refinance” it costs the advertiser over $35 and makes the site owner some money. “Vioxx” pays about $16.50 a click, “poker” pays about $2.50 a click, and “camcorder” pays about $2.60 a click on Google’s advertising network. The newly created blog can make money from these advertisements based on how many people are searching for their targeted keyword, the likelihood of a visitor to click on an ad, and the payout for such keywords.

Automation

The above process becomes even easier through the use of automated tools for blog creation, content retrieval, and advertising placement. More expensive tools include the use of pre-configured Blog*Spot blogs for a quick start.

Conclusion

Free web hosts have hidden costs. You don’t have friendly neighbors and it’s possible that search engines will not want to help others discover your area of the web.

Google has taken more steps to protect its e-mail service, Gmail, from spammers than it has taken them away from Blog*Spot. There is a lot more that Google can do to reduce spam, reduce click fraud, and improve their Blogger service, but it might involve losing some advertising revenue in the short-term. I think no company in the business of content generation, indexing, or payment can afford to ignore the problem.

Niall, I think that Ray isn’t asking whether you think that the porn or low-wage-labor solutions are real, but for some evidence about either actually being used. I’ve yet to see any actual evidence; every single news article points to a 2004 Boing Boing post as “proof”, but that post just says that Cory was told by someone else that it’s happening.

Coming from the biomedical field, I’m a bit dubious whenever every single claim to some reality points to the same exact source for describing that reality as a possibility. I’d love to see proof of it.

After creating a splog, is there anything to keep a splogger from pointing a crawler at their network of sites to automatically click on every advertising link on their blogs and jump start their “revenue-generation engine?”

Niall, I get what you’re saying, but I’m not too sure I buy the analogy. As a physician, I have to consider both the moral obligation to not disclosing specific patients’ illnesses as well as the very real legal obligations of the HIPAA laws. Of course, this doesn’t mean that I can’t publish what I have discovered — and publish the (very real, very verifiable) information that goes along with it (e.g., blood counts, CT scans, pathology specimen images) — so that others both know about the findings and have reason to believe that they’re real; that’s the only way that medicine progresses.

What’s the similar restriction in the web world that prevents you from providing proof of the exploits you’ve described? Similarly, what has prevented anyone else — literally, anyone at all — from pointing to sites that use free porn or low-cost labor to contravene CAPTCHAs? The entirety of the glaring lack of evidence makes the whole story a little harder to take at face value. Again, I’m not saying that it doesn’t happen, just that I’ve yet to see a single person demonstrate it to happen, an important distinction.

(To flip your own analogy: if an oncologist tells a patient that he’s seen times when widely metastatic colon cancer just goes away without any treatment, that patient might believe the doctor, but more realistically would probably want some verifiable proof of the statement before accepting it.)

PWB,
Two possible reasons a spam blog might be created are to increase the amount of targetted advertising inventory available or to create more links promoting a site.
Yes, Google and others can attack the problem at their checkbooks if they can successfully identify the bad actors.

Phil,
I recommend if you would like to continue using Blogger’s Blog*Spot hosting you should demand more of your host. Send in some feedback and let the team at Blogger know your concern that your blog may be cut off or deeply discounted in search results because of the rising problem of spam blogs on the system.

Create a neighborhood association to address the problem or you could move to another free blog host such as a TypePad partner or WordPress.com. You might also have free hosting you are not using through a membership organization or an Internet service provider you could consider.

One data point. I’m starting to see WordPress based spam blogs using all the same techniques as above. I suspect that the barrier to entry has dropped to nothing and a WordPress farm is probably not that much harder to set up than a blogspot farm.

I also see a conflict of interest here for Google. If the intention is to receive AdSense publisher revenue, it’s also inflating Google’s revenues. Or is that too cynical for you all? In their new post-IPO worldview are they balancing the effort to stop pollution of their search index against increased income from their advertising business? Nah, they wouldn’t do that. After all, they do no evil, right?

I don’t understand how these fake blogs can make money from AdSense. Surely no one reads these blogs long enough to click on any ad links. I have stumbled across them occasionally in search results, and I always end up scratching my head. The blogs are full of nonsense and are useless, so I go away. I don’t stick around and click on ads. How exactly does this part of the fraud work?

Pardon my ignorance, but could you explain what the purpose of blog spamming is? Is it to lift PageRank? Make money off AdSense?

If that latter, could the problem be attacked on the back-end by making sure Google only pays legitimate enterprises?

AdSense spam does not hurt all parties the same.

Google has been doing algorithmic search longer than MSN or Yahoo!, and thus has more sophisticated link quality scrubbing technology.
Google places many new sites through a probationary period which prevents many new spam sites from ranking well in Google.
The net effect of no enforced quality standards is Google is paying entrepenuers to stuff competing search networks with spam.

Surely no one reads these blogs long enough to click on any ad links.

If the content looks ugly enough the ads are the obvious thing to click on.

Wrong-doers are already breaking CAPTCHAs on a daily basis. And not through clever algorithmic means but via the old-fashioned human-powered way. We’ve actually been able to observe when human-powered CAPTCHA solvers come on-line by analyzing our logs. You can even use the timestamps to determine from whence this CAPTCHA-solving originates.

I believe the claim that there are offshore human captcha-solvers, working for very low wages. Like you I have independent evidence of that claim that I do not want to share.

But I do not believe the claim that there are systems that reward captcha-solvers with porn rather than with money. I was incautious enough to repeat this idea at a workshop on CAPTCHAs (that’s me in the middle of the photo), and was immediately challenged — no one’s been able to substantiate any such report.

I would be happy to give any of the following to the first person that can persuade me (even while swearing me to secrecy) that a captchas-for-porn scheme has actually been implemented:

$10
A beer of your choice at any bar within a 10-mile radius of Mountain View, CA, hand-ordered by me
A couple of unused Technorati stickers that Tantek gave me at the last Spam Summit

I still don’t get it — why are people hesitant to share the means by which they know that there are low-wage-cost CAPTCHA workarounds at work out there? It feels like me saying that I’m aware of great DHTML drop-down menus being used on web pages, but I don’t want to tell people where I saw them… pointless, and likely to raise my suspicion rather than put it to rest.

Jason L,
I would like to assure you, that human-powered CAPTCHA solving exists. I am SURE even though I have not seen it with my own eyes. As an inhabitant of destroyed and corrupt 3rd world country, I am not only sure it exists – I am also sure that human beings who earn money doing CAPTCHA-solving all day are envied by their neighbours. How much do you think I have to earn here to be considered “making good money”? I can tell you. It’s $3 an hour. Come on, even $2 is still considered very good.
So you don’t believe in human-powered CAPTCHA solving? I could create one especially for you, right away ;-) I know in person some guys who would be absolutely happy to seat all day in front of computer and type characters – with a payment of 8 dollars a day. Or maybe even less. This is real. I live here, believe me.

Now do you want to talk about spam blogs? Comment spamming? Click fraud?

Niall Kennedy is a software engineer in San Francisco, California in the United States. I am very interested in the world of... MORE »