This post was promoted from YouMoz. The author’s views are entirely his or her own (excluding an unlikely case of hypnosis) and may not reflect the views of Moz.

Know someone who thinks they’re smart? Tell them to build a machine learning tool. If they majored in, say, History in college, within 30 minutes they’ll be curled up in a ball, rocking back and forth while humming the opening bars of “Oklahoma.”

Sometimes, though, the alternative is rooting through 250,000 web pages by hand, checking them for compliance with Google’s TOS. Doing that will skip you right past the rocking-and-humming stage, and launch you right into writing-with-crayons-between-your-toes phase.

Those were my two choices six months ago. Several companies came to Portent asking for help with Penguin/manual penalties. They all, for one reason or another, had dirty link profiles.

Link analysis, the hard way. Back when I was a kid...

I did the first link profile review by hand, like this:

Download a list of all external linking pages from SEOmoz, MajesticSEO, and Google Webmaster Tools.

Remove obviously bad links by analyzing URLs. Face it: if a linking page is on a domain like “FreeLinksDirectory.com” or “ArticleSuccess.com,” it’s gotta go.

Analyze the domain and page trustrank and trustflow. Throw out anything with a zero, unless it’s on a list of ‘whitelisted’ domains.

Grab thumbnails of each remaining linking page, using Python, Selenium, and Phantomjs. You don’t have to do this step, but it helps if you’re going to get help from other folks.

Get some poor bugger a faithful Portent team member to review the thumbnails, quickly checking off whether they’re forums, blatant link spam, or something else.

After all of that prep work, my final review still took 10+ hours of eye-rotting agony.

There had to be a better way. I knew just enough about machine learning to realize it had possibilities, so I dove in. After all, how hard can it be?

Machine learning: the basic concept

The concept of machine learning isn’t that hard to grasp:

Take a large dataset you need to classify. It could be book titles, people’s names, Facebook posts, or, for me, linking web pages.

Define the categories. In this case, I’m looking for ‘spam’ and ‘good.’

Get a collection of those items and classify them by hand. Or, if you’re really lucky, you find a collection that someone else classified for you. The Natural Language Toolkit, for example, has a movie reviews corpus you can use for sentiment analysis. This is your training set.

Feed in your training set, with the features — the item attributes used for classification — pre-selected. The tool will find patterns, if it can (giggle).

Use the tool to compare each item in your dataset to the training set.

The tool returns a classification of each item, plus its confidence in the classification and, if it’s really cool, the features that were most critical in that classification.

If you ignore the hysterical laughter, the process seems pretty simple. Alas, the laughter is a dead giveaway: these seven steps are easy the same way “Fly to moon, land on moon, fly home” is three easy steps.

Note: At this point, you could go ahead and use a pre-built toolset like BigML, Datameer, or Google’s Prediction API. Or, you could decide to build it all by hand. Which is what I did. You know, because I have so much spare time. If you’re unsure, keep reading. If this story doesn’t make you run, screaming, to the pre-built tools, start coding. You have my blessings.

The ingredients: Python, NLTK, scikit-learn

I sketched out the process for IIS (Is It Spam, not Internet Information Server) like this:

Download a list of all external linking pages from SEOmoz, MajesticSEO, and Google Webmaster Tools.

Use a little Python script to scrape the content of those pages.

Get the SEOmoz and MajesticSEO metrics for each linking page.

Build any additional features I wanted to use. I needed to calculate the reading grade level and links per word, for example. I also needed to pull out all meaningful words, and a count of those words.

Finally, compare each result to my training set.

To do all of this, I needed a programming language, some kind of natural language processing (to figure out meaningful words, clean up HTML, etc.) and a machine learning algorithm that I could connect to the programming language.

I’m already a bit of a Python hacker (not a programmer – my code makes programmers cry), so Python was the obvious choice of programming language.

I’d dabbled a little with the Natural Language Toolkit (NLTK). It’s built for Python, and would easily filter out stop words, clean up HTML, and do all the other stuff I needed.

I smushed it all together using some really-not-pretty Python code, and connected it to a MongoDB database for storage.

A word about the training set

The training set makes or breaks the model. A good training set means your bouncing baby machine learning program has a good teacher. A bad training set means it’s got Edna Krabappel.

And accuracy alone isn’t enough. A training set also has to cover the full range of possible classification scenarios. One ‘good’ and one ‘spam’ page aren’t enough. You need hundreds or thousands to provide a nice range of possibilities. Otherwise, the machine learning program stagger around, unable to classify items outside the narrow training set.

Luckily, our initial hand-review reinclusion method gave us a set of carefully-selected spam and good pages. That was our initial training set. Later on, we dug deeper and grew the training set by running Is It Spam and hand-verifying good and bad page results.

That worked great on Is It Spam 2.0. It didn’t work so well on 1.0.

First attempt: fail

For my first version of the tool, I used a Bayesian Filter as my machine learning tool. I figured, hey, it works for e-mail spam, why not SEO spam?

Apparently, I was already delirious at that point. Bayesian filtering works for e-mail spam about as well as fishing with a baseball bat. It does occasionally catch spam. It also misses a lot of it, dumps legitimate e-mail into spam folders, and generally amuses serious spammers the world over.

But, in my madness, I forgot all about these little problems. Is It Spam 1.0 seemed pretty great at first. Initial tests showed 75% accuracy. That may not sound great, but with accurate confidence data, it could really streamline link profile reviews. I was the proud papa of a baby machine learning tool.

But Bayesian filters can be ‘poisoned.’ If you feed the filter a training set where 90% of the spam pages talk about weddings, it’s possible the tool will begin seeing all wedding-related content as spam. That’s exactly what happened in my case: I fed in 10,000 or so pages of spammy wedding links (we do a lot of work in the wedding industry). On the next test run, Is It Spam decided that anything matrimonial was spam. Accuracy fell to 50%.

Since we tend to use the tool to evaluate sites in specific verticals, this would never work. Every test would likely poison the filter. We could build the training set to millions of pages, but my pointy little head couldn’t contemplate the infrastructure required to handle that.

The real problem with a pure Bayesian approach is that there’s really only one feature: The content of the page. It ignores things like links, page trust and authority.

Oops. Back to the drawing board. I sent my little AI in for counseling, and a new brain.

Note: I wouldn’t have figured this out without help from SEOmoz’s Dr. Pete and Matt Peters. A ‘hat tip’ doesn’t seem like enough, but for now, it’ll have to do.

Second attempt: a qualified success

My second test used logistic regression. This machine learning model uses numeric data, not text. So, I could feed it more features. After the first exercise, this actually wasn’t too horrific. A few hours of work got me a tool that evaluates:

False positives remain a big problem if we try to build a training set outside a single vertical.

Disappointing. But the tool chugs along happily within verticals, so we continue using it for that. We build a custom training set for each client, then run the training set against the remaining links. The result is a relatively clear report:

Results and next steps

I tried to launch a public version of Is It Spam, but folks started using it to do real link profile evaluations, without checking their results. That scared the crap out of me, so I took the tool down until we cure the false positives problem.

I think we can address the false positives issue by adding a few features to the classification set:

Bayesian filtering: Instead of depending on a Bayesian classification as 100% of the formula we’ll use the Bayesian score as one more feature.

Anchor text matters a lot. The next generation of the tool needs to score the relevant link based on the anchor text. Is it a name (like in a byline)? Or is it a phrase (like in a keyword-stuffed link)?

Link position may matter, too. This is another great feature that could help with spam detection. It might lead to more false positives, though. If Is It Spam sees a large number of spammy links in press release body copy, it may start rating other links located in body copy as spam, too. We’ll test to see if the other features are enough to help with this.

If I'm lucky, one or more of these changes may yield a tool that can evaluate pages across different verticals. If I'm lucky.

Insights

This is by far the most challenging development project I've ever tried. I probably wore another 10 years' enamel off my teeth in just six weeks. But it's been productive:

When you start digging into automated page analysis and machine learning, you learn a lot about how computers evaluate language. That's awfully relevant if you're a 21st Century marketer.

I uncovered an interesting pattern in Google's Penguin implementation. This is based on my fumbling about with machine learning, so take it with a grain of salt, but have a look here.

We learned that there is no such thing as a spammy page. There are only spammy links. One link from a particular page may be totally fine: For example, a brand link from a press release page. Another link from that same page may be spam: For example, a keyword-stuffed link from the same press release.

We've reduced time required for an initial link profile evaluation by a factor of ten.

It's also been a great humility-building exercise.

About wrttnwrd —
Ian Lurie is CEO at Portent, an internet marketing agency he started in 1995 on the honest belief that great marketing can save the world. At Portent, he leads and trains a team that covers SEO, PPC, social media and marketing strategy. Ian writes on the Portent Blog and speaks at various conferences, including MozCon, SMX, SES, Ad::Tech and Pubcon. He recently co-authored the 2nd Edition of the Web Marketing All-In-One Desk Reference for Dummies.

Linear Algebra is really the bread and butter of machine learning. Most ML problems are really some sort of optimization problem, and linear algebra is an exceptional tool for solving optimization problems either directly or iteratively (gradient descent/ascent for example). I suspect that you could get really good performance on this data set with a multi layer neural network, and it might solve some of your problems where your model isn't as generally applicable as you like if you use some of the newer techniques like drop out when you are training the model. Andrew Ng will talk a little bit about Neural Networks in that class, and once you get the hang of it the actual code for the Neural Network training isn't all that difficult in comparison to other things out there. The beauty of machine learning is that from simple principles you can build something amazing as long as you have a good volume of high quality labeled data.

So good. Thanks for sharing this Ian.I gave this a try a few years back when I built a link classification tool and couldn't get it anywhere near as accurate as what you've done.

I was working from this whitepaper http://www.ra.ethz.ch/cdstore/www2006/devel-www2006.ecs.soton.ac.uk/programme/files/pdf/3052.pdf. Suffice to say I couldn't implement it properly so I ended up with a combination score based on a variety of metrics within their algorithm that just yielded terrible results. I also wasn't up on Python or NLTK at the time so I was doing things like n-grams and page classification by leveraging APIs like that of Repustate, AlchemyAPI and Textwise. ProTip: PHP is definitely not the language for stuff like this =]

Anyway, this is a very spirited effort and I hope you keep going with it as ultimately it's something that's incredibly valuable to our industry.

Great post Ian! Very informative and a unique point of discussion. A marketer must also have a decent knowledge in coding, which at least makes him understand the complete code on his own. Every marketer should be technically sound in must be capable of understanding the code on their own.

Could you point me in the right direction for figuring something like this out?

"A construction website with a link profile consistingof a whopping 80 percent suspicious links lostmuch of its search rankings in April 2012 whenPenguin was first rolled out. "

So, first you'd have to scrape the keywords of that site somehow. Then, you'd have to scrape the search results to check rankings. I understand that, but how do you go back and see historical ranking positions from April?

If you have the "secret sauce" for penguin recovery you could very easily identify penalized sites and then reach out to the webmaster. I've used SEMrush before too. I can not believe I didn't realize what I was missing HAH.

Ian, you can very likely improve and generalise your classifier. See Andrew Ng's advice on this prob,em (but for email spam): http://see.stanford.edu/materials/aimlcs229/ML-advice.pdf [pdf].In brief, work out whether the problem is variance or bias. If your training is much lower than your test error, you've got high variance and need to increase the size of training set or reduce the number of features. If the error in both training and testing is way too high, you need to address bias, which can be approached with a larger or different set of features.Also you should try removing individual features as sometimes one that seems intuitively good will turn out to poison the classifier.Finally, I often turn to SVM for text classification problems.

Yeah, this is what I'm trying to dig into now. Thanks for the article link - that'll be a huge help.The problem IS variance - I'm almost totally sure of that. That means I either need hundreds of thousands of sample pages, or I need to reduce the number of features, as you say. But fewer features may mean missing a lot of stuff - I'm more inclined to ADD 2 new features.On to learn some 'big data' (shudder) stuff...

I tried a few different things. PyWebShot was the easiest, but in the end I used Selenium with PhantomJS as the browser, out of pure laziness - the server already had Selenium installed, and getting the dependencies for PyWebShot set up turned into a less than positive experience.

Nicely done Ian. I think the SEO industry would benefit immensely if all SEO's could become as technical as you are. Don't get me wrong - I don't think you need to be a programmer to understand SEO, but in order to actually solve SEO problems the value that a programmatic approach can add is often immeasurable. Nice post, and I'm looking forward to trying something like this myself!

This is the stuff that excites me about SEO and marketing. If it were possible to sort out the bad from the good this way ... yeah. I understand why you would want to do that. :) No amount of Coursera and CodeAcademy is likely to make me understand this stuff - so good luck!

Nice idea on simplification and scale-ability, I have found having your own personal database of "white listed" domains can help but I found it was a numbers game the more data points you got the less likely you were to get false positives. I found the metrics were not always a guarantee but did start to see common onpage/offpage elements to spam links as more sites were reviewed, this also depended on the vertical and agency that used to do their SEO....My question is to get your accuracy beyond 85% does it just require a larger data set?

1.) What do you mean by "75% accurate" etc...I assume you mean the "false negative" rate, i.e. it successfully identified 75% of the pages that hand-reviewers determined to be spam but missed 25%.

If so, what is the false positive rate? For instance, I can identify Spam with 100% accuracy if I simply apply the rule: "everything is spam". ;-) Of course, a fairly high false positive rate might be acceptable for a link profile cleanup project, some wheat getting thrown out with the chaff is to be expected.

2.) Did you set aside any of the "training" data to *not* be used in training, but to be used as test data, after the training was done? It's typical to reserve 10%-20% of training data to be used to validate a model; this helps guard against "overfitting" (with regression and other "curve-fitting" approaches, if you have enough variables, you can get a model that performs great on the data it was trained with, but fails miserably on everything else). Testing on other datasets usually comes after testing against a portion of the original data the training data was pulled from.

Thanks, it's encouraging to see someone doing real yeoman's work on this!

Hi Ted - no problem!!!!By 75% accurate, I actually mean 'false positives.' The tool tends to be far too strict, and marks quality sites as spam about 25% of the time. Of course, this is all subjective. But when it's marking cnn.com as a swirling black hole of spam, I know something's not right. The tool very rarely (less than 2% of the time) missed a site that should have been tagged as spam.Indeed, I set aside 1/2 of the training data for testing. That's how I got the 75% accuracy rating.Thanks!Ian

I went through your PDF report about Penguin and found it very fascinating
- thanks for putting it together. You mention:

"random subjects ranging from STDs to outdoor patio furniture"

"If a blog has huge numbers of poorly written articles covering an absurd
range of topics"

Seems like common sense. Moving away form Penguin and into Panda, would you say
sites structured this way (multiple different topic niches) are prone to future
Panda updates? I'm just confused with the "broad range of topics"
argument because I see Wikipedia, About.com, ehow.com etc. in SERPs all of the
time. I think ehow was one to get hit by Panda, but I still see them all the
time. I guess what I'm asking is, if you have a Wiki site, or a site like
About.com where all of the articles are written by professionals in the
industry, will the high quality of the content trump the "broad range of
topics" ?

A site like Wikipedia is organized around those topics, so the site structure makes sense.Sites like the spam sites I found are generally blogs, with the broad range of topics presented chronologically and no organization whatsoever. Plus, the writing is typically terrible, AND they have ridiculous, useless, keyword-stuffed offsite links.Put that all together and it gets easier to detect the problem sites.

Wow. This was a fantastic read. The journey through your mind as you developed this AI was very insightful. As more clients roll in the door with poor quality backlink portfolios, many of which have received penalties from Penguin updates, auditing the negative links has become a necessity over the past year. I have been doing this by hand for the past 6+ months, using the exact same method you outlined "when you were a kid." Well, as I've been heading up our scraping practices, "mastering" python has become more of a necessity. What I'm learning is that there are a lot of clever ways to interact with the APIs out there, and it looks like Machine Learning is the next path of near-insanity I will be traveling down.Thank you for the excellent write up!

I wish I had even a little bit of coding/programming knowledge so I could help! This would make life so much easier when doing an SEO audit. Perhaps there is a way to re-configure and re-run all the positives to filter out more the false positives each time, at least enough so you could manually review it and not be bleeding from your eyeballs by the time you are done.

I suspect the real issue here is that, if you're going to use a supervised learning algorithm like I do, you have to stick to verticals. Otherwise the characteristics of a spammy page vary too much.I'm going to test out some unsupervised or 'clustering' algorithms next.

But down these mean streets a man must go who is not mean, neither tarnished nor afraid." Brave work for someone with less than a PhD in computer science. Well done. Well presented. Well deserved success in putting some pointy-heads with mucho research $$ to shame.

Eesh. I'm a little afraid to guess. This is one of those things where the knowledge is the expensive part.I know hiring a person with a really deep understanding of machine learning would likely set me back in the low six figures, though. Which might be cheaper than the therapy required.

And it was just last night that I was trying to talk a grant winning doctorate graduate of this same field, a friend of mine, into getting interested about the missing tools and algorithms which could pull this kind of stuff off with an acceptable degree of accuracy.

Very nice post. My personal thought is that computer programs will always have difficulty measuring link spam if they are made to measure all websites equally. Each site is build for a different reason and therefore may have a different set of rules regarding why the creater created particular links. My site is a directory and as such I have many links on a single page on several of my pages. This is not done as spam, it is not done as link building, it is done because I thought it appropriate for my particular site.

In regarding to how "other" websites link to mine, I dislike the fact that engines take that into account. I can not control other websites, and I do not think it should effect my sites ranking. I know their is many reasons why it should, I am just saying I personally don't like it much. Again, you post is very informative (specially for the people that have a higher & more educated level of thinking than I do...).

TLDR No just joking ! Fantastic post love the humour but most of all your dedication to push your self into difficult and dangerous waters in search ways to take the tediousness out of fixing what is becoming a very common problem.

I’m so happy someone finally started doing this kind of tool. I used to manage the project of a link tool that should check all new links to a site and notify if something bad had happened to an old ones(site got penguin, link removal, content was changed, more external links appeared, etc.). However, the project was frozen :(

Hope you will succeed with your tool and soon you gonna blow up people minds with it. Cuz manual checking is ok, but when it comes to checking 100+K links, I wanna kill myself. So good luck with your project, the SEO world awaits for it.

Thank you for the penguin PDF that alone was a great read for training classes. The rest of the article was intense nothing that I would consider doing alone by any means but I had to finish reading. Good luck in the future!

This is the geekest SEO article I have ever read in my life! I mean is great and interesting, still I wish I had 8 more hours a day to finish all the SEO duties I need to place my website on Google's top 3. Well, I rather wait till someone builds a software that makes the same job you did.

Great article I may just tackle this, this week and see what we get. Lately I feel like my work it .links links links, I would love to have some nice on pge opt to do or even an audit, but for now all is in chasing down and qualifying links.

joy to read, I came across this site ages ago but I've only just decided to stop back and have a read of your articles. After reading your article I have to decided reconstruct my algorithm . I have confidence that your article make it easy.so thanks an Lurie.