Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Concerns:(1) Is it relevant/how does it compare to Google?This talk directly compares against sites banned/penalized by Google. Not sure final form of the metric yet, but will compare with Google’s.(2) You should be focusing on other challenges instead.We are focusing on a larger, fresher more consistent index first. This just a research project now that needs some engineering work to scale. Once we have solved the scaling challenges in a fresher more consistent index, we will address those scaling problems.(3) What about outing millions of sites?Two things: (a) delta score (b) search engines have the information now, we believe in transparency to make it available to all.

** feature engineering is interesting from intuitive perspective** black box isn’t black at all, rather transparent since I coded it myself. But for purposes of this talk is a black box** in practice this is iterative

** feature engineering is interesting from intuitive perspective** black box isn’t black at all, rather transparent since I coded it myself. But for purposes of this talk is a black box** in practice this is iterative

** feature engineering is interesting from intuitive perspective** black box isn’t black at all, rather transparent since I coded it myself. But for purposes of this talk is a black box** in practice this is iterative

14.
On-page features
Ntoulas et al: Detecting Spam Web Pages through Content Analysis, WWW „06
Number of words in title

15.
On-page features
Ntoulas et al: Detecting Spam Web Pages through Content Analysis, WWW „06
Number of words in title
Histogram
(probability
density) of all
pages

16.
On-page features
Ntoulas et al: Detecting Spam Web Pages through Content Analysis, WWW „06
Number of words in title
Histogram
(probability
density) of all
pages
Percent of spam
for each title
length

22.
Are these still relevant today?
Banned: manual penalty and removed from index.
KurtisBohrnstedt: http://www.seomoz.org/blog/web-directory-submission-danger

23.
Google penalized sites
Penalized sites: algorithmic penalty
demoted off first page.
I will group both banned an penalized
sites together and call them simply
“penalized.”
KurtisBohrnstedt: http://www.seomoz.org/blog/web-directory-submission-danger

35.
Anchor text
Simple heuristic for
branded/organic anchor text:
(1) Strip off all sub-domains, TLD
extensions, paths from URLs.
Remove white space.
(2) Exact or partial match
between the result and the
target domain.
(3) Use a specified list for
“organic” (click, here, …)
(4) Compute the percentage of
unbranded anchor text.
There are some more technical details (certain symbols are removed, another heuristic for acronyms), but this is the main idea.

36.
Anchor text
Large percent of
unbranded anchor
text is a spam signal.

37.
Anchor text
Large percent of
unbranded anchor
text is a spam signal.
Mix of branded and
unbranded anchor
text is best.

55.
How well can we model spam with these features?
Quite well!
Using a logistic regression model, we can obtain
86%1accuracy and 0.82 AUC using just 32 features
(11 in-link features and 21 on-page features).

56.
How well can we model spam with these features?
Quite well!
Using a logistic regression model, we can obtain
86%1accuracy and 0.82 AUC using just 32 features
(11 in-link features and 21 on-page features).
1 Well, we can get 83% accuracy by always choosing not-spam so
accuracy isn’t the best measure. The 0.82 AUC is quite good for such
a simple model.
Overfitting was controlled with L2 regularization and k-fold cross-validation.

57.
More sophisticated modeling
Logistic
SPAM
NOT
SPAM
??
Logistic
In-link features
On-page features
M
i
x
t
u
r
e
Can use a mixture of logistic models, one for in-link
and one for on-page. Use EM to set parameters.

59.
More sophisticated modeling
65% penalized
85% in-link
15% on-page
A mixture of logistic models attributes “responsibility” to both the in-link and on-page
features as well as predicts the likelihood of a penalty.

61.
“Unnatural” sites or link profiles
With lots of data,
“unnatural” sites or link
profiles are moderately easy
to detect algorithmically.
You are at risk to be
penalized if you build
obvious low quality links.

62.
MozTrust Rules!
mozTrust is a good predictor of spam. Be
careful if you are building links from low
mozTrust sites.
mozTrust, an engineering feat of
awesomeness.

63.
SEOmoz Tools Future
We hope to have a spam score of
some sort available in Mozscape in the
future.
In the more near term, we plan to
repurpose some of this work for
improving Freshscape.