Covers the theoretical and technical details of URL categorization

Main menu

In our previous article we discussed how SVM works, now it’s time to transit from theory to practice.

Processing the document

This post will ignore meta data inside a document (for example Title in a HTML page, it will be covered in later posts). First we need to normalize the document.

Stop words

First step would be to remove any stop keyword, stop keywords are keywords that will not help our classification attempt, for example here is a short list: a, the, on, of, in. You can find many sites that provide will stop words, in various languages.

Zipf’s law

Zipf’s law establishes relationship between word frequencies in various languages, basically what it says is that the speaker will and listener will try to work as easy as possible, which means that the speaker will speak in the easiest way to him, which means many sentences can be ambiguous and will force the listener to process the conversation “harder” in order to understand, on the other hand the listener will want that the speaker will work “harder” and will be detailed and non ambiguous in his speaking.

Zipf’s law explains why stop words exists and why they have minimal affect on the classification process.

Stemming

Stemming is the process of converting keywords to the root stem, for example the words: running, ran, run will be converted to the base stem of run, also it will convert plural form to singular for example: berries, berry will be converted to berry (singular form). Some sites can provide you with a stemming database in various languages.

Converting the document to feature space

Once the document is normalized we need to convert it to feature space, in our previous post we showed a 2d feature space, with documents the feature space is infinite.

To convert the document we compile an index of keywords, for example:

Father â€“ 1
Mother â€“ 2
Car â€“ 3
Truck â€“ 4
Likes – 5

The keywords are taken from the data set of the current category only and the index is relevant only to this category, so for the category ‘porn’ we will have a different index then category ‘news’. Once we compiled the index from the keywords we have a feature space with N dimensions (N being the number of keywords)

Practical example

To convert a document we create a vector in our N dimensions feature space and the value of each dimension is the count of the word.

The sentence: “Father likes car” will be represented as (1,0,0,1,1)
The sentence: “Father likes car father likes mother” will be represented as (2,1,1,0,2)

Phrases

Another approach we can be used in conjunction with using one keyword, is use two or three keywords in the classification index, this means that the phrases are more detailed, for example we can add to the index (based on the examples):

Missing words

Since the index is built from the training set, all the words will be indexed, but when classifying a new document there may be words that are not the index, since the original training set doesn’t contain them, you should not add them to the feature space, unless you decide to add this document to the training set and retrain the algorithm.

How it works

SVM is a method used to determine the type of an object, and object can be anything like: web pages, text, images, hand writing.

The way it works (without getting into the math, if you do want to look at the math and go deep you can look at this: SVM guide) is that you give a classifier N amount of training sets (objects of the type you train the classifier to detect), and then you give the classifier another N of the same type of objects and you tweak the classifier to be more accurate.

What you would do is train a classifier for each category and then run a multi category classification using the SVM to try and detect the document.

How it works visual explanation

SVM looks at the objects in space (it’s called feature space and it can be from 2 dimensions to n-dimensions), for our example we will look at 2d space:

In the image you can see white circles and the algorithm needs to detect what is a white circle and what is black circle based on position in feature space, training the algorithm is needed to it can determined where does the boundary between the white and black circles (the solid black line in the image).

In the right image the algorithm uses is linear and the space between the dotted lines is determined when tweaking the classifier on the second run.

In the left image the algorithm is a kernel machine and is curve, again the space between the dotted line is determined by tweaking the algorithm.

Challenges

The first paragraph is overly simplistic; in reality SVMs are much more complex then magically training the classifier.

Challenge 1 â€“ Algorithm

SVM can use number of detection algorithms: Linear algorithms, curved algorithms (Kernel machines) and each one has number of types, each category will benefit from a different algorithm. Once approach is to classify each category with number of algorithms and when trying to classify an object do a vote between classifiers of different algorithms.

Challenge 2 â€“ Training set

The training set and tweaking set must be accurate because if for example on ‘porn’ training set you would put a news site by mistake, it will contaminate the sample and will cause the classifier to fail.

Another problem is the number of sites you need to provide, let say you have 100 categories and you need 100 sites for the first run and 100 sites for the tweak, you need to provide a total of 20,000 sites just for one language.

Challenge 3 â€“ Training set coverage

Because there are so many types of sites in a single category you need to make sure the training set is broad as possible, for example let take the category ‘porn’, if you provided 100 sites of the same look and feel (for example a regular porn site) and then you tried to classify a different look and feel site (a forum with porn links) the classifier may not be able to correctly classify the site.

Challenge 4 â€“ Representing a document

The example on the second paragraph with the 2D circles is pretty straight forward but with URL classification we use documents which can’t be represented in 2D, there are number of ways to convert a document, this will be covered in the next post.

Overview

Using the weighted keyword approach allows for better fine tuning over the classification of the web site if you compare it vs. the non weighted approach.

The way the weighted keyword method words is that you assign a value to each keyword, a keyword that is a high indicator of a category would get a high value and a more common word that may only indicate the category when repeated often will get a lower value.

Theoretical example

For example, for category of ‘porn’ the keyword ‘blow job’ would get a more higher value then the keyword ‘sensual’, the more exact the keyword is the higher the value, another example of a high value keyword would be ‘Arizona escorts’ which is very precise.

Practical example

Let take a few keyword under porn:

Sex â€“ 50
Porn â€“ 40
Adult â€“ 10

And under news:

News â€“ 20
Reporter â€“ 20
Breaking news â€“ 40

If we analyze this sentence: ‘our reporter just have breaking news about sex ring that was arrested, further in the news’

We can see that the ranking for category news would be: 80 and for porn it would be 50, so we can say this is a news category.

Categories relationships

Once you analyzed the document (for web sites, all keyword are not considered the same, for example you might give the title an extra weight then the body, this will be covered in a different post) you get a list of categories and score, the highest score category is usually the category of the document, the second category may indicate a secondary category, you will need to see based on your weights when to allow the second category, for example when over 50% of main category.

If you look at the previous example, we could say the main category is news and the secondary is porn.

Cross over threshold

You may want to define a cross over threshold which means that if a category passed that number, it will be considered as the main category, this is usually done with porn/adult, which means that even if the category is not first, it will still be considered porn/adult when crossing that boundary.

Advanced weighted options

Another usage for categories is to decide which categories are ‘bad’, for example all the non family categories, and if the sum of all the ‘bad’ categories are more then the sum of the ‘good’ categories, you dim the document ‘bad’ and will choose the highest ‘bad’ category score even if it’s not first.

Playing with the categories

Once you run your engine on a number of sites, patterns will start to emerge and you can see that sometimes having two main categories in different ratio usually indicates a third category, for example a site that has porn/adult and dating as the two main categories usually indicates this is an adult dating site (dating with sex), or entertainment and adult can indicate a gossip site.

URL categorization or data categorization (if you want to classify a web page) which can also be called URL classification or data classification – can look magical, and in a way it is, because even if you use one of the known way to classify a web page you still need to be creative in order to get the best result which means low false positive and negatives.

The common ways to perform classification which this site will discuss in greater depth in other posts and go into the pros and cons of every approach:

Non weighted keyword based

You have a list of keywords for each category, and if you find a specific keyword you give the document that category of the keyword, for example you can say that every document that contains the word ‘fuck’ should be blocked, even though some documents may contain that word but would be family friendly otherwise. You can read more about: Non weighted keyword based URL classification.

Weighted keyword based

Same approach as the previous section, but each keyword has a different score, some low some high and a document will be in a category only if a certain threshold was passed, this allow greater accuracy then the previous method. You can read more about: Weighted keyword based URL classification.

SVM

Support Vector Machine, was invented 20 years ago and allows the software to determine how does a certain category looks based on a training set, first you give it N amount of documents of certain types (you can also provide N documents of different type for further training) and then you tweak the algorithm by doing a special calibration with another set of N documents of the same type, from that point on the algorithm can try to determine the classification of a document.

Manual classification

Every page is reviewed by a human and the correct category is set. A strict set of rules should be set because each person thinks differently (and can be affected by culture and religion) and it’s not uncommon for two people to disagree over a category of a specific document, for example, a nude renaissance portrait, is it art? Or is it nudity?

Manual classification with crowdsourcing

Same approach as manual but instead of having number of trained people doing the classification, you leverage the power of the crowd with services like Mechanical Turks or Microworkers to classify a document.

Link based

This method can be used as a secondary helper to the previous methods, for example we can assume that a web page will give out links to similar pages, so if we have a list of popular gambling sites, we can assume safely that a web page without outgoing links to those sites is related to gambling (this approach will not work with portals, and statistics site).

Computer vision

At the moment of writing the article there isn’t any credible service that uses this approach, but this can be a legit way to do so when computer vision matures. The way it works is trying to determine the classification of the page by detecting the type of images on the page.

Keyword based URL classification (non weighted) can be good for environments where zero tolerance is needed.

A quick recap, keyword based classification means that when a word is encountered then the document or web page will be classified based on the keyword, for example if the word ‘sex’ would appear we can assume an adult document (you can read the entire: URL Classification summary).

The problem with such approach is that some words can be either good or bad, if we take the keyword ‘sucks’ which can be an indication of adult, it can also be in legit phrases such as: “man, that test sucks”, or it can be in a phrase that may or may not be adult: “that woman was sucking a lollipop”.

The non weighted keyword approach can work well in two scenarios:

You try to block a search phrase and don’t have enough information to know if the search is of an adult nature or not (if you don’t take the search results into account).

You don’t care about false positive and prefer to be over cautious.

We can see example of such blocks in Google, if safe search is enabled, some keywords will not return any results, for example the keyword ‘porn’ would be blocked, although you can have some legit uses of it, like in ‘porn blocker’.