I want to teach myself enough machine learning so that I can, to begin with, understand enough to put to use available open source ML frameworks that will allow me to do things like:

Go through the HTML source of pages
from a certain site and "understand"
which sections form the content,
which the advertisements and which
form the metadata ( neither the
content, nor the ads - for e.g. -
TOC, author bio etc )

Go through the HTML source of pages
from disparate sites and "classify"
whether the site belongs to a
predefined category or not ( list of
categories will be supplied
beforehand ) based on the page content derived from 1

... similar classification tasks on
text and pages.

As you can see, my immediate requirements are to do with classification on disparate data sources and large amounts of data.

As far as my limited understanding goes, taking the neural net approach will take a lot of training and maintenance than putting SVMs to use?

I understand that SVMs are well suited to ( binary ) classification tasks like mine, and open source frameworks like libSVM are fairly mature?

Taking forward the question assuming SVMs to be a better answer to classification of the likes of mine, how do I understand things like:

What $R^n$ stands for

or hyperplanes

or how to transform a space so that
it can be broken into two parts etc?2.

In that case, what subjects and topics
does a computer science graduate need
to learn right now, so that the above
requirements can be solved, putting
these frameworks to use? Especially those related to SVMs.

I am willing to learn and put in as much effort as I possibly can.

Recommendations from you on learning specific portions of statistics and probability theory is nothing unexpected from my side, so say that if required!

I will modify this question if needed, depending on all your suggestions and feedback.

I have edited my answer to address your rephrased question (or at least the relevant parts). I would still say your question needs further rewriting as it is still mostly not pertinent to TCS -- dealing with available code, HTML, libSVM, etc.
–
Lev Reyzin♦Sep 22 '10 at 0:31

You may want to try MetaOptimize for more practical ML questions.
–
KavehNov 22 '10 at 2:29

1

as a supplement to theory, i find it is always incredibly useful (if not essential) to build and run lots of simple examples. to that end ORANGE ailab.si/orange is a really nice, flexible ML framework which supports a wide variety of ML techniques and robust python scripting. it also includes a lot of real-world examples from the UC Irvine public data sets.
–
s8soj3o289Nov 22 '10 at 2:46

3 Answers
3

From the theory end, you should understand how SVMs work (and why they do). For example, I like the scribe notes from this class for a quick explanation and something like this tutorial for a more in-depth one. You might also want to consider other machine learning techniques that are suited for binary classification on large data ie. boosting, random forests, ...

As for dealing with open source code issues and with parsing HTML, stackoverflow might be a better place to seek answers.

$R^n$ "stands for" the fact that the (data) points to be classified lie in a $n$-dimensional Euclidean space. For example, points in our familiar $2$ dimensions (with, say, latitude and longitude coordinates) lie in $R^2$ -- an $n$-dimensional space generalizes this notion.

A hyperplane (in $R^n$) is a flat of dimension $n-1$. It is what an SVN uses to classify the data (after they have been mapped to the higher dimensional space by the chosen Kernel). For example, if your space is $R^2$ (or a plane) then a hyperplane is line, and so on.

How to transform the space? This is done via the Kernel function. A big part of SVM design is choosing a proper Kernel for the problem at hand.

To understand these three points in more detail, you can read the references above.

First of all, you should study the fundamentals. I found this Stanford University course indeed helpful. In particular, pay attention to "Lecture 11". It covers best practices for applying machine learning techniques and debugging an application, no matter what learning model you use.

Additionally, among the open source ML frameworks, there are GATE and RapidMiner that look suitable for the task.

The only other thing I would add to what has been said is, if you haven't already, do a course in linear algebra.

It will help you understand things like what $R^n$ means and concepts like that. Sometimes courses on a topic will give you a brief background in the linear algebra concepts and ideas specific to the material at hand, but they are never very deep; and I always find that it is a lot easier to comprehend and absorb an idea if you have a good understanding of the underlying mechanics.