get keyword and category of a page

This is a discussion on get keyword and category of a page within the Tech Board forums, part of the Community Boards category; I am going to write a Demo to extract keyword and category of a web page, does anyone know any ...

You asked for multi-language stuff, which is what zacs was giving you, I would think.

As for reading the webpage, I would guess you would have to read it. If you're looking for the meta keywords, then you can just look for that tag when you read the page; I'm not sure what you mean by category.

Any reference, code/paper/turotials are fine. My purpose is just to extract keywords from a web page. Another job is to category the web page -- for example, identify it as financial web page or sports web page automatically.

Originally Posted by zacs7

You also just asked for references to start you off, are you writing the demo or looking for a program to do what you describe?

Any reference, code/paper/turotials are fine. My purpose is just to extract keywords from a web page. Another job is to category the web page -- for example, identify it as financial web page or sports web page automatically.

The way forums work, in general, and this one is no exception, is that if you ask DETAILED questions, you get good answers. If you ask "open ended" questions that would need dozens of pages to be answered even nearly in full, then you are going to get very short answers with links to basic functionality that does roughly what you want - if you are lucky.

Apparently you already know how to use libcurl.

So what exactly is it you want to have help with?

When you say "keywords" do you mean the content itself? And if so, what constitutes a keyword? Do you have a list of keywords, or is the application supposed to determine what is a keyword by itself.

Reading any foreign language is pretty difficult if you don't know at least some of the language yourself, and for many non-european languages probably require that you have some help from someone who knows the language, as the buildup of words is much more of "one symbol -> one word".

I mean the semantics meaning of a web page. For example, when you browse MSN money pages, you got keywords like financial, stock, debts, layoff, auto industry, something like this.

Just like when we made a site to contract with Google, we need to provide keyword for each page to bid, now I just want to have some automatic way to generate keyword for each page I have.

Not sure whether this time my point is clear.

Originally Posted by matsp

The way forums work, in general, and this one is no exception, is that if you ask DETAILED questions, you get good answers. If you ask "open ended" questions that would need dozens of pages to be answered even nearly in full, then you are going to get very short answers with links to basic functionality that does roughly what you want - if you are lucky.

Apparently you already know how to use libcurl.

So what exactly is it you want to have help with?

When you say "keywords" do you mean the content itself? And if so, what constitutes a keyword? Do you have a list of keywords, or is the application supposed to determine what is a keyword by itself.

Reading any foreign language is pretty difficult if you don't know at least some of the language yourself, and for many non-european languages probably require that you have some help from someone who knows the language, as the buildup of words is much more of "one symbol -> one word".

Well, your point is fairly clear - however, the solution is non-trivial, I'm pretty sure. How do you, when you have the conent of the MSN Money page, determine that THOSE words are keywords. Yes, we can easily remove "the", "it", "is", "are" and other common words. But the content is still fairly complex.

And how do you deal with "sport and finance intermingled":

The Honda Formula1 team leaves the sport, claiminig credit crunch and return on investment as reasons

(By the way, I just made that particular text up - but something similar has been published recently).

So, what I'm saying is that you either have to make a list of words that YOU think will categorize a page, and then use that list for your categorization.
Or you need some really clever algorithms to figure out which words are "important" and which aren't.

Thanks. I understand the solution is not trivial. I just want to know whether there are any existing solution which could be used extract keywords from a specific page, like open source?

I have no idea about this area, just ping people here for experienced advice. :-)

Originally Posted by matsp

Well, your point is fairly clear - however, the solution is non-trivial, I'm pretty sure. How do you, when you have the conent of the MSN Money page, determine that THOSE words are keywords. Yes, we can easily remove "the", "it", "is", "are" and other common words. But the content is still fairly complex.

And how do you deal with "sport and finance intermingled":
(By the way, I just made that particular text up - but something similar has been published recently).

So, what I'm saying is that you either have to make a list of words that YOU think will categorize a page, and then use that list for your categorization.
Or you need some really clever algorithms to figure out which words are "important" and which aren't.