If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Average non-meaningful word ratio

I need to know the ratio of meaningfull words against non-meaningfull words in a sentence. I don't know if I explained that correctly, so here's an example:

"hello. this is a test which you'll all find very interesting and will study for many hours when you get home."

These are the "meaningfull" words in that sentence:
hello, test, you'll, find, very, interesting, study, hours, you, home

All the rest are "non-meaningfull" (ie: they have no impact in the sentance other than to structure it), thus giving a ratio of 10:21. Is there an average ratio like this that matches for all English documents (on average, of course)?

If that doesn't exist, is there a maximum "keyword" average?

For example, if I'm talking about my pen. and my desk:

"This is my pen. I normally keep my pen on my desk at all times. I work on my desk, and use my pen for writting with".

Here, the ratio of keywords (pen, desk) against non-keywords is 5:27. Is there a maximum ratio like this, where the keyword shall not be said more than X number of times in a sentence?

Re: Average non-meaningful word ratio

Hi

Thanks for that, but it's not exactly what I was looking for...

I'm actually trying to make some artificial inteligence for a search engine, and it needs to be able to distinguish between keywords and trivail (non-meaningfull) words. I do that by counting the word frequency. The more times a word appears, the more likely it is to be trivial. However, if a word is appearing frequently because the artical heavily focuses on it, then obviously I don't want it to be considered at all trivial.

My solution is:

If I can find out the average ratio of non-meaningfull words against keywords, I'll be able to guess whether the world is non-meaningfull or very meaningfull. It'll also use loads of other tests at different levels, etc etc.

Re: Average non-meaningfull word ratio

That's a brilliant idea, although I see a major problem in it. To be able to determine a ratio like the one you're describing you would have to first teach a computer program to analyze what a proper sentence should look like under ALL circumstances. Basically you'd have to teach a heuristic algorithm to go beyond itself and analyze all the things that cannot be measured. Namely intent, mood and tone.

As we've all seen with so-called "grammar checks" and "translation software", technology has a loooooooooooooong way to go before this is a reality.

Re: Average non-meaningfull word ratio

Hehe, yeah. That would be the ultimate goal, but I could never be bothered to do that :P

That's only one "layer" of the anaysing... The other "layers" look for where the text is. For example, text which is in bold is considered to be important, but I need a way of making sure any trivial bold text (the, etc) doesn't also get considered as being important.

Once finished, it'll probably try to learn from these layers:
(+) - makes word more important
(-) - makes word less important

Also, I'd like it to learn from previous searches. *Most* people don't include words such as "the" in their search queries (but some do - I'd need a way of checking for that...), so words which had once been in a search term would increase the words importance.

Loads of other ideas in the back of my head, but can't quite put my finger on them yet...

It should be quite cool once finished though (I hope - otherwise I've wasted several weeks work!).

Re: Average non-meaningfull word ratio

I'm actually trying to make some artificial inteligence for a search engine, and it needs to be able to distinguish between keywords and trivail (non-meaningfull) words.

Hi Colin.

If you're talking about search engines and their algos, I would recommend you visit SearchEngineWatch.com and ask in their Search Technology & Relevancy forum. Their moderator, Orion, has an encyclopedic knowledge of all things algorithmic regarding search engines.

There's one condition, that you come back and let us know how you get on!

Re: Average non-meaningfull word ratio

Originally Posted by colinhorne

If I can find out the average ratio of non-meaningfull words against keywords, I'll be able to guess whether the world is non-meaningfull or very meaningfull. It'll also use loads of other tests at different levels, etc etc.

Thanks

This is actually an area or investigation quite close to my heart. I willl be sure to follow your discussion with interest, here or at the SEW forum. It sounds to me as if you require an expanded stop-word list to define the non-meaning words.

Anyway, I wish you luck with your research.

Edited to add:

I'd be interested in starting a new forum area here specifically to do with analysing language. Would anyone (other than me) be interested?

Re: Average non-meaningfull word ratio

Hi

Thanks for the advice, I'll head over searchenginewatch.com shortly...

Once (if ever!) I get this finished, I'd be happy to give you the source code (or document the workings, etc) for you, if you're interested. That is, if it works of course. I have to admit, it wasn't desperatly sensible for me to dive into this project though - my main strengths are encryption and database driven apps, not evalutating the English language (and I've got the satisfaction of failing my English exams when I went to school to prove that fact ).