If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Unanswered: FULLTEXT search problem

Hi all,

I use MySQL 3.23.58 on Fedora Core 2. I have two varchar fields in a table, which I have indexed properly and observe the following problems with FULLTEXT:

If the query finds one result it works ok. If the query finds more than one it returns none. For example I have 3 entries named 'hello', 'world' and 'world' respectively. FULLTEXT search for 'hello' returns entry 1, search for 'world' returns nothing!

In addition, string matching works only for the exact string, ie search for 'hello' works, search for 'hell' does not work.

In your table, 2/3 of all rows contains "world", and therefore MySQL treats it like a stopword. It does so if more than 50% of the rows matches, and instead of (possibly) a lot of results gives you no result at all. You can make a fulltext search IN BOOLEAN MODE to work around this, but often you don't want a result set that contains more than 50% of the rows anyway.

MySQL Reference Manual, section 13.6:

For very small tables, word distribution does not adequately reflect their semantic value, and this model may sometimes produce bizarre results. For example, although the word “MySQL” is present in every row of the articles table, a search for the word produces no results:

The search result is empty because the word “MySQL” is present in at least 50% of the rows. As such, it is effectively treated as a stopword. For large datasets, this is the most desirable behavior—a natural language query should not return every second row from a 1GB table. For small datasets, it may be less desirable. A word that matches half of rows in a table is less likely to locate relevant documents. In fact, it will most likely find plenty of irrelevant documents. We all know this happens far too often when we are trying to find something on the Internet with a search engine. It is with this reasoning that rows containing the word are assigned a low semantic value for the particular dataset in which they occur. A given word may exceed the 50% threshold in one dataset but not another. The 50% threshold has a significant implication when you first try full-text searching to see how it works: If you create a table and insert only one or two rows of text into it, every word in the text occurs in at least 50% of the rows. As a result, no search returns any results. Be sure to insert at least three rows, and preferably many more.