str_word_count

Description

Counts the number of words inside string.
If the optional format is not specified, then
the return value will be an integer representing the number of words
found. In the event the format is specified, the return
value will be an array, content of which is dependent on the
format. The possible value for the
format and the resultant outputs are listed below.

For the purpose of this function, 'word' is defined as a locale dependent
string containing alphabetic characters, which also may contain, but not start
with "'" and "-" characters.

Parameters

string

The string

format

Specify the return value of this function. The current supported values
are:

0 - returns the number of words found

1 - returns an array containing all the words found inside the
string

2 - returns an associative array, where the key is the numeric
position of the word inside the string and
the value is the actual word itself

User Contributed Notes 33 notes

/*** * This simple utf-8 word count function (it only counts) * is a bit faster then the one with preg_match_all * about 10x slower then the built-in str_word_count * * If you need the hyphen or other code points as word-characters * just put them into the [brackets] like [^\p{L}\p{N}\'\-] * If the pattern contains utf-8, utf8_encode() the pattern, * as it is expected to be valid utf-8 (using the u modifier). **/

Here is a count words function which supports UTF-8 and Hebrew. I tried other functions but they don't work. Notice that in Hebrew, '"' and '\'' can be used in words, so they are not separators. This function is not perfect, I would prefer a function we are using in JavaScript which considers all characters except [a-zA-Zא-ת0-9_\'\"] as separators, but I don't know how to do it in PHP.

I removed some of the separators which don't work well with Hebrew ("\x20", "\xA0", "\x0A", "\x0D", "\x09", "\x0B", "\x2E"). I also removed the underline.

This is a fix to my previous post on this page - I found out that my function returned an incorrect result for an empty string. I corrected it and I'm also attaching another function - my_strlen.

it strips the pipe "|" chars, which antiword uses to format tables in its plain text output, removes more than one dashes in a row (also used in tables), then counts the words.

counting words using explode() and then count() is not a good idea for huge texts, because it uses much memory to store the text once more as an array. this is why i'm using while() { .. } to walk the string

Hi this is the first time I have posted on the php manual, I hope some of you will like this little function I wrote.

It returns a string with a certain character limit, but still retaining whole words.It breaks out of the foreach loop once it has found a string short enough to display, and the character list can be edited.

/** * Returns the number of words in a string. * As far as I have tested, it is very accurate. * The string can have HTML in it, * but you should do something like this first: * * $search = array( * '@<script[^>]*?>.*?</script>@si', * '@<style[^>]*?>.*?</style>@siU', * '@<![\s\S]*?--[ \t\n\r]*>@' * ); * $html = preg_replace($search, '', $html); * */

I needed a function which would extract the first hundred words out of a given input while retaining all markup such as line breaks, double spaces and the like. Most of the regexp based functions posted above were accurate in that they counted out a hundred words, but recombined the paragraph by imploding an array down to a string. This did away with any such hopes of line breaks, and thus I devised a crude but very accurate function which does all that I ask it to:

The idea behind it? Go through the keys of the arrays returned by str_word_count and associate the number of each word with its character position in the phrase. Then use substr to return everything up until the nth character. I have tested this function on rather large entries and it seems to be efficient enough that it does not bog down at all.

Never use this function to count/separate alphanumeric words, it will just split them up words to words, numbers to numbers. You could refer to another function "preg_split" when splitting alphanumeric words. It works with Chinese characters as well.

Some ask not just split on ' ', well, it's because simply exploding on a ' ' isn't fully accurate. Words can be separated by tabs, newlines, double spaces, etc. This is why people tend to seperate on all whitespace with regular expressions.

Personally, I dont like using this function becuase the characters it omits are sometime nessesery for instance MS Word counts ">" or "<" alone as single word where this function doesnt. I like using this however, it counts EVERYTHING:

Here is a code for a function str_word_count() compatible with UTF-8. I'm sorry that the comments are in French because I am not very good in English: anyway, these comments only try to explain things that are in PCRE or Unicode documentations.