Multibyte String Funções

References

Multibyte character encoding schemes and their related issues are
fairly complicated, and are beyond the scope of this documentation.
Please refer to the following URLs and other resources for further
information regarding these topics.

User Contributed Notes 31 notes

Please note that all the discussion about mb_str_replace in the comments is pretty pointless. str_replace works just fine with multibyte strings:

<?php

$string = '漢字はユニコード';$needle = 'は';$replace = 'Foo';

echo str_replace($needle, $replace, $string);// outputs: 漢字Fooユニコード

?>

The usual problem is that the string is evaluated as binary string, meaning PHP is not aware of encodings at all. Problems arise if you are getting a value "from outside" somewhere (database, POST request) and the encoding of the needle and the haystack is not the same. That typically means the source code is not saved in the same encoding as you are receiving "from outside". Therefore the binary representations don't match and nothing happens.

PHP can input and output Unicode, but a little different from what Microsoft means: when Microsoft says "Unicode", it unexplicitly means little-endian UTF-16 with BOM(FF FE = chr(255).chr(254)), whereas PHP's "UTF-16" means big-endian with BOM. For this reason, PHP does not seem to be able to output Unicode CSV file for Microsoft Excel. Solving this problem is quite simple: just put BOM infront of UTF-16LE string.

For those who are looking for mb_str_replace, here's a simple function :<?phpfunction mb_str_replace($needle, $replacement, $haystack) { return implode($replacement, mb_split($needle, $haystack));}?>I haven't found a simpliest way to proceed :-)

Here's a cheap and cheeky function to remove leading and trailing *punctuation* (or more specifically "non-word characters") from a UTF-8 string in whatever language. (At least it works well enough for Japanese and English.)

/** * Trim singlebyte and multibyte punctuation from the start and end of a string * * @author Daniel Rhodes * @note we want the first non-word grabbing to be greedy but then * @note we want the dot-star grabbing (before the last non-word grabbing) * @note to be ungreedy * * @param string $string input string in UTF-8 * @return string as $string but with leading and trailing punctuation removed */function mb_punctuation_trim($string){ preg_match('/^[^\w]{0,}(.*?)[^\w]{0,}$/iu', $string, $matches); //case-'i'nsensitive and 'u'ngreedy

Here's a cheap and cheeky function to remove leading and trailing *punctuation* (or more specifically "non-word characters") from a UTF-8 string in whatever language. (At least it works well enough for Japanese and English.)

/** * Trim singlebyte and multibyte punctuation from the start and end of a string * * @author Daniel Rhodes * @note we want the first non-word grabbing to be greedy but then * @note we want the dot-star grabbing (before the last non-word grabbing) * @note to be ungreedy * * @param string $string input string in UTF-8 * @return string as $string but with leading and trailing punctuation removed */function mb_punctuation_trim($string){ preg_match('/^[^\w]{0,}(.*?)[^\w]{0,}$/iu', $string, $matches); //case-'i'nsensitive and 'u'ngreedy

Here's a cheap and cheeky function to remove leading and trailing *punctuation* (or more specifically "non-word characters") from a UTF-8 string in whatever language. (At least it works well enough for Japanese and English.)

/** * Trim singlebyte and multibyte punctuation from the start and end of a string * * @author Daniel Rhodes * @note we want the first non-word grabbing to be greedy but then * @note we want the dot-star grabbing (before the last non-word grabbing) * @note to be ungreedy * * @param string $string input string in UTF-8 * @return string as $string but with leading and trailing punctuation removed */function mb_punctuation_trim($string){ preg_match('/^[^\w]{0,}(.*?)[^\w]{0,}$/iu', $string, $matches); //case-'i'nsensitive and 'u'ngreedy

Note that some of the multi-byte functions run in O(n) time, rather than constant time as is the case for their single-byte equivalents. This includes any functionality requiring access at a specific index, since random access is not possible in a string whose number of bytes will not necessarily match the number of characters. Affected functions include: mb_substr(), mb_strstr(), mb_strcut(), mb_strpos(), etc.

Please note that when migrating code to handle UTF-8 encoding, not only the functions mentioned here are useful, but also the function htmlentities() has to be changed to htmlentities($var, ENT_COMPAT, "UTF-8") or similar. I didn't scan the manual for it, but there could be some more functions that need adjustments like this.

PHP5 has no mb_trim(), so here's one I made. It work just as trim(), but with the added bonus of PCRE character classes (including, of course, all the useful Unicode ones such as \pZ).

Unlike other approaches that I've seen to this problem, I wanted to emulate the full functionality of trim() - in particular, the ability to customise the character list.

<?php
/**
* Trim characters from either (or both) ends of a string in a way that is
* multibyte-friendly.
*
* Mostly, this behaves exactly like trim() would: for example supplying 'abc' as
* the charlist will trim all 'a', 'b' and 'c' chars from the string, with, of
* course, the added bonus that you can put unicode characters in the charlist.
*
* We are using a PCRE character-class to do the trimming in a unicode-aware
* way, so we must escape ^, \, - and ] which have special meanings here.
* As you would expect, a single \ in the charlist is interpretted as
* "trim backslashes" (and duly escaped into a double-\ ). Under most circumstances
* you can ignore this detail.
*
* As a bonus, however, we also allow PCRE special character-classes (such as '\s')
* because they can be extremely useful when dealing with UCS. '\pZ', for example,
* matches every 'separator' character defined in Unicode, including non-breaking
* and zero-width spaces.
*
* It doesn't make sense to have two or more of the same character in a character
* class, therefore we interpret a double \ in the character list to mean a
* single \ in the regex, allowing you to safely mix normal characters with PCRE
* special classes.
*
* *Be careful* when using this bonus feature, as PHP also interprets backslashes
* as escape characters before they are even seen by the regex. Therefore, to
* specify '\\s' in the regex (which will be converted to the special character
* class '\s' for trimming), you will usually have to put *4* backslashes in the
* PHP code - as you can see from the default value of $charlist.
*
* @param string
* @param charlist list of characters to remove from the ends of this string.
* @param boolean trim the left?
* @param boolean trim the right?
* @return String
*/
function mb_trim($string, $charlist='\\\\s', $ltrim=true, $rtrim=true)
{
$both_ends = $ltrim && $rtrim;

A brief note on Daniel Rhodes' mb_punctuation_trim().The regular expression modifier u does not mean ungreedy, rather it means the pattern is in UTF-8 encoding. Instead the U modifier should be used to get ungreedy behavior. (I have not otherwise tested his code.)See http://php.net/manual/en/reference.pcre.pattern.modifiers.php

The opposite of what Eugene Murai wrote in a previous comment is true when importing/uploading a file. For instance, if you export an Excel spreadsheet using the Save As Unicode Text option, you can use the following to convert it to UTF-8 after uploading:

However, then Excel on Mac OS X doesn't identify columns properly and its puts each whole row in its own cell. In order to fix that, use TAB "\\t" character as CSV delimiter rather than comma or colon.

You may also want to use HTTP encoding header, such asheader( "Content-type: application/vnd.ms-excel; charset=UTF-16LE" );

answering to peter albertsson, once you got your data octet-size, you can access each octet with something$string[0] ... $string[$size-1], since the [ operator doesn't complies with multibytes strings.

The function trim() has not failed me so far in my multibyte applications, but in case one needs a truly multibyte function, here it is. The nice thing is that the character to remove can be whitespace or any other specified character, even a multibyte character.

The problem occurs when a file is filled with a string using fwrite in the following manner:

$len = strlen($data);fwrite($fp, $data, $len);

fwrite takes amount of bytes as the third parameter, but mb_strlen returns the amount of characters in the string. Since multibyte characters are possibly more than one byte in length each - this will result in that the last characters of $data never gets written to the file.

After hours of investigating why PEAR::Cache_Lite didn't work - the above is what I found.

I made an attempt at using single byte functions, but it doesn't work. Posting here anyway in case it helps someone else:

Since not all hosted servces currently support the multi-byte function set, it may still be necessary to process Unicode strings using standard single byte functions. The function at the following link - http://www.kanolife.com/escape/2006/03/php-unicode-processing.html - shows by example how to do this. While this only covers UTF-8, the standard PHP function "iconv" allows conversion into and out of UTF-8 if strings need to be input or output in other encodings.

A friend has pointed out that the entry "mbstring.http_input PHP_INI_ALL" in Table 1 on the mbstring page appears to be wrong: above Example 4 it says that "There is no way to control HTTP input character conversion from PHP script. To disable HTTP input character conversion, it has to be done in php.ini". Also the table shows the old-PHP-version defaults: ;; Disable HTTP Input conversion mbstring.http_input = pass *BUT* (for PHP 4.3.0 or higher) ;; Disable HTTP Input conversion mbstring.encoding_translation = Off