Siehe auch

User Contributed Notes 32 notes

For my last project I needed to convert several CSV files from Windows-1250 to UTF-8, and after several days of searching around I found a function that is partially solved my problem, but it still has not transformed all the characters. So I made ​​this:

many people below talk about using
<?php
mb_convert_encode($s,'HTML-ENTITIES','UTF-8');
?>
to convert non-ascii code into html-readable stuff. Due to my webserver being out of my control, I was unable to set the database character set, and whenever PHP made a copy of my $s variable that it had pulled out of the database, it would convert it to nasty latin1 automatically and not leave it in it's beautiful UTF-8 glory.

So [insert korean characters here] turned into ?????.

I found myself needing to pass by reference (which of course is deprecated/nonexistent in recent versions of PHP)
so instead of
<?php
mb_convert_encode(&$s,'HTML-ENTITIES','UTF-8');
?>
which worked perfectly until I upgraded, so I had to use
<?php
call_user_func_array('mb_convert_encoding', array(&$s,'HTML-ENTITIES','UTF-8'));
?>

rodrigo at bb2 dot co dot jp wrote that inconv works better than mb_convert_encoding, I find that when converting from uft8 to shift_jis $conv_str = mb_convert_encoding($str,$toCS,$fromCS); works while$conv_str = iconv($fromCS,$toCS.'//IGNORE',$str); removes tildes from $str.

To add to the Flash conversion comment below, here's how I convert back from what I've stored in a database after converting from Flash HTML text field output, in order to load it back into a Flash HTML text field:

When converting Japanese strings to ISO-2022-JP or JIS on PHP >= 5.2.1, you can use "ISO-2022-JP-MS" instead of them.Kishu-Izon (platform dependent) characters are converted correctly with the encoding, as same as with eucJP-win or with SJIS-win.

When using the Windows Notepad text editor, it is important to note that when you select 'Save As' there is an Encoding selection dropdown. The default encoding is set to ANSI, with the other two options being Unicode and UTF-8. Since most text on the web is in UTF-8 format it could prove vital to save the .txt file with this encoding, since this function does not work on ANSI-encoded text.

mb_substr and probably several other functions works faster in ucs-2 than in utf-8. and utf-16 works slower than utf-8. here is test, ucs-2 is near 50 times faster than utf-8, and utf-16 is near 6 times slower than utf-8 here:<?phpheader('Content-Type: text/html; charset=utf-8');mb_internal_encoding('utf-8');

If you want to convert japanese to ISO-2022-JP it is highly recommended to use ISO-2022-JP-MS as the target encoding instead. This includes the extended character set and avoids ? in the text. For example the often used "1 in a circle" ① will be correctly converted then.

even if you explicitly specify the character encoding of a page as iso-8859-1(via headers and strict xml defs), windows 2000 will ignore that and interpret it as whatever character set it has natively installed.

for example, i wrote char #128 into a page, with char encoding iso-8859-1, and it displayed in internet explorer (& mozilla) as a euro symbol.

it should have displayed a box, denoting that char #128 is undefined in iso-8859-1. The problem was it was displaying in "Windows: western europe" (my native character set).

this led to confusion when i tried to convert this euro to UTF-8 via mb_convert_encoding()

IE displays UTF-8 correctly- and because PHP correctly converted #128 into a box in UTF-8, IE would show a box.

so all i saw was mb_convert_encoding() converting a euro symbol into a box. It took me a long time to figure out what was going on.

==
[EDIT BY danbrown AT php DOT net: Author provided the following update on 27-FEB-2012.]
==

An addendum to my &quot;to7bit&quot; function referenced below in the notes.
The function is supposed to solve the problem that some languages require a different 7bit rendering of special (umlauted) characters for sorting or other applications. For example, the German &szlig; ligature is usually written &quot;ss&quot; in 7bit context. Dutch &yuml; is typically rendered &quot;ij&quot; (not &quot;y&quot;).

If mb_convert_encoding doesn't work for you, and iconv gives you a headache, you might be interested in this free class I found. It can convert almost any charset to almost any other charset. I think it's wonderful and I wish I had found it earlier. It would have saved me tons of headache.

If you don't specify a source encoding, then it assumes the internal (default) encoding. ñ is a multi-byte character whose bytes in your configuration default (often iso-8859-1) would actually mean Ã±. mb_convert_encoding() is upgrading those characters to their multi-byte equivalents within UTF-8.

Try this instead:<?phpprint mb_convert_encoding( "ñ", "UTF-8", "UTF-8" );?>Of course this function does no work (for the most part - it can actually be used to strip characters which are not valid for UTF-8).