For some reason I don't the see the original email, so I'm going to guess based on Marco's response below.

The code below is nearly correct, assuming that the starting point was that each UTF-8 byte was converted into a single java.lang.Character object in the String. That is, if the String contained the sequence U+00E8 U+00AA U+009E..., the code would be:

byte[] byt = myString.getBytes("ISO8859_1"); // get the original UTF-8 bytes back
String ucs2 = new String(byt, "UTF-8"); // turn them into a real UCS-2 string

It is very important to name the encoding in the string constructor, otherwise the String constructor assumes the JVM's file.encoding---> most of the time.

There is a annoying bug/feature in some JVMs on real Asian Windows (including 2K and XP) in which the file.encoding is ignored in favor of the actual System Active code page (SYS_ACP) and setting the -Dfile.encoding="someEncoding" doesn't work to change the String constructor's default behavior. You have to be careful always name the encoding, not just rely on the system to provide it.

If your original byte[] is in a real CJK encoding, then you need to name that encoding instead of UTF-8 above (and you can do that by getting the file.encoding system parameter if you are running on the same platform, la so:

> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Marco Cimarosti
> Sent: Thursday, September 12, 2002 4:51 AM
> To: 'pr1@club-internet.fr'; unicode@unicode.org> Subject: RE: Problems converting from UTF-8 to UCS-2 and vice-versa
> using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1
>
>
> Philippe de Rochambeau wrote:
> > On the other hand, if I store the previous "go" character
> > plus an unusual
> > CJK ideogram whose Unicode equivalent is \u5439 (E5 90 B9 in UTF-8)
> > in the DB and retrieve the data, JRun 3.1 will only display the first
> > character in my form's textarea, plus a few invisible
> > characters, and the
> > database will contain the following hex values:
> >
> > E8 AA 9E E5 3F B9 20 20 20 20 20 20 0D 0A 0A
> >
> > As you can see, "go" is still there, but the following
> > character (E5 3F B9)
> > is not \u5439 (E5 90 B9). I cannot figure out how to fix this problem.
> >
> > Any help with this problem would be much appreciated.
>
> I see what the problem is. As usual, it's all the fault of Bill Gate$. :-)
>
> If you interpret <E5, 90, B9> according to Windows-1252, you see
> that E5 is
> "å", B9 is "¹", but 90 is an unassigned slot! Unassigned characters are
> normally turned into a question marks, and "?"'s code is (guess
> what) 3F...
>
> <E8, AA, 9E> this works only by chance, because all three bytes are valid
> Windows-1252 characters: "é", "ª", and "ž", respectively.
>
> I guess that the problem starts when you try to fool the system into
> thinking that the text is ISO 8859-1:
>
> byte[] byt = (newQfLibelleArray[i]).getBytes( "ISO8859_1" );
> String tempUtf16 = new String( byt );
>
> But, sorry. I can't help with a fix, because I don't know Java API's well
> enough.
>
> Can't you do something like <.getBytes("UTF-8")>? Or, even better, doesn't
> (newQfLibelleArray[i]) have a method to return a <String> object directly?
>
> _ Marco
>
>
>
>