Few Doubts about encoding

Hi, I have got some understanding and hence few doubts regarding encoding.Though, I did browsed previous messages,but still I will appreciate your help for the clarification (from basic): a)I assume every character has a unicode value associated with it. say for any character the value is \u0123, then for this character the unicode value is unique. i.e. a character can have only one. b)Though it is unique for the character but this unicode value \u0123 can be for another character also, and to find for which character it stands for we use encoding. e.g. UTF-8 says \u0123 is for 'x' but 'Shift_JIS' may use this to represent some japanese character. c)If my application will receive, say a kanji string, I will assume that my Windows OS won't be able to understand it, so I will convert it into bytes using my default encoding say new String(request.getParameter(string).getBytes("ISO-8859-1"),"UTF8"); -- I assume, this will get bytes as per windows encoding and I will create back my UTF-8 string, but now how can my OS store this? Does it means OS can store kanji or anything but Its only "A particular way" and we used new String...... to achieve that? e) I tried this with using getBytes() without any parameter it worked fine (as i read bytes are extracted using default encoding), but I find out my default encoding using System.getProperty("file.encoding") and it says Cp1252, but when I used getBytes("Cp1252"), kanji was stored as ??? (question marks) ... where it failed here? f)Lastly, what it means by JVM's default encoding? Is it same as that of OS? Sorry if some or all of my queries were absurd but I will be grateful if you can correct me. Thanks -Varun P.S. Can anyone suggest a good site on this, as I browsed a lot but was confused regarding unicode & UTF stuff, as i think unicode is character representation which can be decoded using encoding like UTF-8, but some call it as an encoding, further Is UTF unicode tranformation format or Universal characterset Transformation Format ?

a)I assume every character has a unicode value associated with it. say for any character the value is \u0123, then for this character the unicode value is unique. i.e. a character can have only one.

TRUE

b)Though it is unique for the character but this unicode value \u0123 can be for another character also, and to find for which character it stands for we use encoding. e.g. UTF-8 says \u0123 is for 'x' but 'Shift_JIS' may use this to represent some japanese character.

Correct

c)If my application will receive, say a kanji string, I will assume that my Windows OS won't be able to understand it, so I will convert it into bytes using my default encoding say new String(request.getParameter(string).getBytes("ISO-8859-1"),"UTF8");

No, you can't use the ISO-8859-1 to store Kanji characters. I assume the users of your application will be japanese, and have japanese versions of windows. In this case the windows charset will be correctly selected, and they should be able to key in and read files with Kanji characters.

e) I tried this with using getBytes() without any parameter it worked fine (as i read bytes are extracted using default encoding), but I find out my default encoding using System.getProperty("file.encoding") and it says Cp1252, but when I used getBytes("Cp1252"), kanji was stored as ??? (question marks) ... where it failed here?

The Cp1252 charset (also named Windows-1252) is very similar to ISO-8859-1 and thus cannot encode Kanji characters, so they are all transformed into question marks. However when calling getBytes without any parameters, it shouldn't work better as it picked up the default encoding for your VM (cp1252), unless you have changed the file.encoding system property.

f)Lastly, what it means by JVM's default encoding? Is it same as that of OS?

I have seen cp1252 for windows, and iso-8859-1 on Linux, but I'm not sure it has anything to do with the way the OS is set. It might be harcoded in the JVM implementation for each OS.

i think unicode is character representation which can be decoded using encoding like UTF-8, but some call it as an encoding, further Is UTF unicode tranformation format or Universal characterset Transformation Format ?

Unicode is sometimes refered to the familly of charsets that uses the Unicode encoding, but as Java internally uses the UTF-16 charset to store characters, many people use to speak about Unicode instead of URF-16. With UTF-16, each character is encoded using 16 bits. Hope this clarifies a bit.

Varun Khanna
Ranch Hand

Joined: May 30, 2002
Posts: 1400

posted Dec 20, 2002 21:03:00

0

Thanks Beno�t. My code is deployed in Unix machine. And the end user currently aren't Japanese people, they are windows user(with non kanji keyboard), the application is displaying end user the page in UTF-8 format(using response.setContentType). So while receiveing request,I m just copying and pasting kanji characters in input boxes (say i m end user),and in servlet new String(request.getParameter("SomeName").getBytes("8859_1"), "UTF8"); is working perfectly. But again,this is a grey area for me,this line: new String(request.getParameter("SomeName").getBytes("8859_1"), "UTF8"); i assume used to "undo" the improper default conversion that occurs in getParameter(),and supposedly by calling getBytes with 8859_1, convert the Unicode back into bytes and then re-interpret those bytes correctly, in this example, as UTF-8.But i read somewhere that, If your incoming request data really is UTF-8 then there may be octets whose values are in the invalid range for 8859-1 or CP1252 and there may be loss, Now my each window (i m end user here) comes in UTF-8 encoding i.e. {coz that i did that intentionally in all my servlet,running in unix machine), and if i paste kanji, it get stores and get save properly (surely), why and how? a)Doesn't it means that my request is UTF-8 encoded? Then y the failure is not occuring? b)If i change my browser encoding to some other then, my data doen't gets stored properly that means some UTF-8 dependeny is there, BUT c)but when i write request.getCharacterEncoding() in servlet it always prints "null" i.e. it is not UTF-8 may be thats y its working fine. Now these 3 point a,b,c are self contradictory. Can anyone clear this to me. Thanks -Varun. [ December 20, 2002: Message edited by: varun Khanna ]