Problem about MD5

:confused: Dear FRIENDS! I have a such as problem: I'm realized algorithm MD5, program working correct with Latin alphabet, but when I give text with Cyrillic symbols, result is incorrect, I checked it with on-line version of MD5 and standard method (message digest) of java :confused: . Results are different if there are Cyrillic symbols. How to change default character table of java, can you advise me, please :( Thanks a lot for each answer!!! :)

Well, look:
- getBytes() really does correctly return the bytes for the string in a given character set
- MessageDigest, if correctly used, really does correctly calculate the MD5 checksum
So possibilities include:
- the online checker is messing up the character encoding in some way (I'm guessing you don't have source code, so how will we ever know...?)
- you're not using MessageDigest correctly (you haven't posted any code)

Here it is

Sure, I'll not ask check all code, but anyway, can you show me, what's my problem? It works with Latin alphabet correct, different results if I check standard MD5 method and mine. If I say true, I used open source of MD5... but, I learned it, how does it work... first of all I realized MD5 looking pseudo code from wikipedia, and I thought that it works only in binary view... then I find open source of MD5, and learned it, but... that problem is realy killing me... :( HELP PLEASE!!!

neilcoffey, I can use MessageDigest class, and I already did it. First of all, I want to answer: My scientific leader wants that, I understood MD5 algorithm. That's why I want to realize MD5 algorithm, but, anyway, it is very interesting algorithm. Thank you very much for advises, but if I'll have any problems, can I ask you?

Even if you have to implement MD5 yourself as an exercise, I'd recommend you use the MessageDigest class to confirm the digest of the string as Java converts them (with either UTF-8 or UTF-16-- the essential point is you need an encoding that doesn't "chop off" the top byte of characters that are more than one byte). Checking against the MessageDigest class eliminates the possibility that your on-line tool is simply converting the characters wrongly.

Problem again

Sorry, It's me again... am I so dumb :( ? Neil Coffey, in Pascal index of Cyrillic symbols begins 128, and I thought than Pascal works with ASCII, but when I saw ASCII table in internet, I am thinking, what encoding Pascal works with??? Can Java translit Cyrillic symbols? Or is it translit:

Pascal probably does work with ASCII. The problem is that there's no standard "ASCII" way of representing Cyrillic-- or indeed any character at all beyond very basic Latin characters and a few punctuation symbols. ASCII allows fewer than 256 characters, and it's simply impossible to accommodate all the common alphabets used around the world within that number of characters! So what used to happen is that different countries/systems just used the top half of the ASCII range (characters 128-255) in some arbitrary way to accommodate the characters they needed, and various "standards" have grown up over the years. The one you mention-- KOI8-R-- is quite common for representing Cyrillic in Russia, I believe. But, for example, in Bulgaria, I understand that another encoding is more common. I'm not aware of a standard encoding that encodes Cyrillic characters starting at position 128, but evidently it sounds like some systems use such a scheme.

I imagine that this encoding isn't tied to the Pascal language per se-- just that Pascal knows nothing about character encoding, so it's just using whatever the computer it is running on happens to choose. I imagine.

If you use Unicode, now pretty much an international standard, then you get round much of this confusion-- each character of each common alphabet is just given a standard code (though that still doesn't answer questions about how you use those characters, of course-- characters such as apostrophe vs quotes or full stops vs ellipses continue to be confused). But to accommodate the tens of thousands of possible characters used around the world, characters have to be represented by more than one byte.

MD5 doesn't care or know anything about character encoding. It just works on a series of bytes. Which character encoding you use to turn your characters into bytes is a completely arbitrary decision that you (and anybody else writing a similar program) must make. If for some reason you want to use the same coding that Pascal used on some computer (though I can't really think why you'd want to do this!), then you're at liberty to copy that-- though in this case, you may have to do a bit of the coding, as I don't think it's a standard character set.

If you have no particular reason to use anything else, I would strongly recommend Unicode encoded with UTF-8 (i.e. the character set you pass in to getBytes() is "UTF-8"). It's a very common international standard.

The code is giving same ASCII table, which Pascal gives. You are absolutely right, ASCII is different in different platforms.
Now, I've a question, how can I get index of 'П' in s1? I wrote code but, it's working incorrect, because index of 'П' is 1055 in Unicode.