That's not so easy. You'd have to know how String.hashCode() calculates its return value exactly. You can look this up, the source code of class String is available in src.zip which is in your JDK installation directory.

Note, however, that it is made in such a way that different strings have different hash codes as much as possible - so it will still not be easy to find two strings that result in the same hash code.

(In Java, the class is named String, not string - the capital S is important).

The only reliable way is to make them both the same String. Even if you hacked it based on the hashCode source code, that could change in a future version. I suppose the obvious question is, why do you want to do this? There's probably a better way of doing what you want to do.

priyanaka jaiswal wrote: How to make two different string to have same hashcode?

Can you shed some light on your reason for wanting to do that? I'm always curious to learn why people want to go against the grain and set themselves up for potential problems. Kind of like, "Yes, you can read that book upside down but why would you want to do that?"

As it is implemented, it might be impossible for two String objects which are unequal according to equals() to have the same value for hashCode but the contract for hashCode() states that this is NOT a requirement. That is, the values returned by hashCode() do not have to be different if two objects are unequal according to equals but it helps improve performance of hashtables if they are.

So, a matching hash is given by making trivial changes. For example, increase str[2] by one, and decrease str[3] by 31

hash("abyz") = 2987778
hash("abz]") = 2987778

If the hash multiplier, 31, was larger than 65535, then there would be no values that generate the same hash.

In general, since the ascii codes range from 33 to 175, or 142 values, you could shift the next to last character by up to 3 places up or down, and adjust the last character to generate the same hash. If you use non ascii codes, or the full 65535 values, you could adjust the last four characters and obtain different strings that generate the same hash code.

John Vorwald
Ranch Hand

Joined: Sep 26, 2010
Posts: 139

posted Apr 20, 2013 07:58:22

0

Below is an example code to generate strings with matching hash codes.

priyanaka jaiswal wrote:How to make two different string to have same hashcode?

I guess the only thing left to resolve now is whether the OP meant to generate unequal strings that have the same hashCode or change the hashCodes of two unequal strings so that they become equal (without changing the strings).

That is, by "make" did the OP mean "to create" as in "I made a copy of the file" or "to cause" as in "The magician made the car disappear"?

Matthew Brown wrote:The only reliable way is to make them both the same String. Even if you hacked it based on the hashCode source code, that could change in a future version.

Well, they did actually commit to the formula as part of the API (in JavaDoc); thus they can't really change the calculated hash code result at this point without violating the API. It's pretty reliable at this point. I agree with your other points though.

John Vorwald
Ranch Hand

Joined: Sep 26, 2010
Posts: 139

posted Apr 21, 2013 05:37:33

0

Since the hash value of a string is given by
h(i+1) = h(i)*31 + (int)str[i];

One approach to generating a second string with the same hash value is to modify the characters so that the same value comes up. If we limit the range of characters to the ASCII set, then this means that the second to last character can be changed by up to three values (i) and the last character changed by -i*31.

A second way that two different strings can generate the same hash is to use the range of int which is from -2,147,483,648 and 2,147,483,647 inclusive.. Since 31^7 = 27,512,614,111, this suggests that the hash value of a string with more than 6 characters would exceed the range of an int, causing folding, and providing a second methodology for matching the hash values of two different strings. If we add one to the maximum integer, we obtain the minimum integer. This suggests that the hash value is given by

So, that suggests that if we start with a string with more than six characters and shift the seventh character from the end by 4, and the sixth character from the end by 26 and the fourth character from the end by 19 and the third character from the end by 29 and the second character from the end by 24 and the last character by 3, we should obtain a different string with the same hash value. Actually, the last character has to be shifted by 4.

The code below generates both types of string modifications to obtain new strings with the same hash code. The integer folding could be repeated to create additional different strings. The integer folding approach was not checked for generating only ASCII characters, so the printed characters may be misleading.

Matthew Brown wrote:The only reliable way is to make them both the same String. Even if you hacked it based on the hashCode source code, that could change in a future version.

Well, they did actually commit to the formula as part of the API (in JavaDoc); thus they can't really change the calculated hash code result at this point without violating the API. It's pretty reliable at this point. I agree with your other points though.

Good point. I was lazy, and didn't check the API docs to see if the algorithm was specified: I just guessed (wrongly) that it probably wouldn't be.

(Realistically, I suppose it's unlikely anyone would bother changing the implementation even if it wasn't specified, since it seems to be "good enough" and has been around a long time).

John Vorwald
Ranch Hand

Joined: Sep 26, 2010
Posts: 139

posted Apr 21, 2013 05:55:23

0

It may be worth pointing out that since the hash code is an int, and there are 4294967295 values for int, all possible hash codes can be generated with a seven character string.

Matthew Brown wrote:The only reliable way is to make them both the same String. Even if you hacked it based on the hashCode source code, that could change in a future version.

Well, they did actually commit to the formula as part of the API (in JavaDoc); thus they can't really change the calculated hash code result at this point without violating the API. It's pretty reliable at this point. I agree with your other points though.

Good point. I was lazy, and didn't check the API docs to see if the algorithm was specified: I just guessed (wrongly) that it probably wouldn't be.

(Realistically, I suppose it's unlikely anyone would bother changing the implementation even if it wasn't specified, since it seems to be "good enough" and has been around a long time).

I seem to remember, back around Java 1.1 or 1.2, String's hashCode() only looked at the first 8 characters. I assume this was because there was some truth to the "Java is slow" mantra back then. This led to some really awful HashMap performance.

But that doesnt mean that each of those have a unique hash code - only the max amount that can be handled as primary hash values.

John Vorwald
Ranch Hand

Joined: Sep 26, 2010
Posts: 139

posted Apr 23, 2013 04:44:06

0

drac yang wrote:

John Vorwald wrote:It may be worth pointing out that since the hash code is an int, and there are 4294967295 values for int, all possible hash codes can be generated with a seven character string.

if it's only for alphabets, seven character string could represent totally 26^7 = 8031810176 which is bigger than all the int number 2^32 = 4294967296

it might be sufficient though, it's possible to find two different strings with the same hashcode.

The question refers to String hash codes, which are given by sum((int)(char[i]) * 31^(nChar-i)) with mod applied when the value exceeds Integer max value.

If we use the printable ascii char, the values range from 32 to 127

We see that 127 * 31^(6-1) = 3607273026, which is less than 4294967296. There needs to be a 7th char to get the required amount of hash values.

The number of hash codes doesn't depend on 26^nchar, but rather (int(char)*31^(nchar-1)). The 26^nchar is the number of possible arrangements of single case (lower or upper) words, but not the number unique hash values.

Indeed we have seen that some Strings return the same hash code, so you might require more than 7 letters to exhaust all the possible hash codes.
Sounds like some complicated maths to solve a non‑problem.

Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3019

10

posted Apr 23, 2013 11:16:21

0

Campbell Ritchie wrote:Indeed we have seen that some Strings return the same hash code, so you might require more than 7 letters to exhaust all the possible hash codes.

No, you don't. The range covered is continuous across integers, with no gaps. Except as noted below, for a 6-character string.

Using chars in the range 32-127, here are the ranges of possible hash values:

Only the last two lines are affected by overflow. After overflow, the line for length 6 splits into ranges -2147483648 to -27554432 and 1073741824 to 2147483647. The line for length 7 covers all possible ints.

Campbell Ritchie wrote:Sounds like some complicated maths to solve a non-problem.

The math is simple enough - in each step we effectively just multiply the min and max by 32. If the problem isn't interesting to someone, they can always skip it. The original question was answered some time ago.

John Vorwald
Ranch Hand

Joined: Sep 26, 2010
Posts: 139

posted Apr 23, 2013 17:26:40

0

Mike,
That's an interesting response. I'm going to see if I can take it a little further, but include some code due to the details... It looks like 7 characters are needed to cover the full range.

And so on... but since there are only 2^32 possible values of hashCode, that means that each possible hashCode belongs to (on average) 2^16 different strings of length 3.

Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 41065

29

posted Apr 24, 2013 03:13:34

0

How do you know there are 2¹⁶ different Strings with 1 character? There are not 2¹⁶ different Unicode characters between 0 and \uffff. Unicode has gaps in, and some of the characters which exist are supplemental characters. So the possible range of hashcodes is all full of gaps. Eventually the ranges will overlap and fill those gaps, however.

Campbell Ritchie wrote:How do you know there are 2¹⁶ different Strings with 1 character? There are not 2¹⁶ different Unicode characters between 0 and \uffff. Unicode has gaps in,

Does that stop us from constructing a String out of nonexistent characters? I'm not being snarky. I honestly don't know, and can't be arsed to find out for myself.

A char can contain any of 2^16 possibilities. It's true that some of those don't have meanings in Unicode, but nevertheless they can be created in Java.

And a String can be made up of any list of chars. It's true that some of those may not even be valid in Unicode because they misuse surrogate pairs, but nevertheless they can be created in Java.

Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 41065

29

posted Apr 24, 2013 09:43:34

0

I think Paul C is correct. You can create a String out of non‑Unicode characters, in which case there are 2¹⁶ possibilities per char, even if you cannot display such a String. I shall have to try it.

Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3019

10

posted Apr 24, 2013 10:01:21

0

Yes, it's part of the Java definitions of char and String. There's no requirement that a char have any defined meaning in Unicode. Here's an example:

From the Unicode chart here, you can see that FF00 has no defined value, while FF01 is a full-width exclamation mark (!). When I try to print these, I get a "＀！" indicating the system doesn't know what to put for the first value. But it's still a valid String, with no exceptions thrown.

Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3019

10

posted Apr 24, 2013 10:11:50

0

Paul Clapham wrote:Or to put it more simply:

[...]

And so on... but since there are only 2^32 possible values of hashCode, that means that each possible hashCode belongs to (on average) 2^16 different strings of length 3.

I think we all agree that there are multiple strings per hashcode - John and I are discussing whether there are any hashcodes that can not be achieved by a String of a given length, with a "printable ascii" restriction. I guess this actually excludes 127, since that's a delete char. Anyway we seem to agree that by the time you get to 7 chars, all possibilities are covered. Obviously, if more chars values are allowed, that goal can be achieved with fewer chars.

Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 41065

29

posted Apr 24, 2013 14:14:53

0

Try it and see

Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3019

10

posted Apr 24, 2013 14:38:51

0

Looks like another way to form a legal Java String out of characters that don't have meaning in Unicode (not when put together that way, at least). Which continues to support Paul's point about each char being one of 2^16 possibilities, regardless of their Unicode validity, no?

Mike Simmons wrote:John and I are discussing whether there are any hashcodes that can not be achieved by a String of a given length, with a "printable ascii" restriction.

Ah yes, that's a much less trivial problem. And therefore more interesting. I don't see any reason for restricting the contents to printable ASCII characters, though, although I suppose that makes the coding a bit simpler.

John Vorwald
Ranch Hand

Joined: Sep 26, 2010
Posts: 139

posted Apr 24, 2013 16:59:10

0

The original question was about different strings that had the same hash code. I sat in a facial recognition meeting last week, and part of the meeting dealt with patterns that were used to create binary scores to quickly locate the face and/or facial features in a digital photo. I wonder if that approach could be applied to the problem of finding different strings that have the same hash code. My simple understanding of the approach was to try a pattern on a number of test cases, and if the pattern had better than 51% chance of success (or failure in which the criteria was reversed), then the pattern could be used. Just a thought, the program isn't obvious to me, so maybe another day....

Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 41065

29

posted Apr 25, 2013 02:41:21

0

Mike Simmons wrote:Looks like another way to form a legal Java String out of characters that don't have meaning in Unicode (not when put together that way, at least). . . .

Yes, I tried that in an attempt to confirm Paul C’s point and to see whether I was mistaken there.

drac yang
Ranch Hand

Joined: Apr 19, 2013
Posts: 61

posted Apr 25, 2013 07:49:08

0

Paul Clapham wrote:Or to put it more simply:

There is 1 string with 0 chars.

There are 2^16 strings with 1 char.

There are 2^32 strings with 2 chars.

There are 2^48 strings with 3 chars.

And so on... but since there are only 2^32 possible values of hashCode, that means that each possible hashCode belongs to (on average) 2^16 different strings of length 3.

since the total number of hashcode in int doesn't seem to be sufficient even for strings of 3 chars, should they consider to hold the hashcodes in like the long type instead of int?

drac yang wrote: since the total number of hashcode in int doesn't seem to be sufficient even for strings of 3 chars, should they consider to hold the hashcodes in like the long type instead of int?

It still wouldn't even be remotely sufficient for relatively short strings, by many orders of magnitude. And if you have an object with, say, two Strings and a Date (e.g. private String firstName ; private String lastName; private Date birthDate;), then what?

And hashCode is so commonly used and so firmly entrenched they're not going to break backward compatibility at this point.

But lets say we start fresh with a new Java. There's still little or no benefit to making hashCode a long. HashCode determines which bucket an object goes into. That means we can have up to 2^32 buckets. Currently, hash based data structures have to re-hash the 4-byte hashCode to a smaller number, because clearly we're not going to create an array of 4 billion entries for our buckets. When multi-terabyte RAM computers are common, and we're storing hundreds of billions of objects in HashMaps, then maybe an 8-byte hashCode would make sense.

I suppose there could even be specialized situations--"big data" type stuff--where it could be useful now. But those situations call for a different tool than Java, or at least writing your own hashing facility, not for breaking backward compatibility in today's Java.

drac yang wrote:since the total number of hashcode in int doesn't seem to be sufficient even for strings of 3 chars, should they consider to hold the
hashcodes in like the long type instead of int?

No. You're missing the point. Hashcodes are probablistic - that is, a good one is supposed to provide a good chance of being different if the underlying object is different.

An int has 4 billion possible values. Given that the chances of you ever dealing with more than a few thousand objects at a time is small, an int is perfectly adequate for most situations.

Remember: the idea of a hashcode is not to never return the same value for different objects; just to make it unlikely. As far as Java hashed collections are concerned, the rest of the algorithm is dealt with by equals() (and hence the tie-in between the two methods).

Winston

Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here

drac yang
Ranch Hand

Joined: Apr 19, 2013
Posts: 61

posted Apr 25, 2013 23:12:35

0

Winston Gutkowski wrote: No. You're missing the point. Hashcodes are probablistic - that is, a good one is supposed to provide a good chance of being different if the underlying object is different.

An int has 4 billion possible values. Given that the chances of you ever dealing with more than a few thousand objects at a time is small, an int is perfectly adequate for most situations.

Remember: the idea of a hashcode is not to never return the same value for different objects; just to make it unlikely. As far as Java hashed collections are concerned, the rest of the algorithm is dealt with by equals() (and hence the tie-in between the two methods).

Winston

yeah, i agree especially on for most situations, problem set scale will be far less than 4 billion.

only if when the problem set grows, say, close to or more than 4 billion, at that time,
for the sake of optimized loadfactor, people might begin to consider changing to long hashcode(if without the impact of backward compatibility like jeff said).

good reply, thanks.

drac yang
Ranch Hand

Joined: Apr 19, 2013
Posts: 61

posted Apr 25, 2013 23:23:13

0

Jeff Verdegan wrote: It still wouldn't even be remotely sufficient for relatively short strings, by many orders of magnitude. And if you have an object with, say, two Strings and a Date (e.g. private String firstName ; private String lastName; private Date birthDate;), then what?

And hashCode is so commonly used and so firmly entrenched they're not going to break backward compatibility at this point.

But lets say we start fresh with a new Java. There's still little or no benefit to making hashCode a long. HashCode determines which bucket an object goes into. That means we can have up to 2^32 buckets. Currently, hash based data structures have to re-hash the 4-byte hashCode to a smaller number, because clearly we're not going to create an array of 4 billion entries for our buckets. When multi-terabyte RAM computers are common, and we're storing hundreds of billions of objects in HashMaps, then maybe an 8-byte hashCode would make sense.

I suppose there could even be specialized situations--"big data" type stuff--where it could be useful now. But those situations call for a different tool than Java, or at least writing your own hashing facility, not for breaking backward compatibility in today's Java.

yeah, besides the backward compatibility issue(but why can't those legacy code use old jdk for maintenance?), another issue would be space waste, right？ for most situations problem set is way under 4 billion.
good reply, thanks.

You're missing the point. I might get Java 8 as soon as it's available, and build my app with it from day 1. If Java 9 changes hashCode() to a long, then I cannot use it without changing all my code--and all the code in every library I use. It's not just old stuff that breaks. It's anything from before the change.

another issue would be space waste, right？

Not really. It's only 4 more bytes, which is at least an order or magnitude less than a lot of the objects it will be generated from anyway, and memory is cheap, and it's not like we typically keep large numbers of hashcode values around anyway. There are several compelling reasons not to change to an 8-byte hashCode(). Saving 4 bytes is not one of them.