Why does the equals method in String not use hash? - java

The code of the equals method in class String is
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String)anObject;
int n = count;
if (n == anotherString.count) {
char v1[] = value;
char v2[] = anotherString.value;
int i = offset;
int j = anotherString.offset;
while (n-- != 0) {
if (v1[i++] != v2[j++])
return false;
}
return true;
}
}
return false;
}
I have a question - why does this method not use hashCode() ?
As far as I know, hashCode() can compare two strings rapidly.
UPDATE: I know, that two unequal strings, can have same hashes. But two equal strings have equal hashes. So, by using hashCode(), we can immediately see that two strings are unequal.
I'm simply thinking that using hashCode() can be a good filter in equals.
UPDATE 2: Here some code, about we are talking here.
It is an example how String method equals can look like
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String)anObject;
if (hashCode() == anotherString.hashCode()){
int n = count;
if (n == anotherString.count) {
char v1[] = value;
char v2[] = anotherString.value;
int i = offset;
int j = anotherString.offset;
while (n-- != 0) {
if (v1[i++] != v2[j++])
return false;
}
return true;
}
}else{
return false;
}
}
return false;
}

Hashcode could be a first-round check for inequality. However, it presents some tradeoffs.
String hashcodes are lazily calculated, although they do use a "guard" value. If you're comparing strings with long lifetimes (ie, they're likely to have had the hashcode computed), this isn't a problem. Otherwise, you're stuck with either computing the hashcode (potentially expensive) or ignoring the check when the hashcode hasn't been computed yet. If you have a lot of short-lived strings, you'll be ignoring the check more often than you'll be using it.
In the real world, most strings differ in their first few characters, so you won't save much by checking hashcode first. There are, of course, exceptions (such as URLs), but again, in real world programming they occur infrequently.

This question has actually been considered by the developers of the JDK. I could not find in the various messages why it has not been included. The enhancement is also listed in the bug database.
Namely, one of the proposed change is:
public boolean equals(Object anObject) {
if (this == anObject) // 1st check identitiy
return true;
if (anObject instanceof String) { // 2nd check type
String anotherString = (String)anObject;
int n = count;
if (n == anotherString.count) { // 3rd check lengths
if (n != 0) { // 4th avoid loading registers from members if length == 0
int h1 = hash, h2 = anotherString.hash;
if (h1 != 0 && h2 != 0 && h1 != h2) // 5th check the hashes
return false;
There was also a discussion to use == for interned strings (i.e. if both strings are interned: if (this != anotherString) return false;).

1) Calculating hashCode may not be faster than comparing the Strings directly.
2) if the hashCode is equal, the Strings may still not be equal

This can be a good idea for many use cases.
However, as a foundation class that is widely used in all kinds of applications, the author really has no idea whether this extra checking can save or hurt performance on average.
I'm gonna guess that the majority of String.equals() are invoked in a Hashmap, after the hash codes has been known to be equal, so testing hash codes again is pointless.
If we consider comparing 2 random strings, even with a small char set like US ASCII, it is very likely that the hashes are different, and char-by-char comparison fails on 1st char. So it'll be a waste to check hashes.

AFAIK, The following check could be added to String. This check that if the hash codes are set and they are different, then the Strings cannot be equal.
if (hash != 0 && anotherString.hash != 0 && hash != anotherString.hash)
return false;
if (hash32 != 0 && anotherString.hash32 != 0 && hash32 != anotherString.hash32)
return false;

The string hash code is not available for free and automatically. In order to rely on hash code, it must be computed for both strings and only then can be compared. As collisions are possible, the second char-by-char comparison is required if the hash codes are equal.
While String appears as immutable for the usual programmer, it does have the private field to store its hashcode once it is computed. However this field is only computed when hashcode is first required. As you can see from the String source code here:
private int hash;
public int hashCode() {
int h = hash;
if (h == 0) {
...
hash = h;
}
return h;
}
Hence it is not obvious that it makes sense to compute the hashcode first. For your specific case (maybe same instances of really long strings are compared with each other a really many of times), it still may be: profile.

As I think, hashCode() can make comparison of two strings quicker.
Arguments?
Arguments against this proposal:
More Operations
hashcode() from String has to access every character in the String and has to do 2 calculations for every character.
So we need for a string with n characters 5*n operations (load, multiplication, lookup/load, multiplication, store). Two times, because we compare two Strings. (Ok, one store and one load does not really count in a reasonable implementation.)
For the best case, this makes a total of 10*x operations for two strings with length m and n and x=min(m,n). Worst case is 10*x with x=m=n. Average somewhere between, perhaps (m*n)/2.
The current equals implementation needs in the best case 3 operations. 2 loads, 1 compare. Worst is 3*x operations for two strings with length m and n and x=m=n. Average is somewhere between, perhaps 3*(m*n)/2.
Even if we cache the hash, it is not clear that we save something
We have to analyze usage patterns. It could be that most of the time, we will only ask one time for equals, not multiple times. Even if we ask multiple times, it could not be enough to have time savings from the caching.
Not direct against the speed, but still good counterarguments:
Counter intuitive
We do not expect a hashcode in equals, because we know for sure that hash(a)==hash(b) for some a!=b. Everyone reading this (and knowledge about hashing) will wonder what is happening there.
Leads to bad examples/unexpected behavior
I can already see the next question on SO: "I have a String with some billion times 'a'. Why does it take forever to compare it with equal() against 'b'?" :)

If the hash code takes the whole content of the string into account, then calculating the hash code of a string with n characters takes n operations. For long strings that's a lot. Comparing two strings takes n operations if they are the same, not longer than calculating the hash. But if the strings are different, then a difference is likely to be found a lot earlier.
String hash functions usually don't consider all characters for very long strings. In that case, if I compare two strings I could first compare the characters used by the hash function, and I'm at least as fast as checking the hashes. But if there is no difference in these characters, then the hash value will be the same, so I have to compare the full strings anyway.
Summary: A good string comparison is never slower but often a lot faster than comparing the hashes (and comparing strings when the hashes match).

Related

Counting The Number of Valleys [duplicate]

This question already has answers here:
How do I compare strings in Java?
(23 answers)
Closed 3 years ago.
I'm doing a problem on hackerRank and the problem is:
Problem Statement
Here we have to count the number of valleys does XYZ person visits.
A valley is a sequence of consecutive steps below sea level, starting with a step down from sea level and ending with a step up to sea level.
For One step up it U, and one step down it is D. We will be given the number of steps does XYZ person traveled plus the ups and down in the form of string, i.e,
UUDDUDUDDUU
Sample Input
8
UDDDUDUU
Sample Output
1
Explanation
If we represent _ as sea level, a step up as /, and a step down as \, Gary's hike can be drawn as:
_/\ _
\ /
\/\/
He enters and leaves one valley.
The code I wrote doesn't work
static int countingValleys(int n, String s) {
int count = 0;
int level = 0;
String[] arr = s.split("");
for(int i = 0; i<n;i++){
if(arr[i] == "U"){
level++;
} else{
level--;
}
if(level==0 && arr[i]=="U"){
count++;
}
}
return count;
}
But another solution I found does, however no matter how I look at it the logic is the same as mine:
static int countingValleys(int n, String s) {
int v = 0; // # of valleys
int lvl = 0; // current level
for(char c : s.toCharArray()){
if(c == 'U') ++lvl;
if(c == 'D') --lvl;
// if we just came UP to sea level
if(lvl == 0 && c == 'U')
++v;
}
return v;
}
So what's the difference I'm missing here that causes mine to not work?
Thanks.
In java, you need to do this to compare String values:
if("U".equals (arr[i])) {
And not this:
if(arr[i] == "U") {
The former compares the value "U" to the contents of arr[i].
The latter checks whether the strings reference the same content or more precisely the same instance of an object. You could think of this as do they refer to the same block of memory? The answer, in this case, is they do not.
To address the other aspect of your question.
Why this works:
for(char c : s.toCharArray()){
if(c == 'U') ++lvl;
if(c == 'D') --lvl;
when this does not:
String[] arr = s.split("");
for(int i = 0; i<n;i++){
if(arr[i] == "U"){
You state that the logic is the same. Hmmmm, maybe, but the data types are not.
In the first version, the string s is split into an array of character values. These are primitive values (i.e. an array of values of a primitive data type) - just like numbers are (ignoring autoboxing for a moment). Since character values are primitive types, the value in arr[i] is compared by the == operator. Thus arr[i] == 'U' (or "is the primitive character value in arr[i] equal to the literal value 'U') results in true if arr[i] happens to contain the letter 'U'.
In the second version, the string s is split into an array of strings. This is an array of instances (or more precisely, an array of references to instances) of String objects. In this case the == operator compares the reference values (you might think of this as a pointer to the two strings). In this case, the value of arr[i] (i.e. the reference to the string) is compared to the reference to the string literal "U" (or "D"). Thus arr[i] == "U" (or "is the reference value in arr[i] equal to the reference value of where the String instance containing a "U" string" is located) is false because these two strings are in different locations in memory.
As mentioned above, since they are different instances of String objects the == test is false (the fact that they just happen to contain the same value is irrelevant in Java because the == operator doesn't look at the content). Hence the need for the various equals, equalsIgnoreCase and some other methods associated with the String class that define exactly how you wish to "compare" the two string values. At risk of confusing you further, you could consider a "reference" or "pointer" to be a primitive data type, and thus, the behaviour of == is entirely consistent.
If this doesn't make sense, then think about it in terms of other object types. For example, consider a Person class which maybe has name, date of birth and zip/postcode attributes. If two instances of Person happen to have the same name, DOB and zip/postcode, does that mean that they are the same Person? Maybe, but it could also mean that they are two different people that just happen to have the same name, same date of birth and just happen to live in the same suburb. While unlikely, it definitely does happen.
FWIW, the behaviour of == in Java is the same behaviour as == in 'C'. For better or worse, right or wrong, this is the behaviour that the Java designers chose for == in Java.
It is worthy to note that other languages, e.g. Scala, define the == operator for Strings (again rightly or wrongly, for better or worse) to perform a comparison of the values of the strings via the == operator. So, in theory, if you addressed other syntactic issues, your arr[i] == "U" test would work in Scala. It all boils down to understanding the rules that the various operators and methods implement.
Going back to the Person example, assume Person was defined as a case class in Scala. If we created two instances of Person with the same name, DOB and zip/postcode (e.g. p1 and p2), then p1 == p2 would be true (in Scala). To perform a reference comparison (i.e. are p1 and p2 instances of the same object), we would need to use p1.eq(p2) (which would result in false).
Hopefully the Scala reference, does not create additional confusion. If it does, then simply think of it as the function of an operator (or method) is defined by the designers of the language / library that you are using and you need to understand what their rules are.
At the time Java was designed, C was prevalent, so it can be argued that it makes sense the C like behaviour of == replicated in Java was a good choice at that time. As time has moved on, more people think that == should be a value comparison and thus some languages have implemented it that way.

Java string comparison using bitwise xor

I came across the below code snippet in a product's code. It is using bitwise XOR for string comparison. Is this better than the String.equals(Object o) method? What is the author trying to achieve here?
private static boolean compareSecure(String a, String b)
{
if ((a == null) || (b == null)) {
return (a == null) && (b == null);
}
int len = a.length();
if (len != b.length()) {
return false;
}
if (len == 0) {
return true;
}
int bits = 0;
for (int i = 0; i < len; i++) {
bits |= a.charAt(i) ^ b.charAt(i);
}
return bits == 0;
}
For context, the strings being equated are authentication tokens.
This is a common implementation of string comparison function that is invulnerable to timing attacks.
In short, the idea is to compare all the characters every time you compare strings, even if you find any of them are not equal. In "standard" implementation you just break on the first difference and return false.
This is not secure because it gives away the information about the compared strings. Specifically if the left-side string is a secret you want to keep (e.g. password), and the right-side string is something provided by the user, an unsafe method allows the hacker to uncover your password with a relative ease, by repeatedly trying out different strings and measuring the response time.
The more characters in the two strings are identical, the more the 'unsecure' function would take to compare them.
For instance, comparing "1234567890" and "0987654321" using a standard method would result in doing just a single comparison of the first character and returning false, since 1!=0. On the other hand comparing "1234567890" to "1098765432", would result in executing 2 comparison operations, because the first ones are equal, you have to compare the second ones to find they are different. This would take a bit more time and it is measurable, even when we are talking about remote calls.
If you do N attacks with N different strings, each starting with a different character, you should see one of the of the results taking a fraction of a milisecond more then the rest. This means the first character is the same, so the function has to take more time to compare the second one. Rinse and repeat for each position in the string and you can crack the secret orders of magnitude faster then brute force.
Preventing such attack is the point of such implementation.
Edit: As diligently pointed out in comment by Mark Rotteveel, this implementation is still vulnerable to timing attack that is aimed at revealing the length of the string. Still this is not a problem in many cases (either you don't care about attacker knowing the length or you deal with data that is standard and anyone can know the length anyway, for instance some kind of known-length hash)

Equals validation vs indexOf validation?

I need to validate if one String contains the char $ before replace this one.
I did two implementations for this propose.
The first implementation always execute replace(char oldChar, char newChar) and equals(Object anObject) as validation.
String getImportLine(Class<?> clazz) {
String importLine = toSanitizedClassName(clazz.getName());
String importStaticLine = importLine.replace('$', '.');
if (importLine.equals(importStaticLine)) {
return String.format("import %s;", importLine);
}
return String.format("import static %s;", importStaticLine);
}
This implementation parses the string two times with:
importLine.replace('$', '.')
importLine.equals(importStaticLine)
The second implementation uses indexOf(int ch) as validation and replace(char oldChar, char newChar) in the worst case.
String getImportLine(Class<?> clazz) {
String importLine = toSanitizedClassName(clazz.getName());
if (importLine.indexOf('$') == -1) {
return String.format("import %s;", importLine);
}
importLine = importLine.replace('$', '.');
return String.format("import static %s;", importLine);
}
The second implementation, in the worst case, parse the string two times with:
importLine.indexOf('$') == -1
importLine.replace('$', '.')
Is there some difference in terms of performance between the use of equals vs indexOf as validation?
What you are asking are the difference in execution time between String.indexOf and String.equals. With Big-O notation these are the same, since both (worst case) will iterate through the entire String before returning.
In practice, it really depends on the input.
For instance:
equals will return pretty much immediatly if the two strings compared are a different length
equals will return sooner if the difference in the strings occur early ("abcdef".equals("aXcdef") is faster than "abcdef".equals("abcdeX"))
indexOf('$') will be faster if $ occurs early in the string ("a$cdef".indexOf('$') is faster than "abcde$".indexOf('$'))
indexOf will be slower if the input char is a special character
On modern computers this should not matter, since they are so fast that any difference will be unnoticable, unless the method is called hundreds of thousands of times (or with really large input strings). When optimizing code one should focus on saving seconds, not nanoseconds. With your current problem you should be a lot more worried about making your code readable and understandable to others than you should be worried about which uses the most CPU cycles..

String.equals() to change with bitwise AND on binary numbers

In Java, working with binary strings (e.g. "00010010", zeroes are added in the beginning when creating these binary strings for the purpose of my program). I have the function
private static boolean isJinSuperSets(String J, List<String> superSets) {
for (String superJ : superSets)
if (superJ.equals(J)) return true;
return false;
}
that checks if a binary string J is contained in the list of binary strings superSets.
I use equals() on the String object but I would like to speed up this code by converting binary strings to binary numbers and doing bitwise operation AND to see if they are equal.
Could you please give me a few tricks on how to accomplish that?
Here for int:
for (String superJ : superSets)
return Integer.valueOf(superJ,2) == Integer.valueOf(J,2);
}
You have to test with benchmarks (take care first time is always slower) for speed.
The best way to optimize if J is more than once used : have J2 as Integer somewhere and test on it.

Using an IF statement to control the return statement?

public static int seqSearch(int numRecords, String[] stuName,
double[] stuGpa, String nameKey, double gpaKey)
for(int i = 0; i < numRecords; i++)
if(stuName[i] == nameKey && stuGpa[i] == gpaKey)
return i;
return -1;
So, how would I used an if statement to control this? I'm doing sequential search to find if the name is found in the array and if the gpa is in the array, then it should return the position it was found in (i). But, all it does do is return -1 and print out that none were found.
You have two separate problems here:
You should be comparing strings using the equals() method (or one of it's kin) - otherwise you are comparing whether two strings are the same reference (instance) rather than equivalent sequences of characters.
You should avoid comparing doubles using == as equality for doubles is more nuanced. Check out this paper for more information about why.
See this question about why using == for floating point comparison is a bad idea in java.
Aside from that, I would also mention that your implementation makes the assumption that both stuName and stuGpa are arrays of the same length. This could easily not be the case ... and is probably something worth asserting before you begin iterating over the arrays.
Strings must be compared with .equals in Java, not ==.
if(stuName[i].equals (nameKey) && stuGpa[i] == gpaKey)
You probably want
if (stuName[i].equals(nameKey) && stuGpa[i].equals(gpaKey))
if(stuName[i] == nameKey is unlikely to be right, you are comparing object identities not string content. Try if(stuName[i].equals(nameKey)
You are comparing two Strings.
Strings are immutable.
Please use "equalsIgnoreCase()" or "equals()" to compare Strings
See examples here
http://www.java-samples.com/showtutorial.php?tutorialid=224
An essential problem is that
stuName[i] == nameKey
Is only comparing whether the objects are the same String Object in memory.
You actually want to use nameKey.equals(stuName[i]), to compare the actual string values.
And you might want to use .equalsIgnoreCase for case insensitivity.
The following is correct for the if statement. stuName[i] is a string so compare with .equals. stuGpa[i] is a double so use ==.
if(stuName[i].equals(nameKey_ && stuGpa[i] == gpaKey)
Your problem is not the conditional if statement, but the conditional operator ==. == refers to the pointer value of the object where as the .equals method returns something computed by the object.
Like everyone has said before, switch your == to .equals in this next line:
public static int seqSearch(int numRecords, String[] stuName,
double[] stuGpa, String nameKey, double gpaKey)
for(int i = 0; i < numRecords; i++)
if(stuName[i].equals(nameKey) && stuGpa[i] == gpaKey)
return i;
return -1;
To actually answer the question about the control of the if statement...
I believe what you're doing is fine with the the multiple return statements, BUT...
I personally prefer one entry point and only one exit point for my methods. It always feels dirty to me having multiple exit points.
So, I would consider the following code instead:
public static int seqSearch(int numRecords, String[] stuName, double[] stuGpa, String nameKey, double gpaKey)
int value = -1;
for(int i = 0; i < numRecords; i++) { // Don't forget your braces, they aren't required, but wait until you add a newline and forget to add them...
if(some.boolean().equals(comparison.here())) {
value = i;
break;
}
}
return value;
}
Best of Luck.

Categories