Java string comparison using bitwise xor

Java string comparison using bitwise xor - java

I came across the below code snippet in a product's code. It is using bitwise XOR for string comparison. Is this better than the String.equals(Object o) method? What is the author trying to achieve here?
private static boolean compareSecure(String a, String b)
{
if ((a == null) || (b == null)) {
return (a == null) && (b == null);
}
int len = a.length();
if (len != b.length()) {
return false;
}
if (len == 0) {
return true;
}
int bits = 0;
for (int i = 0; i < len; i++) {
bits |= a.charAt(i) ^ b.charAt(i);
}
return bits == 0;
}
For context, the strings being equated are authentication tokens.

This is a common implementation of string comparison function that is invulnerable to timing attacks.
In short, the idea is to compare all the characters every time you compare strings, even if you find any of them are not equal. In "standard" implementation you just break on the first difference and return false.
This is not secure because it gives away the information about the compared strings. Specifically if the left-side string is a secret you want to keep (e.g. password), and the right-side string is something provided by the user, an unsafe method allows the hacker to uncover your password with a relative ease, by repeatedly trying out different strings and measuring the response time.
The more characters in the two strings are identical, the more the 'unsecure' function would take to compare them.
For instance, comparing "1234567890" and "0987654321" using a standard method would result in doing just a single comparison of the first character and returning false, since 1!=0. On the other hand comparing "1234567890" to "1098765432", would result in executing 2 comparison operations, because the first ones are equal, you have to compare the second ones to find they are different. This would take a bit more time and it is measurable, even when we are talking about remote calls.
If you do N attacks with N different strings, each starting with a different character, you should see one of the of the results taking a fraction of a milisecond more then the rest. This means the first character is the same, so the function has to take more time to compare the second one. Rinse and repeat for each position in the string and you can crack the secret orders of magnitude faster then brute force.
Preventing such attack is the point of such implementation.
Edit: As diligently pointed out in comment by Mark Rotteveel, this implementation is still vulnerable to timing attack that is aimed at revealing the length of the string. Still this is not a problem in many cases (either you don't care about attacker knowing the length or you deal with data that is standard and anyone can know the length anyway, for instance some kind of known-length hash)

Related

Why use bit shifting instead of a for loop?

I created the following code to find parity of a binary number (i.e output 1 if the number of 1's in the binary word is odd, output 0 if the number of 1's is even).
public class CalculateParity {
String binaryword;
int totalones = 0;
public CalculateParity(String binaryword) {
this.binaryword = binaryword;
getTotal();
}
public int getTotal() {
for(int i=0; i<binaryword.length(); i++) {
if (binaryword.charAt(i) == '1'){
totalones += 1;
}
}
return totalones;
}
public int calcParity() {
if (totalones % 2 == 1) {
return 1;
}
else {
return 0;
}
}
public static void main(String[] args) {
CalculateParity bin = new CalculateParity("1011101");
System.out.println(bin.calcParity());
}
}
However, all of the solutions I find online almost always deal with using bit shift operators, XORs, unsigned shift operations, etc., like this solution I found in a data structure book:
public static short parity(long x){
short result = 0;
while (x != 0) {
result A=(x&1);
x >>>= 1;
}
return result;
}
Why is this the case? What makes bitwise operators more of a valid/standard solution than the solution I came up with, which is simply iterating through a binary word of type String? Is a bitwise solution more efficient? I appreciate any help!

The code that you have quoted uses a loop as well (i.e., while):
public static short parity(long x){
short result = 9;
while (x != 9) {
result A=(x&1);
x >>>= 1;
}
return result;
}
You need to acknowledge that you are using a string that you know beforehand will be composed of only digits, and conveniently in a binary representation. Naturally, given those constraints, one does not need to use bitwise operations instead one just parsers char-by-char and does the desired computations.
On the other hand, if you receive as a parameter a long, as the method that you have quoted, then it comes in handy to use bitwise operations to go through each bit (at a time) in a number and perform the desired computation.
One could also convert the long into a string and apply the same logic code-wise that you have applied, but first, one would have to convert that long into binary. However, that approach would add extra unnecessary steps, more code, and would be performance-wise worse. Probably, the same applies vice-versa if you have a String with your constraints. Nevertheless, a String is not a number, even if it is only composed of digits, which makes using a type that represents a number (e.g., long) even a more desirable approach.
Another thing that you are missing is that you did some of the heavy lifting by converting already a number to binary, and encoded into a String new CalculateParity("1011101");. So you kind of jump a step there. Now try to use your approach, but this time using "93" and find the parity.

If you want know if a String is even. I think this method below is better.
If you convert a String too
long which the length of the String is bigger than 64. there will a error occur.
both of the method you
mention is O(n) performance.It will not perform big different. but
the shift method is more precise and the clock of the cpu use will a little bit less.
private static boolean isEven(String s){
char[] chars = s.toCharArray();
int i = 0;
for(char c : chars){
i ^= c;
}
return i == 0;
}

You use a string based method for a string input. Good choice.
The code you quote uses an integer-based method for an integer input. An equally good choice.

Counting The Number of Valleys [duplicate]

This question already has answers here:
How do I compare strings in Java?
(23 answers)
Closed 3 years ago.
I'm doing a problem on hackerRank and the problem is:
Problem Statement
Here we have to count the number of valleys does XYZ person visits.
A valley is a sequence of consecutive steps below sea level, starting with a step down from sea level and ending with a step up to sea level.
For One step up it U, and one step down it is D. We will be given the number of steps does XYZ person traveled plus the ups and down in the form of string, i.e,
UUDDUDUDDUU
Sample Input
8
UDDDUDUU
Sample Output
1
Explanation
If we represent _ as sea level, a step up as /, and a step down as \, Gary's hike can be drawn as:
_/\ _
\ /
\/\/
He enters and leaves one valley.
The code I wrote doesn't work
static int countingValleys(int n, String s) {
int count = 0;
int level = 0;
String[] arr = s.split("");
for(int i = 0; i<n;i++){
if(arr[i] == "U"){
level++;
} else{
level--;
}
if(level==0 && arr[i]=="U"){
count++;
}
}
return count;
}
But another solution I found does, however no matter how I look at it the logic is the same as mine:
static int countingValleys(int n, String s) {
int v = 0; // # of valleys
int lvl = 0; // current level
for(char c : s.toCharArray()){
if(c == 'U') ++lvl;
if(c == 'D') --lvl;
// if we just came UP to sea level
if(lvl == 0 && c == 'U')
++v;
}
return v;
}
So what's the difference I'm missing here that causes mine to not work?
Thanks.

In java, you need to do this to compare String values:
if("U".equals (arr[i])) {
And not this:
if(arr[i] == "U") {
The former compares the value "U" to the contents of arr[i].
The latter checks whether the strings reference the same content or more precisely the same instance of an object. You could think of this as do they refer to the same block of memory? The answer, in this case, is they do not.
To address the other aspect of your question.
Why this works:
for(char c : s.toCharArray()){
if(c == 'U') ++lvl;
if(c == 'D') --lvl;
when this does not:
String[] arr = s.split("");
for(int i = 0; i<n;i++){
if(arr[i] == "U"){
You state that the logic is the same. Hmmmm, maybe, but the data types are not.
In the first version, the string s is split into an array of character values. These are primitive values (i.e. an array of values of a primitive data type) - just like numbers are (ignoring autoboxing for a moment). Since character values are primitive types, the value in arr[i] is compared by the == operator. Thus arr[i] == 'U' (or "is the primitive character value in arr[i] equal to the literal value 'U') results in true if arr[i] happens to contain the letter 'U'.
In the second version, the string s is split into an array of strings. This is an array of instances (or more precisely, an array of references to instances) of String objects. In this case the == operator compares the reference values (you might think of this as a pointer to the two strings). In this case, the value of arr[i] (i.e. the reference to the string) is compared to the reference to the string literal "U" (or "D"). Thus arr[i] == "U" (or "is the reference value in arr[i] equal to the reference value of where the String instance containing a "U" string" is located) is false because these two strings are in different locations in memory.
As mentioned above, since they are different instances of String objects the == test is false (the fact that they just happen to contain the same value is irrelevant in Java because the == operator doesn't look at the content). Hence the need for the various equals, equalsIgnoreCase and some other methods associated with the String class that define exactly how you wish to "compare" the two string values. At risk of confusing you further, you could consider a "reference" or "pointer" to be a primitive data type, and thus, the behaviour of == is entirely consistent.
If this doesn't make sense, then think about it in terms of other object types. For example, consider a Person class which maybe has name, date of birth and zip/postcode attributes. If two instances of Person happen to have the same name, DOB and zip/postcode, does that mean that they are the same Person? Maybe, but it could also mean that they are two different people that just happen to have the same name, same date of birth and just happen to live in the same suburb. While unlikely, it definitely does happen.
FWIW, the behaviour of == in Java is the same behaviour as == in 'C'. For better or worse, right or wrong, this is the behaviour that the Java designers chose for == in Java.
It is worthy to note that other languages, e.g. Scala, define the == operator for Strings (again rightly or wrongly, for better or worse) to perform a comparison of the values of the strings via the == operator. So, in theory, if you addressed other syntactic issues, your arr[i] == "U" test would work in Scala. It all boils down to understanding the rules that the various operators and methods implement.
Going back to the Person example, assume Person was defined as a case class in Scala. If we created two instances of Person with the same name, DOB and zip/postcode (e.g. p1 and p2), then p1 == p2 would be true (in Scala). To perform a reference comparison (i.e. are p1 and p2 instances of the same object), we would need to use p1.eq(p2) (which would result in false).
Hopefully the Scala reference, does not create additional confusion. If it does, then simply think of it as the function of an operator (or method) is defined by the designers of the language / library that you are using and you need to understand what their rules are.
At the time Java was designed, C was prevalent, so it can be argued that it makes sense the C like behaviour of == replicated in Java was a good choice at that time. As time has moved on, more people think that == should be a value comparison and thus some languages have implemented it that way.

Equals validation vs indexOf validation?

I need to validate if one String contains the char $ before replace this one.
I did two implementations for this propose.
The first implementation always execute replace(char oldChar, char newChar) and equals(Object anObject) as validation.
String getImportLine(Class<?> clazz) {
String importLine = toSanitizedClassName(clazz.getName());
String importStaticLine = importLine.replace('$', '.');
if (importLine.equals(importStaticLine)) {
return String.format("import %s;", importLine);
}
return String.format("import static %s;", importStaticLine);
}
This implementation parses the string two times with:
importLine.replace('$', '.')
importLine.equals(importStaticLine)
The second implementation uses indexOf(int ch) as validation and replace(char oldChar, char newChar) in the worst case.
String getImportLine(Class<?> clazz) {
String importLine = toSanitizedClassName(clazz.getName());
if (importLine.indexOf('$') == -1) {
return String.format("import %s;", importLine);
}
importLine = importLine.replace('$', '.');
return String.format("import static %s;", importLine);
}
The second implementation, in the worst case, parse the string two times with:
importLine.indexOf('$') == -1
importLine.replace('$', '.')
Is there some difference in terms of performance between the use of equals vs indexOf as validation?

What you are asking are the difference in execution time between String.indexOf and String.equals. With Big-O notation these are the same, since both (worst case) will iterate through the entire String before returning.
In practice, it really depends on the input.
For instance:
equals will return pretty much immediatly if the two strings compared are a different length
equals will return sooner if the difference in the strings occur early ("abcdef".equals("aXcdef") is faster than "abcdef".equals("abcdeX"))
indexOf('$') will be faster if $ occurs early in the string ("a$cdef".indexOf('$') is faster than "abcde$".indexOf('$'))
indexOf will be slower if the input char is a special character
On modern computers this should not matter, since they are so fast that any difference will be unnoticable, unless the method is called hundreds of thousands of times (or with really large input strings). When optimizing code one should focus on saving seconds, not nanoseconds. With your current problem you should be a lot more worried about making your code readable and understandable to others than you should be worried about which uses the most CPU cycles..

Why does the equals method in String not use hash?

The code of the equals method in class String is
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String)anObject;
int n = count;
if (n == anotherString.count) {
char v1[] = value;
char v2[] = anotherString.value;
int i = offset;
int j = anotherString.offset;
while (n-- != 0) {
if (v1[i++] != v2[j++])
return false;
}
return true;
}
}
return false;
}
I have a question - why does this method not use hashCode() ?
As far as I know, hashCode() can compare two strings rapidly.
UPDATE: I know, that two unequal strings, can have same hashes. But two equal strings have equal hashes. So, by using hashCode(), we can immediately see that two strings are unequal.
I'm simply thinking that using hashCode() can be a good filter in equals.
UPDATE 2: Here some code, about we are talking here.
It is an example how String method equals can look like
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String)anObject;
if (hashCode() == anotherString.hashCode()){
int n = count;
if (n == anotherString.count) {
char v1[] = value;
char v2[] = anotherString.value;
int i = offset;
int j = anotherString.offset;
while (n-- != 0) {
if (v1[i++] != v2[j++])
return false;
}
return true;
}
}else{
return false;
}
}
return false;
}

Hashcode could be a first-round check for inequality. However, it presents some tradeoffs.
String hashcodes are lazily calculated, although they do use a "guard" value. If you're comparing strings with long lifetimes (ie, they're likely to have had the hashcode computed), this isn't a problem. Otherwise, you're stuck with either computing the hashcode (potentially expensive) or ignoring the check when the hashcode hasn't been computed yet. If you have a lot of short-lived strings, you'll be ignoring the check more often than you'll be using it.
In the real world, most strings differ in their first few characters, so you won't save much by checking hashcode first. There are, of course, exceptions (such as URLs), but again, in real world programming they occur infrequently.

This question has actually been considered by the developers of the JDK. I could not find in the various messages why it has not been included. The enhancement is also listed in the bug database.
Namely, one of the proposed change is:
public boolean equals(Object anObject) {
if (this == anObject) // 1st check identitiy
return true;
if (anObject instanceof String) { // 2nd check type
String anotherString = (String)anObject;
int n = count;
if (n == anotherString.count) { // 3rd check lengths
if (n != 0) { // 4th avoid loading registers from members if length == 0
int h1 = hash, h2 = anotherString.hash;
if (h1 != 0 && h2 != 0 && h1 != h2) // 5th check the hashes
return false;
There was also a discussion to use == for interned strings (i.e. if both strings are interned: if (this != anotherString) return false;).

1) Calculating hashCode may not be faster than comparing the Strings directly.
2) if the hashCode is equal, the Strings may still not be equal

This can be a good idea for many use cases.
However, as a foundation class that is widely used in all kinds of applications, the author really has no idea whether this extra checking can save or hurt performance on average.
I'm gonna guess that the majority of String.equals() are invoked in a Hashmap, after the hash codes has been known to be equal, so testing hash codes again is pointless.
If we consider comparing 2 random strings, even with a small char set like US ASCII, it is very likely that the hashes are different, and char-by-char comparison fails on 1st char. So it'll be a waste to check hashes.

AFAIK, The following check could be added to String. This check that if the hash codes are set and they are different, then the Strings cannot be equal.
if (hash != 0 && anotherString.hash != 0 && hash != anotherString.hash)
return false;
if (hash32 != 0 && anotherString.hash32 != 0 && hash32 != anotherString.hash32)
return false;

The string hash code is not available for free and automatically. In order to rely on hash code, it must be computed for both strings and only then can be compared. As collisions are possible, the second char-by-char comparison is required if the hash codes are equal.
While String appears as immutable for the usual programmer, it does have the private field to store its hashcode once it is computed. However this field is only computed when hashcode is first required. As you can see from the String source code here:
private int hash;
public int hashCode() {
int h = hash;
if (h == 0) {
...
hash = h;
}
return h;
}
Hence it is not obvious that it makes sense to compute the hashcode first. For your specific case (maybe same instances of really long strings are compared with each other a really many of times), it still may be: profile.

As I think, hashCode() can make comparison of two strings quicker.
Arguments?
Arguments against this proposal:
More Operations
hashcode() from String has to access every character in the String and has to do 2 calculations for every character.
So we need for a string with n characters 5*n operations (load, multiplication, lookup/load, multiplication, store). Two times, because we compare two Strings. (Ok, one store and one load does not really count in a reasonable implementation.)
For the best case, this makes a total of 10*x operations for two strings with length m and n and x=min(m,n). Worst case is 10*x with x=m=n. Average somewhere between, perhaps (m*n)/2.
The current equals implementation needs in the best case 3 operations. 2 loads, 1 compare. Worst is 3*x operations for two strings with length m and n and x=m=n. Average is somewhere between, perhaps 3*(m*n)/2.
Even if we cache the hash, it is not clear that we save something
We have to analyze usage patterns. It could be that most of the time, we will only ask one time for equals, not multiple times. Even if we ask multiple times, it could not be enough to have time savings from the caching.
Not direct against the speed, but still good counterarguments:
Counter intuitive
We do not expect a hashcode in equals, because we know for sure that hash(a)==hash(b) for some a!=b. Everyone reading this (and knowledge about hashing) will wonder what is happening there.
Leads to bad examples/unexpected behavior
I can already see the next question on SO: "I have a String with some billion times 'a'. Why does it take forever to compare it with equal() against 'b'?" :)

If the hash code takes the whole content of the string into account, then calculating the hash code of a string with n characters takes n operations. For long strings that's a lot. Comparing two strings takes n operations if they are the same, not longer than calculating the hash. But if the strings are different, then a difference is likely to be found a lot earlier.
String hash functions usually don't consider all characters for very long strings. In that case, if I compare two strings I could first compare the characters used by the hash function, and I'm at least as fast as checking the hashes. But if there is no difference in these characters, then the hash value will be the same, so I have to compare the full strings anyway.
Summary: A good string comparison is never slower but often a lot faster than comparing the hashes (and comparing strings when the hashes match).

Compare first three characters of two strings

Strings s1 and s2 will always be of length 1 or higher.
How can I speed this up?
int l1 = s1.length();
if (l1 > 3) { l1 = 3; }
if (s2.startsWith(s1.substring(0,l1)))
{
// do something..
}
Regex maybe?

Rewrite to avoid object creation
Your instincts were correct. The creation of new objects (substring()) is not very fast and it means that each one created must incur g/c overhead as well.
This might be a lot faster:
static boolean fastCmp(String s1, String s2) {
return s1.regionMatches(0, s2, 0, 3);
}

This seems pretty reasonable. Is this really too slow for you? You sure it's not premature optimization?

if (s2.startsWith(s1.substring(0, Math.min(3, s1.length())) {..};
Btw, there is nothing slow in it. startsWith has complexity O(n)
Another option is to compare the char values, which might be more efficient:
boolean match = true;
for (int i = 0; i < Math.min(Math.min(s1.length(), 3), s2.length()); i++) {
if (s1.charAt(i) != s2.charAt(i)) {
match = false;
break;
}
}

My java isn't that good, so I'll give you an answer in C#:
int len = Math.Min(s1.Length, Math.Min(s2.Length, 3));
for(int i=0; i< len; ++i)
{
if (s1[i] != s2[i])
return false;
}
return true;
Note that unlike yours and Bozho's, this does not create a new string, which would be the slowest part of your algorithm.

Perhaps you could do this
if (s1.length() > 3 && s2.length() > 3 && s1.indexOf (s2.substring (0, 3)) == 0)
{
// do something..
}

There is context missing here:
What are you trying to scan for? What type of application? How often is it expected to run?
These things are important because different scenarios call for different solutions:
If this is a one-time scan then this is probably unneeded optimization. Even for a 20MB text file, it wouldn't take more than a couple of minutes in the worst case.
If you have a set of inputs and for each of them you're scanning all the words in a 20MB file, it might be better to sort/index the 20MB file to make it easy to look up matches and skip the 99% of unnecessary comparisons. Also, if inputs tend to repeat themselves it might make sense to employ caching.
Other solutions might also be relevant, depending on the actual problem.
But if you boil it down only to comparing the first 3 characters of two strings, I believe the code snippets given here are as good as you're going to get - they're all O(1)*, so there's no drastic optimization you can do.
*The only place where this might not hold true is if getting the length of the string is O(n) rather than O(1) (which is the case for the strlen function in C++), which is not the case for Java and C# string objects.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java string comparison using bitwise xor - java

Related

Why use bit shifting instead of a for loop?

Counting The Number of Valleys [duplicate]

Equals validation vs indexOf validation?

Why does the equals method in String not use hash?

Compare first three characters of two strings

Categories

Resources