String.equals() to change with bitwise AND on binary numbers - java

In Java, working with binary strings (e.g. "00010010", zeroes are added in the beginning when creating these binary strings for the purpose of my program). I have the function
private static boolean isJinSuperSets(String J, List<String> superSets) {
for (String superJ : superSets)
if (superJ.equals(J)) return true;
return false;
}
that checks if a binary string J is contained in the list of binary strings superSets.
I use equals() on the String object but I would like to speed up this code by converting binary strings to binary numbers and doing bitwise operation AND to see if they are equal.
Could you please give me a few tricks on how to accomplish that?

Here for int:
for (String superJ : superSets)
return Integer.valueOf(superJ,2) == Integer.valueOf(J,2);
}
You have to test with benchmarks (take care first time is always slower) for speed.
The best way to optimize if J is more than once used : have J2 as Integer somewhere and test on it.

Related

Comparing two char arrays in Java

I am trying to compare two char arrays lexicographically, using loops and arrays only. I solved the task, however, I think my code is bulky and unnecessarily long and would like an advice on how to optimize it. I am a beginner. See code below:
//Compare Character Arrays Lexicographically
//Write a program that compares two char arrays lexicographically (letter by letter).
// Research how to convert string to char array.
Scanner scanner = new Scanner(System.in);
String word1 = scanner.nextLine();
String word2 = scanner.nextLine();
char[] firstArray = word1.toCharArray();
char[] secondArray = word2.toCharArray();
for (char element : firstArray) {
System.out.print(element + " ");
}
System.out.println();
for (char element : secondArray) {
System.out.print(element + " ");
}
System.out.println();
String s = String.valueOf(firstArray);
String b = String.valueOf(secondArray);
int result = s.compareTo(b);
if (result < 0) {
System.out.println("First");
} else if (result > 0) {
System.out.println("Second");
} else {
System.out.println("Equal");
}
}
}
I think its pretty normal. You've done it right. There's not much code to reduce here , best you can do is not write the two for loops to print the char arrays. Or if you are wanting to print the two arrays then maybe use System.out.println(Arrays.toString(array_name)); instead of two full dedicated for/for each loops. It does the same thing in the background but makes your code look a little bit cleaner and that's what you are looking for.
As commented by tgdavies, you schoolwork assignment was likely intended for you to compare characters in your own code rather than call String#compareTo.
In real life, sorting words alphabetically is quite complex because of various cultural norms across various languages and dialects. For real work, we rely on collation tools rather than write our own sorting code. For example, an e with the diacritical ’ may sort before or after an e without, depending on the cultural context.
But for a schoolwork assignment, the goal of the exercise is likely to have you compare each letter of each word by examining its code point number, the number assigned to identify each character defined in Unicode. These code point numbers are assigned by Unicode in roughly alphabetical order. This code point number ordering is not sufficient to do sorting in real work, but is presumably good enough for your assignment, especially for text using only basic American English using letters a-z/A-Z.
So, if the numbers are the same, move to the next character in each word. When you reach the nth letter that are not the same in both, then in overly simplistic terms, you know which comes after which alphabetically. If all the numbers are the same, the words are the same.
Another real world problem is the char type has been legacy since Java 5, essentially broken since Java 2. As a 16-bit value, char is physically incapable of representing most characters.
So instead of char arrays, use int arrays to hold code point integer numbers.
int[] firstWordCodePoints = firstWord.codePoints().toArray() ;

Why use bit shifting instead of a for loop?

I created the following code to find parity of a binary number (i.e output 1 if the number of 1's in the binary word is odd, output 0 if the number of 1's is even).
public class CalculateParity {
String binaryword;
int totalones = 0;
public CalculateParity(String binaryword) {
this.binaryword = binaryword;
getTotal();
}
public int getTotal() {
for(int i=0; i<binaryword.length(); i++) {
if (binaryword.charAt(i) == '1'){
totalones += 1;
}
}
return totalones;
}
public int calcParity() {
if (totalones % 2 == 1) {
return 1;
}
else {
return 0;
}
}
public static void main(String[] args) {
CalculateParity bin = new CalculateParity("1011101");
System.out.println(bin.calcParity());
}
}
However, all of the solutions I find online almost always deal with using bit shift operators, XORs, unsigned shift operations, etc., like this solution I found in a data structure book:
public static short parity(long x){
short result = 0;
while (x != 0) {
result A=(x&1);
x >>>= 1;
}
return result;
}
Why is this the case? What makes bitwise operators more of a valid/standard solution than the solution I came up with, which is simply iterating through a binary word of type String? Is a bitwise solution more efficient? I appreciate any help!
The code that you have quoted uses a loop as well (i.e., while):
public static short parity(long x){
short result = 9;
while (x != 9) {
result A=(x&1);
x >>>= 1;
}
return result;
}
You need to acknowledge that you are using a string that you know beforehand will be composed of only digits, and conveniently in a binary representation. Naturally, given those constraints, one does not need to use bitwise operations instead one just parsers char-by-char and does the desired computations.
On the other hand, if you receive as a parameter a long, as the method that you have quoted, then it comes in handy to use bitwise operations to go through each bit (at a time) in a number and perform the desired computation.
One could also convert the long into a string and apply the same logic code-wise that you have applied, but first, one would have to convert that long into binary. However, that approach would add extra unnecessary steps, more code, and would be performance-wise worse. Probably, the same applies vice-versa if you have a String with your constraints. Nevertheless, a String is not a number, even if it is only composed of digits, which makes using a type that represents a number (e.g., long) even a more desirable approach.
Another thing that you are missing is that you did some of the heavy lifting by converting already a number to binary, and encoded into a String new CalculateParity("1011101");. So you kind of jump a step there. Now try to use your approach, but this time using "93" and find the parity.
If you want know if a String is even. I think this method below is better.
If you convert a String too
long which the length of the String is bigger than 64. there will a error occur.
both of the method you
mention is O(n) performance.It will not perform big different. but
the shift method is more precise and the clock of the cpu use will a little bit less.
private static boolean isEven(String s){
char[] chars = s.toCharArray();
int i = 0;
for(char c : chars){
i ^= c;
}
return i == 0;
}
You use a string based method for a string input. Good choice.
The code you quote uses an integer-based method for an integer input. An equally good choice.

Why java Set.contains() is faster than String.contains()?

For a problem to find common characters between 2 strings, at first I used the straight forward String.contains() method:
static String twoStrings(String s1, String s2) {
boolean subStringFound = false;
for(int i = 0; i < s2.length(); i++){
if(s1.contains(Character.toString(s2.charAt(i)))) {
subStringFound = true;
break;
}
}
return subStringFound?"YES":"NO";
}
However, it passed most of the test cases 5/7 test cases, but faced time-out for 2 cases which were really long strings.
Then I tried with Set.contains():
static String twoStrings(String s1, String s2) {
boolean subStringFound = false;
HashSet<Character> set = new HashSet<>();
for(int i = 0; i < s1.length(); i++){
set.add(s1.charAt(i));
}
for(int i = 0; i < s2.length(); i++){
if(set.contains(s2.charAt(i))) {
subStringFound = true;
break;
}
}
return subStringFound?"YES":"NO";
}
And despite I'm running an additional loop to create a Set, it passed all the tests.
What's the main reason for this significant difference in runtime.
Because they are different data structures, and the contains method is implemented differently on them.
A string is a sequence of characters, so to test whether it contains a given character, you have to look at each character in the sequence and compare it. This algorithm is called linear search, and it takes O(n) time where n is the number of characters, meaning it takes proportionally more time when there are more characters.
A HashSet is a kind of hash table data structure. Basically, to test whether it contains a given character, you take the hash of that character, use the hash as an index in an array, and either the character is there (or very near to there), or it isn't. So you don't have to search the whole set; it takes O(1) time on average, meaning the time is roughly the same however many characters there are.
You'd have to look at the implementation in the JDK being used, but most likely String.contains is a linear search but HashSet.contains is not. From the HashSet documentation:
This class implements the Set interface, backed by a hash table (actually a HashMap instance)...
This class offers constant time performance for the basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets.

Java string comparison using bitwise xor

I came across the below code snippet in a product's code. It is using bitwise XOR for string comparison. Is this better than the String.equals(Object o) method? What is the author trying to achieve here?
private static boolean compareSecure(String a, String b)
{
if ((a == null) || (b == null)) {
return (a == null) && (b == null);
}
int len = a.length();
if (len != b.length()) {
return false;
}
if (len == 0) {
return true;
}
int bits = 0;
for (int i = 0; i < len; i++) {
bits |= a.charAt(i) ^ b.charAt(i);
}
return bits == 0;
}
For context, the strings being equated are authentication tokens.
This is a common implementation of string comparison function that is invulnerable to timing attacks.
In short, the idea is to compare all the characters every time you compare strings, even if you find any of them are not equal. In "standard" implementation you just break on the first difference and return false.
This is not secure because it gives away the information about the compared strings. Specifically if the left-side string is a secret you want to keep (e.g. password), and the right-side string is something provided by the user, an unsafe method allows the hacker to uncover your password with a relative ease, by repeatedly trying out different strings and measuring the response time.
The more characters in the two strings are identical, the more the 'unsecure' function would take to compare them.
For instance, comparing "1234567890" and "0987654321" using a standard method would result in doing just a single comparison of the first character and returning false, since 1!=0. On the other hand comparing "1234567890" to "1098765432", would result in executing 2 comparison operations, because the first ones are equal, you have to compare the second ones to find they are different. This would take a bit more time and it is measurable, even when we are talking about remote calls.
If you do N attacks with N different strings, each starting with a different character, you should see one of the of the results taking a fraction of a milisecond more then the rest. This means the first character is the same, so the function has to take more time to compare the second one. Rinse and repeat for each position in the string and you can crack the secret orders of magnitude faster then brute force.
Preventing such attack is the point of such implementation.
Edit: As diligently pointed out in comment by Mark Rotteveel, this implementation is still vulnerable to timing attack that is aimed at revealing the length of the string. Still this is not a problem in many cases (either you don't care about attacker knowing the length or you deal with data that is standard and anyone can know the length anyway, for instance some kind of known-length hash)

Why does the equals method in String not use hash?

The code of the equals method in class String is
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String)anObject;
int n = count;
if (n == anotherString.count) {
char v1[] = value;
char v2[] = anotherString.value;
int i = offset;
int j = anotherString.offset;
while (n-- != 0) {
if (v1[i++] != v2[j++])
return false;
}
return true;
}
}
return false;
}
I have a question - why does this method not use hashCode() ?
As far as I know, hashCode() can compare two strings rapidly.
UPDATE: I know, that two unequal strings, can have same hashes. But two equal strings have equal hashes. So, by using hashCode(), we can immediately see that two strings are unequal.
I'm simply thinking that using hashCode() can be a good filter in equals.
UPDATE 2: Here some code, about we are talking here.
It is an example how String method equals can look like
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String)anObject;
if (hashCode() == anotherString.hashCode()){
int n = count;
if (n == anotherString.count) {
char v1[] = value;
char v2[] = anotherString.value;
int i = offset;
int j = anotherString.offset;
while (n-- != 0) {
if (v1[i++] != v2[j++])
return false;
}
return true;
}
}else{
return false;
}
}
return false;
}
Hashcode could be a first-round check for inequality. However, it presents some tradeoffs.
String hashcodes are lazily calculated, although they do use a "guard" value. If you're comparing strings with long lifetimes (ie, they're likely to have had the hashcode computed), this isn't a problem. Otherwise, you're stuck with either computing the hashcode (potentially expensive) or ignoring the check when the hashcode hasn't been computed yet. If you have a lot of short-lived strings, you'll be ignoring the check more often than you'll be using it.
In the real world, most strings differ in their first few characters, so you won't save much by checking hashcode first. There are, of course, exceptions (such as URLs), but again, in real world programming they occur infrequently.
This question has actually been considered by the developers of the JDK. I could not find in the various messages why it has not been included. The enhancement is also listed in the bug database.
Namely, one of the proposed change is:
public boolean equals(Object anObject) {
if (this == anObject) // 1st check identitiy
return true;
if (anObject instanceof String) { // 2nd check type
String anotherString = (String)anObject;
int n = count;
if (n == anotherString.count) { // 3rd check lengths
if (n != 0) { // 4th avoid loading registers from members if length == 0
int h1 = hash, h2 = anotherString.hash;
if (h1 != 0 && h2 != 0 && h1 != h2) // 5th check the hashes
return false;
There was also a discussion to use == for interned strings (i.e. if both strings are interned: if (this != anotherString) return false;).
1) Calculating hashCode may not be faster than comparing the Strings directly.
2) if the hashCode is equal, the Strings may still not be equal
This can be a good idea for many use cases.
However, as a foundation class that is widely used in all kinds of applications, the author really has no idea whether this extra checking can save or hurt performance on average.
I'm gonna guess that the majority of String.equals() are invoked in a Hashmap, after the hash codes has been known to be equal, so testing hash codes again is pointless.
If we consider comparing 2 random strings, even with a small char set like US ASCII, it is very likely that the hashes are different, and char-by-char comparison fails on 1st char. So it'll be a waste to check hashes.
AFAIK, The following check could be added to String. This check that if the hash codes are set and they are different, then the Strings cannot be equal.
if (hash != 0 && anotherString.hash != 0 && hash != anotherString.hash)
return false;
if (hash32 != 0 && anotherString.hash32 != 0 && hash32 != anotherString.hash32)
return false;
The string hash code is not available for free and automatically. In order to rely on hash code, it must be computed for both strings and only then can be compared. As collisions are possible, the second char-by-char comparison is required if the hash codes are equal.
While String appears as immutable for the usual programmer, it does have the private field to store its hashcode once it is computed. However this field is only computed when hashcode is first required. As you can see from the String source code here:
private int hash;
public int hashCode() {
int h = hash;
if (h == 0) {
...
hash = h;
}
return h;
}
Hence it is not obvious that it makes sense to compute the hashcode first. For your specific case (maybe same instances of really long strings are compared with each other a really many of times), it still may be: profile.
As I think, hashCode() can make comparison of two strings quicker.
Arguments?
Arguments against this proposal:
More Operations
hashcode() from String has to access every character in the String and has to do 2 calculations for every character.
So we need for a string with n characters 5*n operations (load, multiplication, lookup/load, multiplication, store). Two times, because we compare two Strings. (Ok, one store and one load does not really count in a reasonable implementation.)
For the best case, this makes a total of 10*x operations for two strings with length m and n and x=min(m,n). Worst case is 10*x with x=m=n. Average somewhere between, perhaps (m*n)/2.
The current equals implementation needs in the best case 3 operations. 2 loads, 1 compare. Worst is 3*x operations for two strings with length m and n and x=m=n. Average is somewhere between, perhaps 3*(m*n)/2.
Even if we cache the hash, it is not clear that we save something
We have to analyze usage patterns. It could be that most of the time, we will only ask one time for equals, not multiple times. Even if we ask multiple times, it could not be enough to have time savings from the caching.
Not direct against the speed, but still good counterarguments:
Counter intuitive
We do not expect a hashcode in equals, because we know for sure that hash(a)==hash(b) for some a!=b. Everyone reading this (and knowledge about hashing) will wonder what is happening there.
Leads to bad examples/unexpected behavior
I can already see the next question on SO: "I have a String with some billion times 'a'. Why does it take forever to compare it with equal() against 'b'?" :)
If the hash code takes the whole content of the string into account, then calculating the hash code of a string with n characters takes n operations. For long strings that's a lot. Comparing two strings takes n operations if they are the same, not longer than calculating the hash. But if the strings are different, then a difference is likely to be found a lot earlier.
String hash functions usually don't consider all characters for very long strings. In that case, if I compare two strings I could first compare the characters used by the hash function, and I'm at least as fast as checking the hashes. But if there is no difference in these characters, then the hash value will be the same, so I have to compare the full strings anyway.
Summary: A good string comparison is never slower but often a lot faster than comparing the hashes (and comparing strings when the hashes match).

Categories