Equals validation vs indexOf validation?

Equals validation vs indexOf validation? - java

I need to validate if one String contains the char $ before replace this one.
I did two implementations for this propose.
The first implementation always execute replace(char oldChar, char newChar) and equals(Object anObject) as validation.
String getImportLine(Class<?> clazz) {
String importLine = toSanitizedClassName(clazz.getName());
String importStaticLine = importLine.replace('$', '.');
if (importLine.equals(importStaticLine)) {
return String.format("import %s;", importLine);
}
return String.format("import static %s;", importStaticLine);
}
This implementation parses the string two times with:
importLine.replace('$', '.')
importLine.equals(importStaticLine)
The second implementation uses indexOf(int ch) as validation and replace(char oldChar, char newChar) in the worst case.
String getImportLine(Class<?> clazz) {
String importLine = toSanitizedClassName(clazz.getName());
if (importLine.indexOf('$') == -1) {
return String.format("import %s;", importLine);
}
importLine = importLine.replace('$', '.');
return String.format("import static %s;", importLine);
}
The second implementation, in the worst case, parse the string two times with:
importLine.indexOf('$') == -1
importLine.replace('$', '.')
Is there some difference in terms of performance between the use of equals vs indexOf as validation?

What you are asking are the difference in execution time between String.indexOf and String.equals. With Big-O notation these are the same, since both (worst case) will iterate through the entire String before returning.
In practice, it really depends on the input.
For instance:
equals will return pretty much immediatly if the two strings compared are a different length
equals will return sooner if the difference in the strings occur early ("abcdef".equals("aXcdef") is faster than "abcdef".equals("abcdeX"))
indexOf('$') will be faster if $ occurs early in the string ("a$cdef".indexOf('$') is faster than "abcde$".indexOf('$'))
indexOf will be slower if the input char is a special character
On modern computers this should not matter, since they are so fast that any difference will be unnoticable, unless the method is called hundreds of thousands of times (or with really large input strings). When optimizing code one should focus on saving seconds, not nanoseconds. With your current problem you should be a lot more worried about making your code readable and understandable to others than you should be worried about which uses the most CPU cycles..

Related

Count the Characters in a String Recursively & treat "eu" as a Single Character

I am new to Java, and I'm trying to figure out how to count Characters in the given string and threat a combination of two characters "eu" as a single character, and still count all other characters as one character.
And I want to do that using recursion.
Consider the following example.
Input:
"geugeu"
Desired output:
4 // g + eu + g + eu = 4
Current output:
2
I've been trying a lot and still can't seem to figure out how to implement it correctly.
My code:
public static int recursionCount(String str) {
if (str.length() == 1) {
return 0;
}
else {
String ch = str.substring(0, 2);
if (ch.equals("eu") {
return 1 + recursionCount(str.substring(1));
}
else {
return recursionCount(str.substring(1));
}
}
}

OP wants to count all characters in a string but adjacent characters "ae", "oe", "ue", and "eu" should be considered a single character and counted only once.
Below code does that:
public static int recursionCount(String str) {
int n;
n = str.length();
if(n <= 1) {
return n; // return 1 if one character left or 0 if empty string.
}
else {
String ch = str.substring(0, 2);
if(ch.equals("ae") || ch.equals("oe") || ch.equals("ue") || ch.equals("eu")) {
// consider as one character and skip next character
return 1 + recursionCount(str.substring(2));
}
else {
// don't skip next character
return 1 + recursionCount(str.substring(1));
}
}
}

Recursion explained
In order to address a particular task using Recursion, you need a firm understanding of how recursion works.
And the first thing you need to keep in mind is that every recursive solution should (either explicitly or implicitly) contain two parts: Base case and Recursive case.
Let's have a look at them closely:
Base case - a part that represents a simple edge-case (or a set of edge-cases), i.e. a situation in which recursion should terminate. The outcome for these edge-cases is known in advance. For this task, base case is when the given string is empty, and since there's nothing to count the return value should be 0. That is sufficient for the algorithm to work, outcomes for other inputs should be derived from the recursive case.
Recursive case - is the part of the method where recursive calls are made and where the main logic resides. Every recursive call eventually hits the base case and stars building its return value.
In the recursive case, we need to check whether the given string starts from a particular string like "eu". And for that we don't need to generate a substring (keep in mind that object creation is costful). instead we can use method String.startsWith() which checks if the bytes of the provided prefix string match the bytes at the beginning of this string which is chipper (reminder: starting from Java 9 String is backed by an array of bytes, and each character is represented either with one or two bytes depending on the character encoding) and we also don't bother about the length of the string because if the string is shorter than the prefix startsWith() will return false.
Implementation
That said, here's how an implementation might look:
public static int recursionCount(String str) {
if(str.isEmpty()) {
return 0;
}
return str.startsWith("eu") ?
1 + recursionCount(str.substring(2)) : 1 + recursionCount(str.substring(1));
}
Note: that besides from being able to implement a solution, you also need to evaluate it's Time and Space complexity.
In this case because we are creating a new string with every call time complexity is quadratic O(n^2) (reminder: creation of the new string requires allocating the memory to coping bytes of the original string). And worse case space complexity also would be O(n^2).
There's a way of solving this problem recursively in a linear time O(n) without generating a new string at every call. For that we need to introduce the second argument - current index, and each recursive call should advance this index either by 1 or by 2 (I'm not going to implement this solution and living it for OP/reader as an exercise).
In addition
In addition, here's a concise and simple non-recursive solution using String.replace():
public static int count(String str) {
return str.replace("eu", "_").length();
}
If you would need handle multiple combination of character (which were listed in the first version of the question) you can make use of the regular expressions with String.replaceAll():
public static int count(String str) {
return str.replaceAll("ue|au|oe|eu", "_").length();
}

Java string comparison using bitwise xor

I came across the below code snippet in a product's code. It is using bitwise XOR for string comparison. Is this better than the String.equals(Object o) method? What is the author trying to achieve here?
private static boolean compareSecure(String a, String b)
{
if ((a == null) || (b == null)) {
return (a == null) && (b == null);
}
int len = a.length();
if (len != b.length()) {
return false;
}
if (len == 0) {
return true;
}
int bits = 0;
for (int i = 0; i < len; i++) {
bits |= a.charAt(i) ^ b.charAt(i);
}
return bits == 0;
}
For context, the strings being equated are authentication tokens.

This is a common implementation of string comparison function that is invulnerable to timing attacks.
In short, the idea is to compare all the characters every time you compare strings, even if you find any of them are not equal. In "standard" implementation you just break on the first difference and return false.
This is not secure because it gives away the information about the compared strings. Specifically if the left-side string is a secret you want to keep (e.g. password), and the right-side string is something provided by the user, an unsafe method allows the hacker to uncover your password with a relative ease, by repeatedly trying out different strings and measuring the response time.
The more characters in the two strings are identical, the more the 'unsecure' function would take to compare them.
For instance, comparing "1234567890" and "0987654321" using a standard method would result in doing just a single comparison of the first character and returning false, since 1!=0. On the other hand comparing "1234567890" to "1098765432", would result in executing 2 comparison operations, because the first ones are equal, you have to compare the second ones to find they are different. This would take a bit more time and it is measurable, even when we are talking about remote calls.
If you do N attacks with N different strings, each starting with a different character, you should see one of the of the results taking a fraction of a milisecond more then the rest. This means the first character is the same, so the function has to take more time to compare the second one. Rinse and repeat for each position in the string and you can crack the secret orders of magnitude faster then brute force.
Preventing such attack is the point of such implementation.
Edit: As diligently pointed out in comment by Mark Rotteveel, this implementation is still vulnerable to timing attack that is aimed at revealing the length of the string. Still this is not a problem in many cases (either you don't care about attacker knowing the length or you deal with data that is standard and anyone can know the length anyway, for instance some kind of known-length hash)

Space complexity of a recursive algorithm

I have a recursive algorithm to find a palindrome in Java. It should return true if the given string is palindrome. False otherwise. This method uses substring method, which is little bit trickier to find the complexity.
Here's my algorithm:
static boolean isPalindrome (String str) {
if (str.length() > 1) {
if (str.charAt(0) == (str.charAt(str.length() - 1))) {
if (str.length() == 2) return true;
return isPalindrome(str.substring(1, str.length() - 1));
}
return false;
}
else {
return true;
}
}
What is the space complexity of this algorithm ?
I mean, when I call the method substring(), does it create a new string all the time ? What actually substring method do in Java ?

In older versions of Java (mainly in Java 6 and before)*, substring returned a new instance that shared the internal char array of the longer string (that is nicely illustrated here). Then substring had time and a space complexity of O(1).
Newer versions use a different representation of String, which does not rely on a shared array. Instead, substring allocates a new array of just the required size, and copies the contents from the longer string. Then substring has a time and a space complexity of O(n).
*Actually the change was introduced in update 6 of Java 7.

String concatenation complexity in C++ and Java [duplicate]

This question already has answers here:
C++ equivalent of StringBuffer/StringBuilder?
(10 answers)
Closed 9 years ago.
Consider this piece of code:
public String joinWords(String[] words) {
String sentence = "";
for(String w : words) {
sentence = sentence + w;
}
return sentence;
}
On each concatenation a new copy of the string is created, so that the overall complexity is O(n^2). Fortunately in Java we could solve this with a StringBuffer, which has O(1) complexity for each append, then the overall complexity would be O(n).
While in C++, std::string::append() has complexity of O(n), and I'm not clear about the complexity of stringstream.
In C++, are there methods like those in StringBuffer with the same complexity?

C++ strings are mutable, and pretty much as dynamically sizable as a StringBuffer. Unlike its equivalent in Java, this code wouldn't create a new string each time; it just appends to the current one.
std::string joinWords(std::vector<std::string> const &words) {
std::string result;
for (auto &word : words) {
result += word;
}
return result;
}
This runs in linear time if you reserve the size you'll need beforehand. The question is whether looping over the vector to get sizes would be slower than letting the string auto-resize. That, i couldn't tell you. Time it. :)
If you don't want to use std::string itself for some reason (and you should consider it; it's a perfectly respectable class), C++ also has string streams.
#include <sstream>
...
std::string joinWords(std::vector<std::string> const &words) {
std::ostringstream oss;
for (auto &word : words) {
oss << word;
}
return oss.str();
}
It's probably not any more efficient than using std::string, but it's a bit more flexible in other cases -- you can stringify just about any primitive type with it, as well as any type that has specified an operator <<(ostream&, its_type&) override.

This is somewhat tangential to your Question, but relevant nonetheless. (And too big for a comment!!)
On each concatenation a new copy of the string is created, so that the overall complexity is O(n^2).
In Java, the complexity of s1.concat(s2) or s1 + s2 is O(M1 + M2) where M1 and M2 are the respective String lengths. Turning that into the complexity of a sequence of concatenations is difficult in general. However, if you assume N concatenations of Strings of length M, then the complexity is indeed O(M * N * N) which matches what you said in the Question.
Fortunately in Java we could solve this with a StringBuffer, which has O(1) complexity for each append, then the overall complexity would be O(n).
In the StringBuilder case, the amortized complexity of N calls to sb.append(s) for strings of size M is O(M*N). The key word here is amortized. When you append characters to a StringBuilder, the implementation may need to expand its internal array. But the expansion strategy is to double the array's size. And if you do the math, you will see that each character in the buffer is going to be copied on average one extra time during the entire sequence of append calls. So the complexity of the entire sequence of appends still works out as O(M*N) ... and, as it happens M*N is the final string length.
So your end result is correct, but your statement about the complexity of a single call to append is not correct. (I understand what you mean, but the way you say it is facially incorrect.)
Finally, I'd note that in Java you should use StringBuilder rather than StringBuffer unless you need the buffer to be thread-safe.

As an example of a really simple structure that has O(n) complexity in C++11:
template<typename TChar>
struct StringAppender {
std::vector<std::basic_string<TChar>> buff;
StringAppender& operator+=( std::basic_string<TChar> v ) {
buff.push_back(std::move(v));
return *this;
}
explicit operator std::basic_string<TChar>() {
std::basic_string<TChar> retval;
std::size_t total = 0;
for( auto&& s:buff )
total+=s.size();
retval.reserve(total+1);
for( auto&& s:buff )
retval += std::move(s);
return retval;
}
};
use:
StringAppender<char> append;
append += s1;
append += s2;
std::string s3 = append;
This takes O(n), where n is the number of characters.
Finally, if you know how long all of the strings are, just doing a reserve with enough room makes append or += take a total of O(n) time. But I agree that is awkward.
Use of std::move with the above StringAppender (ie, sa += std::move(s1)) will significantly increase performance for non-short strings (or using it with xvalues etc)
I do not know the complexity of std::ostringstream, but ostringstream is for pretty printing formatted output, or cases where high performance is not important. I mean, they aren't bad, and they may even out perform scripted/interpreted/bytecode languages, but if you are in a rush, you need something else.
As usual, you need to profile, because constant factors are important.
A rvalue-reference-to-this operator+ might also be a good one, but few compilers implement rvalue references to this.

Why does the equals method in String not use hash?

The code of the equals method in class String is
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String)anObject;
int n = count;
if (n == anotherString.count) {
char v1[] = value;
char v2[] = anotherString.value;
int i = offset;
int j = anotherString.offset;
while (n-- != 0) {
if (v1[i++] != v2[j++])
return false;
}
return true;
}
}
return false;
}
I have a question - why does this method not use hashCode() ?
As far as I know, hashCode() can compare two strings rapidly.
UPDATE: I know, that two unequal strings, can have same hashes. But two equal strings have equal hashes. So, by using hashCode(), we can immediately see that two strings are unequal.
I'm simply thinking that using hashCode() can be a good filter in equals.
UPDATE 2: Here some code, about we are talking here.
It is an example how String method equals can look like
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String)anObject;
if (hashCode() == anotherString.hashCode()){
int n = count;
if (n == anotherString.count) {
char v1[] = value;
char v2[] = anotherString.value;
int i = offset;
int j = anotherString.offset;
while (n-- != 0) {
if (v1[i++] != v2[j++])
return false;
}
return true;
}
}else{
return false;
}
}
return false;
}

Hashcode could be a first-round check for inequality. However, it presents some tradeoffs.
String hashcodes are lazily calculated, although they do use a "guard" value. If you're comparing strings with long lifetimes (ie, they're likely to have had the hashcode computed), this isn't a problem. Otherwise, you're stuck with either computing the hashcode (potentially expensive) or ignoring the check when the hashcode hasn't been computed yet. If you have a lot of short-lived strings, you'll be ignoring the check more often than you'll be using it.
In the real world, most strings differ in their first few characters, so you won't save much by checking hashcode first. There are, of course, exceptions (such as URLs), but again, in real world programming they occur infrequently.

This question has actually been considered by the developers of the JDK. I could not find in the various messages why it has not been included. The enhancement is also listed in the bug database.
Namely, one of the proposed change is:
public boolean equals(Object anObject) {
if (this == anObject) // 1st check identitiy
return true;
if (anObject instanceof String) { // 2nd check type
String anotherString = (String)anObject;
int n = count;
if (n == anotherString.count) { // 3rd check lengths
if (n != 0) { // 4th avoid loading registers from members if length == 0
int h1 = hash, h2 = anotherString.hash;
if (h1 != 0 && h2 != 0 && h1 != h2) // 5th check the hashes
return false;
There was also a discussion to use == for interned strings (i.e. if both strings are interned: if (this != anotherString) return false;).

1) Calculating hashCode may not be faster than comparing the Strings directly.
2) if the hashCode is equal, the Strings may still not be equal

This can be a good idea for many use cases.
However, as a foundation class that is widely used in all kinds of applications, the author really has no idea whether this extra checking can save or hurt performance on average.
I'm gonna guess that the majority of String.equals() are invoked in a Hashmap, after the hash codes has been known to be equal, so testing hash codes again is pointless.
If we consider comparing 2 random strings, even with a small char set like US ASCII, it is very likely that the hashes are different, and char-by-char comparison fails on 1st char. So it'll be a waste to check hashes.

AFAIK, The following check could be added to String. This check that if the hash codes are set and they are different, then the Strings cannot be equal.
if (hash != 0 && anotherString.hash != 0 && hash != anotherString.hash)
return false;
if (hash32 != 0 && anotherString.hash32 != 0 && hash32 != anotherString.hash32)
return false;

The string hash code is not available for free and automatically. In order to rely on hash code, it must be computed for both strings and only then can be compared. As collisions are possible, the second char-by-char comparison is required if the hash codes are equal.
While String appears as immutable for the usual programmer, it does have the private field to store its hashcode once it is computed. However this field is only computed when hashcode is first required. As you can see from the String source code here:
private int hash;
public int hashCode() {
int h = hash;
if (h == 0) {
...
hash = h;
}
return h;
}
Hence it is not obvious that it makes sense to compute the hashcode first. For your specific case (maybe same instances of really long strings are compared with each other a really many of times), it still may be: profile.

As I think, hashCode() can make comparison of two strings quicker.
Arguments?
Arguments against this proposal:
More Operations
hashcode() from String has to access every character in the String and has to do 2 calculations for every character.
So we need for a string with n characters 5*n operations (load, multiplication, lookup/load, multiplication, store). Two times, because we compare two Strings. (Ok, one store and one load does not really count in a reasonable implementation.)
For the best case, this makes a total of 10*x operations for two strings with length m and n and x=min(m,n). Worst case is 10*x with x=m=n. Average somewhere between, perhaps (m*n)/2.
The current equals implementation needs in the best case 3 operations. 2 loads, 1 compare. Worst is 3*x operations for two strings with length m and n and x=m=n. Average is somewhere between, perhaps 3*(m*n)/2.
Even if we cache the hash, it is not clear that we save something
We have to analyze usage patterns. It could be that most of the time, we will only ask one time for equals, not multiple times. Even if we ask multiple times, it could not be enough to have time savings from the caching.
Not direct against the speed, but still good counterarguments:
Counter intuitive
We do not expect a hashcode in equals, because we know for sure that hash(a)==hash(b) for some a!=b. Everyone reading this (and knowledge about hashing) will wonder what is happening there.
Leads to bad examples/unexpected behavior
I can already see the next question on SO: "I have a String with some billion times 'a'. Why does it take forever to compare it with equal() against 'b'?" :)

If the hash code takes the whole content of the string into account, then calculating the hash code of a string with n characters takes n operations. For long strings that's a lot. Comparing two strings takes n operations if they are the same, not longer than calculating the hash. But if the strings are different, then a difference is likely to be found a lot earlier.
String hash functions usually don't consider all characters for very long strings. In that case, if I compare two strings I could first compare the characters used by the hash function, and I'm at least as fast as checking the hashes. But if there is no difference in these characters, then the hash value will be the same, so I have to compare the full strings anyway.
Summary: A good string comparison is never slower but often a lot faster than comparing the hashes (and comparing strings when the hashes match).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.