Understanding logic in CaseInsensitiveComparator

Understanding logic in CaseInsensitiveComparator - java

Can anyone explain the following code from String.java, specifically why there are three if statements (which I've marked //1, //2 and //3)?
private static class CaseInsensitiveComparator
implements Comparator<String>, java.io.Serializable {
// use serialVersionUID from JDK 1.2.2 for interoperability
private static final long serialVersionUID = 8575799808933029326L;
public int compare(String s1, String s2) {
int n1=s1.length(), n2=s2.length();
for (int i1=0, i2=0; i1<n1 && i2<n2; i1++, i2++) {
char c1 = s1.charAt(i1);
char c2 = s2.charAt(i2);
if (c1 != c2) {/////////////////////////1
c1 = Character.toUpperCase(c1);
c2 = Character.toUpperCase(c2);
if (c1 != c2) {/////////////////////////2
c1 = Character.toLowerCase(c1);
c2 = Character.toLowerCase(c2);
if (c1 != c2) {/////////////////////////3
return c1 - c2;
}
}
}
}
return n1 - n2;
}
}

From Unicode Technical Standard:
In addition, because of the vagaries of natural language, there are situations where two different Unicode characters have the same uppercase or lowercase
So, it's not enough to compare only uppercase of two characters, because they may have different uppercase and same lowercase
Simple brute force check gives some results. Check for example code points 73 and 304:
char ch1 = (char) 73; //LATIN CAPITAL LETTER I
char ch2 = (char) 304; //LATIN CAPITAL LETTER I WITH DOT ABOVE
System.out.println(ch1==ch2);
System.out.println(Character.toUpperCase(ch1)==Character.toUpperCase(ch2));
System.out.println(Character.toLowerCase(ch1)==Character.toLowerCase(ch2));
Output:
false
false
true
So "İ" and "I" are not equal to each other. Both characters are uppercase. But they share the same lower case letter: "i" and that gives a reason to treat them as same values in case insensitive comparison.

Normally, we would expect to convert the case once and compare and be done with it. However, the code convert the case twice, and the reason is stated in the comment on a different method public boolean regionMatches(boolean ignoreCase, int toffset, String other, int ooffset, int len):
Unfortunately, conversion to uppercase does not work properly for the Georgian alphabet, which has strange rules about case conversion. So we need to make one last check before exiting.
Appendix
The code of regionMatches has a few difference from the code in the CaseInsenstiveComparator, but essentially does the same thing. The full code of the method is quoted below for the purpose of cross-checking:
public boolean regionMatches(boolean ignoreCase, int toffset,
String other, int ooffset, int len) {
char ta[] = value;
int to = offset + toffset;
char pa[] = other.value;
int po = other.offset + ooffset;
// Note: toffset, ooffset, or len might be near -1>>>1.
if ((ooffset < 0) || (toffset < 0) || (toffset > (long)count - len) ||
(ooffset > (long)other.count - len)) {
return false;
}
while (len-- > 0) {
char c1 = ta[to++];
char c2 = pa[po++];
if (c1 == c2) {
continue;
}
if (ignoreCase) {
// If characters don't match but case may be ignored,
// try converting both characters to uppercase.
// If the results match, then the comparison scan should
// continue.
char u1 = Character.toUpperCase(c1);
char u2 = Character.toUpperCase(c2);
if (u1 == u2) {
continue;
}
// Unfortunately, conversion to uppercase does not work properly
// for the Georgian alphabet, which has strange rules about case
// conversion. So we need to make one last check before
// exiting.
if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
continue;
}
}
return false;
}
return true;
}

In another answer, default locale already gave an example for why comparing only uppercase is not enough, namely the ASCII letter "I" and the capital I with dot "İ".
Now you might wonder, why do they not just compare only lowercase instead of both upper and lower case, if it catches more cases than uppercase? The answer is, it does not catch more cases, it merely finds different cases.
Take the letter "ı" ((char)305, small dotless i) and the ascii "i". They are different, their lowercase is different, but they share the same uppercase letter "I".
And finally, compare capital I with dot "İ" with small dotless i "ı". Neither their uppercases ("İ" vs. "I") nor their lowercases ("i" vs. "ı") matches, but the lowercase of their uppercase is the same ("I"). I found another case if this phenomenon, in the greek letters "ϴ" and "ϑ" (char 1012 and 977).
So a true case insensitive comparison can not even check uppercases and lowercases of the original characters, but must check the lowercases of the uppercases.

Consider the following characters: f and F. The initial if statement would return false because they don't match. However, if you capitalise both characters, you get F and F. Then they would match. The same would not be true of, say, c and G.
The code is efficient. There is no need to capitalise both characters if they already match (hence the first if statement). However, if they don't match, we need to check if they differ only in case (hence the second if statement).
The final if statement is used for certain alphabets (such as Georgian) where capitalisation is a complicated affair. To be honest, I don't know a great deal about how that works (just trust that Oracle does!).

In the above case for case insensitive comparison assume s1="Apple" and s2="apple"
In this case 'A'!='a' as the ascii values of both the characters are different then it changes both the Characters to upper case and again compares then the loop continues to get the final value of n1-n2=0 thus the strings become the same.Suppose if the characters are not equal at all the second check
if (c1 != c2) {
return c1 - c2;
}
returns the difference in ascii value of the two characters.

Related

How to compare char with number in Java

I got a problem and I think it is in comparing a char with a number.
String FindCountry = "BB";
Map<String, String> Cont = new HashMap <> ();
Cont.put("BA-BE", "Angola");
Cont.put("9X-92", "Trinidad & Tobago");
for ( String key : Cont.keySet()) {
if (key.charAt(0) == FindCountry.charAt(0) && FindCountry.charAt(1) >= key.charAt(1) && FindCountry.charAt(1) <= key.charAt(4)) {
System.out.println("Country: "+ Cont.get(key));
}
}
In this case the code print "Angola", but if
String FindCountry = "9Z"
it doesn't print anything. I am not sure I think the problem is in that it can't compare that is '2' greater than 'Z'. In that example, I got only two Cont.put(), but in my file, I got much more and a lot of them are not only with chars. I got a problem with them.
What is the smartest and best way to compare char with a number ? Actually, if I set a rule like "1" is greater than "Z" it will be okay because I need this way of greater: A-Z-9-0.
Thanks!

You can use a lookup "table", I used a String:
private static final String LOOKUP = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
And then compare the chars with indexOf(), but it seems messy and could probably be achieved more easily, I just can't come up with something easier at the moment:
String FindCountry = "9Z";
Map<String, String> Cont = new HashMap<>();
Cont.put("BA-BE", "Angola");
Cont.put("9X-92", "Trinidad & Tobago");
for (String key : Cont.keySet()) {
if (LOOKUP.indexOf(key.charAt(0)) == LOOKUP.indexOf(FindCountry.charAt(0)) &&
LOOKUP.indexOf(FindCountry.charAt(1)) >= LOOKUP.indexOf(key.charAt(1)) &&
LOOKUP.indexOf(FindCountry.charAt(1)) <= LOOKUP.indexOf(key.charAt(4))) {
System.out.println("Country: " + Cont.get(key));
}
}

If you only use the characters A-Z and 0-9, you could add a conversion method in between which will increase the values of the 0-9 characters so they'll be after A-Z:
int applyCharOrder(char c){
// If the character is a digit:
if(c < 58){
// Add 43 to put it after the 'Z' in terms of decimal unicode value:
return c + 43;
}
// If it's an uppercase letter instead: simply return it as is
return c;
}
Which can be used like this:
if(applyCharOrder(key.charAt(0)) == applyCharOrder(findCountry.charAt(0))
&& applyCharOrder(findCountry.charAt(1)) >= applyCharOrder(key.charAt(1))
&& applyCharOrder(findCountry.charAt(1)) <= applyCharOrder(key.charAt(4))){
System.out.println("Country: "+ cont.get(key));
}
Try it online.
Note: Here is a table with the decimal unicode values. Characters '0'-'9' will have the values 48-57 and 'A'-'Z' will have the values 65-90. So the < 58 is used to check if it's a digit-character, and the + 43 will increase the 48-57 to 91-100, putting their values above the 'A'-'Z' so your <= and >= checks will work as you'd want them to.
Alternatively, you could create a look-up String and use its index for the order:
int applyCharOrder(char c){
return "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".indexOf(c);
}
Try it online.
PS: As mentioned in the first comment by #Stultuske, variables are usually in camelCase, so they aren't starting with an uppercase letter.

As the others stated in the comments, such mathematical comparison operations on characters are based on the actual ASCII values of each char. So I'd suggest you refactor your logic using the ASCII table as reference.

Java: Why String.compareIgnoreCase() uses both Character.toUpperCase() and Character.toLowerCase()? [duplicate]

This question already has an answer here:
Curious about the implementation of CaseInsensitiveComparator [duplicate]
(1 answer)
Closed 6 years ago.
The compareToIgnoreCase method of String Class is implemented using the method in the snippet below(jdk1.8.0_45).
i. Why are both Character.toUpperCase(char) and Character.toLowerCase(char) used for comparison? Wouldn't either of them suffice the purpose of comparison?
ii. Why was s1.toLowerCase().compare(s2.toLowerCase()) not used to implement compareToIgnoreCase? - I understand the same logic can be implemented in different ways. But, still I would like to know if there are specific reasons to choose one over the other.
public int compare(String s1, String s2) {
int n1 = s1.length();
int n2 = s2.length();
int min = Math.min(n1, n2);
for (int i = 0; i < min; i++) {
char c1 = s1.charAt(i);
char c2 = s2.charAt(i);
if (c1 != c2) {
c1 = Character.toUpperCase(c1);
c2 = Character.toUpperCase(c2);
if (c1 != c2) {
c1 = Character.toLowerCase(c1);
c2 = Character.toLowerCase(c2);
if (c1 != c2) {
// No overflow because of numeric promotion
return c1 - c2;
}
}
}
}
return n1 - n2;
}

Here's an example using Turkish i's:
System.out.println(Character.toUpperCase('i') == Character.toUpperCase('İ'));
System.out.println(Character.toLowerCase('i') == Character.toLowerCase('İ'));
The first line prints false; the second true. Ideone demo.

There are languages that have special characters that are converted to an upper or lower character (or sequence of characters).
So using only one case can have some problem for this kind of special characters.
As an example the character Eszett ß in german is converted to SS in upper case. From wikipedia:
The name eszett comes from the two letters S and Z as they are pronounced in German. Its Unicode encoding is U+00DF.
So a word like groß compared to gross will generate a failure if only lower comparison is used.
#chrylis here is a working example
System.out.println("ß".toUpperCase().equals("SS")); // True
System.out.println("ß".toLowerCase().equals("ss")); // false
Thanks to the comment of #chrylis i made some additional test and I found a possible error on String class:
System.out.println("ß".toUpperCase().equals("SS")); // True
System.out.println("ß".toLowerCase().equals("ss")); // false
but
System.out.println("ß".equalsIgnoreCase("SS")); // False
System.out.println("ß".equalsIgnoreCase("ss")); // False
So there is at least a case where two strings are equals if manually both converted to uppercase, but are not equals if are compared ignoring case.

From the Character class documentation, in the toUpperCase and the toLowerCase methods it states:
Note that Character.isUpperCase(Character.toUpperCase(ch)) does not
always return true for some ranges of characters, particularly those
that are symbols or ideographs. (Similar for toLowerCase)
Since there is the potential for anomalies in the comparison, they check for both cases to give the most accurate response possible.

equalsIgnoreCase doesn't conform to javadoc?

The javadoc for String.equalsIgnoreCase says:
Two strings are considered equal ignoring case if they are of the same length and corresponding characters in the two strings are equal ignoring case.
Two characters c1 and c2 are considered the same ignoring case if at least one of the following is true:
The two characters are the same (as compared by the == operator)
Applying the method Character.toUpperCase(char) to each character produces the same result
Applying the method Character.toLowerCase(char) to each character produces the same result
So can anyone explain this?
public class Test
{
private static void testChars(char ch1, char ch2) {
boolean b1 = (ch1 == ch2 ||
Character.toLowerCase(ch1) == Character.toLowerCase(ch2) ||
Character.toUpperCase(ch1) == Character.toUpperCase(ch2));
System.out.println("Characters match: " + b1);
String s1 = Character.toString(ch1);
String s2 = Character.toString(ch2);
boolean b2 = s1.equalsIgnoreCase(s2);
System.out.println("equalsIgnoreCase returns: " + b2);
}
public static void main(String args[]) {
testChars((char)0x0130, (char)0x0131);
testChars((char)0x03d1, (char)0x03f4);
}
}
Output:
Characters match: false
equalsIgnoreCase returns: true
Characters match: false
equalsIgnoreCase returns: true

Those characters' defintion of upper and lower case is probably locale-specific. From the JavaDoc for Character.toLowerCase():
In general, String.toLowerCase() should be used to map characters to
lowercase. String case mapping methods have several benefits over
Character case mapping methods. String case mapping methods can
perform locale-sensitive mappings, context-sensitive mappings, and 1:M
character mappings, whereas the Character case mapping methods cannot.
If you look at the String.toLowerCase() method you will find it overridden to accept a Locale object as well. This will perform a locale-specific case conversion.
Edit: I would like to be clear that yes, the JavaDoc for String.equalsIgnoreCase() says what it says, but it is wrong. It cannot be correct in all cases, certainly not for characters that have surrogates, for example, but also for characters where the locale defines upper/lower case.

I found this in String.java (this snippet is also in a document that peter.petrov linked to):
if (ignoreCase) {
// If characters don't match but case may be ignored,
// try converting both characters to uppercase.
// If the results match, then the comparison scan should
// continue.
char u1 = Character.toUpperCase(c1);
char u2 = Character.toUpperCase(c2);
if (u1 == u2) {
continue;
}
// Unfortunately, conversion to uppercase does not work properly
// for the Georgian alphabet, which has strange rules about case
// conversion. So we need to make one last check before
// exiting.
if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
continue;
}
}
It's used by equalsIgnoreCase. What's interesting is that if it followed what the javadoc said, the line toward the bottom should be
if (Character.toLowerCase(c1) == Character.toLowerCase(c2)) {
using c1 and c2 instead of u1 and u2. This affects the result for these two cases. We can all agree that the javadoc is "wrong" in that it doesn't really reflect how case-folding should work; but the above logic doesn't deal with correct case-folding any better, and it doesn't conform to the documentation besides.

How do I check if a char is a vowel?

This Java code is giving me trouble:
String word = <Uses an input>
int y = 3;
char z;
do {
z = word.charAt(y);
if (z!='a' || z!='e' || z!='i' || z!='o' || z!='u')) {
for (int i = 0; i==y; i++) {
wordT = wordT + word.charAt(i);
} break;
}
} while(true);
I want to check if the third letter of word is a non-vowel, and if it is I want it to return the non-vowel and any characters preceding it. If it is a vowel, it checks the next letter in the string, if it's also a vowel then it checks the next one until it finds a non-vowel.
Example:
word = Jaemeas then wordT must = Jaem
Example 2:
word=Jaeoimus then wordT must =Jaeoim
The problem is with my if statement, I can't figure out how to make it check all the vowels in that one line.

Clean method to check for vowels:
public static boolean isVowel(char c) {
return "AEIOUaeiou".indexOf(c) != -1;
}

Your condition is flawed. Think about the simpler version
z != 'a' || z != 'e'
If z is 'a' then the second half will be true since z is not 'e' (i.e. the whole condition is true), and if z is 'e' then the first half will be true since z is not 'a' (again, whole condition true). Of course, if z is neither 'a' nor 'e' then both parts will be true. In other words, your condition will never be false!
You likely want &&s there instead:
z != 'a' && z != 'e' && ...
Or perhaps:
"aeiou".indexOf(z) < 0

How about an approach using regular expressions? If you use the proper pattern you can get the results from the Matcher object using groups. In the code sample below the call to m.group(1) should return you the string you're looking for as long as there's a pattern match.
String wordT = null;
Pattern patternOne = Pattern.compile("^([\\w]{2}[AEIOUaeiou]*[^AEIOUaeiou]{1}).*");
Matcher m = patternOne.matcher("Jaemeas");
if (m.matches()) {
wordT = m.group(1);
}
Just a little different approach that accomplishes the same goal.

Actually there are much more efficient ways to check it but since you've asked what is the problem with yours, I can tell that the problem is you have to change those OR operators with AND operators. With your if statement, it will always be true.

So in event anyone ever comes across this and wants a easy compare method that can be used in many scenarios.
Doesn't matter if it is UPPERCASE or lowercase. A-Z and a-z.
bool vowel = ((1 << letter) & 2130466) != 0;
This is the easiest way I could think of. I tested this in C++ and on a 64bit PC so results may differ but basically there's only 32 bits available in a "32 bit integer" as such bit 64 and bit 32 get removed and you are left with a value from 1 - 26 when performing the "<< letter".
If you don't understand how bits work sorry i'm not going go super in depth but the technique of
1 << N is the same thing as 2^N power or creating a power of two.
So when we do 1 << N & X we checking if X contains the power of two that creates our vowel is located in this value 2130466. If the result doesn't equal 0 then it was successfully a vowel.
This situation can apply to anything you use bits for and even values larger then 32 for an index will work in this case so long as the range of values is 0 to 31. So like the letters as mentioned before might be 65-90 or 97-122 but since but we keep remove 32 until we are left with a remainder ranging from 1-26. The remainder isn't how it actually works, but it gives you an idea of the process.
Something to keep in mind if you have no guarantee on the incoming letters it to check if the letter is below 'A' or above 'u'. As the results will always be false anyways.
For example teh following will return a false vowel positive. "!" exclamation point is value 33 and it will provide the same bit value as 'A' or 'a' would.

For starters, you are checking if the letter is "not a" OR "not e" OR "not i" etc.
Lets say that the letter is i. Then the letter is not a, so that returns "True". Then the entire statement is True because i != a. I think what you are looking for is to AND the statements together, not OR them.
Once you do this, you need to look at how to increment y and check this again. If the first time you get a vowel, you want to see if the next character is a vowel too, or not. This only checks the character at location y=3.

String word="Jaemeas";
String wordT="";
int y=3;
char z;
do{
z=word.charAt(y);
if(z!='a'&&z!='e'&&z!='i'&&z!='o'&&z!='u'&&y<word.length()){
for(int i = 0; i<=y;i++){
wordT=wordT+word.charAt(i);
}
break;
}
else{
y++;
}
}while(true);
here is my answer.

I have declared a char[] constant for the VOWELS, then implemented a method that checks whether a char is a vowel or not (returning a boolean value). In my main method, I am declaring a string and converting it to an array of chars, so that I can pass the index of the char array as the parameter of my isVowel method:
public class FindVowelsInString {
static final char[] VOWELS = {'a', 'e', 'i', 'o', 'u'};
public static void main(String[] args) {
String str = "hello";
char[] array = str.toCharArray();
//Check with a consonant
boolean vowelChecker = FindVowelsInString.isVowel(array[0]);
System.out.println("Is this a character a vowel?" + vowelChecker);
//Check with a vowel
boolean vowelChecker2 = FindVowelsInString.isVowel(array[1]);
System.out.println("Is this a character a vowel?" + vowelChecker2);
}
private static boolean isVowel(char vowel) {
boolean isVowel = false;
for (int i = 0; i < FindVowelsInString.getVowel().length; i++) {
if (FindVowelsInString.getVowel()[i] == vowel) {
isVowel = true;
}
}
return isVowel;
}
public static char[] getVowel() {
return FindVowelsInString.VOWELS;
}
}

Comparing String Integers Issue

I have a scanner that reads a 7 character alphanumeric code (inputted by the user). the String variable is called "code".
The last character of the code (7th character, 6th index) MUST BE NUMERIC, while the rest may be either numeric or alphabetical.
So, I sought ought to make a catch, which would stop the rest of the method from executing if the last character in the code was anything but a number (from 0 - 9).
However, my code does not work as expected, seeing as even if my code ends in an integer between 0 and 9, the if statement will be met, and print out "last character in code is non-numerical).
example code: 45m4av7
CharacterAtEnd prints out as the string character 7, as it should.
however my program still tells me my code ends non-numerically.
I'm aware that my number values are string characters, but it shouldnt matter, should it?
also I apparently cannot compare actual integer values with an "|", which is mainly why im using String.valueOf, and taking the string characters of 0-9.
String characterAtEnd = String.valueOf(code.charAt(code.length()-1));
System.out.println(characterAtEnd);
if(!characterAtEnd.equals(String.valueOf(0|1|2|3|4|5|6|7|8|9))){
System.out.println("INVALID CRC CODE: last character in code in non-numerical.");
System.exit(0);
I cannot for the life of me, figure out why my program is telling me my code (that has a 7 at the end) ends non-numerically. It should skip the if statement and continue on. right?

The String contains method will work here:
String digits = "0123456789";
digits.contains(characterAtEnd); // true if ends with digit, false otherwise
String.valueOf(0|1|2|3|4|5|6|7|8|9) is actually "15", which of course can never be equal to the last character. This should make sense, because 0|1|2|3|4|5|6|7|8|9 evaluates to 15 using integer math, which then gets converted to a String.
Alternatively, try this:
String code = "45m4av7";
char characterAtEnd = code.charAt(code.length() - 1);
System.out.println(characterAtEnd);
if(characterAtEnd < '0' || characterAtEnd > '9'){
System.out.println("INVALID CRC CODE: last character in code in non-numerical.");
System.exit(0);
}

You are doing bitwise operations here: if(!characterAtEnd.equals(String.valueOf(0|1|2|3|4|5|6|7|8|9)))
Check out the difference between | and ||
This bit of code should accomplish your task using regular expressions:
String code = "45m4av7";
if (!code.matches("^.+?\\d$")){
System.out.println("INVALID CRC CODE");
}
Also, for reference, this method sometimes comes in handy in similar situations:
/* returns true if someString actually ends with the specified suffix */
someString.endsWith(suffix);
As .endswith(suffix) does not take regular expressions, if you wanted to go through all possible lower-case alphabet values, you'd need to do something like this:
/* ASCII approach */
String s = "hello";
boolean endsInLetter = false;
for (int i = 97; i <= 122; i++) {
if (s.endsWith(String.valueOf(Character.toChars(i)))) {
endsInLetter = true;
}
}
System.out.println(endsInLetter);
/* String approach */
String alphabet = "abcdefghijklmnopqrstuvwxyz";
boolean endsInLetter2 = false;
for (int i = 0; i < alphabet.length(); i++) {
if (s.endsWith(String.valueOf(alphabet.charAt(i)))) {
endsInLetter2 = true;
}
}
System.out.println(endsInLetter2);
Note that neither of the aforementioned approaches are a good idea - they are clunky and rather inefficient.
Going off of the ASCII approach, you could even do something like this:
ASCII reference : http://www.asciitable.com/
int i = (int)code.charAt(code.length() - 1);
/* Corresponding ASCII values to digits */
if(i <= 57 && i >= 48){
System.out.println("Last char is a digit!");
}
If you want a one-liner, stick to regular expressions, for example:
System.out.println((!code.matches("^.+?\\d$")? "Invalid CRC Code" : "Valid CRC Code"));
I hope this helps!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.