equalsIgnoreCase doesn't conform to javadoc? - java

The javadoc for String.equalsIgnoreCase says:
Two strings are considered equal ignoring case if they are of the same length and corresponding characters in the two strings are equal ignoring case.
Two characters c1 and c2 are considered the same ignoring case if at least one of the following is true:
The two characters are the same (as compared by the == operator)
Applying the method Character.toUpperCase(char) to each character produces the same result
Applying the method Character.toLowerCase(char) to each character produces the same result
So can anyone explain this?
public class Test
{
private static void testChars(char ch1, char ch2) {
boolean b1 = (ch1 == ch2 ||
Character.toLowerCase(ch1) == Character.toLowerCase(ch2) ||
Character.toUpperCase(ch1) == Character.toUpperCase(ch2));
System.out.println("Characters match: " + b1);
String s1 = Character.toString(ch1);
String s2 = Character.toString(ch2);
boolean b2 = s1.equalsIgnoreCase(s2);
System.out.println("equalsIgnoreCase returns: " + b2);
}
public static void main(String args[]) {
testChars((char)0x0130, (char)0x0131);
testChars((char)0x03d1, (char)0x03f4);
}
}
Output:
Characters match: false
equalsIgnoreCase returns: true
Characters match: false
equalsIgnoreCase returns: true

Those characters' defintion of upper and lower case is probably locale-specific. From the JavaDoc for Character.toLowerCase():
In general, String.toLowerCase() should be used to map characters to
lowercase. String case mapping methods have several benefits over
Character case mapping methods. String case mapping methods can
perform locale-sensitive mappings, context-sensitive mappings, and 1:M
character mappings, whereas the Character case mapping methods cannot.
If you look at the String.toLowerCase() method you will find it overridden to accept a Locale object as well. This will perform a locale-specific case conversion.
Edit: I would like to be clear that yes, the JavaDoc for String.equalsIgnoreCase() says what it says, but it is wrong. It cannot be correct in all cases, certainly not for characters that have surrogates, for example, but also for characters where the locale defines upper/lower case.

I found this in String.java (this snippet is also in a document that peter.petrov linked to):
if (ignoreCase) {
// If characters don't match but case may be ignored,
// try converting both characters to uppercase.
// If the results match, then the comparison scan should
// continue.
char u1 = Character.toUpperCase(c1);
char u2 = Character.toUpperCase(c2);
if (u1 == u2) {
continue;
}
// Unfortunately, conversion to uppercase does not work properly
// for the Georgian alphabet, which has strange rules about case
// conversion. So we need to make one last check before
// exiting.
if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
continue;
}
}
It's used by equalsIgnoreCase. What's interesting is that if it followed what the javadoc said, the line toward the bottom should be
if (Character.toLowerCase(c1) == Character.toLowerCase(c2)) {
using c1 and c2 instead of u1 and u2. This affects the result for these two cases. We can all agree that the javadoc is "wrong" in that it doesn't really reflect how case-folding should work; but the above logic doesn't deal with correct case-folding any better, and it doesn't conform to the documentation besides.

Related

Java if statment contains letters

I was wondering how I should write this if statement in my assignment.
The question is:
Print option 2 if one of the strings begins with the letter w or has 5 characters.
I would use the .contains to find the "w".
if (two.contains("w") {
System.out.println(two);
but the one with the characters I am unsure how to find the method.
If you have a List or Set, you may need to loop over them one by one, to do the actual comparing try this:
(two.startsWith("w") || two.length() == 5) {
System.out.println(two);
}
The first condition checks if given String object starts with given char, and the other one counts the number of characters in the String and checks your desired lenght, which is 5.
For more useful information and String object has, check this out String (Java Platform SE 7)
String aString = ".....";
if (aString.startsWith("w") || aString.length() == 5) {
System.out.println("option 2");
}
The method .contains returns true if find a substring (in this case "w") in string. You should use .startsWith method.

Java - using ".contains" in the opposite manner

I want to be able to print a string that doesn't contain the words "Java", "Code" or "String", though I am unsure on how to achieve this as I thought this would be achieved by using '!' (NOT). However, this is not the case as the string is still printed despite the inclusion of the words I want to forbid.
Any advice on how to achieve this would be greatly appreciated, thanks in advance.
System.out.println("Type in an input, plez?");
String userInput6 = inputScanner.nextLine();
if (!userInput6.toLowerCase().contains("Java") || !userInput6.toLowerCase().contains("Code") || !userInput6.toLowerCase().contains("String")) {
System.out.println("I see your does not string contain 'Java', 'Code' or 'String, here is your string:- " + userInput6);
} else {
System.out.println("Your string contains 'Java, 'Code' or 'String'.");
}
I thought this would be achieved by using '!' (NOT)
It is. You just haven't applied it correctly to your situation:
You start with this statement:
userInput6.toLowerCase().contains("java") ||
userInput6.toLowerCase().contains("code") ||
userInput6.toLowerCase().contains("string")
which checks if the input contains any of these, and you wish to negate this statement.
You can either wrap the entire statement in parentheses (()) and negate that:
!(userInput6.toLowerCase().contains("java") ||
userInput6.toLowerCase().contains("code") ||
userInput6.toLowerCase().contains("string"))
or apply the DeMorgan's law for the negation of disjunctions which states that the negation of a || b is !a && !b.
So, as Carcigenicate stated in the comments, you would need
!userInput6.toLowerCase().contains("java") &&
!userInput6.toLowerCase().contains("code") &&
!userInput6.toLowerCase().contains("string")
instead.
Your statement is simply checking if the string doesn't contain at least one of these substrings. This means the check would only fail if the string contained all of these strings. With ||, if any operand is true, the entire statement is true.
Additionally, mkobit makes the point that your strings you are checking for should be entirely lowercase. Otherwise, you are checking if a .toLowerCased string contains an uppercase character - which is always false.
An easier way to think of it may be to invert your if statement:
if (userInput6.toLowerCase().contains("Java") ||
userInput6.toLowerCase().contains("Code") ||
userInput6.toLowerCase().contains("String")) {
System.out.println("Your string contains 'Java, 'Code' or 'String'.");
} else {
System.out.println("I see your does not string contain 'Java', 'Code' or 'String, here is your string:- " + userInput6);
}
Since you're using logical OR, as soon as one your contains checks it true, the entire condition is true. You want all the checks to be true, so you need to use logical AND (&&) instead
As #mk points out, you have another problem. Look at:
userInput6.toLowerCase().contains("Java")
You lower case the string, then check it against a string that contains an uppercase. You just removed all uppercase though, so that check will always fail.
Also, you can use regexp :)
boolean notContains(String in) {
return !Pattern.compile(".*((java)|(code)|(string)).*")
.matcher(in.toLowerCase())
.matches();
}
Or just inline it:
System.out.println("Type in an input, plez?");
String userInput6 = inputScanner.nextLine();
if (!Pattern.compile(".*((java)|(code)|(string)).*")
.matcher(userInput6.toLowerCase())
.matches()) {
System.out.println("I see your does not string contain 'Java', 'Code' or 'String, here is your string:- " + userInput6);
} else {
System.out.println("Your string contains 'Java, 'Code' or 'String'.");
}

Java: Why String.compareIgnoreCase() uses both Character.toUpperCase() and Character.toLowerCase()? [duplicate]

This question already has an answer here:
Curious about the implementation of CaseInsensitiveComparator [duplicate]
(1 answer)
Closed 6 years ago.
The compareToIgnoreCase method of String Class is implemented using the method in the snippet below(jdk1.8.0_45).
i. Why are both Character.toUpperCase(char) and Character.toLowerCase(char) used for comparison? Wouldn't either of them suffice the purpose of comparison?
ii. Why was s1.toLowerCase().compare(s2.toLowerCase()) not used to implement compareToIgnoreCase? - I understand the same logic can be implemented in different ways. But, still I would like to know if there are specific reasons to choose one over the other.
public int compare(String s1, String s2) {
int n1 = s1.length();
int n2 = s2.length();
int min = Math.min(n1, n2);
for (int i = 0; i < min; i++) {
char c1 = s1.charAt(i);
char c2 = s2.charAt(i);
if (c1 != c2) {
c1 = Character.toUpperCase(c1);
c2 = Character.toUpperCase(c2);
if (c1 != c2) {
c1 = Character.toLowerCase(c1);
c2 = Character.toLowerCase(c2);
if (c1 != c2) {
// No overflow because of numeric promotion
return c1 - c2;
}
}
}
}
return n1 - n2;
}
Here's an example using Turkish i's:
System.out.println(Character.toUpperCase('i') == Character.toUpperCase('İ'));
System.out.println(Character.toLowerCase('i') == Character.toLowerCase('İ'));
The first line prints false; the second true. Ideone demo.
There are languages that have special characters that are converted to an upper or lower character (or sequence of characters).
So using only one case can have some problem for this kind of special characters.
As an example the character Eszett ß in german is converted to SS in upper case. From wikipedia:
The name eszett comes from the two letters S and Z as they are pronounced in German. Its Unicode encoding is U+00DF.
So a word like groß compared to gross will generate a failure if only lower comparison is used.
#chrylis here is a working example
System.out.println("ß".toUpperCase().equals("SS")); // True
System.out.println("ß".toLowerCase().equals("ss")); // false
Thanks to the comment of #chrylis i made some additional test and I found a possible error on String class:
System.out.println("ß".toUpperCase().equals("SS")); // True
System.out.println("ß".toLowerCase().equals("ss")); // false
but
System.out.println("ß".equalsIgnoreCase("SS")); // False
System.out.println("ß".equalsIgnoreCase("ss")); // False
So there is at least a case where two strings are equals if manually both converted to uppercase, but are not equals if are compared ignoring case.
From the Character class documentation, in the toUpperCase and the toLowerCase methods it states:
Note that Character.isUpperCase(Character.toUpperCase(ch)) does not
always return true for some ranges of characters, particularly those
that are symbols or ideographs. (Similar for toLowerCase)
Since there is the potential for anomalies in the comparison, they check for both cases to give the most accurate response possible.

Address with zip code. Java

The constructor will throw an IllegalArgumentException exception with the message "Invalid Address Argument" if any parameter is null, or if the zip code has characters others than digits.
The method Character.isDigit can help during the implementation of this method. See the Java API (Character class) for additional information.
I've had the illegal argument exception down. But, not the zip code. Help?
Program.
if(street==null||city==null||state==null){
throw new IllegalArgumentException("Invalid Address Argument");
}
if(zip == Character.isDigit(ch)){
//To do???
}
try apache stringutils
public static boolean isNumeric(CharSequence cs)
Checks if the CharSequence contains only Unicode digits. A decimal point is not a Unicode digit and returns false.
null will return false. An empty CharSequence (length()=0) will return false.
StringUtils.isNumeric(null) = false
StringUtils.isNumeric("") = false
StringUtils.isNumeric(" ") = false
StringUtils.isNumeric("123") = true
StringUtils.isNumeric("12 3") = false
StringUtils.isNumeric("ab2c") = false
StringUtils.isNumeric("12-3") = false
StringUtils.isNumeric("12.3") = false
Parameters:
cs - the CharSequence to check, may be null
Returns:
true if only contains digits, and is non-null
Since:
3.0 Changed signature from isNumeric(String) to isNumeric(CharSequence), 3.0 Changed "" to return false and not true
int zipcode = 0;
try {
zipcode = Integer.parseInt(zipcode);
}catch (Exception e){}
if (zipcode <= 0)
{
throw new Exception(..);
}
And less than 1,000,000 if you want to be precise. You are using Char which makes no sense as you will have a String.
This sounds like homework to me, so I think the first thing you need to do here is learn how to read the documentation. Let's start by taking your instructor's hint, and looking up the documentation for Character.isDigit(char ch)
public static boolean isDigit(char ch)
Handwaving away some of the terms there, the critical things are that the method is static (which means we call it like Character.isDigit(myVariable) and that it returns a boolean (true or false value), and that it accepts a parameter of type char.
So, to call this method, we need a char (single character). I'm assuming that your zip variable is a String. We know that a string is made up of multiple characters. So what we need is a way to get those characters, one at a time, from the String. You can find the documentation for the String class here.
There's a couple of ways to go about it. We could get the characters in an array using toCharArray(), or get a specific character out of the string using charAt(int index)
However you want to tackle it, you need to do this (in pseudocode)
for each char ch in zip
if ch is not a digit
throw new IllegalArgumentException("Invalid Address Argument")

Understanding logic in CaseInsensitiveComparator

Can anyone explain the following code from String.java, specifically why there are three if statements (which I've marked //1, //2 and //3)?
private static class CaseInsensitiveComparator
implements Comparator<String>, java.io.Serializable {
// use serialVersionUID from JDK 1.2.2 for interoperability
private static final long serialVersionUID = 8575799808933029326L;
public int compare(String s1, String s2) {
int n1=s1.length(), n2=s2.length();
for (int i1=0, i2=0; i1<n1 && i2<n2; i1++, i2++) {
char c1 = s1.charAt(i1);
char c2 = s2.charAt(i2);
if (c1 != c2) {/////////////////////////1
c1 = Character.toUpperCase(c1);
c2 = Character.toUpperCase(c2);
if (c1 != c2) {/////////////////////////2
c1 = Character.toLowerCase(c1);
c2 = Character.toLowerCase(c2);
if (c1 != c2) {/////////////////////////3
return c1 - c2;
}
}
}
}
return n1 - n2;
}
}
From Unicode Technical Standard:
In addition, because of the vagaries of natural language, there are situations where two different Unicode characters have the same uppercase or lowercase
So, it's not enough to compare only uppercase of two characters, because they may have different uppercase and same lowercase
Simple brute force check gives some results. Check for example code points 73 and 304:
char ch1 = (char) 73; //LATIN CAPITAL LETTER I
char ch2 = (char) 304; //LATIN CAPITAL LETTER I WITH DOT ABOVE
System.out.println(ch1==ch2);
System.out.println(Character.toUpperCase(ch1)==Character.toUpperCase(ch2));
System.out.println(Character.toLowerCase(ch1)==Character.toLowerCase(ch2));
Output:
false
false
true
So "İ" and "I" are not equal to each other. Both characters are uppercase. But they share the same lower case letter: "i" and that gives a reason to treat them as same values in case insensitive comparison.
Normally, we would expect to convert the case once and compare and be done with it. However, the code convert the case twice, and the reason is stated in the comment on a different method public boolean regionMatches(boolean ignoreCase, int toffset, String other, int ooffset, int len):
Unfortunately, conversion to uppercase does not work properly for the Georgian alphabet, which has strange rules about case conversion. So we need to make one last check before exiting.
Appendix
The code of regionMatches has a few difference from the code in the CaseInsenstiveComparator, but essentially does the same thing. The full code of the method is quoted below for the purpose of cross-checking:
public boolean regionMatches(boolean ignoreCase, int toffset,
String other, int ooffset, int len) {
char ta[] = value;
int to = offset + toffset;
char pa[] = other.value;
int po = other.offset + ooffset;
// Note: toffset, ooffset, or len might be near -1>>>1.
if ((ooffset < 0) || (toffset < 0) || (toffset > (long)count - len) ||
(ooffset > (long)other.count - len)) {
return false;
}
while (len-- > 0) {
char c1 = ta[to++];
char c2 = pa[po++];
if (c1 == c2) {
continue;
}
if (ignoreCase) {
// If characters don't match but case may be ignored,
// try converting both characters to uppercase.
// If the results match, then the comparison scan should
// continue.
char u1 = Character.toUpperCase(c1);
char u2 = Character.toUpperCase(c2);
if (u1 == u2) {
continue;
}
// Unfortunately, conversion to uppercase does not work properly
// for the Georgian alphabet, which has strange rules about case
// conversion. So we need to make one last check before
// exiting.
if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
continue;
}
}
return false;
}
return true;
}
In another answer, default locale already gave an example for why comparing only uppercase is not enough, namely the ASCII letter "I" and the capital I with dot "İ".
Now you might wonder, why do they not just compare only lowercase instead of both upper and lower case, if it catches more cases than uppercase? The answer is, it does not catch more cases, it merely finds different cases.
Take the letter "ı" ((char)305, small dotless i) and the ascii "i". They are different, their lowercase is different, but they share the same uppercase letter "I".
And finally, compare capital I with dot "İ" with small dotless i "ı". Neither their uppercases ("İ" vs. "I") nor their lowercases ("i" vs. "ı") matches, but the lowercase of their uppercase is the same ("I"). I found another case if this phenomenon, in the greek letters "ϴ" and "ϑ" (char 1012 and 977).
So a true case insensitive comparison can not even check uppercases and lowercases of the original characters, but must check the lowercases of the uppercases.
Consider the following characters: f and F. The initial if statement would return false because they don't match. However, if you capitalise both characters, you get F and F. Then they would match. The same would not be true of, say, c and G.
The code is efficient. There is no need to capitalise both characters if they already match (hence the first if statement). However, if they don't match, we need to check if they differ only in case (hence the second if statement).
The final if statement is used for certain alphabets (such as Georgian) where capitalisation is a complicated affair. To be honest, I don't know a great deal about how that works (just trust that Oracle does!).
In the above case for case insensitive comparison assume s1="Apple" and s2="apple"
In this case 'A'!='a' as the ascii values of both the characters are different then it changes both the Characters to upper case and again compares then the loop continues to get the final value of n1-n2=0 thus the strings become the same.Suppose if the characters are not equal at all the second check
if (c1 != c2) {
return c1 - c2;
}
returns the difference in ascii value of the two characters.

Categories