Clarification on Java Language Specification - java

I read the following phrase in the Java language specification.
It is a compile-time error for the character following the SingleCharacter or
EscapeSequence to be other than a '.'
I am not able to understand what is the meaning of above line. Could someone please explain it with example.

What is says is basically: A compile time error will be generated for every character different than a ', that comes after the "character" itself. Where the "character" is the content in the form of a character (like: a, 0, \u0093) or an escape sequence (like: \\, \b, \n).
So, this will be wrong:
'aa', because the second a is not a single quote (').
'\\a', because the second character (the a) is not a single quote.
'a, because the character which comes after the "content" is not a quote (but probably a newline or a space).
Side note: This won't work either: char c = '\u0027';. Because that is the code point for a single quote, so it gets translated into: char c = ''';.

I guess this is about character literals. Another way to say this is: character literals must be enclosed by apostrophes, it is an error if you forget the second apostrophe.
Hence:
'a' // correct
'\007' // correct
'ab // wrong

In Java, you can define character variable as an escape sequences or single characters. Those should be surrounded by single quotes.
char ch = 'a';
// Unicode for uppercase Greek omega character
char uniChar = '\u039A';
More information and examples can be found in Java tutorial on Characters.

Related

Matching The Arabic punctuation marks in Java

I want to edit on REGEX_PATTERN2 in this code to work with matches()method of The Arabic punctuation marks
String REGEX_PATTERN = "[\\.|,|:|;|!|_|\\?]+";
String s1 = "My life :is happy, stable";
String[] result = s1.split(REGEX_PATTERN);
for (String myString : result) {
System.out.println(myString);
}
String REGEX_PATTERN2 = "[\\.|,|:|;|!|_|،|؛|؟\\?]+";
String s2 = " حياتي ؛ سعيدة، مستقر";
String[] result2 = s2.split(REGEX_PATTERN2);
for (String myString : result2) {
System.out.println(myString);
}
The output I wanted
My life
is happy
stable
حياتي
سعيدة
مستقر
How I can edit to this code and use the matches() instead of split() method to get the same output with Arabic punctuation marks
There are a few problems here. First this example:
if (word.matches("[\\.|,|:|;|!|\\?]+"))
That is mildly1 incorrect for the following reason:
A . does not need to be escaped in a character class.
A | does not mean alternation in a character class.
A ? does not need to be escaped in a character class.
(For more details, read the javadoc or a tutorial on Java regexes.)
So you can rewrite the above as:
if (word.matches("[.,:;!?]+"))
... assuming that you don't want to classify the pipe character as punctuation.
Now this:
if (word.matches("[\.|,|:|;|!|،|؛|..|...|؟|\?]+"))
You have same problems as above. In addition, you seem to have used the two and three full-stop / period characters instead of (presumably) some Unicode character. I suspect they might be a \ufbb7 or u061e or \u06db, but I'm no linguist. (Certainly 2 or 3 full-stops is incorrect.)
So what are the punctuation characters in Arabic?
To be honest, I think that the answer depends on what source you look at, but Wikipedia states:
Only the Arabic question mark ⟨؟⟩ and the Arabic comma ⟨،⟩ are used in regular Arabic script typing and the comma is often substituted for the Latin script comma (,).
1 - By mildly incorrect, I mean that the mistakes in this example are mostly harmless. However, your inclusion of (multiple instances of) the | character n the class does mean that you will incorrectly classify a "pipe" as punctuation.
[] denotes a regex character class, which means it only matches single characters. ... is 3 characters, so it cannot be used in a character class.
In a character class, you don't separate characters with |, and you don't need to escape . and ?.
You probably meant this, which is a list of alternate character sequences:
"(?:\\.|,|:|;|!|\\?|،|؛|؟|\\.\\.|\\.\\.\\.)+"
You might get better performance if you do use a character class where you can:
"(?:\\.{1,3}|[,:;!?،؛؟])+"
Of course, with the + at the end, matching 1-3 periods in each iteration is rather redundant, so this will do:
"[.,:;!?،؛؟]+"
Here's a different approach, that uses Unicode properties instead of specific characters (In case you care about more Arabic marks than just the question mark and comma mentioned in another answer):
"(?=^[\\p{InArabic}.,:;!?]+$)^\\p{IsPunctuation}+$"
It matches an entire string of characters that have a punctuation category, that also are either in the Arabic block or are one of the other punctuation characters you listed in your efforts.
It'll match strings like "؟،" or "؟،:", but not "؟،ؠ" or "؟،a".

Messed up with Java Declaration

why java constant have strange behaviour (Unicode Character and normal representation).. I mean see below example.
Note : All code is in java language.
char a = '\u0061'; //This is correct
char 'a' = 'a'; //This gives compile time error
char \u0061 = 'a'; //this is correct no error
ch\u0061r a = 'a'; //This too works
ch'a'r a = 'a'; // This really is confusing compile time error
Why last declaration is not works whereas ch\u0061r a='a'; works?
You cannot put literals ('a') in the middle of identifiers.
The line
char 'a' = 'a';
Does not compile because there is no identifier, and you cannot assign one literal to another.
Unicode is permitted, however. It is just hard to read :-)
You can not put literal characters, 'a', in identifiers. You can use unicode, \u0061, though.
This isn't confusing at all. You're randomly scattering single quotes around and expecting them to be irrelevant. In the first case, you're assigning the value of the single character \u0061 to a char variable. Then you're trying to use a character literal as a variable name, which doesn't work. Then you're using a Unicode-formatted character (not quoted) as a variable name, which is okay. Perhaps you're confusing Java's quote rules with shell?
You can find the reason in specification of literals
Unicode composite characters are different from the decomposed characters.
Identifier:
IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral
IdentifierChars:
JavaLetter
IdentifierChars JavaLetterOrDigit
JavaLetter:
any Unicode character that is a Java letter (see below)
JavaLetterOrDigit:
any Unicode character that is a Java letter-or-digit (see below)

Why can some ASCII characters not be expressed in the form '\uXXXX' in Java source code?

I stumbled over this (again) today:
class Test {
char ok = '\n';
char okAsWell = '\u000B';
char error = '\u000A';
}
It does not compile:
Invalid character constant in line 4.
The compiler seems to insist that I write '\n' instead. I see no reason for this, yet it's very annoying.
Is there a logical explanation why characters that have a special notation (like \t, \n, \r) must be expressed in that form in Java source?
Unicode characters are replaced by their value, so your line is replaced by the compiler with:
char error = '
';
which is not a valid Java statement.
This is dictated by the Language Specification:
A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value, and passing all other characters unchanged. Representing supplementary characters requires two consecutive Unicode escapes. This translation step results in a sequence of Unicode input characters.
This can lead to surprising stuff, for example, this is a valid Java program (it contains hidden unicode characters) - courtesy of Peter Lawrey:
public static void main(String[] args) {
for (char c‮h = 0; c‮h < Character.MAX_VALUE; c‮h++) {
if (Character.isJavaIdentifierPart(c‮h) && !Character.isJavaIdentifierStart(c‮h)) {
System.out.printf("%04x <%s>%n", (int) c‮h, "" + c‮h);
}
}
}
Unicode escape sequences like \u000a are replaced by the actual characters they represent before the Java compiler does anything else with the source code. And so, your program eventually ends up at
char ch = '
';
So the \u000a in your source code is replaced internally by a linefeed character. Note that this happens before the compiler actually reads and interprets your source code.
Referring to the Java Language Specification:
It is a compile-time error for a line terminator (§3.4) to appear after the opening ' and before the closing '.
And as well all know by heart, \n is a line terminator, quoting:
LineTerminator:
the ASCII LF character, also known as "newline"
the ASCII CR character, also known as "return"
the ASCII CR character followed by the ASCII LF character
Other symbols that could cause problems are \, ' and " for example.
I think the reason is that \uXXXX sequences are expanded when the code is being parsed, see JLS §3.2. Lexical Translations.
It is described in 3.3. Unicode Escapes http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html. Javac first finds \uxxxx sequences in .java and replaces them with real characters then compiles. In case of
char error = '\u000A';
\u000A will be replace with newline character code (10) and the actual text will be
char error = '
';
Because the compiler treats them the same as unescaped text.
This is valid code:
class \u00C9 {}

How to replace a special character with single slash

I have a question about strings in Java. Let's say, I have a string like so:
String str = "The . startup trace ?state is info?";
As the string contains the special character like "?" I need the string to be replaced with "\?" as per my requirement. How do I replace special characters with "\"? I tried the following way.
str.replace("?","\?");
But it gives a compilation error. Then I tried the following:
str.replace("?","\\?");
When I do this it replaces the special characters with "\\". But when I print the string, it prints with single slash. I thought it is taking single slash only but when I debugged I found that the variable is taking "\\".
Can anyone suggest how to replace the special characters with single slash ("\")?
On escape sequences
A declaration like:
String s = "\\";
defines a string containing a single backslash. That is, s.length() == 1.
This is because \ is a Java escape character for String and char literals. Here are some other examples:
"\n" is a String of length 1 containing the newline character
"\t" is a String of length 1 containing the tab character
"\"" is a String of length 1 containing the double quote character
"\/" contains an invalid escape sequence, and therefore is not a valid String literal
it causes compilation error
Naturally you can combine escape sequences with normal unescaped characters in a String literal:
System.out.println("\"Hey\\\nHow\tare you?");
The above prints (tab spacing may vary):
"Hey\
How are you?
References
JLS 3.10.6 Escape Sequences for Character and String Literals
See also
Is the char literal '\"' the same as '"' ?(backslash-doublequote vs only-doublequote)
Back to the problem
Your problem definition is very vague, but the following snippet works as it should:
System.out.println("How are you? Really??? Awesome!".replace("?", "\\?"));
The above snippet replaces ? with \?, and thus prints:
How are you\? Really\?\?\? Awesome!
If instead you want to replace a char with another char, then there's also an overload for that:
System.out.println("How are you? Really??? Awesome!".replace('?', '\\'));
The above snippet replaces ? with \, and thus prints:
How are you\ Really\\\ Awesome!
String API links
replace(CharSequence target, CharSequence replacement)
Replaces each substring of this string that matches the literal target sequence with the specified literal replacement sequence.
replace(char oldChar, char newChar)
Returns a new string resulting from replacing all occurrences of oldChar in this string with newChar.
On how regex complicates things
If you're using replaceAll or any other regex-based methods, then things becomes somewhat more complicated. It can be greatly simplified if you understand some basic rules.
Regex patterns in Java is given as String values
Metacharacters (such as ? and .) have special meanings, and may need to be escaped by preceding with a backslash to be matched literally
The backslash is also a special character in replacement String values
The above factors can lead to the need for numerous backslashes in patterns and replacement strings in a Java source code.
It doesn't look like you need regex for this problem, but here's a simple example to show what it can do:
System.out.println(
"Who you gonna call? GHOSTBUSTERS!!!"
.replaceAll("[?!]+", "<$0>")
);
The above prints:
Who you gonna call<?> GHOSTBUSTERS<!!!>
The pattern [?!]+ matches one-or-more (+) of any characters in the character class [...] definition (which contains a ? and ! in this case). The replacement string <$0> essentially puts the entire match $0 within angled brackets.
Related questions
Having trouble with Splitting text. - discusses common mistakes like split(".") and split("|")
Regular expressions references
regular-expressions.info
Character class and Repetition with Star and Plus
java.util.regex.Pattern and Matcher
In case you want to replace ? with \?, there are 2 possibilities: replace and replaceAll (for regular expressions):
str.replace("?", "\\?")
str.replaceAll("\\?","\\\\?");
The result is "The . startup trace \?state is info\?"
If you want to replace ? with \, just remove the ? character from the second argument.
But when I print the string, it prints
with single slash.
Good. That's exactly what you want, isn't it?
There are two simple rules:
A backslash inside a String literal has to be specified as two to satisfy the compiler, i.e. "\". Otherwise it is taken as a special-character escape.
A backslash in a regular expresion has to be specified as two to satisfy regex, otherwise it is taken as a regex escape. Because of (1) this means you have to write 2x2=4 of them:"\\\\" (and because of the forum software I actually had to write 8!).
String str="\\";
str=str.replace(str,"\\\\");
System.out.println("New String="+str);
Out put:- New String=\
In java "\\" treat as "\". So, the above code replace a "\" single slash into "\\".

Is there a difference between single and double quotes in Java?

Is there a difference between single and double quotes in Java?
Use single quotes for literal chars, double quotes for literal Strings, like so:
char c = 'a';
String s = "hello";
They cannot be used any other way around (like in Python, for example).
A char is a single UTF-16 character, that is a letter, a digit, a punctuation mark, a tab, a space or something similar.
A char literal is either a single one character enclosed in single quote marks like this
char myCharacter = 'g';
or an escape sequence, or even a unicode escape sequence:
char a = '\t'; // Escape sequence: tab
char b = '\177' // Escape sequence, octal.
char c = '\u03a9' // Unicode escape sequence.
It is worth noting that Unicode escape sequences are processed very early during compilation and hence using '\u00A' will lead to a compiler error. For special symbols it is better to use escape sequences instead, i.e. '\n' instead of '\u00A' .
Double quotes being for String, you have to use a "double quote escape sequence" (\") inside strings where it would otherwise terminate the string.
For instance:
System.out.println("And then Jim said, \"Who's at the door?\"");
It isn't necessary to escape the double quote inside single quotes.
The following line is legal in Java:
char doublequote = '"';
Let's consider this lines of code (Java):
System.out.println("H"+"A"); //HA
System.out.println('H'+'a'); //169
First line is concatenation of H and A that will result in HA (String literal)
Second we are adding the values of two char that according to the ASCII Table H=72 and a=97 that means that we are adding 72+97 it's like ('H'+'a').
Let's consider another case where we would have:
System.out.println("A"+'N');//AN
In this case we are dealing with concatenation of String A and char N that will result in AN.
Single quote indicates character and double quote indicates string..
char c='c';
'c'-----> c is a character
String s="stackoverflow";
"stackoverflow"------> stackoverflow is a string(i.e collection if characters)

Categories