Are all names identifiers? - java

In the Java Language Specification 6.2 Link
Here is the following code example:
class Test {
public static void main(String[] args) {
Class c = System.out.getClass();
System.out.println(c.toString().length() +
args[0].length() + args.length);
}
}
And it states:
the identifiers Test, main, and the first occurrences of args and c are not names. Rather, they are used in declarations to specify the names of the declared entities. The names String, Class, System.out.getClass, System.out.println, c.toString, args, and args.length appear in the example.
But are the names like Class and String also identifiers? What is an identifier exactly?

An identifier is a type of a token. From the specification of the lexical structure of Java:
3.8. Identifiers
An identifier is an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter.
Identifier:
IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral
IdentifierChars:
JavaLetter
IdentifierChars JavaLetterOrDigit
JavaLetter:
any Unicode character that is a Java letter (see below)
JavaLetterOrDigit:
any Unicode character that is a Java letter-or-digit (see below)
A "Java letter" is a character for which the method
Character.isJavaIdentifierStart(int) returns true.
A "Java letter-or-digit" is a character for which the method
Character.isJavaIdentifierPart(int) returns true.
The "Java letters" include uppercase and lowercase ASCII Latin letters
A-Z (\u0041-\u005a), and a-z (\u0061-\u007a), and, for historical
reasons, the ASCII underscore (_, or \u005f) and dollar sign ($, or
\u0024). The $ character should be used only in mechanically generated
source code or, rarely, to access pre-existing names on legacy
systems.
The "Java digits" include the ASCII digits 0-9 (\u0030-\u0039).
Letters and digits may be drawn from the entire Unicode character set,
which supports most writing scripts in use in the world today,
including the large sets for Chinese, Japanese, and Korean. This
allows programmers to use identifiers in their programs that are
written in their native languages.
An identifier cannot have the same spelling (Unicode character
sequence) as a keyword (§3.9), boolean literal (§3.10.3), or the null
literal (§3.10.7), or a compile-time error occurs.

An identifier is a user defined symbol.
It allows the compiler to differentiate between bindings to objects of the same type in the symbol table.

This might answer your 2nd question:
http://www.cafeaulait.org/course/week2/08.html
Identifiers are the names of variables, methods, classes, packages and
interfaces. Unlike literals they are not the things themselves, just
ways of referring to them.

Related

How compile java using unicode characters in identifiers [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 6 months ago.
Improve this question
First, I understand this goes against all convention and advice, but I want to do it anyway.
How can I (or is it even possible) compile java code using unicode characters in identifiers (method names, variable names, etc.)
I want to be able to do something like the following:
public class 😋 extends 😃 {
public void сделайЧтонибудь() { ... }
}
Completely ridiculous example, but you get the point.
No, you can't.
An identifier has to start with a so-called Java letter that is
[...] a character for which the method Character.isJavaIdentifierStart(int) returns true.
Which in turn means
A character [ch] may start a Java identifier if and only if one of the following conditions is true:
isLetter(ch) returns true
getType(ch) returns LETTER_NUMBER
ch is a currency symbol (such as '$')
ch is a connecting punctuation character (such as '_').
The (optional) subsequent characters must be a Java letter-or-digit, that is
[...] a character for which the method Character.isJavaIdentifierPart(int) returns true.
Which in turn means
A character may be part of a Java identifier if any of the following conditions are true:
it is a letter
it is a currency symbol (such as '$')
it is a connecting punctuation character (such as '_')
it is a digit
it is a numeric letter (such as a Roman numeral character)
it is a combining mark
it is a non-spacing mark
isIdentifierIgnorable returns true for the character
None of the above is true for either 😋 or 😃, but it is for сделайЧтонибудь which is, in fact, a valid identifier.
What you could do (why bother, tho) is write a pre-processor that translates those emojis into sequences of Java letters, with its output being a java program with valid identifiers which you can finally feed to the compiler.
This is not valid Java, so you can't "make" it compile. Choose a valid identifier name as defined by the specification:
https://docs.oracle.com/javase/specs/jls/se18/html/jls-3.html#jls-3.8
Identifiers may contain "Java letters" or "Java digits", which are unicode, but do not allow arbitrary unicode symbols:
The "Java letters" include uppercase and lowercase ASCII Latin letters A-Z (\u0041-\u005a), and a-z (\u0061-\u007a), and, for historical reasons, the ASCII dollar sign ($, or \u0024) and underscore (_, or \u005f). The dollar sign should be used only in mechanically generated source code or, rarely, to access pre-existing names on legacy systems. The underscore may be used in identifiers formed of two or more characters, but it cannot be used as a one-character identifier due to being a keyword.
The "Java digits" include the ASCII digits 0-9 (\u0030-\u0039).

Different Java Regex matching behavior when using UNICODE_CHARACTER_CLASS flag

I was testing the behavior of the Pattern.UNICODE_CHARACTER_CLASS flag for different punctuation characters and noticed that the matches for grave accent character (U+0060) ` occur differently depending on whether Pattern.UNICODE_CHARACTER_CLASS is used.
For example, see the below code:
public class GraceAccentTest {
public static void main(String args[]) {
Pattern p = Pattern.compile("\\p{Punct}");
Matcher m = p.matcher("`");
System.out.println(m.matches()); // returns true
Pattern p1 = Pattern.compile("\\p{Punct}", Pattern.UNICODE_CHARACTER_CLASS);
Matcher m1 = p1.matcher("`");
System.out.println(m1.matches()); // returns false
}
}
When I don't use Pattern.UNICODE_CHARACTER_CLASS flag grave accent character matches with \p{Punct} character class but when I use the flag it doesn't match. Can someone explain the reasoning for this ?
When you use Pattern p = Pattern.compile("\\p{Punct}");, then \p{Punct} refers to the following 32 characters:
!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
Reference: the Pattern class.
These 32 characters correspond to the ASCII character set characters 0x21 through 0x7e, excluding letters and digits. They also happen to represent all the non-letter and non-digit symbols on my standard U.S. keyboard (your keyboard may be different, of course).
The grave accent (also known as a backtick) is in that list and on my keyboard.
That is a simple example of a "predefined character class" - and explains why your m.matches() returns true.
When you add the Pattern.UNICODE_CHARACTER_CLASS flag things get more complicated.
As the documentation for this flag explains, it:
Enables the Unicode version of Predefined character classes and POSIX character classes.
and:
When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expressions Annex C: Compatibility Properties.
Looking at the Annex C referred to above, we find a table showing the "recommended assignments for compatibility property names".
For our property name (punct), the standard recommendation is to use characters defined by this:
\p{gc=Punctuation}
Here, "gc" stands for "general category". Unicode characters are assigned a "general category" value. In this case, that is Punctuation - also abbreviated to P and further broken down into various sub-categories such as Pc for connectors, Pd for dashes, and so on. There is also a catch-all Po for "other punctuation characters".
The grave character is assigned to the Symbol general category in Unicode - and to the Modifier subcategory. You can see that assignment to Sk here.
Contrast that with a character such as the ASCII exclamation mark (also part of our original \p{Punct} list, shown above). For that we can see that the general category assignment is Po.
That explains why the grave is no longer matched when we add the Pattern.UNICODE_CHARACTER_CLASS flag to our original pattern.
It is assigned to a different general category from the punctuation category we are using in our regex.
The obvious next question is why did the grave character not get included in the Unicode Po general category? Why is it in Sk instead?
I do not have a good answer for that - I'm sure there are "historical reasons". It's worth noting, however, that the Sk cateogry includes characters such as the acute accent, the cedilla, the diaeresis, and so on - and (as already noted) our grave accent.
All these are diacritics - typically used in combination with a base letter to alter the pronunciation. So maybe that is the underlying reason.
The grave is a bit of an oddity, perhaps, given it has a historical usage outside of being used as a diacritic.
It may be more relevant to ask how the grave ended up as part of the original ASCII character set, in the first place. Some background about this is provided in the Wikipedia page for the backtick.
Reading the documentation for UNICODE_CHARACTER_CLASS
When this flag is specified then the (US-ASCII only) Predefined
character classes and POSIX character classes are in conformance with
Unicode Technical Standard #18: Unicode Regular Expression Annex C:
Compatibility Properties.
So this is saying that is using US-ASCII only. So if you check the table of characters Punctuation you will check there is a lot of missing chars.
Tables :
https://www.fileformat.info/info/unicode/category/Po/list.htm
https://www.gaijin.at/en/infos/unicode-character-table-punctuation

What is the equivalent in .Net of the :print: character class from PHP or Java? [duplicate]

Is there a special regex statement like \w that denotes all printable characters? I'd like to validate that a string only contains a character that can be printed--i.e. does not contain ASCII control characters like \b (bell), or null, etc. Anything on the keyboard is fine, and so are UTF chars.
If there isn't a special statement, how can I specify this in a regex?
Very late to the party, but this regexp works: /[ -~]/.
How? It matches all characters in the range from space (ASCII DEC 32) to tilde (ASCII DEC 126), which is the range of all printable characters.
If you want to strip non-ASCII characters, you could use something like:
$someString.replace(/[^ -~]/g, '');
NOTE: this is not valid .net code, but an example of regexp usage for those who stumble upon this via search engines later.
If your regex flavor supports Unicode properties, this is probably the best the best way:
\P{Cc}
That matches any character that's not a control character, whether it be ASCII -- [\x00-\x1F\x7F] -- or Latin1 -- [\x80-\x9F] (also known as the C1 control characters).
The problem with POSIX classes like [:print:] or \p{Print} is that they can match different things depending on the regex flavor and, possibly, the locale settings of the underlying platform. In Java, they're strictly ASCII-oriented. That means \p{Print} matches only the ASCII printing characters -- [\x20-\x7E] -- while \P{Cntrl} (note the capital 'P') matches everything that's not an ASCII control character -- [^\x00-\x1F\x7F]. That is, it matches any ASCII character that isn't a control character, or any non-ASCII character--including C1 control characters.
TLDR Answer
Use this Regex...
\P{Cc}\P{Cn}\P{Cs}
Working Demo
In this demo, I use this regex to search the string "Hello, World!_". I'm going to add a weird character at the end, (char)4 — this is the character for END TRANSMISSION.
using System;
using System.Text.RegularExpressions;
public class Test {
public static void Main() {
// your code goes here
var regex = new Regex(#"![\P{Cc}\P{Cn}\P{Cs}]");
var matches = regex.Matches("Hello, World!" + (char)4);
Console.WriteLine("Results: " + matches.Count);
foreach (Match match in matches) {
Console.WriteLine("Result: " + match);
}
}
}
Full Working Demo at IDEOne.com
TLDR Explanation
\P{Cc} : Do not match control characters.
\P{Cn} : Do not match unassigned characters.
\P{Cs} : Do not match UTF-8-invalid characters.
Alternatives
\P{C} : Match only visible characters. Do not match any invisible characters.
\P{Cc} : Match only non-control characters. Do not match any control characters.
\P{Cc}\P{Cn} : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
\P{Cc}\P{Cn}\P{Cs} : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
\P{Cc}\P{Cn}\P{Cs}\P{Cf} : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!
All Matchable Unicode Character Sets
If you want to know any other character sets available, check out regular-expressions.info...
\p{L} or \p{Letter}: any kind of letter from any language.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another
character without taking up extra space (e.g. accents, umlauts, etc.).
\p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
\p{Me} or \p{Enclosing_Mark}: a character that encloses the character it is combined with (circle, square, keycap, etc.).
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
\p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
\p{Zl} or \p{Line_Separator}: line separator character U+2028.
\p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm} or \p{Math_Symbol}: any mathematical symbol.
\p{Sc} or \p{Currency_Symbol}: any currency sign.
\p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
\p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
\p{N} or \p{Number}: any kind of numeric character in any script.
\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
\p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
\p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
\p{P} or \p{Punctuation}: any kind of punctuation character.
\p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
\p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
\p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
\p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
\p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
\p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
\p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C} or \p{Other}: invisible control characters and unused code points.
\p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\p{Cf} or \p{Format}: invisible formatting indicator.
\p{Co} or \p{Private_Use}: any code point reserved for private use.
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
\p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.
There is a POSIX character class designation [:print:] that should match printable characters, and [:cntrl:] for control characters. Note that these match codes throughout the ASCII table, so they might not be suitable for matching other encodings.
Failing that, the expression [\x00-\x1f] will match through the ASCII control characters, although again, these could be printable in other encodings.
In Java, the \p{Print} option specifies the printable character class.
It depends wildly on what regex package you are using. This is one of these situations about which some wag said that the great thing about standards is there are so many to choose from.
If you happen to be using C, the isprint(3) function/macro is your friend.
Adding on to #Alan-Moore, \P{Cc} is actually as example of Negative Unicode Category or Unicode Block (ref: Character Classes in Regular Expressions). \P{name} matches any character that does not belong to a Unicode general category or named block. See the referred link for more examples of named blocks supported in .Net

How to include backslash in String variable name (in java)

I want to include backslash in string variable name how to do that .
Ex:
String Cd_St_SSLC/PUC;
/ (forward-slashes) are discouraged as they are reserved characters. The presence of a / will throw a compile-time error if you are not dividing, commenting (//, /** */, or /* */), or enclosing it in a string ("//") or treating it as a character literal ('//'). Operators cannot be in a variable's name.
The Java™ Tutorials
Variables
Naming
Every programming language has its own set of rules and conventions for the kinds of names that you're allowed to use, and the Java programming language is no different. The rules and conventions for naming your variables can be summarized as follows:
Variable names are case-sensitive. A variable's name can be any legal identifier — an unlimited-length sequence of Unicode letters and digits, beginning with a letter, the dollar sign "$", or the underscore character "". The convention, however, is to always begin your variable names with a letter, not "$" or "". Additionally, the dollar sign character, by convention, is never used at all. You may find some situations where auto-generated names will contain the dollar sign, but your variable names should always avoid using it. A similar convention exists for the underscore character; while it's technically legal to begin your variable's name with "_", this practice is discouraged. White space is not permitted.
Subsequent characters may be letters, digits, dollar signs, or underscore characters. Conventions (and common sense) apply to this rule as well. When choosing a name for your variables, use full words instead of cryptic abbreviations. Doing so will make your code easier to read and understand. In many cases it will also make your code self-documenting; fields named cadence, speed, and gear, for example, are much more intuitive than abbreviated versions, such as s, c, and g. Also keep in mind that the name you choose must not be a keyword or reserved word.
If the name you choose consists of only one word, spell that word in all lowercase letters. If it consists of more than one word, capitalize the first letter of each subsequent word. The names gearRatio and currentGear are prime examples of this convention. If your variable stores a constant value, such as static final int NUM_GEARS = 6, the convention changes slightly, capitalizing every letter and separating subsequent words with the underscore character. By convention, the underscore character is never used elsewhere.
See also 1/2
The Java language specification for identifiers.
The Java® Language Specification: Java SE 7 Edition
Chapter 3. Lexical Structure
3.8. Identifiers
An identifier is an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter.
Identifier:
IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral
IdentifierChars:
JavaLetter
IdentifierChars JavaLetterOrDigit
JavaLetter:
any Unicode character that is a Java letter (see below)
JavaLetterOrDigit:
any Unicode character that is a Java letter-or-digit (see below)
3.12. Operators
37 tokens are the operators, formed from ASCII characters.
Operator: one of
= > < ! ~ ? :
== <= >= != && || ++ --
+ - * / & | ^ % << >> >>>
+= -= *= /= &= |= ^= %= <<= >>= >>>=
See also 2/2
The following method Character.isUnicodeIdentifierPart can determine "if the character may be part of a Unicode identifier".
Method: Java.lang.Character.isUnicodeIdentifierPart()
Description
The java.lang.Character.isUnicodeIdentifierPart(char ch) [method] determines if the specified character may be part of a Unicode identifier as other than the first character.
A character may be part of a Unicode identifier if and only if one of the following statements is true:
it is a letter
it is a connecting punctuation character (such as '_')
it is a digit
it is a numeric letter (such as a Roman numeral character)
it is a combining mark
it is a non-spacing mark
isIdentifierIgnorable returns true for this character.
That's a forward slash, and not legal in a Java variable name because it is the division operator.
int a = b/c;
I suggest you to take into consideration the Java naming conventions! You can read more about this in "Thinking in java", from http://java.about.com/od/javasyntax/a/nameconventions.htm... It is a good practice to avoid characters like '/', maybe you can replace it with '_'.
Roxana
I assist you not use this String Cd_St_SSLC/PUC because it is not legal in a Java variable name instead of this if you want to meaningful name use String Cd_St_SSLC_PUC underscore.

Messed up with Java Declaration

why java constant have strange behaviour (Unicode Character and normal representation).. I mean see below example.
Note : All code is in java language.
char a = '\u0061'; //This is correct
char 'a' = 'a'; //This gives compile time error
char \u0061 = 'a'; //this is correct no error
ch\u0061r a = 'a'; //This too works
ch'a'r a = 'a'; // This really is confusing compile time error
Why last declaration is not works whereas ch\u0061r a='a'; works?
You cannot put literals ('a') in the middle of identifiers.
The line
char 'a' = 'a';
Does not compile because there is no identifier, and you cannot assign one literal to another.
Unicode is permitted, however. It is just hard to read :-)
You can not put literal characters, 'a', in identifiers. You can use unicode, \u0061, though.
This isn't confusing at all. You're randomly scattering single quotes around and expecting them to be irrelevant. In the first case, you're assigning the value of the single character \u0061 to a char variable. Then you're trying to use a character literal as a variable name, which doesn't work. Then you're using a Unicode-formatted character (not quoted) as a variable name, which is okay. Perhaps you're confusing Java's quote rules with shell?
You can find the reason in specification of literals
Unicode composite characters are different from the decomposed characters.
Identifier:
IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral
IdentifierChars:
JavaLetter
IdentifierChars JavaLetterOrDigit
JavaLetter:
any Unicode character that is a Java letter (see below)
JavaLetterOrDigit:
any Unicode character that is a Java letter-or-digit (see below)

Categories