java valid identifier from java language specification - java

Many places on SO lead to the JLS section on Identifiers, but I have a question on what's written there.
The "Java letters" include uppercase and lowercase ASCII Latin letters
A-Z (\u0041-\u005a), and a-z (\u0061-\u007a), and, for historical
reasons, the ASCII underscore (_, or \u005f) and dollar sign ($, or
\u0024). The $ character should be used only in mechanically generated
source code or, rarely, to access pre-existing names on legacy
systems. The "Java digits" include the ASCII digits 0-9
(\u0030-\u0039).
But it goes on to say:
Letters and digits may be drawn from the entire Unicode character set,
which supports most writing scripts in use in the world today,
including the large sets for Chinese, Japanese, and Korean. This
allows programmers to use identifiers in their programs that are
written in their native languages.
I don't understand how these can both be true. The first section seems to dictate exactly which characters are allowed whereas the second section seems to say that the allowance is much more flexible.
I agree that usage of "includes" instead of "includes but is not limited to" shows that it doesn't exactly contradict. But it also first refers specifically to "Java letters"/"Java digits" and then relaxes this to just "letters"/"digits". My main point is lack of clarity and I wanted confirmation on what I assumed it meant.

As per the question Legal identifiers in Java you can see that there are many legal identifiers.
[For languages using the roman alphabet] only alphanumeric characters and occasionally underscores are used when naming identifiers by convention. However, a vast array of characters can be used.
The first paragraph refers to the code-style, or convention, among java programmers to use a reasonably consistent and readable naming scheme. The second paragraph you've quoted explains that there are a vast array of other characters which the JVM will accept - although your fellow programmers may disapprove.

First section is a special case of the second, and characters mentioned in both the sections have to satisfy the criteria mentioned in JLS 3.8 that is missed here,
A "Java letter" is a character for which the method Character.isJavaIdentifierStart(int) returns true.
A "Java letter-or-digit" is a character for which the method
Character.isJavaIdentifierPart(int) returns true.
The above methods accept/verify the code points that correspond to the characters in the entire Unicode character set (Section 2) which includes the Basic-Latin character set (Section 1).
Usually, you will never see anybody going beyond the Basic-Latin character set in their Java source files.

Related

Undocumented Java regex character class: \p{C}

I found an interesting regex in a Java project: "[\\p{C}&&\\S]"
I understand that the && means "set intersection", and \S is "non-whitespace", but what is \p{C}, and is it okay to use?
The java.util.regex.Pattern documentation doesn't mention it. The only similar class on the list is \p{Cntrl}, but they behave differently: they both match on control characters, but \p{C} matches twice on Unicode characters above U+FFFF, such as PILE OF POO:
public class StrangePattern {
public static void main(String[] argv) {
// As far as I can tell, this is the simplest way to create a String
// with code points above U+FFFF.
String poo = new String(Character.toChars(0x1F4A9));
System.out.println(poo); // prints `💩`
System.out.println(poo.replaceAll("\\p{C}", "?")); // prints `??`
System.out.println(poo.replaceAll("\\p{Cntrl}", "?")); // prints `💩`
}
}
The only mention I've found anywhere is here:
\p{C} or \p{Other}: invisible control characters and unused code points.
However, \p{Other} does not seem to exist in Java, and the matching code points are not unused.
My Java version info:
$ java -version
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)
Bonus question: what is the likely intent of the original pattern, "[\\p{C}&&\\S]"? It occurs in a method which validates a string before it is sent in an email: if that pattern is matched, an exception with the message "Invalid string" is raised.
Buried down in the Pattern docs under Unicode Support, we find the following:
This class is in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression, plus RL2.1 Canonical Equivalents.
...
Categories may be specified with the optional prefix Is: Both \p{L}
and \p{IsL} denote the category of Unicode letters. Same as scripts
and blocks, categories can also be specified by using the keyword
general_category (or its short form gc) as in general_category=Lu or
gc=Lu.
The supported categories are those of The Unicode Standard in the
version specified by the Character class. The category names are those
defined in the Standard, both normative and informative.
From Unicode Technical Standard #18, we find that C is defined to match any Other General_Category value, and that support for this is part of the requirements for Level 1 conformance. Java implements \p{C} because it claims conformance to Level 1 of UTS #18.
It probably should support \p{Other}, but apparently it doesn't.
Worse, it's violating RL1.7, required for Level 1 conformance, which requires that matching happen by code point instead of code unit:
To meet this requirement, an implementation shall handle the full range of Unicode code points, including values from U+FFFF to U+10FFFF. In particular, where UTF-16 is used, a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching.
There should be no matches for \p{C} in your test string, because your test string should be matched as a single emoji code point with General_Category=So (Other Symbol) instead of as two surrogates.
According to https://regex101.com/, \p{C} matches
Invisible control characters and unused code points
(the \ has to be escaped because java string, so string \\p{C} is regex \p{C})
I'm guessing this is a 'hacked string check' as a \p{C} probably should never appear inside a valid (character filled) string, but the author should have left a comment as what they checked and what they wanted to check are usually 2 different things.
Anything other than a valid two-letter Unicode category code or a single letter that begins a Unicode category code is illegal since Java supports only single letter and two-letter abbreviations for Unicode categories. That's why \p{Other} doesn't work here.
\p{C} matches twice on Unicode characters above U+FFFF, such as PILE
OF POO.
Right. Java uses UTF-16 encoding internally for Unicode characters and 💩 is encoded as two 16-bit code units (0xD83D 0xDCA9) called surrogate pairs (high surrogates) and since \p{C} matches each half separately
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16
encoding.
you see two matches in result set.
What is the likely intent of the original pattern, [\\p{C}&&\\S]?
I don't see a much valid reason but it seems developer worried about characters in category Other (like avoiding spammy goomojies in email's subject) so simply tried to block them.
As for the Bonus question: the expression [\\p{C}&&\\S] finds control characters excluding whitespace characters like tabs or line feeds in Java. These characters have no value in regular mails and therefore it is a good idea to filter them away (or, as in this case, declare an email content as faulty). Be aware that the double backslashes (\\) are only necessary to escape the expression for Java processing. The correct regular expression would be: [\p{C}&&\S]

Java pound (#) character syntax [duplicate]

What characters are valid in a Java class name? What other rules govern Java class names (for instance, Java class names cannot begin with a number)?
You can have almost any character, including most Unicode characters! The exact definition is in the Java Language Specification under section 3.8: Identifiers.
An identifier is an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter. ...
Letters and digits may be drawn from the entire Unicode character set, ... This allows programmers to use identifiers in their programs that are written in their native languages.
An identifier cannot have the same spelling (Unicode character sequence) as a keyword (§3.9), boolean literal (§3.10.3), or the null literal (§3.10.7), or a compile-time error occurs.
However, see this question for whether or not you should do that.
Every programming language has its own set of rules and conventions for the kinds of names that you're allowed to use, and the Java programming language is no different. The rules and conventions for naming your variables can be summarized as follows:
Variable names are case-sensitive. A variable's name can be any legal identifier — an unlimited-length sequence of Unicode letters and digits, beginning with a letter, the dollar sign "$", or the underscore character "_". The convention, however, is to always begin your variable names with a letter, not "$" or "_". Additionally, the dollar sign character, by convention, is never used at all. You may find some situations where auto-generated names will contain the dollar sign, but your variable names should always avoid using it. A similar convention exists for the underscore character; while it's technically legal to begin your variable's name with "_", this practice is discouraged. White space is not permitted.
Subsequent characters may be letters, digits, dollar signs, or underscore characters. Conventions (and common sense) apply to this rule as well. When choosing a name for your variables, use full words instead of cryptic abbreviations. Doing so will make your code easier to read and understand. In many cases it will also make your code self-documenting; fields named cadence, speed, and gear, for example, are much more intuitive than abbreviated versions, such as s, c, and g. Also keep in mind that the name you choose must not be a keyword or reserved word.
If the name you choose consists of only one word, spell that word in all lowercase letters. If it consists of more than one word, capitalize the first letter of each subsequent word. The names gearRatio and currentGear are prime examples of this convention. If your variable stores a constant value, such as static final int NUM_GEARS = 6, the convention changes slightly, capitalizing every letter and separating subsequent words with the underscore character. By convention, the underscore character is never used elsewhere.
From the official Java Tutorial.
Further to previous answers its worth noting that:
Java allows any Unicode currency symbol in symbol names, so the following will all work:
$var1
£var2
€var3
I believe the usage of currency symbols originates in C/C++, where variables added to your code by the compiler conventionally started with '$'. An obvious example in Java is the names of '.class' files for inner classes, which by convention have the format 'Outer$Inner.class'
Many C# and C++ programmers adopt the convention of placing 'I' in front of interfaces (aka pure virtual classes in C++). This is not required, and hence not done, in Java because the implements keyword makes it very clear when something is an interface.
Compare:
class Employee : public IPayable //C++
with
class Employee : IPayable //C#
and
class Employee implements Payable //Java
Many projects use the convention of placing an underscore in front of field names, so that they can readily be distinguished from local variables and parameters e.g.
private double _salary;
A tiny minority place the underscore after the field name e.g.
private double salary_;
As already stated by Jason Cohen, the Java Language Specification defines what a legal identifier is in section 3.8:
"An identifier is an unlimited-length sequence of Java letters and Java digits, the
first of which must be a Java letter. [...] A 'Java letter' is a character for which the method Character.isJavaIdentifierStart(int) returns true. A 'Java letter-or-digit' is a character for which the method Character.isJavaIdentifierPart(int) returns true."
This hopefully answers your second question. Regarding your first question; I've been taught both by teachers and (as far as I can remember) Java compilers that a Java class name should be an identifier that begins with a capital letter A-Z, but I can't find any reliable source on this. When trying it out with OpenJDK there are no warnings when beginning class names with lower-case letters or even a $-sign. When using a $-sign, you do have to escape it if you compile from a bash shell, however.
I'd like to add to bosnic's answer that any valid currency character is legal for an identifier in Java. th€is is a legal identifier, as is €this, and € as well. However, I can't figure out how to edit his or her answer, so I am forced to post this trivial addition.
What other rules govern Java class names (for instance, Java class names cannot begin with a number)?
Java class names usually begin with a capital letter.
Java class names cannot begin with a number.
if there are multiple words in the class name like "MyClassName" each word should begin with a capital letter. eg- "MyClassName".This naming convention is based on CamelCase Type.
Class names should be nouns in UpperCamelCase, with the first letter of every word capitalised. Use whole words — avoid acronyms and abbreviations (unless the abbreviation is much more widely used than the long form, such as URL or HTML).
The naming conventions can be read over here:
http://www.oracle.com/technetwork/java/codeconventions-135099.html
Identifiers are used for class names, method names, and variable names. An identifiermay be any descriptive sequence of uppercase and lowercase letters, numbers, or theunderscore and dollar-sign characters. They must not begin with a number, lest they beconfused with a numeric literal. Again, Java is case-sensitive, so VALUE is a differentidentifier than Value.
Some examples of valid identifiers are:
AvgTemp ,count a4 ,$test ,this_is_ok
Invalid variable names include:
2count, high-temp, Not/ok

Which of these are valid variable names?

This is a question from a Java test I took at University
I. publicProtected
II. $_
III. _identi#ficador
I've. Protected
I'd say I, II, and I've are correct. What is the correct answer for this?
Source of the question in spanish: Teniendo la siguiente lista de identificadores de variables, ¿Cuál (es) es (son) válido (s)?
From the java documentation:
Variable names are case-sensitive. A variable's name can be any legal
identifier — an unlimited-length sequence of Unicode letters and
digits, beginning with a letter, the dollar sign "$", or the
underscore character "". The convention, however, is to always begin
your variable names with a letter, not "$" or "". Additionally, the
dollar sign character, by convention, is never used at all. You may
find some situations where auto-generated names will contain the
dollar sign, but your variable names should always avoid using it. A
similar convention exists for the underscore character; while it's
technically legal to begin your variable's name with "_", this
practice is discouraged. White space is not permitted. Subsequent
characters may be letters, digits, dollar signs, or underscore
characters. Conventions (and common sense) apply to this rule as well.
When choosing a name for your variables, use full words instead of
cryptic abbreviations. Doing so will make your code easier to read and
understand. In many cases it will also make your code
self-documenting; fields named cadence, speed, and gear, for example,
are much more intuitive than abbreviated versions, such as s, c, and
g. Also keep in mind that the name you choose must not be a keyword or
reserved word.
https://docs.oracle.com/javase/tutorial/java/nutsandbolts/variables.html
In short: yes, you're right. You can use underscores, dollarsigns, and characters to start a variable name. After the first letter of the variable name, you can also use numbers. Note that using dollar signs is generally not good practice.
From your comment, you said that your teacher rejected "II". Under your question, II is perfectly fine (try it, it will run). However, if the question on your test asked which are "good" variable names, or which variable names follow common practice, then II would be eliminated as explained in the quotation above. One reason for this is that dollar signs do not make readable variable names; they're included because internally Java makes variables that use the dollar sign.
What is the meaning of $ in a variable name?
As pointed out in the comments, IV is not a good name either, since the lower case version "protected" is a reserved keyword. With syntax highlighting, you probably wouldn't get the two confused, but using keyword-variations as variable names is certainly one way to confuse future readers
Private protected public are reserved or keywords in java.. Use _ or to use that those words.. example
int public_x;
int protected_x;
String private_s;

Checking for specific strings with regex

I have a list of arbitrary length of Type String, I need to ensure each String element in the list is alphanumerical or numerical with no spaces and special characters such as - \ / _ etc.
Example of accepted strings include:
J0hn-132ss/sda
Hdka349040r38yd
Hd(ersd)3r4y743-2\d3
123456789
Examples of unacceptable strings include:
Hello
Joe
King
etc basically no words.
I’m currently using stringInstance.matches("regex") but not too sure on how to write the appropriate expression
if (str.matches("^[a-zA-Z0-9_/-\\|]*$")) return true;
else return false;
This method will always return true for words that don't conform to the format I mentioned.
A description of the regex I’m looking for in English would be something like:
Any String, where the String contains characters from (a-zA-Z AND 0-9 AND special characters)
OR (0-9 AND Special characters)
OR (0-9)
Edit: I have come up with the following expression which works but I feel that it may be bad in terms of it being unclear or to complex.
The expression:
(([\\pL\\pN\\pP]+[\\pN]+|[\\pN]+[\\pL\\pN\\pP]+)|([\\pN]+[\\pP]*)|([\\pN]+))+
I've used this website to help me: http://xenon.stanford.edu/~xusch/regexp/analyzer.html
Note that I’m still new to regex
WARNING: “Never” Write A-Z
All instances of ranges like A-Z or 0-9 that occur outside an RFC definition are virtually always ipso facto wrong in Unicode. In particular, things like [A-Za-z] are horrible antipatterns: they’re sure giveaways that the programmer has a caveman mentality about text that is almost wholly inappropriate this side of the Millennium. The Unicode patterns work on ASCII, but the ASCII patterns break on Uniocode, sometimes in ways that leave you open to security violations. Always write the Unicode version of the pattern no matter whether you are using 1970s data or modern Unicode, because that way you won’t screw up when you actually use real Java character data. It’s like the way you use your turn signal even when you “know” there is no one behind you, because if you’re wrong, you do no harm, whereas the other way, you very most certainly do. Get used to using the 7 Unicode categories:
\pL for Letters. Notice how \pL is a lot shorter to type than [A-Za-z].
\pN for Numbers.
\pM for Marks that combine with other code points.
\pS for Symbols, Signs, and Sigils. :)
\pP for Punctuation.
\pZ for Separators like spaces (but not control characters)
\pC for other invisible formatting and Control characters, including unassigned code points.
Solution
If you just want a pattern, you want
^[\pL\pN]+$
although in Java 7 you can do this:
(?U)^\w+$
assuming you don’t mind underscores and letters with arbitrary combining marks. Otherwise you have to write the very awkward:
(?U)^[[:alpha:]\pN]+$
The (?U) is new to Java 7. It corresponds to the Pattern class’s UNICODE_CHARACTER_CLASSES compilation flag. It switches the POSIX character classes like [:alpha:] and the simple shortcuts like \w to actually work with the full Java character set. Normally, they work only on the 1970sish ASCII set, which can be a security hole.
There is no way to make Java 7 always do this with its patterns without being told to, but you can write a frontend function that does this for you. You just have to remember to call yours instead.
Note that patterns in Java before v1.7 cannot be made to work according to the way UTS#18 on Unicode Regular Expressions says they must. Because of this, you leave yourself open to a wide range of bugs, infelicities, and paradoxes if you do not use the new Unicode flag. For example, the trivial and common pattern \b\w+\b will not be found to match anywhere at all within the string "élève", let alone in its entirety.
Therefore, if you are using patterns in pre-1.7 Java, you need to be extremely careful, far more careful than anyone ever is. You cannot use any of the POSIX charclasses or charclass shortcuts, including \w, \s, and \b, all of which break on anything but stone-age ASCII data. They cannot be used on Java’s native character set.
In Java 7, they can — but only with the right flag.
It is possible to refrase the description of needed regex to "contains at least one number" so the followind would work /.*[\pN].*/. Or, if you would like to limit your search to letters numbers and punctuation you shoud use /[\pL\pN\pP]*[\pN][\pL\pN\pP]*/. I've tested it on your examples and it works fine.
You can further refine your regexp by using lazy quantifiers like this /.*?[\pN].*?/. This way it would fail faster if there are no numbers.
I would like to recomend you a great book on regular expressions: Mastering regular expressions, it has a great introduction, in depth explanation of how regular expressions work and a chapter on regular expressions in java.
It looks like you just want to make sure that there are no spaces in the string. If so, you can this very simply:
return str.indexOf(" ") == -1;
This will return true if there are no spaces (valid by my understanding of your rules), and false if there is a space anywhere in the string (invalid).
Here is a partial answer, which does 0-9 and special characters OR 0-9.
^([\d]+|[\\/\-_]*)*$
This can be read as ((1 or more digits) OR (0 or more special char \ / - '_')) 0 or more times. It requires a digit, will take digits only, and will reject strings consisting of only special characters.
I used regex tester to test several of the strings.
Adding alphabetic characters seems easy, but a repetition of the given regexp may be required.

Is it a good idea to use unicode symbols as Java identifiers?

I have a snippet of code that looks like this:
double Δt = lastPollTime - pollTime;
double α = 1 - Math.exp(-Δt / τ);
average += α * (x - average);
Just how bad an idea is it to use unicode characters in Java identifiers? Or is this perfectly acceptable?
It's a bad idea, for various reasons.
Many people's keyboards do not support these characters. If I were to maintain that code on a qwerty keyboard (or any other without Greek letters), I'd have to copy and paste those characters all the time.
Some people's editors or terminals might not display these characters properly. For example, some editors (unfortunately) still default to some ISO-8859 (Latin) variant. The main reason why ASCII is still so prevalent is that it nearly always works.
Even if the characters can be rendered properly, they may cause confusion. Straight from Sun (emphasis mine):
Identifiers that have the same external appearance may yet be different. For example, the identifiers consisting of the single letters LATIN CAPITAL LETTER A (A, \u0041), LATIN SMALL LETTER A (a, \u0061), GREEK CAPITAL LETTER ALPHA (A, \u0391), CYRILLIC SMALL LETTER A (a, \u0430) and MATHEMATICAL BOLD ITALIC SMALL A (a, \ud835\udc82) are all different.
...
Unicode composite characters are different from the decomposed characters. For example, a LATIN CAPITAL LETTER A ACUTE (Á, \u00c1) could be considered to be the same as a LATIN CAPITAL LETTER A (A, \u0041) immediately followed by a NON-SPACING ACUTE (´, \u0301) when sorting, but these are different in identifiers.
This is in no way an imaginary problem: α (U+03b1 GREEK SMALL LETTER ALPHA) and ⍺ (U+237a APL FUNCTIONAL SYMBOL ALPHA) are different characters!
There is no way to tell which characters are valid. The characters from your code work, but when I use the FUNCTIONAL SYMBOL ALPHA my Java compiler complains about "illegal character: \9082". Even though the functional symbol would be more appropriate in this code. There seems to be no solid rule about which characters are acceptable, except asking Character.isJavaIdentifierPart().
Even though you may get it to compile, it seems doubtful that all Java virtual machine implementations have been rigorously tested with Unicode identifiers. If these characters are only used for variables in method scope, they should get compiled away, but if they are class members, they will end up in the .class file as well, possibly breaking your program on buggy JVM implementations.
looks good as it uses the correct symbols, but how many of your team will know the keystrokes for those symbols?
I would use an english representation just to make it easier to type. And others might not have a character set that supports those symbols set up on their pc.
That code is fine to read, but horrible to maintain - I suggest use plain English identifiers like so:
double deltaTime = lastPollTime - pollTime;
double alpha = 1 - Math.exp(-delta....
It is perfectly acceptable if it is acceptable in your working group. A lot of the answers here operate on the arrogant assumption that everybody programs in English. Non-English programmers are by no means rare these days and they're getting less rare at an accelerating rate. Why should they restrict themselves to English versions when they have a perfectly good language at their disposal?
Anglophone arrogance aside, there are other legitimate reasons for using non-English identifiers. If you're writing mathematics packages, for example, using Greek is fine if your target is fellow mathematicians. Why should people type out "delta" in your workgroup when everybody can understand "Δ" and likely type it more quickly? Almost any problem domain will have its own jargon and sometimes that jargon is expressed in something other than the Latin alphabet. Why on Earth would you want to try and jam everything into ASCII?
It's an excellent idea. Honest. It's just not easily practicable at the time. Let's keep a reference to it for the future. I would love to see triangles, circles, squares, etc... as part of program code. But for now, please do try to re-write it, the way Crozin suggests.
Why not?
If the people working on that code can type those easily, it's acceptable.
But god help those who can't display unicode, or who can't type them.
In a perfect world, this would be the recommended way.
Unfortunately you run into character encodings when moving outside of plain 7-bit ASCII characters (UTF-8 is different from ISO-Latin-1 is different from UTF-16 etc), meaning that you eventually will run into problems. This has happened to me when moving from Windows to Linux. Our national scandinavian characters broke in the process, but fortunately was only in strings. We then used the \u encoding for all those.
If you can be absolutely certain that you will never, ever run into such a thing - for instance if your files contain a proper BOM - then by all means, do this. It will make your code more readable. If at least the smallest amount of doubt, then don't.
(Please note that the "use non-English languages" is a different matter. I'm just thinking in using symbols instead of letters).

Categories