I can not think anything other than "string of digits would be a valid identifier as well as a valid number."
Is there any other explanation other than this one?
Because that would make telling number literals from symbols names a serious PITA.
For example with a digit being valid for the first character a variables of the names 0xdeadbeef or 0xc00lcafe were valid. But that could be interpreted as a hexadecimal number as well. By limiting the first character of a symbol to be a non-digit, ambiguities of that kind are avoided.
If it could then this assignment would be possible
int 33 = 44; // oh oh
then how would the JVM distinguish between a numeric literal and a variable?
It's to keep the rules simple for the compiler as well as for the programmer.
An identifier could be defined as any alphanumeric sequence that can not be interpreted as a number, but you would get into situations where the compiler would interpret the code differently from what you expect.
Example:
double 1e = 9;
double x = 1e-4;
The result in x would not be 5 but 0.0001 as 1e-4 is a number in scientific notation and not interpreted as 1e minus 4.
This is done in Java and in many other languages so that a parser could classify a terminal symbol uniquely regardless of its surrounding context. Technically, it is entirely possible to allow identifiers that look like numbers or even like keywords: for example, it is possible to write a parser that lifts the restriction on identifiers, allowing you to write something like this:
int 123 = 321; // 123 is an identifier in this imaginary compiler
The compiler knows enough to "understand" that whatever comes after the type name must be a variable name, so 123 is an identifier, and so it could treat this as a valid declaration. However, this would create more ambiguities down the road, because 123 becomes in invalid number "shadowed" by your new "identifier".
In the end, the rule works both ways: it helps compiler designers write simpler compilers, and it also helps programmers write readable code.
Note that there were attempts in the past to build compilers that are not particularly picky about names of identifiers - for example
int a real int = 3
would declare an identifier with spaces (i.e. "a real int" is a single identifier). This did not help readability, though, so modern compilers abandoned the trend.
Related
In Java, variable names start with a letter, currency character ($) etc. but not with number, :, or .
Simple question: why is that?
Why doesn't the compiler allow to have variable declarations such as
int 7dfs;
Simply put, it would break facets of the language grammar.
For example, would 7f be a variable name, or a floating point literal with a value of 7?
You can conjure others too: if . was allowed then that would clash with the member selection operator: would foo.bar be an identifier in its own right, or would it be the bar field of an object instance foo?
Because the Java Language specification says so:
IdentifierChars:
JavaLetter {JavaLetterOrDigit}
So - yes, an identifier must start with a letter; it can't start with a digit.
The main reasons behind that:
it is simply what most people expect
it makes parsing source code (much) easier when you restrict the "layout" of identifiers; for example it reduces the possible ambiguities between literals and variable names.
In Java, for Double, we have a value for NaN (Not A Number).
Now, for Character, do we have a similar equivalent for "Not A Character"?
If the answer is no, then I think a safe substitute may be Character.MIN_VALUE (which is of type char and has value \u0000). Do you think this substitute is safe enough? Or do you have another suggestion?
In mathematics, there is a concept of "not a number" - 5 divided by 0 is not a number. Since this concept exists, there is NaN for the double type.
Characters are an abstract concept of mapping numbers to characters. The idea of "not a character" doesn't really exist, since the charset in use can vary (UTF-8, UTF-16, etc.).
Think of it this way. If I ask you, "what is 5 divided by 0?", you would say it's "not a number". But, we do have a defined way to represent the value, even though it's not a number. If I draw a random squiggle and ask you, "what letter is this?", you would say "it's not a letter". But, we don't have a way to actually represent that squiggle outside of what I just drew. There's no real way to communicate the "non-character" I've just drawn, but there is a way to communicate the "non-number" of 5 divided by 0.
\u0000 is the null character, which is still a character. What exactly are you trying to achieve? Depending on your goal \u0000 may suffice.
The "not-a-number" concept does not really belong to Java; rather, Java defines double as being IEEE 754 double precision floating-point numbers, which have that concept. (That said, if I recall correctly, Java does specify some details about NaN that IEEE 754 leaves open to implementations.)
The analogous standard for Java char is Unicode: Java defines char as being UTF-16 code units.
Unicode does have various reserved-undefined characters that you could use; for example, U+FFFF ('\uFFFF') will never be a character. Alternatively, you could use U+FFFD ('\uFFFD'), which is a character, but is specifically the "replacement character" suitable for replacing garbage or invalid characters.
Depends what you're trying to do. If you're trying to represent the lack of a character you could do
Optional<Character> noCharacter = Optional.empty();
You could check if the character's code is greater than or equal to the value of 'a' and less than or equal to the value of 'Z'. That would qualify as not a character if by not a character, you mean an alphabet letter. You could extend it to symbols like question mark, full stop, comma etc, but if you want to go further than ASCII territory, I think it gets out of hand.
One other approach would be to check if something is a number. If it's not, you could check if it's a white character, then if it's not, everything else qualifies as a character, therefore you get your answer.
It's a long discussion IMO, because answers vary, depending on your view on what's a character.
What is the difference between these two errors, lexical and semantic?
int d = "orange";
inw d = 4;
Would the first one be a semantic error? Since you can't assign a literal to an int? As for the second one the individual tokens are messed up so it would be lexical? That is my thought process, I could be wrong but I'd like to understand this a little more.
There are really three commonly recognized levels of interpretation: lexical, syntactic and semantic. Lexical analysis turns a string of characters into tokens, syntactic builds the tokens into valid statements in the language and semantic interprets those statements correctly to perform some algorithm.
Your first error is semantic: while all the tokens are legal it's not legal in Java to assign a string constant to a integer variable.
Your second error could be classified as lexical (as the string "inw" is not a valid keyword) or as syntactic ("inw" could be the name of a variable but it's not legal syntax to have a variable name in that context).
A semantic error can also be something that is legal in the language but does not represent the intended algorithm. For example: "1" + n is perfectly valid code but if it is intending to do an arithmetic addition then it has a semantic error. Some semantic errors can be picked up by modern compilers but ones such as these depend on the intention of the programmer.
See the answers to whats-the-difference-between-syntax-and-semantics for more details.
I'm trying to write a grammar in ANTLR, and the rules for recognizing IDs and int literals are written as follows:
ID : Letter(Letter|Digit|'_')*;
TOK_INTLIT : [0-9]+ ;
//this is not the complete grammar btw
and when the input is :
void main(){
int 2a;
}
the problem is, the lexer is recognizing 2 as an int literal and a as an ID, which is completely logical based on the grammar I've written, but I don't want 2a to be recognized this way, instead I want an error to be displayed since identifiers cannot begin with something other than a letter... I'm really new to this compiler course... what should be done here?
It's at least interesting that in C and C++, 2n is an invalid number, not an invalid identifier. That's because the C lexer (or, to be more precise, the preprocessor) is required by the standard to interpret any sequence of digits and letters starting with a digit as a "preprocessor number". Later on, an attempt is made to reinterpret the preprocessor number (if it is still part of the preprocessed code) as one of the many possible numeric syntaxes. 2n isn't, so an error will be generated at that point.
Preprocessor numbers are more complicated than that, but that should be enough of a hint for you to come up with a simple solution for your problem.
This is a question from a Java test I took at University
I. publicProtected
II. $_
III. _identi#ficador
I've. Protected
I'd say I, II, and I've are correct. What is the correct answer for this?
Source of the question in spanish: Teniendo la siguiente lista de identificadores de variables, ¿Cuál (es) es (son) válido (s)?
From the java documentation:
Variable names are case-sensitive. A variable's name can be any legal
identifier — an unlimited-length sequence of Unicode letters and
digits, beginning with a letter, the dollar sign "$", or the
underscore character "". The convention, however, is to always begin
your variable names with a letter, not "$" or "". Additionally, the
dollar sign character, by convention, is never used at all. You may
find some situations where auto-generated names will contain the
dollar sign, but your variable names should always avoid using it. A
similar convention exists for the underscore character; while it's
technically legal to begin your variable's name with "_", this
practice is discouraged. White space is not permitted. Subsequent
characters may be letters, digits, dollar signs, or underscore
characters. Conventions (and common sense) apply to this rule as well.
When choosing a name for your variables, use full words instead of
cryptic abbreviations. Doing so will make your code easier to read and
understand. In many cases it will also make your code
self-documenting; fields named cadence, speed, and gear, for example,
are much more intuitive than abbreviated versions, such as s, c, and
g. Also keep in mind that the name you choose must not be a keyword or
reserved word.
https://docs.oracle.com/javase/tutorial/java/nutsandbolts/variables.html
In short: yes, you're right. You can use underscores, dollarsigns, and characters to start a variable name. After the first letter of the variable name, you can also use numbers. Note that using dollar signs is generally not good practice.
From your comment, you said that your teacher rejected "II". Under your question, II is perfectly fine (try it, it will run). However, if the question on your test asked which are "good" variable names, or which variable names follow common practice, then II would be eliminated as explained in the quotation above. One reason for this is that dollar signs do not make readable variable names; they're included because internally Java makes variables that use the dollar sign.
What is the meaning of $ in a variable name?
As pointed out in the comments, IV is not a good name either, since the lower case version "protected" is a reserved keyword. With syntax highlighting, you probably wouldn't get the two confused, but using keyword-variations as variable names is certainly one way to confuse future readers
Private protected public are reserved or keywords in java.. Use _ or to use that those words.. example
int public_x;
int protected_x;
String private_s;