Distinguishing between lexical error and semantic error - java

What is the difference between these two errors, lexical and semantic?
int d = "orange";
inw d = 4;
Would the first one be a semantic error? Since you can't assign a literal to an int? As for the second one the individual tokens are messed up so it would be lexical? That is my thought process, I could be wrong but I'd like to understand this a little more.

There are really three commonly recognized levels of interpretation: lexical, syntactic and semantic. Lexical analysis turns a string of characters into tokens, syntactic builds the tokens into valid statements in the language and semantic interprets those statements correctly to perform some algorithm.
Your first error is semantic: while all the tokens are legal it's not legal in Java to assign a string constant to a integer variable.
Your second error could be classified as lexical (as the string "inw" is not a valid keyword) or as syntactic ("inw" could be the name of a variable but it's not legal syntax to have a variable name in that context).
A semantic error can also be something that is legal in the language but does not represent the intended algorithm. For example: "1" + n is perfectly valid code but if it is intending to do an arithmetic addition then it has a semantic error. Some semantic errors can be picked up by modern compilers but ones such as these depend on the intention of the programmer.
See the answers to whats-the-difference-between-syntax-and-semantics for more details.

Related

Legacy Java Syntax

Reading the Java Code Conventions document from 1997, I saw this in an example on P16 about variable naming conventions:
int i;
char *cp;
float myWidth;
The second declaration is of interest - to me it looks a lot like how you might declare a pointer in C. It gives a syntax error when compiling under Java 8.
Just out of curiosity: was this ever valid syntax? If so, what did it mean?
It's a copy-paste error, I suppose.
From JLS 1 (which is really not that easy to find!), the section on local variable declarations states that such a declaration, in essence, is a type followed by an identifier. Note that there is no special reference made about *, but there is special reference made about [] (for arrays).
char is our type, so the only possibility that remains is that *cp is an identifier. The section on Identifiers states
An identifier is an unlimited-length sequence of Java letters and Java
digits, the first of which must be a Java letter.
...
A Java letter is a character for which the method Character.isJavaLetter (§20.5.17) returns true
And the JavaDoc for that method states:
A character is considered to be a Java letter if and only if it is a
letter (§20.5.15) or is the dollar sign character '$' (\u0024) or the
underscore ("low line") character '_' (\u005F).
so foo, _foo and $foo were fine, but *foo was never valid.
If you want a more up-to-date Java style guide, Google's style guide is the arguably the most commonly referenced.
It appears that this is a generic coding style document for C-like languages with some Java-specific additions. See, for example, also the next page:
Do not use the assignment operator in a place where it can be easily confused with the equality operator. Example:
if (c++ = d++) { // AVOID! Java disallows.
…
}
It does not make sense to tell a programmer to avoid something that is a syntax error anyway, so the only conclusion we can draw from this is that the document is not 100% Java-specific.
Another possibility is that it was meant as a coding style for the entire Java system, including the C++ parts of the JRE and JDK.
Note that Sun abandoned the coding style document even long before Oracle came into the picture. They restrained themselves to specifying what the language is, not how to use it.
Invalid syntax!
It's just a copy/paste mistake.
The Token (*) in variables is applicable only in C because it uses pointers whereas JAVA never uses pointers.
And Token (*) is used only as operator in JAVA.

How to determine if a string is a valid Java expression in Java?

I want to determine whether a given string is a valid Java expression (according to Java's syntax).
For example:
object.apply()
x == 2
(x != null) && x.alive
Are all valid expressions in Java.
But:
object.apply();
==
for(int i=1; i < n; ++i) i.print();
Are not valid expression in Java (some are valid statements, but this is not what I'm looking for).
Is there a simple solution? (like isJavaIdentifierStart and isJavaIdentifierPart when one wants to determine whether a string is a valid identifier)
You need to parse the expression the same way the Java compiler would parse it, following the Java language standard specification.
Building your own parser from scratch is not a good idea; the Java syntax has gotten complicated in the last decade. You should find an existing Java parser and reuse that so you don't have to reinvent the wheel incorrectly.
JavaCC and ANTLR are both available in Java-form, and have Java grammars defined for them. I suggest you consider them as prime candidates. A complication is that these parsers parse full programs, not expressions. You can fix that by modifying the grammar to make expression a goal rule, and then fixing any grammar conflicts that may produce; I would not expect much.
A more complex issue: just because the syntax is valid, doesn't mean the expression is valid. I'm pretty sure that the syntax of java will accept:
"abc" * 17.2
as valid syntax.
If you want to verify the validity of the expression, you have to type-check it, using the context in which the expression will be evaluated to provide the background type information. Otherwise one will accept this as valid:
s * d // expression that parses correctly, but isn't valid
when the background knowledge is this:
Object s;
char d;
Doing a full type check is much, much harder. As a practical matter, you'll need a full Java compiler front end, which parses and does the type checking.
Parser generators (e.g., ANTLR, JavaCC) provide zero help doing this.
So you either use the Java compiler or search for a Java front end; there are a few. [Full disclosure: my company provides one that can do this].
Nope, there is definetly not a simple way to check whether a String is a valid Java code. I can think of only two ways.
1. Export to a file and complie it
You can save a String as a file with the .java suffix and compile it. According the result of compilation, you can said if the String is valid or not.
2. Java parser
You may find a library able to do that. Take a look at JavaCC. Here I cite from their site:
A parser generator is a tool that reads a grammar specification and converts it to a Java program that can recognize matches to the grammar.
The notion of a "valid Java expression" is ... rubbery.
For example:
1 == true
is syntactically valid, but a Java compiler would reject it because == cannot be used with operands with that have those types. Then:
x.length() == 42
may or may not be valid, depending on the declared type of x.
If you are simply interested in whether an expression is syntactically valid, then a parser for a subset of the Java language is sufficient.
On the other hand, if you want to check if the expression would be compilable when embedded into a Java program, then the simplest approach is to embed the expression in an equivalent context and compile it with a real Java compiler.
You can create a parser with ANTLR
and you can define your own rules.

defining rule for identifiers in ANTLR

I'm trying to write a grammar in ANTLR, and the rules for recognizing IDs and int literals are written as follows:
ID : Letter(Letter|Digit|'_')*;
TOK_INTLIT : [0-9]+ ;
//this is not the complete grammar btw
and when the input is :
void main(){
int 2a;
}
the problem is, the lexer is recognizing 2 as an int literal and a as an ID, which is completely logical based on the grammar I've written, but I don't want 2a to be recognized this way, instead I want an error to be displayed since identifiers cannot begin with something other than a letter... I'm really new to this compiler course... what should be done here?
It's at least interesting that in C and C++, 2n is an invalid number, not an invalid identifier. That's because the C lexer (or, to be more precise, the preprocessor) is required by the standard to interpret any sequence of digits and letters starting with a digit as a "preprocessor number". Later on, an attempt is made to reinterpret the preprocessor number (if it is still part of the preprocessed code) as one of the many possible numeric syntaxes. 2n isn't, so an error will be generated at that point.
Preprocessor numbers are more complicated than that, but that should be enough of a hint for you to come up with a simple solution for your problem.

Why can't an identifier begin with a digit in java

I can not think anything other than "string of digits would be a valid identifier as well as a valid number."
Is there any other explanation other than this one?
Because that would make telling number literals from symbols names a serious PITA.
For example with a digit being valid for the first character a variables of the names 0xdeadbeef or 0xc00lcafe were valid. But that could be interpreted as a hexadecimal number as well. By limiting the first character of a symbol to be a non-digit, ambiguities of that kind are avoided.
If it could then this assignment would be possible
int 33 = 44; // oh oh
then how would the JVM distinguish between a numeric literal and a variable?
It's to keep the rules simple for the compiler as well as for the programmer.
An identifier could be defined as any alphanumeric sequence that can not be interpreted as a number, but you would get into situations where the compiler would interpret the code differently from what you expect.
Example:
double 1e = 9;
double x = 1e-4;
The result in x would not be 5 but 0.0001 as 1e-4 is a number in scientific notation and not interpreted as 1e minus 4.
This is done in Java and in many other languages so that a parser could classify a terminal symbol uniquely regardless of its surrounding context. Technically, it is entirely possible to allow identifiers that look like numbers or even like keywords: for example, it is possible to write a parser that lifts the restriction on identifiers, allowing you to write something like this:
int 123 = 321; // 123 is an identifier in this imaginary compiler
The compiler knows enough to "understand" that whatever comes after the type name must be a variable name, so 123 is an identifier, and so it could treat this as a valid declaration. However, this would create more ambiguities down the road, because 123 becomes in invalid number "shadowed" by your new "identifier".
In the end, the rule works both ways: it helps compiler designers write simpler compilers, and it also helps programmers write readable code.
Note that there were attempts in the past to build compilers that are not particularly picky about names of identifiers - for example
int a real int = 3
would declare an identifier with spaces (i.e. "a real int" is a single identifier). This did not help readability, though, so modern compilers abandoned the trend.

how to work with string literals mistakes in a compiler

I am working on a lexical analyzer, which is the first step to build a compiler. Given a .txt file, the code has to identify each one of the lexical components, for example if i have
String c = "abcdefg";
it has to print
String -> type
c -> variable
= -> assignment operator
"abcdefg" -> constant String
; -> Delimit
but if i have something like this:
String c = "abc
d"; System.out.println("*");
the compile will say: String literal is not properly closed by a double quote. But how does the java compiler has to work with the other statement, the System.out.println("*"); does it have to ignore it, or it has to identify its elements?
The Nub of your question is this:
But how does the java compiler has to work with the other statement, the System.out.println("*"); does it have to ignore it, or it has to identify its elements?
First of all, try it out and see what error messages the Java compiler actually gives you in an example like that. (Obviously, you need to tweak your test case to isolate the handling of that particular situation ...)
You will most likely find that the compiler doesn't do a perfect job of recovering. I would expect that the strategy for dealing with strings that are not closed at the end of line would be to assume that the string literal is closed and continue "lexing" in non-quoted mode. But in your example, that is liable to give further errors.
Which brings me to my second point. I would advise you to not to try too hard with recovery from lexical errors. Focus on getting the lexer / compiler to work in the cases where the input is valid. You can always come back an improve on the error recovery later ... when you got more important things working properly.
(And #EJP's comment is spot on. The "heavy duty" error recovery is typically done at the parser level, not the lexer level.)
Finally: your requirements:
String c = "abcdefg";
it has to print
String -> type
c -> variable
= -> assignment operator
"abcdefg" -> constant String
; -> Delimit
If you are parsing real Java, then a (pure) lexer cannot do that. The problem is that it is not possible to determine that String is a type (or should be a type) and a is a variable name (or should be a variable name) in the lexer. Indeed, if you ignore the grammatical context (which is typically not available to a lexer!) then String could be all sorts of things, including a class name, a package name, a method name, a field name, a local variable name and so on.
The normal way to handle that is for the lexical analyser to treat both String and a as "identifier"s ... and leave it for the parser and/or the semantic analyser to sort it out.

Categories