could anyone tell me the difference between Terminal and non-terminal symbol in the case of Java?
Does Terminal mean a Keyword and non-terminal any common string literal?
In grammars, a terminal is some form of token (keyword, identifier, symbol, literal, etc.) whilst a non-terminal reference rules.
So both a keyword and a literal string would be terminals. A statement would be non-terminal.
(That's probably a really bad description. Read the dragon book.)
EDIT (not by original answerer): I'd never heard of the dragon book, so here's a reference.
Your question is rather unclear. Are you talking about the formal grammar which describes the Java language? If so, everything you see in a syntactically valid Java file is (part of) a terminal.
A string is 'in' the language described
by some grammar if it can be produced
by applying the production rules of
the grammar until only terminals
remain.
Perhaps you should check out the Java Language Specification,
Related
Reading the Java Code Conventions document from 1997, I saw this in an example on P16 about variable naming conventions:
int i;
char *cp;
float myWidth;
The second declaration is of interest - to me it looks a lot like how you might declare a pointer in C. It gives a syntax error when compiling under Java 8.
Just out of curiosity: was this ever valid syntax? If so, what did it mean?
It's a copy-paste error, I suppose.
From JLS 1 (which is really not that easy to find!), the section on local variable declarations states that such a declaration, in essence, is a type followed by an identifier. Note that there is no special reference made about *, but there is special reference made about [] (for arrays).
char is our type, so the only possibility that remains is that *cp is an identifier. The section on Identifiers states
An identifier is an unlimited-length sequence of Java letters and Java
digits, the first of which must be a Java letter.
...
A Java letter is a character for which the method Character.isJavaLetter (§20.5.17) returns true
And the JavaDoc for that method states:
A character is considered to be a Java letter if and only if it is a
letter (§20.5.15) or is the dollar sign character '$' (\u0024) or the
underscore ("low line") character '_' (\u005F).
so foo, _foo and $foo were fine, but *foo was never valid.
If you want a more up-to-date Java style guide, Google's style guide is the arguably the most commonly referenced.
It appears that this is a generic coding style document for C-like languages with some Java-specific additions. See, for example, also the next page:
Do not use the assignment operator in a place where it can be easily confused with the equality operator. Example:
if (c++ = d++) { // AVOID! Java disallows.
…
}
It does not make sense to tell a programmer to avoid something that is a syntax error anyway, so the only conclusion we can draw from this is that the document is not 100% Java-specific.
Another possibility is that it was meant as a coding style for the entire Java system, including the C++ parts of the JRE and JDK.
Note that Sun abandoned the coding style document even long before Oracle came into the picture. They restrained themselves to specifying what the language is, not how to use it.
Invalid syntax!
It's just a copy/paste mistake.
The Token (*) in variables is applicable only in C because it uses pointers whereas JAVA never uses pointers.
And Token (*) is used only as operator in JAVA.
A comment in this question said that numbers outside of java code are not called literals. Are literals only applicable in java code?
Can't we say the following number in a .properties file is a literal?
min-socre=20
A literal is a fixed value written directly within source code. That's the simplest definition: a fixed value within code. A .properties file is not a code file, so numbers in it are not literals.
Can't we say the following number in a .properties file is a literal?
How do you know it's a number? The concatenation of the characters '2' and '0' is interpretable as a representation of the number 20, but it's also interpretable as a representation of the string "20". There are surely other interpretations.
The term "literal" has a specific meaning in the context of Java code, and that particular meaning does not apply to other files. It does not necessarily apply even to source files written in other languages. For example, C has string literals, but integer constants. In more general contexts, the term "literal" is rarely used in this way at all, because the key characteristic differentiating a Java literal from other expressions of the same type either does not arise or is not important.
In one sense, of course you can describe the property value as a "literal". Inasmuch as .properties files are mostly a Java thing, most people to whom you might say that would probably understand you. But many of them would probably look at you oddly. That word usage doesn't really apply to the situation.
I want to determine whether a given string is a valid Java expression (according to Java's syntax).
For example:
object.apply()
x == 2
(x != null) && x.alive
Are all valid expressions in Java.
But:
object.apply();
==
for(int i=1; i < n; ++i) i.print();
Are not valid expression in Java (some are valid statements, but this is not what I'm looking for).
Is there a simple solution? (like isJavaIdentifierStart and isJavaIdentifierPart when one wants to determine whether a string is a valid identifier)
You need to parse the expression the same way the Java compiler would parse it, following the Java language standard specification.
Building your own parser from scratch is not a good idea; the Java syntax has gotten complicated in the last decade. You should find an existing Java parser and reuse that so you don't have to reinvent the wheel incorrectly.
JavaCC and ANTLR are both available in Java-form, and have Java grammars defined for them. I suggest you consider them as prime candidates. A complication is that these parsers parse full programs, not expressions. You can fix that by modifying the grammar to make expression a goal rule, and then fixing any grammar conflicts that may produce; I would not expect much.
A more complex issue: just because the syntax is valid, doesn't mean the expression is valid. I'm pretty sure that the syntax of java will accept:
"abc" * 17.2
as valid syntax.
If you want to verify the validity of the expression, you have to type-check it, using the context in which the expression will be evaluated to provide the background type information. Otherwise one will accept this as valid:
s * d // expression that parses correctly, but isn't valid
when the background knowledge is this:
Object s;
char d;
Doing a full type check is much, much harder. As a practical matter, you'll need a full Java compiler front end, which parses and does the type checking.
Parser generators (e.g., ANTLR, JavaCC) provide zero help doing this.
So you either use the Java compiler or search for a Java front end; there are a few. [Full disclosure: my company provides one that can do this].
Nope, there is definetly not a simple way to check whether a String is a valid Java code. I can think of only two ways.
1. Export to a file and complie it
You can save a String as a file with the .java suffix and compile it. According the result of compilation, you can said if the String is valid or not.
2. Java parser
You may find a library able to do that. Take a look at JavaCC. Here I cite from their site:
A parser generator is a tool that reads a grammar specification and converts it to a Java program that can recognize matches to the grammar.
The notion of a "valid Java expression" is ... rubbery.
For example:
1 == true
is syntactically valid, but a Java compiler would reject it because == cannot be used with operands with that have those types. Then:
x.length() == 42
may or may not be valid, depending on the declared type of x.
If you are simply interested in whether an expression is syntactically valid, then a parser for a subset of the Java language is sufficient.
On the other hand, if you want to check if the expression would be compilable when embedded into a Java program, then the simplest approach is to embed the expression in an equivalent context and compile it with a real Java compiler.
You can create a parser with ANTLR
and you can define your own rules.
What is the difference between these two errors, lexical and semantic?
int d = "orange";
inw d = 4;
Would the first one be a semantic error? Since you can't assign a literal to an int? As for the second one the individual tokens are messed up so it would be lexical? That is my thought process, I could be wrong but I'd like to understand this a little more.
There are really three commonly recognized levels of interpretation: lexical, syntactic and semantic. Lexical analysis turns a string of characters into tokens, syntactic builds the tokens into valid statements in the language and semantic interprets those statements correctly to perform some algorithm.
Your first error is semantic: while all the tokens are legal it's not legal in Java to assign a string constant to a integer variable.
Your second error could be classified as lexical (as the string "inw" is not a valid keyword) or as syntactic ("inw" could be the name of a variable but it's not legal syntax to have a variable name in that context).
A semantic error can also be something that is legal in the language but does not represent the intended algorithm. For example: "1" + n is perfectly valid code but if it is intending to do an arithmetic addition then it has a semantic error. Some semantic errors can be picked up by modern compilers but ones such as these depend on the intention of the programmer.
See the answers to whats-the-difference-between-syntax-and-semantics for more details.
In Java, and it seems in a few other languages, backreferences in the pattern are preceded by a backslash (e.g. \1, \2, \3, etc), but in a replacement string they preceded by a dollar sign (e.g. $1, $2, $3, and also $0).
Here's a snippet to illustrate:
System.out.println(
"left-right".replaceAll("(.*)-(.*)", "\\2-\\1") // WRONG!!!
); // prints "2-1"
System.out.println(
"left-right".replaceAll("(.*)-(.*)", "$2-$1") // CORRECT!
); // prints "right-left"
System.out.println(
"You want million dollar?!?".replaceAll("(\\w*) dollar", "US\\$ $1")
); // prints "You want US$ million?!?"
System.out.println(
"You want million dollar?!?".replaceAll("(\\w*) dollar", "US$ \\1")
); // throws IllegalArgumentException: Illegal group reference
Questions:
Is the use of $ for backreferences in replacement strings unique to Java? If not, what language started it? What flavors use it and what don't?
Why is this a good idea? Why not stick to the same pattern syntax? Wouldn't that lead to a more cohesive and an easier to learn language?
Wouldn't the syntax be more streamlined if statements 1 and 4 in the above were the "correct" ones instead of 2 and 3?
Is the use of $ for backreferences in replacement strings unique to Java?
No. Perl uses it, and Perl certainly predates Java's Pattern class. Java's regex support is explicitly described in terms of Perl regexes.
For example: http://perldoc.perl.org/perlrequick.html#Search-and-replace
Why is this a good idea?
Well obviously you don't think it is a good idea! But one reason that it is a good idea is to make Java search/replace support (more) compatible with Perl's.
There is another possible reason why $ might have been viewed as a better choice than \. That is that \ has to be written as \\ in a Java String literal.
But all of this is pure speculation. None of us were in the room when the design decisions were made. And ultimately it doesn't really matter why they designed the replacement String syntax that way. The decisions have been made and set in concrete, and any further discussion is purely academic ... unless you just happen to be designing a new language or a new regex library for Java.
After doing some research, I've understood the issues now: Perl had to use a different symbol for pattern backreferences and replacement backreferences, and while java.util.regex.* doesn't have to follow suit, it chooses to, not for a technical but rather traditional reason.
On the Perl side
(Please keep in mind that all I know about Perl at this point comes from reading Wikipedia articles, so feel free to correct any mistakes I may have made)
The reason why it had to be done this way in Perl is the following:
Perl uses $ as a sigil (i.e. a symbol attached to variable name).
Perl string literals are variable interpolated.
Perl regex actually captures groups as variables $1, $2, etc.
Thus, because of the way Perl is interpreted and how its regex engine works, a preceding slash for backreferences (e.g. \1) in the pattern must be used, because if the sigil $ is used instead (e.g. $1), it would cause unintended variable interpolation into the pattern.
The replacement string, due to how it works in Perl, is evaluated within the context of every match. It is most natural for Perl to use variable interpolation here, so the regex engine captures groups into variables $1, $2, etc, to make this work seamlessly with the rest of the language.
References
Wikipedia/String literal - variable interpolation
Wikipedia/Sigil (computer programming)
On the Java side
Java is a very different language than Perl, but most importantly here is that there is no variable interpolation. Moreover, replaceAll is a method call, and as with all method calls in Java, arguments are evaluated once, prior to the method invoked.
Thus, variable interpolation feature by itself is not enough, since in essence the replacement string must be re-evaluated on every match, and that's just not the semantics of method calls in Java. A variable-interpolated replacement string that is evaluated before the replaceAll is even invoked is practically useless; the interpolation needs to happen during the method, on every match.
Since that is not the semantics of Java language, replaceAll must do this "just-in-time" interpolation manually. As such, there is absolutely no technical reason why $ is the escape symbol for backreferences in replacement strings. It could've very well been the \. Conversely, backreferences in the pattern could also have been escaped with $ instead of \, and it would've still worked just as fine technically.
The reason Java does regex the way it does is purely traditional: it's simply following the precedent set by Perl.