how to work with string literals mistakes in a compiler

how to work with string literals mistakes in a compiler - java

I am working on a lexical analyzer, which is the first step to build a compiler. Given a .txt file, the code has to identify each one of the lexical components, for example if i have
String c = "abcdefg";
it has to print
String -> type
c -> variable
= -> assignment operator
"abcdefg" -> constant String
; -> Delimit
but if i have something like this:
String c = "abc
d"; System.out.println("*");
the compile will say: String literal is not properly closed by a double quote. But how does the java compiler has to work with the other statement, the System.out.println("*"); does it have to ignore it, or it has to identify its elements?

The Nub of your question is this:
But how does the java compiler has to work with the other statement, the System.out.println("*"); does it have to ignore it, or it has to identify its elements?
First of all, try it out and see what error messages the Java compiler actually gives you in an example like that. (Obviously, you need to tweak your test case to isolate the handling of that particular situation ...)
You will most likely find that the compiler doesn't do a perfect job of recovering. I would expect that the strategy for dealing with strings that are not closed at the end of line would be to assume that the string literal is closed and continue "lexing" in non-quoted mode. But in your example, that is liable to give further errors.
Which brings me to my second point. I would advise you to not to try too hard with recovery from lexical errors. Focus on getting the lexer / compiler to work in the cases where the input is valid. You can always come back an improve on the error recovery later ... when you got more important things working properly.
(And #EJP's comment is spot on. The "heavy duty" error recovery is typically done at the parser level, not the lexer level.)
Finally: your requirements:
String c = "abcdefg";
it has to print
String -> type
c -> variable
= -> assignment operator
"abcdefg" -> constant String
; -> Delimit
If you are parsing real Java, then a (pure) lexer cannot do that. The problem is that it is not possible to determine that String is a type (or should be a type) and a is a variable name (or should be a variable name) in the lexer. Indeed, if you ignore the grammatical context (which is typically not available to a lexer!) then String could be all sorts of things, including a class name, a package name, a method name, a field name, a local variable name and so on.
The normal way to handle that is for the lexical analyser to treat both String and a as "identifier"s ... and leave it for the parser and/or the semantic analyser to sort it out.

Related

Distinguishing between lexical error and semantic error

What is the difference between these two errors, lexical and semantic?
int d = "orange";
inw d = 4;
Would the first one be a semantic error? Since you can't assign a literal to an int? As for the second one the individual tokens are messed up so it would be lexical? That is my thought process, I could be wrong but I'd like to understand this a little more.

There are really three commonly recognized levels of interpretation: lexical, syntactic and semantic. Lexical analysis turns a string of characters into tokens, syntactic builds the tokens into valid statements in the language and semantic interprets those statements correctly to perform some algorithm.
Your first error is semantic: while all the tokens are legal it's not legal in Java to assign a string constant to a integer variable.
Your second error could be classified as lexical (as the string "inw" is not a valid keyword) or as syntactic ("inw" could be the name of a variable but it's not legal syntax to have a variable name in that context).
A semantic error can also be something that is legal in the language but does not represent the intended algorithm. For example: "1" + n is perfectly valid code but if it is intending to do an arithmetic addition then it has a semantic error. Some semantic errors can be picked up by modern compilers but ones such as these depend on the intention of the programmer.
See the answers to whats-the-difference-between-syntax-and-semantics for more details.

Extract variables from code statement using regex

I'm trying to extract variables from code statements and "if" condition. I have a regex to that but mymatcher.find() doesn't return any values matched.
I don't know what is wrong.
here is my code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class test {
public static void main(String[] args) {
String test="x=y+z/n-10+my5th_integer+201";
Pattern mypattern = Pattern.compile("^[a-zA-Z_$][a-zA-Z_$0-9]*$");
Matcher mymatcher = mypattern.matcher(test);
while (mymatcher.find()) {
String find = mymatcher.group(1) ;
System.out.println("variable:" + find);
}
}
}

You need to remove ^ and $ anchors that assert positions at start and end of string repectively, and use mymatcher.group(0) instead of mymatcher.group(1) because you do not have any capturing groups in your regex:
String test="x=y+z/n-10+my5th_integer+201";
Pattern mypattern = Pattern.compile("[a-zA-Z_$][a-zA-Z_$0-9]*");
Matcher mymatcher = mypattern.matcher(test);
while (mymatcher.find()) {
String find = mymatcher.group(0) ;
System.out.println("variable:" + find);
}
See IDEONE demo, the results are:
variable:x
variable:y
variable:z
variable:n
variable:my5th_integer

Usually processing source code with just a regex simply fails.
If all you want to do is pick out identifiers (we discuss variables further below) you have some chance with regular expressions (after all, this is how lexers are built).
But you probably need a much more sophisticated version than what you have, even with corrections as suggested by other authors.
A first problem is that if you allow arbitrary statements, they often have keywords that look like identifiers. In your specific example, "if" looks like an identifier. So your matcher either has to recognize identifier-like substrings, and subtract away known keywords, or the regex itself must express the idea that an identifier has a basic shape but not cannot look like a specific list of keywords. (The latter is called a subtractive regex, and aren't found in most regex engines. It looks something like:
[a-zA-Z_$][a-zA-Z_$0-9]* - (if | else | class | ... )
Our DMS lexer generator [see my bio] has subtractive regex because this is extremely useful in language-lexing).
This gets more complex if the "keywords" are not always keywords, that is,
they can be keywords only in certain contexts. The Java "keyword" enum is just that: if you use it in a type context, it is a keyword; otherwise it is an identifier; C# is similar. Now the only way to know
if a purported identifier is a keyword is to actually parse the code (which is how you detect the context that controls its keyword-ness).
Next, identifiers in Java allow a variety of Unicode characters (Latin1, Russian, Chinese, ...) A regexp to recognize this, accounting for all the characters, is a lot bigger than the simple "A-Z" style you propose.
For Java, you need to defend against string literals containing what appear to be variable names. Consider the (funny-looking but valid) statement:
a = "x=y+z/n-10+my5th_integer+201";
There is only one identifier here. A similar problem occurs with comments
that contain content that look like statements:
/* Tricky:
a = "x=y+z/n-10+my5th_integer+201";
*/
For Java, you need to worry about Unicode escapes, too. Consider this valid Java statement:
\u0061 = \u0062; // means "a=b;"
or nastier:
a\u006bc = 1; // means "akc=1;" not "abc=1;"!
Pushing this, without Unicode character decoding, you might not even
notice a string. The following is a variant of the above:
a = \u0042x=y+z/n-10+my5th_integer+201";
To extract identifiers correctly, you need to build (or use) the equivalent of a full Java lexer, not just a simple regex match.
If you don't care about being right most of the time, you can try your regex. Usually regex-applied-to-source-code-parsing ends badly, partly because of the above problems (e.g, oversimplification).
You are lucky in that you are trying to do for Java. If you had to do this for C#, a very similar language, you'd have to handle interpolated strings, which allow expressions inside strings. The expressions themselves can contain strings... its turtles all the way down. Consider the C# (version 6) statement:
a = $"x+{y*$"z=${c /* p=q */}"[2]}*q" + b;
This contains the identifiers a, b, c and y. Every other "identifier" is actually just a string or comment character. PHP has similar interpolated strings.
To extract identifiers from this, you need a something that understands the nesting of string elements. Lexers usually don't do recursion (Our DMS lexers handle this, for precisely this reason), so to process this correctly you usually need a parser, or at least something that tracks nesting.
You have one other issue: do you want to extract just variable names?
What if the identifier represents a method, type, class or package?
You can't figure this out without having a full parser and full Java name and type resolution, and you have to do this in the context in which the statement is found. You'd be amazed how much code it takes to do this right.
So, if your goals are simpleminded and you don't care if it handles these complications, you can get by with a simple regex to pick out things
that look like identifiers.
If you want to it well (e.g., use this in some production code) the single regex will be total disaster. You'll spend your life explaining to users what they cannot type, and that never works.
Summary: because of all the complications, usually processing source code with just a regex simply fails. People keep re-learning this lesson. It is one of key reasons that lexer generators are widely used in language processing tools.

defining rule for identifiers in ANTLR

I'm trying to write a grammar in ANTLR, and the rules for recognizing IDs and int literals are written as follows:
ID : Letter(Letter|Digit|'_')*;
TOK_INTLIT : [0-9]+ ;
//this is not the complete grammar btw
and when the input is :
void main(){
int 2a;
}
the problem is, the lexer is recognizing 2 as an int literal and a as an ID, which is completely logical based on the grammar I've written, but I don't want 2a to be recognized this way, instead I want an error to be displayed since identifiers cannot begin with something other than a letter... I'm really new to this compiler course... what should be done here?

It's at least interesting that in C and C++, 2n is an invalid number, not an invalid identifier. That's because the C lexer (or, to be more precise, the preprocessor) is required by the standard to interpret any sequence of digits and letters starting with a digit as a "preprocessor number". Later on, an attempt is made to reinterpret the preprocessor number (if it is still part of the preprocessed code) as one of the many possible numeric syntaxes. 2n isn't, so an error will be generated at that point.
Preprocessor numbers are more complicated than that, but that should be enough of a hint for you to come up with a simple solution for your problem.

Elegant way to do variable substitution in a java string

Pretty simple question and my brain is frozen today so I can't think of an elegant solution where I know one exists.
I have a formula which is passed to me in the form "A+B"
I also have a mapping of the formula variables to their "readable names".
Finally, I have a formula parser which will calculate the value of the formula, but only if its passed with the readable names for the variables.
For example, as an input I get
String formula = "A+B"
String readableA = "foovar1"
String readableB = "foovar2"
and I want my output to be "foovar1+foovar2"
The problem with a simple find and replace is that it can be easily be broken because we have no guarantees on what the 'readable' names are. Lets say I take my example again with different parameters
String formula = "A+B"
String readableA = "foovarBad1"
String readableB = "foovarAngry2"
If I do a simple find and replace in a loop, I'll end up replacing the capital A's and B's in the readable names I have already replaced.
This looks like an approximate solution but I don't have brackets around my variables
How to replace a set of tokens in a Java String?

That link you provided is an excellent source since matching using patterns is the way to go. The basic idea here is first get the tokens using a matcher. After this you will have Operators and Operands
Then, do the replacement individually on each Operand.
Finally, put them back together using the Operators.

A somewhat tedious solution would be to scan for all occurences of A and B and note their indexes in the string, and then use StringBuilder.replace(int start, int end, String str) method. (in naive form this would not be very efficient though, approaching smth like square complexity, or more precisely "number of variables" * "number of possible replacements")
If you know all of your operators, you could do split on them (like on "+") and then replace individual "A" and "B" (you'd have to do trimming whitespace chars first of course) in an array or ArrayList.

A simple way to do it is
String foumula = "A+B".replaceAll("\\bA\\b", readableA)
.replaceAll("\\bB\\b", readableB);

Your approach does not work fine that way
Formulas (mathematic Expressions) should be parsed into an expression structure (eg. expression tree).
Such that you have later Operand Nodes and Operator nodes.
Later this expression will be evaluated traversing the tree and considering the mathematical priority rules.
I recommend reading more on Expression parsing.

Matching Only
If you don't have to evaluate the expression after doing the substitution, you might be able to use a regex. Something like (\b\p{Alpha}\p{Alnum}*\b)
or the java string "(\\b\\p{Alpha}\\p{Alnum}*\\b)"
Then use find() over and over to find all the variables and store their locations.
Finally, go through the locations and build up a new string from the old one with the variable bits replaced.
Not that It will not do much checking that the supplied expression is reasonable. For example, it wouldn't mind at all if you gave it )A 2 B( and would just replace the A and B (like )XXX 2 XXX(). I don't know if that matters.
This is similar to the link you supplied in your question except you need a different regular expression than they used. You can go to http://www.regexplanet.com/advanced/java/index.html to play with regular expressions and figure out one that will work. I used it with the one I suggested and it finds what it needs in A+B and A + (C* D ) just fine.
Parsing
You parse the expression using one of the available parser generators (Antlr or Sable or ...) or find an algebraic expression parser available as open source and use it. (You would have to search the web to find those, I haven't used one but suspect they exist.)
Then you use the parser to generate a parsed form of the expression, replace the variables and reconstitute the string form with the new variables.
This one might work better but the amount of effort depends on whether you can find existing code to use.
It also depends on whether you need to validate the expression is valid according to the normal rules. This method will not accept invalid expressions, most likely.

Is there a semi-automated way to perform string extraction for i18n?

We have a Java project which contains a large number of English-language strings for user prompts, error messages and so forth. We want to extract all the translatable strings into a properties file so that they can be translated later.
For example, we would want to replace:
Foo.java
String msg = "Hello, " + name + "! Today is " + dayOfWeek;
with:
Foo.java
String msg = Language.getString("foo.hello", name, dayOfWeek);
language.properties
foo.hello = Hello, {0}! Today is {1}
I understand that doing in this in a completely automated way is pretty much impossible, as not every string should be translated. However, we were wondering if there was a semi-automated way which removes some of the laboriousness.

What you want is a tool that replaces every expression involving string concatenations with a library call, with the obvious special case of expressions involving just a single literal string.
A program transformation system in which you can express your desired patterns can do this.
Such a system accepts rules in the form of:
lhs_pattern -> rhs_pattern if condition ;
where patterns are code fragments with syntax-category constraints on the pattern variables. This causes the tool to look for syntax matching the lhs_pattern, and if found, replace by the rhs_pattern, where the pattern matching is over langauge structures rather than text. So it works regardless of code formatting, indentation, comments, etc.
Sketching a few rules (and oversimplifying to keep this short)
following the style of your example:
domain Java;
nationalize_literal(s1:literal_string):
" \s1 " -> "Language.getString1(\s1 )";
nationalize_single_concatenation(s1:literal_string,s2:term):
" \s1 + \s2 " -> "Language.getString1(\s1) + \s2";
nationalize_double_concatenation(s1:literal_string,s2:term,s3:literal_string):
" \s1 + \s2 + \s3 " ->
"Language.getString3(\generate_template1\(\s1 + "{1}" +\s3\, s2);"
if IsNotLiteral(s2);
The patterns are themselves enclosed in "..."; these aren't Java string literals, but rather a way of saying to the multi-computer-lingual pattern matching engine
that the suff inside the "..." is (domain) Java code. Meta-stuff are marked with \,
e.g., metavariables \s1, \s2, \s3 and the embedded pattern call \generate with ( and ) to denote its meta-parameter list :-}
Note the use of the syntax category constraints on the metavariables s1 and s3 to ensure matching only of string literals. What the meta variables match on the left hand side pattern, is substituted on the right hand side.
The sub-pattern generate_template is a procedure that at transformation time (e.g., when the rule fires) evaluates its known-to-be-constant first argument into the template string you suggested and inserts into your library, and returns a library string index.
Note that the 1st argument to generate pattern is this example is composed entirely of literal strings concatenated.
Obviously, somebody will have to hand-process the templated strings that end up in the library to produce the foreign language equivalents.
You're right in that this may over templatize the code because some strings shouldn't be placed in the nationalized string library. To the extent that you can write programmatic checks for those cases, they can be included as conditions in the rules to prevent them from triggering. (With a little bit of effort, you could place the untransformed text into a comment, making individual transformations easier to undo later).
Realistically, I'd guess you have to code ~~100 rules like this to cover the combinatorics and special cases of interests. The payoff is that the your code gets automatically enhanced. If done right, you could apply this transformation to your code repeatedly as your code goes through multiple releases; it would leave previously nationalized expressions alone and just revise the new ones inserted by the happy-go-lucky programmers.
A system which can do this is the DMS Software Reengineering Toolkit. DMS can parse/pattern match/transform/prettyprint many langauges, include Java and C#.

Eclipse will externalize every individual string and does not automatically build substitution like you are looking for. If you have a very consistent convention of how you build your strings you could write a perl script to do some intelligent replacement on .java files. But this script will get quite complex if you want to handle
String msg = new String("Hello");
String msg2 = "Hello2";
String msg3 = new StringBuffer().append("Hello3").toString();
String msg4 = "Hello" + 4;
etc.
I think there are some paid tools that can help with this. I remember evaluating one, but I don't recall its name. I also don't remember if it could handle variable substitution in external strings. I'll try to find the info and edit this post with the details.
EDIT:
The tool was Globalyzer by Lingport. The website says it supports string externalization, but not specifically how. Not sure if it supports variable substitution. There is a free trial version so you could try it out and see.

Globalyzer has extensive capabilities to detect, manage and externalize strings and speeds up the work dramatically over looking at strings and externalizing one by one. You can filter the strings as well see them in context and then either externalize one by one, or in batches. It works for a wide variety of programming languages and resource types, of course including Java. Plus Globalyzer finds much more than embedded strings for your internationalization projects. You can read more at http://lingoport.com/globalyzer and there's links there to sign up for a demo account. Globalyzer was first built for performing big internationalization service projects and then over the years it's grown in to a full scale enterprise tool for making sure development gets and stays internationalized.

As well as Eclipse's string externalizer, which generates properties files, Eclipse has a warning for non-externalized strings, which is helpful for finding files that you haven't internationalized.
String msg = "Hello " + name;
gives the warning "Non-externalized string literal; it should be followed by //$NON-NLS-$". For strings that truly do belong in the code you can add an annotation (#SuppressWarnings("nls")) or you can add a comment:
String msg = "Hello " + name; //$NON-NLS-1$
This is very helpful for converting a project to proper internationalization.

I think eclipse has some option to externalize all strings into a property file.

You can use "Externalize String" method from eclipse.
Open your Java file in the editor and then click on "Externalize String" in the "Source" main menu. IT generates a properties file for you with all strngs you checked in the selection.
Hope this helps.

since everyone is weighing in an IDE i guess i'd better stand up for Netbeans :)
Tools-->Internationalisation-->Internationalisation Wizard
very handy..

InteliJ idea is another tool which have this feature.
Here's a link to the demo

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.