Parsing regular expressions based on a context free grammar

Parsing regular expressions based on a context free grammar - java

Good evening, Stack Overflow.
I'd like to develop an interpreter for expressions based on a pretty simple context-free grammar:
Grammar
Basically, the language is constituted by 2 base statements
( SET var 25 ) // Output: var = 25
( GET ( MUL var 5 ) ) // Output: 125
( SET var2 ( MUL 30 5 ) ) //Output: var2 = 150
Now, I'm pretty sure about what should I do in order to interpret a statement: 1) Lexical analysis to turn a statement into a sequence of tokens 2) Syntax analysis to get a symbol table (HashMap with the variables and their values) and a syntactic tree (to perform the GET statements) to 3) perform an inorder visit of the tree to get the results I want.
I'd like some advice on the parsing method to read the source file. Considering the parser should ignore any whitespace, tabulation or newline, is it possible to use a Java Pattern to get a general statement I want to analyze? Is there a good way to read a statement weirdly formatted (and possibly more complex) like this
(
SET var
25
)
without confusing the parser with the open and closed parenthesises?
For example
Scanner scan; //scanner reading the source file
String pattern = "..." //ideal pattern I've found to represent an expression
while(scan.hasNext(pattern))
Interpreter.computeStatement(scan.next(pattern));
would it be a viable option for this problem?

Solution proposed by Ira Braxter:
Your title is extremely confused. You appear to want to parse what are commonly called "S-expressions" in the LISP world; this takes a (simple but) context-free grammar. You cannot parse such expressions with regexps. Time to learn about real parsers.
Maybe this will help: stackoverflow.com/a/2336769/120163

In the end, I understood thanks to Ira Baxter that this context free grammar can't be parsed with RegExp and I used the concepts of S-Expressions to build up the interpreter, whose source code you can find here. If you have any question about it (mainly because the comments aren't translated in english, even though I think the code is pretty clear), just message me or comment here.
Basically what I do is:
Parse every character and tokenize it (e.g '(' -> is OPEN_PAR, while "SET" -> STATEMENT_SET or a random letter like 'b' is parsed as a VARIABLE )
Then, I use the token list created to do a syntactic analysis, which checks the patterns occuring inside the token list, according to the grammar
If there's an expression inside the statement, I check recursively for any expression inside an expression, throwing an exception and going to the following correct statement if needed
At the end of analysing every single statement, I compute the statement as necessary as for specifications

Related

Extract variables from code statement using regex

I'm trying to extract variables from code statements and "if" condition. I have a regex to that but mymatcher.find() doesn't return any values matched.
I don't know what is wrong.
here is my code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class test {
public static void main(String[] args) {
String test="x=y+z/n-10+my5th_integer+201";
Pattern mypattern = Pattern.compile("^[a-zA-Z_$][a-zA-Z_$0-9]*$");
Matcher mymatcher = mypattern.matcher(test);
while (mymatcher.find()) {
String find = mymatcher.group(1) ;
System.out.println("variable:" + find);
}
}
}

You need to remove ^ and $ anchors that assert positions at start and end of string repectively, and use mymatcher.group(0) instead of mymatcher.group(1) because you do not have any capturing groups in your regex:
String test="x=y+z/n-10+my5th_integer+201";
Pattern mypattern = Pattern.compile("[a-zA-Z_$][a-zA-Z_$0-9]*");
Matcher mymatcher = mypattern.matcher(test);
while (mymatcher.find()) {
String find = mymatcher.group(0) ;
System.out.println("variable:" + find);
}
See IDEONE demo, the results are:
variable:x
variable:y
variable:z
variable:n
variable:my5th_integer

Usually processing source code with just a regex simply fails.
If all you want to do is pick out identifiers (we discuss variables further below) you have some chance with regular expressions (after all, this is how lexers are built).
But you probably need a much more sophisticated version than what you have, even with corrections as suggested by other authors.
A first problem is that if you allow arbitrary statements, they often have keywords that look like identifiers. In your specific example, "if" looks like an identifier. So your matcher either has to recognize identifier-like substrings, and subtract away known keywords, or the regex itself must express the idea that an identifier has a basic shape but not cannot look like a specific list of keywords. (The latter is called a subtractive regex, and aren't found in most regex engines. It looks something like:
[a-zA-Z_$][a-zA-Z_$0-9]* - (if | else | class | ... )
Our DMS lexer generator [see my bio] has subtractive regex because this is extremely useful in language-lexing).
This gets more complex if the "keywords" are not always keywords, that is,
they can be keywords only in certain contexts. The Java "keyword" enum is just that: if you use it in a type context, it is a keyword; otherwise it is an identifier; C# is similar. Now the only way to know
if a purported identifier is a keyword is to actually parse the code (which is how you detect the context that controls its keyword-ness).
Next, identifiers in Java allow a variety of Unicode characters (Latin1, Russian, Chinese, ...) A regexp to recognize this, accounting for all the characters, is a lot bigger than the simple "A-Z" style you propose.
For Java, you need to defend against string literals containing what appear to be variable names. Consider the (funny-looking but valid) statement:
a = "x=y+z/n-10+my5th_integer+201";
There is only one identifier here. A similar problem occurs with comments
that contain content that look like statements:
/* Tricky:
a = "x=y+z/n-10+my5th_integer+201";
*/
For Java, you need to worry about Unicode escapes, too. Consider this valid Java statement:
\u0061 = \u0062; // means "a=b;"
or nastier:
a\u006bc = 1; // means "akc=1;" not "abc=1;"!
Pushing this, without Unicode character decoding, you might not even
notice a string. The following is a variant of the above:
a = \u0042x=y+z/n-10+my5th_integer+201";
To extract identifiers correctly, you need to build (or use) the equivalent of a full Java lexer, not just a simple regex match.
If you don't care about being right most of the time, you can try your regex. Usually regex-applied-to-source-code-parsing ends badly, partly because of the above problems (e.g, oversimplification).
You are lucky in that you are trying to do for Java. If you had to do this for C#, a very similar language, you'd have to handle interpolated strings, which allow expressions inside strings. The expressions themselves can contain strings... its turtles all the way down. Consider the C# (version 6) statement:
a = $"x+{y*$"z=${c /* p=q */}"[2]}*q" + b;
This contains the identifiers a, b, c and y. Every other "identifier" is actually just a string or comment character. PHP has similar interpolated strings.
To extract identifiers from this, you need a something that understands the nesting of string elements. Lexers usually don't do recursion (Our DMS lexers handle this, for precisely this reason), so to process this correctly you usually need a parser, or at least something that tracks nesting.
You have one other issue: do you want to extract just variable names?
What if the identifier represents a method, type, class or package?
You can't figure this out without having a full parser and full Java name and type resolution, and you have to do this in the context in which the statement is found. You'd be amazed how much code it takes to do this right.
So, if your goals are simpleminded and you don't care if it handles these complications, you can get by with a simple regex to pick out things
that look like identifiers.
If you want to it well (e.g., use this in some production code) the single regex will be total disaster. You'll spend your life explaining to users what they cannot type, and that never works.
Summary: because of all the complications, usually processing source code with just a regex simply fails. People keep re-learning this lesson. It is one of key reasons that lexer generators are widely used in language processing tools.

Would ANTLR 4 allow me to create a parser for a boolean expression?

I have a program where users want to be able to filter out certain String criteria using the format
(someType != 'a' AND someType != 'b') OR (anotherType = 'abc' AND
somethingElse = 'cns')
We are looking into using ANTLR 4 for parsing this out. Each group will always be separated by an OR and each inner group will always be separated by ANDs.
I am a junior developer and I will learn ANTLR4 by reading the book if this is the route we want to go in. I just want to make sure ANTLR4 will take care of this.
We essentially want to know if the expression will evaluate to true or false based on this grammar.

Antlr doesn't evaluate expressions. It parses them.
"Evaluation" of the parsed result is up to you. Generally you attach node-building actions to grammar rules; with that, ANTLR will help you build a tree, and then you walk to the tree to evaluate it.
If you are really sneaky, you can likely do expression evaluation in the semantic actions. Passing values up is somewhat like passing created nodes up. Passing values down takes more effort, and I'm not the guy to describe how to do this with ANTLR.

Elegant way to do variable substitution in a java string

Pretty simple question and my brain is frozen today so I can't think of an elegant solution where I know one exists.
I have a formula which is passed to me in the form "A+B"
I also have a mapping of the formula variables to their "readable names".
Finally, I have a formula parser which will calculate the value of the formula, but only if its passed with the readable names for the variables.
For example, as an input I get
String formula = "A+B"
String readableA = "foovar1"
String readableB = "foovar2"
and I want my output to be "foovar1+foovar2"
The problem with a simple find and replace is that it can be easily be broken because we have no guarantees on what the 'readable' names are. Lets say I take my example again with different parameters
String formula = "A+B"
String readableA = "foovarBad1"
String readableB = "foovarAngry2"
If I do a simple find and replace in a loop, I'll end up replacing the capital A's and B's in the readable names I have already replaced.
This looks like an approximate solution but I don't have brackets around my variables
How to replace a set of tokens in a Java String?

That link you provided is an excellent source since matching using patterns is the way to go. The basic idea here is first get the tokens using a matcher. After this you will have Operators and Operands
Then, do the replacement individually on each Operand.
Finally, put them back together using the Operators.

A somewhat tedious solution would be to scan for all occurences of A and B and note their indexes in the string, and then use StringBuilder.replace(int start, int end, String str) method. (in naive form this would not be very efficient though, approaching smth like square complexity, or more precisely "number of variables" * "number of possible replacements")
If you know all of your operators, you could do split on them (like on "+") and then replace individual "A" and "B" (you'd have to do trimming whitespace chars first of course) in an array or ArrayList.

A simple way to do it is
String foumula = "A+B".replaceAll("\\bA\\b", readableA)
.replaceAll("\\bB\\b", readableB);

Your approach does not work fine that way
Formulas (mathematic Expressions) should be parsed into an expression structure (eg. expression tree).
Such that you have later Operand Nodes and Operator nodes.
Later this expression will be evaluated traversing the tree and considering the mathematical priority rules.
I recommend reading more on Expression parsing.

Matching Only
If you don't have to evaluate the expression after doing the substitution, you might be able to use a regex. Something like (\b\p{Alpha}\p{Alnum}*\b)
or the java string "(\\b\\p{Alpha}\\p{Alnum}*\\b)"
Then use find() over and over to find all the variables and store their locations.
Finally, go through the locations and build up a new string from the old one with the variable bits replaced.
Not that It will not do much checking that the supplied expression is reasonable. For example, it wouldn't mind at all if you gave it )A 2 B( and would just replace the A and B (like )XXX 2 XXX(). I don't know if that matters.
This is similar to the link you supplied in your question except you need a different regular expression than they used. You can go to http://www.regexplanet.com/advanced/java/index.html to play with regular expressions and figure out one that will work. I used it with the one I suggested and it finds what it needs in A+B and A + (C* D ) just fine.
Parsing
You parse the expression using one of the available parser generators (Antlr or Sable or ...) or find an algebraic expression parser available as open source and use it. (You would have to search the web to find those, I haven't used one but suspect they exist.)
Then you use the parser to generate a parsed form of the expression, replace the variables and reconstitute the string form with the new variables.
This one might work better but the amount of effort depends on whether you can find existing code to use.
It also depends on whether you need to validate the expression is valid according to the normal rules. This method will not accept invalid expressions, most likely.

How can I simplify token prediction DFA?

Lexer DFA results in "code too large" error
I'm trying to parse Java Server Pages using ANTLR 3.
Java has a limit of 64k for the byte code of a single method, and I keep running into a "code too large" error when compiling the Java source generated by ANTLR.
In some cases, I've been able to fix it by compromising my lexer. For example, JSP uses the XML "Name" token, which can include a wide variety of characters. I decided to accept only ASCII characters in my "Name" token, which drastically simplified some tests in the and lexer allowed it to compile.
However, I've gotten to the point where I can't cut any more corners, but the DFA is still too complex.
What should I do about it?
Are there common mistakes that result in complex DFAs?
Is there a way to inhibit generation of the DFA, perhaps relying on semantic predicates or fixed lookahead to help with the prediction?
Writing this lexer by hand will be easy, but before I give up on ANTLR, I want to make sure I'm not overlooking something obvious.
Background
ANTLR 3 lexers use a DFA to decide how to tokenize input. In the generated DFA, there is a method called specialStateTransition(). This method contains a switch statement with a case for each state in the DFA. Within each case, there is a series of if statements, one for each transition from the state. The condition of each if statement tests an input character to see if it matches the transition.
These character-testing conditions can be very complex. They normally have the following form:
int ch = … ; /* "ch" is the next character in the input stream. */
switch(s) { /* "s" is the current state. */
…
case 13 :
if ((('a' <= ch) && (ch <= 'z')) || (('A' <= ch) && (ch <= 'Z')) || … )
s = 24; /* If the character matches, move to the next state. */
else if …
A seemingly minor change to my lexer can result in dozens of comparisons for a single transition, several transitions for each state, and scores of states. I think that some of the states being considered are impossible to reach due to my semantic predicates, but it seems like semantic predicates are ignored by the DFA. (I could be misreading things though—this code is definitely not what I'd be able to write by hand!)
I found an ANTLR 2 grammar in the Jsp2x tool, but I'm not satisfied with its parse tree, and I want to refresh my ANTLR skills, so I thought I'd try writing my own. I am using ANTLRWorks, and I tried to generate graphs for the DFA, but there appear to be bugs in ANTLRWorks that prevent it.

Grammars that are very large (many different tokens) have that problem, unfortunately (SQL grammars suffer from this too).
Sometimes this can be fixed by making certain lexer rules fragments opposed to "full" lexer rules that produce tokens and/or re-arranging the way characters are matched inside the rules, but by looking at the way you already tried yourself, I doubt there can gained much in your case. However, if you're willing to post your lexer grammar here on SO, I, or someone else, might see something that could be changed.
In general, this problem is fixed by splitting the lexer grammar into 2 or more separate lexer grammars and then importing those in one "master" grammar. In ANTLR terms, these are called composite grammars. See this ANTLR Wiki page about them: http://www.antlr.org/wiki/display/ANTLR3/Composite+Grammars
EDIT
As #Gunther rightfully mentioned in the comment beneath the OP, see the Q&A: Why my antlr lexer java class is "code too large"? where a small change (the removal of a certain predicate) caused this "code too large"-error to disappear.

Well, actually it is not always easy to make a composite grammar. In many cases this AntTask helps to fix this problem (it must be run every time after recompiling a grammar, but this process is not so boring).
Unfortunately, even this magic script doesn't help in some complex cases. Compiler can begin to complaining about too large blocks of DFA transitions (static String[] fields).
I found an easy way to solve it, by moving (using IDE refactoring features) such fields to another class with arbitrarily generated name. It always helps when moving just one or more fields in such way.

Split textual script into substrings by pattern

Consider following script (it's total nonsense in pseudo-language):
if (Request.hostMatch("asfasfasf.com") && someString.existsIn(new String[] {"brr", "hrr"})) {
if (Requqest.clientIp("10.0.x.x")) {
somevar = "1";
}
somevar = "2";
}
else {
somevar = "first";
}
string foo = "foo";
// etc. etc.
How would you grab if-block's parameters and contents from it? The if-block has format of:
if<whitespace>(<parameters>)<whitespace>{<contents>}<anything>
I tried using String.split() with regex pattern of ^if\s*\(|\)\s*\{|\}\s* but this fails miserably. Namely, the problem is that ) { is found also in inner if-block and the closing } is found from many places as well. I don't think neither lazy or eager expansion works here.
So... any pointers to what might I need here in order to implement this with regex?
I also need to get the remaining string without the if-block's code (so code starting from else { ...). Using just String.split() seems to make it difficult as there is no information about the length of the parts that were parsed away.
I initially created a loop based solution (using String.substring() heavily) for this, but it's dull. I would like to have something fancier instead. Should I go with regex or create a custom, generic function (there are many other cases than just this) that takes the parseable String and the pattern instead (consider the if<whitespace>(... pattern above)?
Edit: Changed returns to variable assignments as it would have not made sense otherwise.

You'd be far better off using (or writing) a parser than trying to do this with Regex.
Regex is great for somethings, but for complex parsing like this, it sucks. Another example where it sucks that gets asked a lot here is parsing HTML - you can do it to a limited degree, but for anything complex, a DOM parser is a much better solution.
For a [very] simple parser, what you need is a recursive function that searches for a braces { and }, recursing down a level each time it comes across an opening brace, and returning back up a level when it finds a closing brace. It then needs to store the string contents between the two braces at each level.

A regular language won't work because a regular grammar can't match things like "any number of open parenthesis followed by any number of close parenthesis". A context-free grammar would be needed for that.
Unless you use a context-free grammar parser for Java or a regular expression extension that makes regular expressions no longer regular, your loop-based solution is probably the fanciest solution.

As per the above, you'll need a parser. One type that's easy to implement (and fun to write!) is a recursive descent parser with backtracking. There is also a plethora of parser generators out there, though most of those have a learning curve. One Java-friendly parser generator is JavaCC.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.