Edit AST by using visitors in Antlr

Edit AST by using visitors in Antlr - java

I am new to AntLR and I am struggling to do the following:
What I want to do is after I have parsed a source file (for which I have a valid grammar of course) and I have the AST in memory, to go and change some stuff and then print it back out though the visitor API.
e.g.
int foo() {
y = x ? 1 : 2;
}
and turn it into:
int foo() {
if (x) {
y = 1;
else {
y = 2;
}
}
Up to now I have the appropriate grammar to parse such syntax and I have also made some visitor methods that are getting called when I am on the correct position. What baffles me is that during visiting I can't change the text.
Ideally I would like to have something like this:
public Void visitTernExpr(SimpleCParser.TernExprContext ctx) {
ctx.setText("something");
return null;
}
and in my Main I would like to have this AST edited by different visitors that each one of them is specialised in something. Like this:
ANTLRInputStream input = new ANTLRInputStream(new FileInputStream(filename));
SimpleCLexer lexer = new SimpleCLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
SimpleCParser parser = new SimpleCParser(tokens);
ProgramContext ctx = parser.program();
MyChecker1 mc1 = new MyChecker1();
mc1.visit(ctx);
MyChecker2 mc2 = new MyChecker2();
mc1.visit(ctx);
ctx.printToFile("myfile");
Is there any way of doing those stuff in AntLR or am I on a very wrong direction?

You can do this ANTLR by smashing the AST nodes and links. You'll have create all the replacement subtree nodes and splice them in place. Then you'll have to implement the "spit source text" tree walk; I suggest you investigate "string templates" for this purpose.
But ultimately you have to do a lot of work to achieve this effect. This is because the goal of the ANTLR tool is largely focused around "parsing", which pushes the rest on you.
If what you are want to do is to replace one set of syntax by another, what you really want is a program transformation system. These are tools that are designed to have all of the above built-in already so you don't have to reinvent it all. They also usually have source-to-source transformations, which make accomplishing tasks like the one you have shown much, much easier to implement.
To accomplish your example with our DMS program transformation engine, you'd write a transformation rule and then apply it:
rule replace_ternary_assignment_by_ifthenelse
(l: left_hand_side, c: expression, e1: expression, e2: expression):
statement -> statement
"\l = \c ? \e1 : \e2;"
=> " if (\c) \l = \e1; else \l = \e2 ";
DMS parses your code, builds ASTs, find matches for the rewrites, constructs/splices all those replacement nodes for you. Finally,
DMS has built-in prettyprinters to regenerate the text. The point
of all this is to let you get on with your task of modifying your
code, rather than creating a whole new engineering job before you can
do your task. Read my essay, "Life After Parsing", easily
found by my bio or with a google search for more on this topic.
[If you go to the DMS wikipedia page, you will amusingly find
the inverse of this transform used as an example].

I would use a listener, and yes you can modify the AST while you are walking through it.
You can create a new instance of the if/else context and then replace the ternary operator context with it. This is posible because you have a reference to the rule parent and an extensive API to handle every rule children.

Related

Create an ANTLR grammar rule that returns function name as a token if it finds a doctype comment above the function declaration

This is the code sample which I want to parse. I want getSaveable PaymentMethodsSmartList() as a token, when I overwrite the function in the parserBaseListener.java file created by ANTLR.
/** #suppress */
public any function getSaveablePaymentMethodsSmartList() {
if(!structKeyExists(variables, "saveablePaymentMethodsSmartList")) {
variables.saveablePaymentMethodsSmartList = getService("paymentService").getPaymentMethodSmartList();
variables.saveablePaymentMethodsSmartList.addFilter('activeFlag', 1);
variables.saveablePaymentMethodsSmartList.addFilter('allowSaveFlag', 1);
variables.saveablePaymentMethodsSmartList.addInFilter('paymentMethodType', 'creditCard,giftCard,external,termPayment');
if(len(setting('accountEligiblePaymentMethods'))) {
variables.saveablePaymentMethodsSmartList.addInFilter('paymentMethodID', setting('accountEligiblePaymentMethods'));
}
}
return variables.saveablePaymentMethodsSmartList;
}
I already have the grammar that parses function declaration, but I need a new rule that can associate doctype comments with a function declaration and give the function name as separate token if there is a doctype comment associated with it.
Grammar looks like this:
functionDeclaration
: accessType? typeSpec? FUNCTION identifier
LEFTPAREN parameterList? RIGHTPAREN
functionAttribute* body=compoundStatement
;

You want grammar rules that:
return X if something "far away" in the source is a A,
returns Y if something far away is a B (or ...).
In general, this is context dependency. It is not handled well by context free grammars, which is something that ANTLR is trying to approximate with its BNF rules. In essence, what you think you want to do is to encode history of what the parser has seen long ago, to influence what is being produced now. Generally that is hard.
The usual solution to something like this is to not address it in the grammar at all. Instead:
have the grammar rules produce an X regardless of what is far away,
build a tree as you parse (ANTLR does this for you); this captures not only X but everything about the parsed entity, including tokens for A that are far away
walk over the tree, interpreting a found X as Y if the tree contains the A (usually far away in the tree).
For your specific case of docstring-influences-function name, you can probably get away with encoding far away history.
You need (IMHO, ugly) grammar rules that look something like this:
functionDeclaration: documented_function | undocumented_function ;
documented_function: docstring accessType? typeSpec? FUNCTION
documented_function_identifier rest_of_function ;
undocumented_function: accessType? typeSpec? FUNCTION
identifier rest_of_function ;
rest_of_function: // avoids duplication, not pretty
LEFTPAREN parameterList? RIGHTPAREN
functionAttribute* body=compoundStatement ;
You have to recognize the docstring as an explicit token that can be "seen" by the parser, which means modifying your lexer to make docstrings from comments (e.g, whitespace) into tokens. [This is the first ugly thing]. Then having seen such a docstring, the lexer has to switch to a lexical mode that will pick up identifier-like text and produce documented_function_identifier, and then switch back to normal mode. [This is the second ugly thing]. What you are doing is implementing literally a context dependency.
The reason you can accomplish this in spite of my remarks about context dependency is that A is not very far away; it is within few tokens of X.
So, you could do it this way. I would not do this; you are trying to make the parser do too much. Stick to the "usual solution". (You'll have different problem: your A is a comment/whitespace, and probably isn't stored in the tree by ANTLR. You'll have to solve that; I'm not an ANTLR expert.)

How to check if given line of code is written in java?

What is the right way to check if given line is java code?
Input: LogSupport.java:44 com/sun/activation/registries/LogSupport log (Ljava/lang/String;)V
Expected Output: false.
Input: Scanner in = new Scanner(System.in);
Expected Output: true.
I tried Eclipse JDT ASTParser to check if we can create an AST. Here's the code:
public static boolean isJava(String line) {
boolean isJava = false;
ASTParser parser = ASTParser.newParser(AST.JLS3);
parser.setSource(line.toCharArray());
parser.setResolveBindings(false);
ASTNode node = null;
parser.setKind(ASTParser.K_STATEMENTS);
try {
node = parser.createAST(null);
if (node == null) return false;
isJava = true;
} catch (Exception e) {
return false;
}
return isJava;
}
But this does not work. Any ideas? Thanks!

Try Beanshell
http://www.beanshell.org/intro.html
Java evaluation features:
Evaluate full Java source classes dynamically as well as isolated Java methods, statements, and expressions.
Summary of features
Dynamic execution of the full Java syntax, Java code fragments, as well as loosely typed Java and additional scripting conveniences.
Transparent access to all Java objects and APIs.
Runs in four modes: Command Line, Console, Applet, Remote Session Server.
Can work in security constrained environments without a classloader or bytecode generation for most features.
The interpreter is small ~150K jar file.
Pure Java.
It's Free!!
The link below has some other option you could try
Syntax Checking in Java

What you want apparantly is to decide if a string you have is a valid substring of the Java language.
Obviously, to do this, you need a full Java parser as a foundation.
Some parsing machinery may let you try parsing the string as a nonterminal in the language; this is relatively easy to do with a recursive descent parser. (It appears the Eclipse parse offers that, based on OP's example).
But if you want to accept an substring (e.g,
57).x=2; foo[15].bar(abc>=
is a valid Java fragment, you need parsing machinery specialized to handle this.
Our DMS Software Reengineering Toolkit with its Java Front End will do this. The parser APIs
provide facilities for "parse a full compilation unit", "parse a nonterminal", and "parse a substring". The first two return trees; the latter returns a sequence of trees. It isn't quite an arbitrary substring; you can't start or end in the middle of token (e.g., a string literal). Other than that, it will parse arbitrary substrings.

How to parse a file in INI/JSON-like non-standard format?

Suppose I have a text file in the following (non-standard) format:
xxx { a = v1; b = v2 }
yyy { a = v3; c = v4 }
I cannot change it to any standard (INI/XML/YAML, etc.) format.
Now I would like to find the value of property a in section xxx (that is v1). What is the simplest way to do it in Java/Groovy?

With Groovy, you could leverage the ConfigSlurper.
However, you would first need to hack a map of valid values together, so that it doesn't choke trying to work out what v1, v2, v3, etc are:
This seems to work:
def input = '''xxx { a = v1; b = v2 }
|yyy { a = v3; c = v4 }'''.stripMargin()
def slurper = new ConfigSlurper()
// Find all words 'w' and make a map of [ w1:'w1', w2:'w2', ... ]
slurper.binding = ( ( input =~ /\w+/ ) as List ).collectEntries { w -> [ (w):w ] }
def result = slurper.parse( input )
println result
That prints out:
[xxx:[a:v1, b:v2], yyy:[a:v3, c:v4]]
(Groovy 1.8.4)

For a true INI-format file: What is the easiest way to parse an INI file in Java?
What you're showing here looks more like JSON than INI format to me. Perhaps look at JSON parsing libraries. The truth here is that you're not using an established format, so you probably won't be using an established format parser. Your best bet is probably to refactor the file you're dealing with (if possible) into a well-known format to begin with. Don't try to reinvent the wheel unless you absolutely have to.

There's likely not going to be an out-of-box solution if you're dealing with a non-standard format. Here's a few approaches you might want to look into:
if the format is simple, write a custom recursive descent parser
write a filter to transform your format into INI, JSON, etc. and use existing libraries
create a groovy DSL that matches your format and execute your file as a groovy script
use a parser generator tool like antlr or parboiled to create a parser from a language specification

Firstly, you've given an example, not specified a format. Before you go any further, you need to get hold of a complete specification for the format. Or if there isn't one, you need to see the code that generates it, and reverse engineer a specification.
(If you try to implement based on a small example, there's a good chance that your parser will encounter real life examples that don't fit the patterns that you have intuited.)
Having done that you can look for an off-the-shelf parser that can cope with your format. If you are lucky, it might be close enough to INI, or JSON or YAML or something else for the corresponding parser to (mostly) work.
But the chances are that it won't, and that you will need to write your own parser. There are various ways you could do this, for instance:
You could split the file into lines and "parse" each line with a regex.
You could parse the file using a Scanner with appropriate delimiters.
You could use a parser generator to implement a lexer and parser.
You could implement a simple lexer and parser by hand.
There are probably Groovy specific solutions.
In reality the correct choice(s) depend on how simple or complex the actual format is. We can't tell that from a single example.

Trying to understand parsers

I'm trying to use JavaCC to build a simple command line calculator that can handle a variety of expressions. While there are plenty of tutorials out there on how to write grammars, none that I've seen so far explain what happens afterwards.
What I understand right now is that after a string is passed into the parser, it's split into a tokens and turned into a parse tree. What happens next? Do I traverse through the parse tree doing a bunch of if-else string comparisons on the contents of each node and then perform the appropriate function?

I highly suggest you watch Scott Stanchfield's ANTLR 3.x tutorials. Even if you don't end up using ANTLR, which may be overkill for your project but I doubt it, you will learn a lot by watching him go through the thought process.
In general the process is...
Build a lexer to understand your tokens
Build a parser that can validate and understand and organize the input into an abstract syntax tree (AST) which should represent a simplified/easy-to-work-with version of your syntax
Run any calculation based on the AST

You need to actually compile or interpret it according to what you need..
For a calculator you just need to visit the tree recursively and evaluate the parsed tree while with a more complex language you would have to translate it to an intermediate language which is assembly-like but keeps abstraction from the underlying architecture.
Of course you could develop your own simple VM that is able to execute a set of instruction in which your language compiles but it would be overkill in your case.. just visit the parse tree. Something like:
enum Operation {
PLUS, MINUS
}
interface TreeNode {
float eval();
}
class TreeFloat implements TreeNode {
float val;
float eval() { return val; }
}
class TreeBinaryOp implements TreeNode {
TreeNode first;
TreeNode second;
Operation op;
float eval() {
if (op == PLUS)
return first.eval()+second.eval();
}
Then you just call the eval function on the root of the tree. A semantic checking could be needed (with the construction of a symbol table too if you plan to have variables or whatever).

Do I traverse through the parse tree doing a bunch of if-else string comparisons on the contents of each node and then perform the appropriate function?
No, there's no need to build a parse tree to implement a calculator. In the parts of the code where you would create a new node object, just do the calculations and return a number.
JavaCC allows you to choose any return type for a production, so just have your's return numbers.

Some parser generators (such as YACC) let you put actions within the grammar so when you apply a certain production you can also apply a defined action during that production.
E.g. in YACC:
E: NUM + NUM {$$ = $1.value + $2.value};
would add the values of NUM and return the result to the E non-terminal.
Not sure what JavaCC lets you do.

I need a fast key substitution algorithm for java

Given a string with replacement keys in it, how can I most efficiently replace these keys with runtime values, using Java? I need to do this often, fast, and on reasonably long strings (say, on average, 1-2kb). The form of the keys is my choice, since I'm providing the templates here too.
Here's an example (please don't get hung up on it being XML; I want to do this, if possible, cheaper than using XSL or DOM operations). I'd want to replace all #[^#]*?# patterns in this with property values from bean properties, true Property properties, and some other sources. The key here is fast. Any ideas?
<?xml version="1.0" encoding="utf-8"?>
<envelope version="2.3">
<delivery_instructions>
<delivery_channel>
<channel_type>#CHANNEL_TYPE#</channel_type>
</delivery_channel>
<delivery_envelope>
<chan_delivery_envelope>
<queue_name>#ADDRESS#</queue_name>
</chan_delivery_envelope>
</delivery_envelope>
</delivery_instructions>
<composition_instructions>
<mime_part content_type="application/xml">
<content><external_uri>#URI#</external_uri></content>
</mime_part>
</composition_instructions>
</envelope>
The naive implementation is to use String.replaceAll() but I can't help but think that's less than ideal. If I can avoid adding new third-party dependencies, so much the better.

The appendReplacement method in Matcher looks like it might be useful, although I can't vouch for its speed.
Here's the sample code from the Javadoc:
Pattern p = Pattern.compile("cat");
Matcher m = p.matcher("one cat two cats in the yard");
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb, "dog");
}
m.appendTail(sb);
System.out.println(sb.toString());
EDIT: If this is as complicated as it gets, you could probably implement your own state machine fairly easily. You'd pretty much be doing what appendReplacement is already doing, although a specialized implementation might be faster.

It's premature to leap to writing your own. I would start with the naive replace solution, and actually benchmark that. Then I would try a third-party templating solution. THEN I would take a stab at the custom stream version.
Until you get some hard numbers, how can you be sure it's worth the effort to optimize it?

Does Java have a form of regexp replace() where a function gets called?
I'm spoiled by the Javascript String.replace() method. (For that matter you could run Rhino and use Javascript, but somehow I don't think that would be anywhere near as fast as a pure Java call even if the Javascript compiler/interpreter were efficient)
edit: never mind, #mmyers probably has the best answer.
gratuitous point-groveling: (and because I wanted to see if I could do it myself :)
Pattern p = Pattern.compile("#([^#]*?)#");
Matcher m = p.matcher(s);
StringBuffer sb = new StringBuffer();
while (m.find())
{
m.appendReplacement(sb,substitutionTable.lookupKey(m.group(1)));
}
m.appendTail(sb);
// replace "substitutionTable.lookupKey" with your routine

You really want to write something custom so you can avoid processing the string more than once. I can't stress this enough - as most of the other solutions I see look like they are ignoring that problem.
Optionally turn the text into a stream. Read it char by char forwarding each char to an output string/stream until you see the # then read to the next # slurping out the key, substituting the key into the output: repeat until end of stream.
I know it's plain old brute for - but it's probably the best.
I'm assuming you have some reasonable assumption around '#' not just 'showing up' independant of your token keys in the input. :)

please don't get hung up on it being XML; I want to do this, if possible, cheaper than using XSL or DOM operations
Whatever's downstream from your process will get hung up if you don't also process the inserted strings for character escapes. Which isn't to say that you can't do it yourself if you have good cause, but does mean you either have to make sure your patterns are all in text nodes, and you also correctly escape the replacement text.
What exact advantage does #Foo# have over the standard &Foo; syntax already built into the XML libraries which ship with Java?

Text processing is going to always be bounded if you dont shift your paradigm. I dont know how flexible your domain is, so not sure if this is applicable, but here goes:
try creating an index into where your text substitution is - this is especially good if the template doesnt change often, because it becomes part of the "compile" of the template, into a binary object that can take in the value required for the substitutions, and blit out the entire string as a byte array. This object can be cached/saved, and next time, resubstitute in the new value to use again. I.e., you save on parsing the document every time. (implementation is left as an exercise to the reader =D )
But please use a profiler to check whether this is actually the bottleneck that you say it is before embarking on writing a custom templating engine. The problem may actually be else where.

As others have said, appendReplacement() and appendTail() are the tools you need, but there's something you have watch out for. If the replacement string contains any dollar signs, the method will try to interpret them as capture-group references. If there are any backslashes (which are used to escape the dollars sing), it will either eat them or throw an exception.
If your replacement string is dynamically generated, you may not know in advance whether it will contain any dollar signs or backslashes. To prevent problems, you can append the replacement directly to the StringBuffer, like so:
Pattern p = Pattern.compile("#([^#]*?)#");
Matcher m = p.matcher(s);
StringBuffer sb = new StringBuffer();
while (m.find())
{
m.appendReplacement("");
sb.append(substitutionTable.lookupKey(m.group(1)));
}
m.appendTail(sb);
You still have to call appendReplacement() each time, because that's what keeps you in sync with the match position. But this trick avoids a lot of pointless processing, which could give you a noticeable performance boost as a bonus.

this is what I use, from the apache commons project
http://commons.apache.org/lang/api/org/apache/commons/lang/text/StrSubstitutor.html

I also have a non-regexp based substitution library, available here. I have not tested its speed, and it doesn't directly support the syntax in your example. But it would be easy to extend to support that syntax; see, for instance, this class.

Take a look at a library that specializes in this, e.g., Apache Velocity. If nothing else, you can bet their implementation for this part of the logic is fast.

I wouldn't be so sure the accepted answer is faster than String.replaceAll(String,String). Here for your comparison is the implementation of String.replaceAll and the Matcher.replaceAll that is used under the covers. looks very similar to what the OP is looking for, and I'm guessing its probably more optomized than this simplistic solution.
public String replaceAll(String s, String s1)
{
return Pattern.compile(s).matcher(this).replaceAll(s1);
}
public String replaceAll(String s)
{
reset();
boolean flag = find();
if(flag)
{
StringBuffer stringbuffer = new StringBuffer();
boolean flag1;
do
{
appendReplacement(stringbuffer, s);
flag1 = find();
} while(flag1);
appendTail(stringbuffer);
return stringbuffer.toString();
} else
{
return text.toString();
}
}

... Chii is right.
If this is a template that has to be run so many times that speed matters, find the index of your substitution tokens to be able to get to them directly without having to start at the beginning each time. Abstract the 'compilation' into an object with the nice properties, they should only need updating after a change to the template.

Rythm a java template engine now released with an new feature called String interpolation mode which allows you do something like:
String result = Rythm.render("Hello #who!", "world");
The above case shows you can pass argument to template by position. Rythm also allows you to pass arguments by name:
Map<String, Object> args = new HashMap<String, Object>();
args.put("title", "Mr.");
args.put("name", "John");
String result = Rythm.render("Hello #title #name", args);
Since your template content is relatively long you could put them into a file and then call Rythm.render using the same API:
Map<String, Object> args = new HashMap<String, Object>();
// ... prepare the args
String result = Rythm.render("path/to/my/template.xml", args);
Note Rythm compile your template into java byte code and it's fairly fast, about 2 times faster than String.format
Links:
Check the full featured demonstration
read a brief introduction to Rythm
download the latest package or
fork it

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.