I'm building a program with ANTLR where I ask the user to enter some Java code, and it spits out equivalent C# code. In my program, I ask the user to enter some Java code and then parse it. Up until now I've been assuming that they will enter something that will parse as a valid compilation unit on its own, e.g. something like
package foo;
class A { ... }
class B { ... }
class C { ... }
However, that isn't always the case. They might just enter code from the inside of a class:
public void method1() {
...
}
public void method2() {
...
}
Or the inside of a method:
System.out.print("hello ");
System.out.println("world!");
Or even just an expression:
context.getSystemService(Context.ACTIVITY_SERVICE)
If I try to parse such snippets by calling parser.compilationUnit(), it won't work correctly because most of the code is parsed as error nodes. I need to call the correct method depending on the nature of the code, such as parser.expression() or parser.blockStatements(). However, I don't want to ask the user to explicitly indicate this. What's the best way to infer what kind of code I'm parsing?
Rather than trying to guess a valid grammar rule entry point to parse a language snippet of unknown scope, progressively add scope wrappers to the source text until a valid top-level rule parse is achieved.
That is, with each successive parse failure, progressively add dummy package, class, & method statements as source text wrappers.
Whichever wrapper was added to achieve a successful parse will then be a known quantity. Therefore, the parse tree node representing the original source text can be easily identified.
Probably want to use a fail-fast parser; construct the parser with the BailErrorStrategy to obtain this behavior.
Our algorithm in Swiftify tries to select the best suitable parse rule from the defined rule set. This web-service converts Objective-C code fragments to Swift and you can estimate the quality of conversion immediately by your own.
Algorithm
We use open-sourced ObjectiveC grammar. Detail Steps of algorithm look like this:
Parse input Objective-C code fragment with the following rules
translationUnit
implementationDefinitionList
interfaceDeclarationList
expression
compoundStatement
If parse result of the certain rule does not contain any error returns this
rule at once.
Select the rule with the nearest to the end parse error.
If there are two or more rules with the same nearest to the end error
location, select the rule with the minimum number of syntax errors.
Demo
There are test code samples that parsed with different parse rules:
translationUnit: http://swiftify.me/clye5z
implementationDefinitionList: http://swiftify.me/fpasza
interfaceDeclarationList: http://swiftify.me/13rv2j
compoundStatement: http://swiftify.me/4cpl9n
Our algorithm is able to detect suitably parse rule even with an incorrect input:
compoundStatement with errors: http://swiftify.me/13rv2j/1
I want to write a program which is able to search in source code files for specific patterns ... in other words: the input is a piece of code for example:
int fib (int i) {
int pred, result, temp;
pred = 1;
result = 0;
while (i > 0) {
temp = pred + result;
result = pred;
pred = temp;
i = i-1;
}
return(result);
}
The output are files that contain this piece of code or similar code.
In the Open Source World code is reused in other projects. Especially libraries are often copied into projects. To make bug fixing easier I need to be able to know in which projects specific libraries or code is used.
Therefore I want to try to use apache solr. I don't know if its a good idea (I am would be happy about everything that could help me)
My plan is to index my source code files ... therefore I need some tools? to tokenize source code files. Like give me all names of functions, variables etc. The output I can use to feed the solr index. But I am not sure maybe there are already tokenizer or dataimporthandler in apache solr that do the trick?
I am not sure if this can be done using solr, since different projects may use different naming conventions.
Have a look at the link below if it helps:
Tools for Code Seacrh
Apache Solr is probably not the best option here. You have more like tree/graph comparison problem than string comparison here. I'd recommend using specialized tools for that.
If you do want to do it by hand, you basically need a parser with tree traversal API or some other way to get the stream/tree of tokens. This would very much depend on the language you are parsing. Something like ANTLR might be one way to go if it has the grammar for your language.
Alternatively, you could extract the information from the compiled code, if it is structured enough. For Java, something like ASM may do the job.
But you would still have to figure out the representation. Answering - to yourself - the question of how do I know these two pieces of code are similar should be the right first step.
I am new to AntLR and I am struggling to do the following:
What I want to do is after I have parsed a source file (for which I have a valid grammar of course) and I have the AST in memory, to go and change some stuff and then print it back out though the visitor API.
e.g.
int foo() {
y = x ? 1 : 2;
}
and turn it into:
int foo() {
if (x) {
y = 1;
else {
y = 2;
}
}
Up to now I have the appropriate grammar to parse such syntax and I have also made some visitor methods that are getting called when I am on the correct position. What baffles me is that during visiting I can't change the text.
Ideally I would like to have something like this:
public Void visitTernExpr(SimpleCParser.TernExprContext ctx) {
ctx.setText("something");
return null;
}
and in my Main I would like to have this AST edited by different visitors that each one of them is specialised in something. Like this:
ANTLRInputStream input = new ANTLRInputStream(new FileInputStream(filename));
SimpleCLexer lexer = new SimpleCLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
SimpleCParser parser = new SimpleCParser(tokens);
ProgramContext ctx = parser.program();
MyChecker1 mc1 = new MyChecker1();
mc1.visit(ctx);
MyChecker2 mc2 = new MyChecker2();
mc1.visit(ctx);
ctx.printToFile("myfile");
Is there any way of doing those stuff in AntLR or am I on a very wrong direction?
You can do this ANTLR by smashing the AST nodes and links. You'll have create all the replacement subtree nodes and splice them in place. Then you'll have to implement the "spit source text" tree walk; I suggest you investigate "string templates" for this purpose.
But ultimately you have to do a lot of work to achieve this effect. This is because the goal of the ANTLR tool is largely focused around "parsing", which pushes the rest on you.
If what you are want to do is to replace one set of syntax by another, what you really want is a program transformation system. These are tools that are designed to have all of the above built-in already so you don't have to reinvent it all. They also usually have source-to-source transformations, which make accomplishing tasks like the one you have shown much, much easier to implement.
To accomplish your example with our DMS program transformation engine, you'd write a transformation rule and then apply it:
rule replace_ternary_assignment_by_ifthenelse
(l: left_hand_side, c: expression, e1: expression, e2: expression):
statement -> statement
"\l = \c ? \e1 : \e2;"
=> " if (\c) \l = \e1; else \l = \e2 ";
DMS parses your code, builds ASTs, find matches for the rewrites, constructs/splices all those replacement nodes for you. Finally,
DMS has built-in prettyprinters to regenerate the text. The point
of all this is to let you get on with your task of modifying your
code, rather than creating a whole new engineering job before you can
do your task. Read my essay, "Life After Parsing", easily
found by my bio or with a google search for more on this topic.
[If you go to the DMS wikipedia page, you will amusingly find
the inverse of this transform used as an example].
I would use a listener, and yes you can modify the AST while you are walking through it.
You can create a new instance of the if/else context and then replace the ternary operator context with it. This is posible because you have a reference to the rule parent and an extensive API to handle every rule children.
I'm trying to use JavaCC to build a simple command line calculator that can handle a variety of expressions. While there are plenty of tutorials out there on how to write grammars, none that I've seen so far explain what happens afterwards.
What I understand right now is that after a string is passed into the parser, it's split into a tokens and turned into a parse tree. What happens next? Do I traverse through the parse tree doing a bunch of if-else string comparisons on the contents of each node and then perform the appropriate function?
I highly suggest you watch Scott Stanchfield's ANTLR 3.x tutorials. Even if you don't end up using ANTLR, which may be overkill for your project but I doubt it, you will learn a lot by watching him go through the thought process.
In general the process is...
Build a lexer to understand your tokens
Build a parser that can validate and understand and organize the input into an abstract syntax tree (AST) which should represent a simplified/easy-to-work-with version of your syntax
Run any calculation based on the AST
You need to actually compile or interpret it according to what you need..
For a calculator you just need to visit the tree recursively and evaluate the parsed tree while with a more complex language you would have to translate it to an intermediate language which is assembly-like but keeps abstraction from the underlying architecture.
Of course you could develop your own simple VM that is able to execute a set of instruction in which your language compiles but it would be overkill in your case.. just visit the parse tree. Something like:
enum Operation {
PLUS, MINUS
}
interface TreeNode {
float eval();
}
class TreeFloat implements TreeNode {
float val;
float eval() { return val; }
}
class TreeBinaryOp implements TreeNode {
TreeNode first;
TreeNode second;
Operation op;
float eval() {
if (op == PLUS)
return first.eval()+second.eval();
}
Then you just call the eval function on the root of the tree. A semantic checking could be needed (with the construction of a symbol table too if you plan to have variables or whatever).
Do I traverse through the parse tree doing a bunch of if-else string comparisons on the contents of each node and then perform the appropriate function?
No, there's no need to build a parse tree to implement a calculator. In the parts of the code where you would create a new node object, just do the calculations and return a number.
JavaCC allows you to choose any return type for a production, so just have your's return numbers.
Some parser generators (such as YACC) let you put actions within the grammar so when you apply a certain production you can also apply a defined action during that production.
E.g. in YACC:
E: NUM + NUM {$$ = $1.value + $2.value};
would add the values of NUM and return the result to the E non-terminal.
Not sure what JavaCC lets you do.
Given a string with replacement keys in it, how can I most efficiently replace these keys with runtime values, using Java? I need to do this often, fast, and on reasonably long strings (say, on average, 1-2kb). The form of the keys is my choice, since I'm providing the templates here too.
Here's an example (please don't get hung up on it being XML; I want to do this, if possible, cheaper than using XSL or DOM operations). I'd want to replace all #[^#]*?# patterns in this with property values from bean properties, true Property properties, and some other sources. The key here is fast. Any ideas?
<?xml version="1.0" encoding="utf-8"?>
<envelope version="2.3">
<delivery_instructions>
<delivery_channel>
<channel_type>#CHANNEL_TYPE#</channel_type>
</delivery_channel>
<delivery_envelope>
<chan_delivery_envelope>
<queue_name>#ADDRESS#</queue_name>
</chan_delivery_envelope>
</delivery_envelope>
</delivery_instructions>
<composition_instructions>
<mime_part content_type="application/xml">
<content><external_uri>#URI#</external_uri></content>
</mime_part>
</composition_instructions>
</envelope>
The naive implementation is to use String.replaceAll() but I can't help but think that's less than ideal. If I can avoid adding new third-party dependencies, so much the better.
The appendReplacement method in Matcher looks like it might be useful, although I can't vouch for its speed.
Here's the sample code from the Javadoc:
Pattern p = Pattern.compile("cat");
Matcher m = p.matcher("one cat two cats in the yard");
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb, "dog");
}
m.appendTail(sb);
System.out.println(sb.toString());
EDIT: If this is as complicated as it gets, you could probably implement your own state machine fairly easily. You'd pretty much be doing what appendReplacement is already doing, although a specialized implementation might be faster.
It's premature to leap to writing your own. I would start with the naive replace solution, and actually benchmark that. Then I would try a third-party templating solution. THEN I would take a stab at the custom stream version.
Until you get some hard numbers, how can you be sure it's worth the effort to optimize it?
Does Java have a form of regexp replace() where a function gets called?
I'm spoiled by the Javascript String.replace() method. (For that matter you could run Rhino and use Javascript, but somehow I don't think that would be anywhere near as fast as a pure Java call even if the Javascript compiler/interpreter were efficient)
edit: never mind, #mmyers probably has the best answer.
gratuitous point-groveling: (and because I wanted to see if I could do it myself :)
Pattern p = Pattern.compile("#([^#]*?)#");
Matcher m = p.matcher(s);
StringBuffer sb = new StringBuffer();
while (m.find())
{
m.appendReplacement(sb,substitutionTable.lookupKey(m.group(1)));
}
m.appendTail(sb);
// replace "substitutionTable.lookupKey" with your routine
You really want to write something custom so you can avoid processing the string more than once. I can't stress this enough - as most of the other solutions I see look like they are ignoring that problem.
Optionally turn the text into a stream. Read it char by char forwarding each char to an output string/stream until you see the # then read to the next # slurping out the key, substituting the key into the output: repeat until end of stream.
I know it's plain old brute for - but it's probably the best.
I'm assuming you have some reasonable assumption around '#' not just 'showing up' independant of your token keys in the input. :)
please don't get hung up on it being XML; I want to do this, if possible, cheaper than using XSL or DOM operations
Whatever's downstream from your process will get hung up if you don't also process the inserted strings for character escapes. Which isn't to say that you can't do it yourself if you have good cause, but does mean you either have to make sure your patterns are all in text nodes, and you also correctly escape the replacement text.
What exact advantage does #Foo# have over the standard &Foo; syntax already built into the XML libraries which ship with Java?
Text processing is going to always be bounded if you dont shift your paradigm. I dont know how flexible your domain is, so not sure if this is applicable, but here goes:
try creating an index into where your text substitution is - this is especially good if the template doesnt change often, because it becomes part of the "compile" of the template, into a binary object that can take in the value required for the substitutions, and blit out the entire string as a byte array. This object can be cached/saved, and next time, resubstitute in the new value to use again. I.e., you save on parsing the document every time. (implementation is left as an exercise to the reader =D )
But please use a profiler to check whether this is actually the bottleneck that you say it is before embarking on writing a custom templating engine. The problem may actually be else where.
As others have said, appendReplacement() and appendTail() are the tools you need, but there's something you have watch out for. If the replacement string contains any dollar signs, the method will try to interpret them as capture-group references. If there are any backslashes (which are used to escape the dollars sing), it will either eat them or throw an exception.
If your replacement string is dynamically generated, you may not know in advance whether it will contain any dollar signs or backslashes. To prevent problems, you can append the replacement directly to the StringBuffer, like so:
Pattern p = Pattern.compile("#([^#]*?)#");
Matcher m = p.matcher(s);
StringBuffer sb = new StringBuffer();
while (m.find())
{
m.appendReplacement("");
sb.append(substitutionTable.lookupKey(m.group(1)));
}
m.appendTail(sb);
You still have to call appendReplacement() each time, because that's what keeps you in sync with the match position. But this trick avoids a lot of pointless processing, which could give you a noticeable performance boost as a bonus.
this is what I use, from the apache commons project
http://commons.apache.org/lang/api/org/apache/commons/lang/text/StrSubstitutor.html
I also have a non-regexp based substitution library, available here. I have not tested its speed, and it doesn't directly support the syntax in your example. But it would be easy to extend to support that syntax; see, for instance, this class.
Take a look at a library that specializes in this, e.g., Apache Velocity. If nothing else, you can bet their implementation for this part of the logic is fast.
I wouldn't be so sure the accepted answer is faster than String.replaceAll(String,String). Here for your comparison is the implementation of String.replaceAll and the Matcher.replaceAll that is used under the covers. looks very similar to what the OP is looking for, and I'm guessing its probably more optomized than this simplistic solution.
public String replaceAll(String s, String s1)
{
return Pattern.compile(s).matcher(this).replaceAll(s1);
}
public String replaceAll(String s)
{
reset();
boolean flag = find();
if(flag)
{
StringBuffer stringbuffer = new StringBuffer();
boolean flag1;
do
{
appendReplacement(stringbuffer, s);
flag1 = find();
} while(flag1);
appendTail(stringbuffer);
return stringbuffer.toString();
} else
{
return text.toString();
}
}
... Chii is right.
If this is a template that has to be run so many times that speed matters, find the index of your substitution tokens to be able to get to them directly without having to start at the beginning each time. Abstract the 'compilation' into an object with the nice properties, they should only need updating after a change to the template.
Rythm a java template engine now released with an new feature called String interpolation mode which allows you do something like:
String result = Rythm.render("Hello #who!", "world");
The above case shows you can pass argument to template by position. Rythm also allows you to pass arguments by name:
Map<String, Object> args = new HashMap<String, Object>();
args.put("title", "Mr.");
args.put("name", "John");
String result = Rythm.render("Hello #title #name", args);
Since your template content is relatively long you could put them into a file and then call Rythm.render using the same API:
Map<String, Object> args = new HashMap<String, Object>();
// ... prepare the args
String result = Rythm.render("path/to/my/template.xml", args);
Note Rythm compile your template into java byte code and it's fairly fast, about 2 times faster than String.format
Links:
Check the full featured demonstration
read a brief introduction to Rythm
download the latest package or
fork it