I am trying to parse Java class files using Java.g4 grammar and Antlr4.
There is a particular parser rule as follows:
classOrInterfaceType
: Identifier typeArguments? ('.' Identifier typeArguments? )*
;
I am parsing it in my visitor class in this way:
public String visitClassOrInterfaceType(JavaParser.ClassOrInterfaceTypeContext ctx) {
StringBuilder clsIntr = new StringBuilder("");
int n = ctx.getChildCount();
for(int i = 0; i < n; i++){
TerminalNode id = ctx.Identifier(i);
if(id!=null){
clsIntr.append(id.getText()).append(" ");
}
TypeArgumentsContext typArgCtx =ctx.typeArguments(i);
if(typArgCtx!=null){
String val = this.visitTypeArguments(typArgCtx);
clsIntr.append(val);
}
}
return clsIntr.toString();
}
Is this correct or there is some other way to do this?
Your approach looks ok, even though that ultimately depends on what you are actually trying to do. My crystal ball tells me you try to reconstruct the original query text by walking the parse tree. However, you can get that much simpler. Each parse context has start and stop members that hold tokens for the parsed text range this context stands for. You could use those to directly get the original text exactly like it was entered (via the token stream and the token's positions).
Writing a pretty-printer for legacy code in an older language. The plan is for me to learn parsing and unparsing before I write a translator to output C++. I kind of got thrown into the deep end with Java and ANTLR back in June, so I definitely have some knowledge gaps.
I've gotten to the point where I'm comfortable writing methods for my custom listener, and I want to be able to pretty-print the comments as well. My comments are on a separate hidden channel. Here are the grammar rules for the hidden tokens:
/* Comments and whitespace -- Nested comments are allowed, each is redirected to a specific channel */
COMMENT_1 : '(*' (COMMENT_1|COMMENT_2|.)*? '*)' -> channel(1) ;
COMMENT_2 : '{' (COMMENT_1|COMMENT_2|.)*? '}' -> channel(1) ;
NEWLINES : [\r\n]+ -> channel(2) ;
WHITESPACE : [ \t]+ -> skip ;
I've been playing with the Cymbol CommentShifter example on p. 207 of The Definitive ANTLR 4 Reference and I'm trying to figure out how to adapt it to my listener methods.
public void exitVarDecl(ParserRuleContext ctx) {
Token semi = ctx.getStop();
int i = semi.getTokenIndex();
List<Token> cmtChannel = tokens.getHiddenTokensToRight(i, CymbolLexer.COMMENTS);
if (cmtChannel != null) {
Token cmt = cmtChannel.get(0);
if (cmt != null) {
String txt = cmt.getText().substring(2);
String newCmt = "// " + txt.trim(); // printing comments in original format
rewriter.insertAfter(ctx.stop, newCmt); // at end of line
rewriter.replace(cmt, "\n");
}
}
}
I adapted this example by using exitEveryRule rather than exitVarDecl and it worked for the Cymbol example but when I adapt it to my own listener I get a null pointer exception whether I use exitEveryRule or exitSpecificThing
I'm looking at this answer and it seems promising but I think what I really need is an explanation of how the hidden channel data is stored and how to access it. It took me months to really get listener methods and context in the parse tree.
It seems like CommonTokenStream.LT(), CommonTokenStream.LA(), and consume() are what I want to be using, but why is the example in that SO answer using completely different methods from the ANTLR book example? What should I know about the token index or token types?
I'd like to better understand the logic behind this.
Okay, so I can't answer how AnTLR stores its data internally, but I can tell you how to access your hidden tokens. I have tested this on my computer using AnTLR v4.1 for C# .NET v4.5.2.
I have a rule that looks like this:
LineComment
: '//' ~[\r\n]*
-> channel(1)
;
In my code, I am getting the entire raw token stream like this:
IList<IToken> lTokenList = cmnTokenStream.Get( 0, cmnTokenStream.Size );
To test, I printed the token list using the following loop:
foreach ( IToken iToken in lTokenList )
{
Console.WriteLine( "{0}[{1}] : {2}",
iToken.Channel,
iToken.TokenIndex,
iToken.Text );
}
Running on this code:
void Foo()
{
// comment
i = 5;
}
Yields the following output (for the sake of brevity, please assume I have a complete grammar that is also ignoring whitespace):
0[0] : void
0[1] : Foo
0[2] : (
0[3] : )
0[4] : {
1[5] : // comment
0[6] : i
0[7] : =
0[8] : 6
0[9] : ;
0[10] : }
You can see the channel index is 1 only for the single comment token. So you can use this loop to access only the comment tokens:
int lCommentCount = 0;
foreach ( IToken iToken in lTokenList )
{
if ( iToken.Channel == 1 )
{
Console.WriteLine( "{0} : {1}",
lCommentCount++,
iToken.Text );
}
}
Then you can do your whatever with those tokens. Also works if you have multiple streams, though I will caution against using more than 65,536 streams. AnTLR gave the following error when I tried to compile a grammar with a token rule redirect to stream index 65536:
Serialized ATN data element out of range.
So I guess they're only using a 16-bit unsigned integer to index the streams. Wierd.
Using Antlr4, I want to generate the parse tree in the form of Java/JavaScript code.
This is what my main.Java looks like
String sql = "SELECT log AS x FROM t1 \n" +
"GROUP BY x\n" +
"HAVING count(*) >= 4 \n" +
"ORDER BY max(n) + 0";
// Create a lexer and parser for the input.
SQLiteLexer lexer = new SQLiteLexer(new ANTLRInputStream(sql));
SQLiteParser parser = new SQLiteParser(new CommonTokenStream(lexer));
// Invoke the `select_stmt` production.
ParseTree tree = parser.select_stmt();
ParseTreeWalker walker = new ParseTreeWalker();
SQLiteListener listener = new SQLiteBaseListener();
ParseTreeWalker.DEFAULT.walk(listener, tree);
System.out.println(listener.);
What function should I invoke to generate the parse tree in code format?
I'm not sure what you meant by "hierarchical" view. There is -gui option with antlr command line tool if that is that is what you are looking for. Otherwise, you can print how the grammar is evaluated by adding Actions and/or add Sysouts in the enter/exist listener methods created by ANTLR. For example, if you have the following grammer:
grammar Grammar;
#lexer::header{
//package name where Java files will be created
}
#parser::header{
//package name where Java files will be created
}
value : letters | number | string;
letters : LETTERS;
number : NUMBER;
string : STRING;
LETTERS : '/*' {System.out.println("Found Letters!");};
NUMBER : [0-9]+ {System.out.println("Found Number!");};
STRING : [a-zA-Z0-9]+ {System.out.println("Found String!");};
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
ANTLR4 will generate a GrammarListener.java (assuming that your grammar is called Grammar.g4) if you run right click the .g4 file in Eclipse and select "Generate ANTLR Recognizer" with ANTRL4 IDE installed. You can also generate the parser and lexer using Java by calling the static org.antlr.v4.Tool.main:
Tool.main(new String[]{grammarFile, "-o", outputDirectory});
The generated interface will contain methods like as follows:
public interface GrammarListener extends ParserTreeListener {
void enterValue(GrammarParser.Valuecontext ctx);
void exitValue(GrammarParser.ValueContext ctx);
void enterLetters(GrammarParser.StringContext ctx);
void exitLetters(GrammarParser.StringContext ctx);
.
.
}
You would need to implement this interface...
public class GrammarListenerImpl implements GrammarListener {
.
.
#Override
public void enterLetters(GrammarParser.LetterContext ctx) {
System.out.println("Enter: Letters");
// do other stuff
.
.
}
and add Sysouts, in this case, or other business logic to handle when this match occurs in the grammar. The Sysouts can generate something like:
Enter value
Enter Letters
Do something...
Exit Letters
Exit value
This would show a nested (format it with tab/space etc.) the call sequence in which the grammar is evaluated.
This line:
ParseTree tree = parser.select_stmt();
is your parse tree. Checkout the API docs to see what method it has: http://www.antlr.org/api/JavaTool/org/antlr/v4/runtime/tree/ParseTree.html
You'll probably be interested in its getChild(...) and getParent() methods.
Hopefully there are a few experts in the EpochX framework around here...I'm not sure that the user group is still active.
I am attempting to implement simple recursion within their represention of a BNF grammar and have fun into the following issue:
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -9
at java.lang.String.substring(String.java:1911)
at org.epochx.epox.EpoxParser.parse(EpoxParser.java:235)
at org.epochx.epox.EpoxParser.parse(EpoxParser.java:254)
at org.epochx.tools.eval.EpoxInterpreter.eval(EpoxInterpreter.java:89)
at org.epochx.ge.model.epox.SAGE.getFitness(SAGE.java:266)
at org.epochx.ge.representation.GECandidateProgram.getFitness(GECandidateProgram.java:304)
at org.epochx.stats.StatField$7.getStatValue(StatField.java:97)
at org.epochx.stats.Stats.getStat(Stats.java:134)
at org.epochx.stats.StatField$8.getStatValue(StatField.java:117)
at org.epochx.stats.Stats.getStat(Stats.java:134)
at org.epochx.stats.Stats.getStats(Stats.java:162)
at org.epochx.stats.Stats.print(Stats.java:194)
at org.epochx.stats.Stats.print(Stats.java:178)
at org.epochx.ge.model.epox.Tester$1.onGenerationEnd(Tester.java:41)
at org.epochx.life.Life.fireGenerationEndEvent(Life.java:634)
at org.epochx.core.InitialisationManager.initialise(InitialisationManager.java:207)
at org.epochx.core.RunManager.run(RunManager.java:166)
at org.epochx.core.Model.run(Model.java:147)
at org.epochx.ge.model.GEModel.run(GEModel.java:82)
at org.epochx.ge.model.epox.Tester.main(Tester.java:55)
Java Result: 1
My simple grammar is structured as follows, where terminals are passed in separately to the evaluation function:
public static final String GRAMMAR_FRAGMENT = "<program> ::= <node>\n"
+ "<node> ::= <s_list>\n"
+ "<s_list> ::= <s> | <s> <s_list>\n"
+ "<s> ::= FUNCTION( <terminal> )\n"
+ "<terminal> ::= ";
Edit: Terminal creation -
// Generate the input sequences.
inputValues = BoolUtils.generateBoolSequences(4);
argNames = new String[4];
argNames[0] = "void";
argNames[1] = "bubbleSort";
argNames[2] = "int*";
argNames[3] = "numbers";
...
// Evaluate all possible inputValues.
for (final boolean[] vars: inputValues) {
// Convert to object array.
final Boolean[] objVars = ArrayUtils.toObject(vars);
Boolean result = null;
try {
interpreter.eval(program.getSourceCode(),
argNames, objVars);
score = (double)program.getParseTreeDepth();
} catch (final MalformedProgramException e) {
// Assign worst possible fitness and stop evaluating.
score = 0;
break;
}
}
The stacktrace shows that the problem is actually in the EpoxParser, this means that its not so much the grammar that is ill-formed, but rather that the programs that get generated cannot be parsed.
Because you're using the EpoxInterpreter, the programs that get generated get parsed as Epox programs. Epox is the name used to refer to the language that the tree representation of EpochX uses (a sort of corrupted form of Lisp which you can add your own literals/functions to). The parsing expects the S-Expression format, and tries to identify each function and terminal and it builds a tree made up of equivalent Node objects (see the org.epochx.epox.* packages). Then the tree can be evaluated to run the program.
But in Epox there's no built-in function called FUNCTION, nor any known literals 'void', 'bubbleSort', 'int*' or 'numbers'. So the parsing fails. So you need to add these constructs to the EpoxParser, so it knows how to parse them into nodes. You can do this with the declareFunction, declareLiteral and declareVariable methods (see the JavaDoc for the EpoxParser http://www.epochx.org/javadoc/1.4/).
I'm trying to write a simple interactive (using System.in as source) language using antlr, and I have a few problems with it. The examples I've found on the web are all using a per line cycle, e.g.:
while(readline)
result = parse(line)
doStuff(result)
But what if I'm writing something like pascal/smtp/etc, with a "first line" looks like X requirment? I know it can be checked in doStuff, but I think logically it is part of the syntax.
Or what if a command is split into multiple lines? I can try
while(readline)
lines.add(line)
try
result = parse(lines)
lines = []
doStuff(result)
catch
nop
But with this I'm also hiding real errors.
Or I could reparse all lines everytime, but:
it will be slow
there are instructions I don't want to run twice
Can this be done with ANTLR, or if not, with something else?
Dutow wrote:
Or I could reparse all lines everytime, but:
it will be slow
there are instructions I don't want to run twice
Can this be done with ANTLR, or if not, with something else?
Yes, ANTLR can do this. Perhaps not out of the box, but with a bit of custom code, it sure is possible. You also don't need to re-parse the entire token stream for it.
Let's say you want to parse a very simple language line by line that where each line is either a program declaration, or a uses declaration, or a statement.
It should always start with a program declaration, followed by zero or more uses declarations followed by zero or more statements. uses declarations cannot come after statements and there can't be more than one program declaration.
For simplicity, a statement is just a simple assignment: a = 4 or b = a.
An ANTLR grammar for such a language could look like this:
grammar REPL;
parse
: programDeclaration EOF
| usesDeclaration EOF
| statement EOF
;
programDeclaration
: PROGRAM ID
;
usesDeclaration
: USES idList
;
statement
: ID '=' (INT | ID)
;
idList
: ID (',' ID)*
;
PROGRAM : 'program';
USES : 'uses';
ID : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;
INT : '0'..'9'+;
SPACE : (' ' | '\t' | '\r' | '\n') {skip();};
But, we'll need to add a couple of checks of course. Also, by default, a parser takes a token stream in its constructor, but since we're planning to trickle tokens in the parser line-by-line, we'll need to create a new constructor in our parser. You can add custom members in your lexer or parser classes by putting them in a #parser::members { ... } or #lexer::members { ... } section respectively. We'll also add a couple of boolean flags to keep track whether the program declaration has happened already and if uses declarations are allowed. Finally, we'll add a process(String source) method which, for each new line, creates a lexer which gets fed to the parser.
All of that would look like:
#parser::members {
boolean programDeclDone;
boolean usesDeclAllowed;
public REPLParser() {
super(null);
programDeclDone = false;
usesDeclAllowed = true;
}
public void process(String source) throws Exception {
ANTLRStringStream in = new ANTLRStringStream(source);
REPLLexer lexer = new REPLLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
super.setTokenStream(tokens);
this.parse(); // the entry point of our parser
}
}
Now inside our grammar, we're going to check through a couple of gated semantic predicates if we're parsing declarations in the correct order. And after parsing a certain declaration, or statement, we'll want to flip certain boolean flags to allow- or disallow declaration from then on. The flipping of these boolean flags is done through each rule's #after { ... } section that gets executed (not surprisingly) after the tokens from that parser rule are matched.
Your final grammar file now looks like this (including some System.out.println's for debugging purposes):
grammar REPL;
#parser::members {
boolean programDeclDone;
boolean usesDeclAllowed;
public REPLParser() {
super(null);
programDeclDone = false;
usesDeclAllowed = true;
}
public void process(String source) throws Exception {
ANTLRStringStream in = new ANTLRStringStream(source);
REPLLexer lexer = new REPLLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
super.setTokenStream(tokens);
this.parse();
}
}
parse
: programDeclaration EOF
| {programDeclDone}? (usesDeclaration | statement) EOF
;
programDeclaration
#after{
programDeclDone = true;
}
: {!programDeclDone}? PROGRAM ID {System.out.println("\t\t\t program <- " + $ID.text);}
;
usesDeclaration
: {usesDeclAllowed}? USES idList {System.out.println("\t\t\t uses <- " + $idList.text);}
;
statement
#after{
usesDeclAllowed = false;
}
: left=ID '=' right=(INT | ID) {System.out.println("\t\t\t " + $left.text + " <- " + $right.text);}
;
idList
: ID (',' ID)*
;
PROGRAM : 'program';
USES : 'uses';
ID : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;
INT : '0'..'9'+;
SPACE : (' ' | '\t' | '\r' | '\n') {skip();};
which can be tested wit the following class:
import org.antlr.runtime.*;
import java.util.Scanner;
public class Main {
public static void main(String[] args) throws Exception {
Scanner keyboard = new Scanner(System.in);
REPLParser parser = new REPLParser();
while(true) {
System.out.print("\n> ");
String input = keyboard.nextLine();
if(input.equals("quit")) {
break;
}
parser.process(input);
}
System.out.println("\nBye!");
}
}
To run this test class, do the following:
# generate a lexer and parser:
java -cp antlr-3.2.jar org.antlr.Tool REPL.g
# compile all .java source files:
javac -cp antlr-3.2.jar *.java
# run the main class on Windows:
java -cp .;antlr-3.2.jar Main
# or on Linux/Mac:
java -cp .:antlr-3.2.jar Main
As you can see, you can only declare a program once:
> program A
program <- A
> program B
line 1:0 rule programDeclaration failed predicate: {!programDeclDone}?
uses cannot come after statements:
> program X
program <- X
> uses a,b,c
uses <- a,b,c
> a = 666
a <- 666
> uses d,e
line 1:0 rule usesDeclaration failed predicate: {usesDeclAllowed}?
and you must start with a program declaration:
> uses foo
line 1:0 rule parse failed predicate: {programDeclDone}?
Here's an example of how to parse input from System.in without first manually parsing it one line at a time and without making major compromises in the grammar. I'm using ANTLR 3.4. ANTLR 4 may have addressed this problem already. I'm still using ANTLR 3, though, and maybe someone else with this problem still is too.
Before getting into the solution, here are the hurdles I ran into that keeps this seemingly trivial problem from being easy to solve:
The built-in ANTLR classes that derive from CharStream consume the entire stream of data up-front. Obviously an interactive mode (or any other indeterminate-length stream source) can't provide all the data.
The built-in BufferedTokenStream and derived class(es) will not end on a skipped or off-channel token. In an interactive setting, this means that the current statement can't end (and therefore can't execute) until the first token of the next statement or EOF has been consumed when using one of these classes.
The end of the statement itself may be indeterminate until the next statement begins.
Consider a simple example:
statement: 'verb' 'noun' ('and' 'noun')*
;
WS: //etc...
Interactively parsing a single statement (and only a single statement) isn't possible. Either the next statement has to be started (that is, hitting "verb" in the input), or the grammar has to be modified to mark the end of the statement, e.g. with a ';'.
I haven't found a way to manage a multi-channel lexer with my solution. It doesn't hurt me since I can replace my $channel = HIDDEN with skip(), but it's still a limitation worth mentioning.
A grammar may need a new rule to simplify interactive parsing.
For example, my grammar's normal entry point is this rule:
script
: statement* EOF -> ^(STMTS statement*)
;
My interactive session can't start at the script rule because it won't end until EOF. But it can't start at statement either because STMTS might be used by my tree parser.
So I introduced the following rule specifically for an interactive session:
interactive
: statement -> ^(STMTS statement)
;
In my case, there are no "first line" rules, so I can't say how easy or hard it would be to do something similar for them. It may be a matter of making a rule like so and execute it at the beginning of the interactive session:
interactive_start
: first_line
;
The code behind a grammar (e.g., code that tracks symbols) may have been written under the assumption that the lifespan of the input and the lifespan of the parser object would effectively be the same. For my solution, that assumption doesn't hold. The parser gets replaced after each statement, so the new parser must be able to pick up the symbol tracking (or whatever) where the last one left off. This is a typical separation-of-concerns problem so I don't think there's much else to say about it.
The first problem mentioned, the limitations of the built-in CharStream classes, was my only major hang-up. ANTLRStringStream has all the workings that I need, so I derived my own CharStream class off of it. The base class's data member is assumed to have all the past characters read, so I needed to override all the methods that access it. Then I changed the direct read to a call to (new method) dataAt to manage reading from the stream. That's basically all there is to this. Please note that the code here may have unnoticed problems and does no real error handling.
public class MyInputStream extends ANTLRStringStream {
private InputStream in;
public MyInputStream(InputStream in) {
super(new char[0], 0);
this.in = in;
}
#Override
// copied almost verbatim from ANTLRStringStream
public void consume() {
if (p < n) {
charPositionInLine++;
if (dataAt(p) == '\n') {
line++;
charPositionInLine = 0;
}
p++;
}
}
#Override
// copied almost verbatim from ANTLRStringStream
public int LA(int i) {
if (i == 0) {
return 0; // undefined
}
if (i < 0) {
i++; // e.g., translate LA(-1) to use offset i=0; then data[p+0-1]
if ((p + i - 1) < 0) {
return CharStream.EOF; // invalid; no char before first char
}
}
// Read ahead
return dataAt(p + i - 1);
}
#Override
public String substring(int start, int stop) {
if (stop >= n) {
//Read ahead.
dataAt(stop);
}
return new String(data, start, stop - start + 1);
}
private int dataAt(int i) {
ensureRead(i);
if (i < n) {
return data[i];
} else {
// Nothing to read at that point.
return CharStream.EOF;
}
}
private void ensureRead(int i) {
if (i < n) {
// The data has been read.
return;
}
int distance = i - n + 1;
ensureCapacity(n + distance);
// Crude way to copy from the byte stream into the char array.
for (int pos = 0; pos < distance; ++pos) {
int read;
try {
read = in.read();
} catch (IOException e) {
// TODO handle this better.
throw new RuntimeException(e);
}
if (read < 0) {
break;
} else {
data[n++] = (char) read;
}
}
}
private void ensureCapacity(int capacity) {
if (capacity > n) {
char[] newData = new char[capacity];
System.arraycopy(data, 0, newData, 0, n);
data = newData;
}
}
}
Launching an interactive session is similar to the boilerplate parsing code, except that UnbufferedTokenStream is used and the parsing takes place in a loop:
MyLexer lex = new MyLexer(new MyInputStream(System.in));
TokenStream tokens = new UnbufferedTokenStream(lex);
//Handle "first line" parser rule(s) here.
while (true) {
MyParser parser = new MyParser(tokens);
//Set up the parser here.
MyParser.interactive_return r = parser.interactive();
//Do something with the return value.
//Break on some meaningful condition.
}
Still with me? Okay, well that's it. :)
If you are using System.in as source, which is an input stream, why not just have ANTLR tokenize the input stream as it is read and then parse the tokens?
You have to put it in doStuff....
For instance, if you're declaring a function, the parse would return a function right? without body, so, that's fine, because the body will come later. You'd do what most REPL do.