I am trying to parse Java class files using Java.g4 grammar and Antlr4.
There is a particular parser rule as follows:
classOrInterfaceType
: Identifier typeArguments? ('.' Identifier typeArguments? )*
;
I am parsing it in my visitor class in this way:
public String visitClassOrInterfaceType(JavaParser.ClassOrInterfaceTypeContext ctx) {
StringBuilder clsIntr = new StringBuilder("");
int n = ctx.getChildCount();
for(int i = 0; i < n; i++){
TerminalNode id = ctx.Identifier(i);
if(id!=null){
clsIntr.append(id.getText()).append(" ");
}
TypeArgumentsContext typArgCtx =ctx.typeArguments(i);
if(typArgCtx!=null){
String val = this.visitTypeArguments(typArgCtx);
clsIntr.append(val);
}
}
return clsIntr.toString();
}
Is this correct or there is some other way to do this?
Your approach looks ok, even though that ultimately depends on what you are actually trying to do. My crystal ball tells me you try to reconstruct the original query text by walking the parse tree. However, you can get that much simpler. Each parse context has start and stop members that hold tokens for the parsed text range this context stands for. You could use those to directly get the original text exactly like it was entered (via the token stream and the token's positions).
Writing a pretty-printer for legacy code in an older language. The plan is for me to learn parsing and unparsing before I write a translator to output C++. I kind of got thrown into the deep end with Java and ANTLR back in June, so I definitely have some knowledge gaps.
I've gotten to the point where I'm comfortable writing methods for my custom listener, and I want to be able to pretty-print the comments as well. My comments are on a separate hidden channel. Here are the grammar rules for the hidden tokens:
/* Comments and whitespace -- Nested comments are allowed, each is redirected to a specific channel */
COMMENT_1 : '(*' (COMMENT_1|COMMENT_2|.)*? '*)' -> channel(1) ;
COMMENT_2 : '{' (COMMENT_1|COMMENT_2|.)*? '}' -> channel(1) ;
NEWLINES : [\r\n]+ -> channel(2) ;
WHITESPACE : [ \t]+ -> skip ;
I've been playing with the Cymbol CommentShifter example on p. 207 of The Definitive ANTLR 4 Reference and I'm trying to figure out how to adapt it to my listener methods.
public void exitVarDecl(ParserRuleContext ctx) {
Token semi = ctx.getStop();
int i = semi.getTokenIndex();
List<Token> cmtChannel = tokens.getHiddenTokensToRight(i, CymbolLexer.COMMENTS);
if (cmtChannel != null) {
Token cmt = cmtChannel.get(0);
if (cmt != null) {
String txt = cmt.getText().substring(2);
String newCmt = "// " + txt.trim(); // printing comments in original format
rewriter.insertAfter(ctx.stop, newCmt); // at end of line
rewriter.replace(cmt, "\n");
}
}
}
I adapted this example by using exitEveryRule rather than exitVarDecl and it worked for the Cymbol example but when I adapt it to my own listener I get a null pointer exception whether I use exitEveryRule or exitSpecificThing
I'm looking at this answer and it seems promising but I think what I really need is an explanation of how the hidden channel data is stored and how to access it. It took me months to really get listener methods and context in the parse tree.
It seems like CommonTokenStream.LT(), CommonTokenStream.LA(), and consume() are what I want to be using, but why is the example in that SO answer using completely different methods from the ANTLR book example? What should I know about the token index or token types?
I'd like to better understand the logic behind this.
Okay, so I can't answer how AnTLR stores its data internally, but I can tell you how to access your hidden tokens. I have tested this on my computer using AnTLR v4.1 for C# .NET v4.5.2.
I have a rule that looks like this:
LineComment
: '//' ~[\r\n]*
-> channel(1)
;
In my code, I am getting the entire raw token stream like this:
IList<IToken> lTokenList = cmnTokenStream.Get( 0, cmnTokenStream.Size );
To test, I printed the token list using the following loop:
foreach ( IToken iToken in lTokenList )
{
Console.WriteLine( "{0}[{1}] : {2}",
iToken.Channel,
iToken.TokenIndex,
iToken.Text );
}
Running on this code:
void Foo()
{
// comment
i = 5;
}
Yields the following output (for the sake of brevity, please assume I have a complete grammar that is also ignoring whitespace):
0[0] : void
0[1] : Foo
0[2] : (
0[3] : )
0[4] : {
1[5] : // comment
0[6] : i
0[7] : =
0[8] : 6
0[9] : ;
0[10] : }
You can see the channel index is 1 only for the single comment token. So you can use this loop to access only the comment tokens:
int lCommentCount = 0;
foreach ( IToken iToken in lTokenList )
{
if ( iToken.Channel == 1 )
{
Console.WriteLine( "{0} : {1}",
lCommentCount++,
iToken.Text );
}
}
Then you can do your whatever with those tokens. Also works if you have multiple streams, though I will caution against using more than 65,536 streams. AnTLR gave the following error when I tried to compile a grammar with a token rule redirect to stream index 65536:
Serialized ATN data element out of range.
So I guess they're only using a 16-bit unsigned integer to index the streams. Wierd.
JSpeech Grammar Format allows user to specify tags for separate strings in curly brackets as follows:
<jump> = jump { primitive jump } [up] |
jump [to the] (left { primitive jump_left } |right { primitive jump_right } );
or
<effects> = nothing happens { NOTHING_HAPPENS } | ( [will] die | dies ) { OBJECT_DESTRUCTION } | (get|gets) new (coin|coins) { COIN_INCREASE };
Using tags is described more thoroughly in section 4.6.1 of the referenced specification.
In Sphinx4 you can catch these tags using getTags() method in RuleParse. So if user says "jump to the left" the following tag will be returned "primitive jump_left"
Now, I would like to do exactly the opposite - given the tag, I would like to match it to the string. So for "NOTHING_HAPPENS" I would like to get "nothing happens" or for "OBJECT_DESTRUCTION" an arry with all possible options: "will die, die, dies".
Is there any such method that can parse grammar files in such way or do I have to hardcode it?
My sollution to this is to generate all possible sentences defined by JSGF file. This can be done easily with dumpRandomSentences or getRandomSentence methods provided by Grammar classin Sphinx and give them back to the Recognizer which will print out the tags.
Sample code from my project:
for (int i = 0; i < 20000; i++) {
String utterance = grammar.getRandomSentence();
String tags;
try {
tags = parser.getTagString(utterance);
System.out.println(tags+" ==> "+utterance);
} catch (GrammarException e) {
error(e.toString());
}
}
Hopefully there are a few experts in the EpochX framework around here...I'm not sure that the user group is still active.
I am attempting to implement simple recursion within their represention of a BNF grammar and have fun into the following issue:
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -9
at java.lang.String.substring(String.java:1911)
at org.epochx.epox.EpoxParser.parse(EpoxParser.java:235)
at org.epochx.epox.EpoxParser.parse(EpoxParser.java:254)
at org.epochx.tools.eval.EpoxInterpreter.eval(EpoxInterpreter.java:89)
at org.epochx.ge.model.epox.SAGE.getFitness(SAGE.java:266)
at org.epochx.ge.representation.GECandidateProgram.getFitness(GECandidateProgram.java:304)
at org.epochx.stats.StatField$7.getStatValue(StatField.java:97)
at org.epochx.stats.Stats.getStat(Stats.java:134)
at org.epochx.stats.StatField$8.getStatValue(StatField.java:117)
at org.epochx.stats.Stats.getStat(Stats.java:134)
at org.epochx.stats.Stats.getStats(Stats.java:162)
at org.epochx.stats.Stats.print(Stats.java:194)
at org.epochx.stats.Stats.print(Stats.java:178)
at org.epochx.ge.model.epox.Tester$1.onGenerationEnd(Tester.java:41)
at org.epochx.life.Life.fireGenerationEndEvent(Life.java:634)
at org.epochx.core.InitialisationManager.initialise(InitialisationManager.java:207)
at org.epochx.core.RunManager.run(RunManager.java:166)
at org.epochx.core.Model.run(Model.java:147)
at org.epochx.ge.model.GEModel.run(GEModel.java:82)
at org.epochx.ge.model.epox.Tester.main(Tester.java:55)
Java Result: 1
My simple grammar is structured as follows, where terminals are passed in separately to the evaluation function:
public static final String GRAMMAR_FRAGMENT = "<program> ::= <node>\n"
+ "<node> ::= <s_list>\n"
+ "<s_list> ::= <s> | <s> <s_list>\n"
+ "<s> ::= FUNCTION( <terminal> )\n"
+ "<terminal> ::= ";
Edit: Terminal creation -
// Generate the input sequences.
inputValues = BoolUtils.generateBoolSequences(4);
argNames = new String[4];
argNames[0] = "void";
argNames[1] = "bubbleSort";
argNames[2] = "int*";
argNames[3] = "numbers";
...
// Evaluate all possible inputValues.
for (final boolean[] vars: inputValues) {
// Convert to object array.
final Boolean[] objVars = ArrayUtils.toObject(vars);
Boolean result = null;
try {
interpreter.eval(program.getSourceCode(),
argNames, objVars);
score = (double)program.getParseTreeDepth();
} catch (final MalformedProgramException e) {
// Assign worst possible fitness and stop evaluating.
score = 0;
break;
}
}
The stacktrace shows that the problem is actually in the EpoxParser, this means that its not so much the grammar that is ill-formed, but rather that the programs that get generated cannot be parsed.
Because you're using the EpoxInterpreter, the programs that get generated get parsed as Epox programs. Epox is the name used to refer to the language that the tree representation of EpochX uses (a sort of corrupted form of Lisp which you can add your own literals/functions to). The parsing expects the S-Expression format, and tries to identify each function and terminal and it builds a tree made up of equivalent Node objects (see the org.epochx.epox.* packages). Then the tree can be evaluated to run the program.
But in Epox there's no built-in function called FUNCTION, nor any known literals 'void', 'bubbleSort', 'int*' or 'numbers'. So the parsing fails. So you need to add these constructs to the EpoxParser, so it knows how to parse them into nodes. You can do this with the declareFunction, declareLiteral and declareVariable methods (see the JavaDoc for the EpoxParser http://www.epochx.org/javadoc/1.4/).
I'm trying to write a simple interactive (using System.in as source) language using antlr, and I have a few problems with it. The examples I've found on the web are all using a per line cycle, e.g.:
while(readline)
result = parse(line)
doStuff(result)
But what if I'm writing something like pascal/smtp/etc, with a "first line" looks like X requirment? I know it can be checked in doStuff, but I think logically it is part of the syntax.
Or what if a command is split into multiple lines? I can try
while(readline)
lines.add(line)
try
result = parse(lines)
lines = []
doStuff(result)
catch
nop
But with this I'm also hiding real errors.
Or I could reparse all lines everytime, but:
it will be slow
there are instructions I don't want to run twice
Can this be done with ANTLR, or if not, with something else?
Dutow wrote:
Or I could reparse all lines everytime, but:
it will be slow
there are instructions I don't want to run twice
Can this be done with ANTLR, or if not, with something else?
Yes, ANTLR can do this. Perhaps not out of the box, but with a bit of custom code, it sure is possible. You also don't need to re-parse the entire token stream for it.
Let's say you want to parse a very simple language line by line that where each line is either a program declaration, or a uses declaration, or a statement.
It should always start with a program declaration, followed by zero or more uses declarations followed by zero or more statements. uses declarations cannot come after statements and there can't be more than one program declaration.
For simplicity, a statement is just a simple assignment: a = 4 or b = a.
An ANTLR grammar for such a language could look like this:
grammar REPL;
parse
: programDeclaration EOF
| usesDeclaration EOF
| statement EOF
;
programDeclaration
: PROGRAM ID
;
usesDeclaration
: USES idList
;
statement
: ID '=' (INT | ID)
;
idList
: ID (',' ID)*
;
PROGRAM : 'program';
USES : 'uses';
ID : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;
INT : '0'..'9'+;
SPACE : (' ' | '\t' | '\r' | '\n') {skip();};
But, we'll need to add a couple of checks of course. Also, by default, a parser takes a token stream in its constructor, but since we're planning to trickle tokens in the parser line-by-line, we'll need to create a new constructor in our parser. You can add custom members in your lexer or parser classes by putting them in a #parser::members { ... } or #lexer::members { ... } section respectively. We'll also add a couple of boolean flags to keep track whether the program declaration has happened already and if uses declarations are allowed. Finally, we'll add a process(String source) method which, for each new line, creates a lexer which gets fed to the parser.
All of that would look like:
#parser::members {
boolean programDeclDone;
boolean usesDeclAllowed;
public REPLParser() {
super(null);
programDeclDone = false;
usesDeclAllowed = true;
}
public void process(String source) throws Exception {
ANTLRStringStream in = new ANTLRStringStream(source);
REPLLexer lexer = new REPLLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
super.setTokenStream(tokens);
this.parse(); // the entry point of our parser
}
}
Now inside our grammar, we're going to check through a couple of gated semantic predicates if we're parsing declarations in the correct order. And after parsing a certain declaration, or statement, we'll want to flip certain boolean flags to allow- or disallow declaration from then on. The flipping of these boolean flags is done through each rule's #after { ... } section that gets executed (not surprisingly) after the tokens from that parser rule are matched.
Your final grammar file now looks like this (including some System.out.println's for debugging purposes):
grammar REPL;
#parser::members {
boolean programDeclDone;
boolean usesDeclAllowed;
public REPLParser() {
super(null);
programDeclDone = false;
usesDeclAllowed = true;
}
public void process(String source) throws Exception {
ANTLRStringStream in = new ANTLRStringStream(source);
REPLLexer lexer = new REPLLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
super.setTokenStream(tokens);
this.parse();
}
}
parse
: programDeclaration EOF
| {programDeclDone}? (usesDeclaration | statement) EOF
;
programDeclaration
#after{
programDeclDone = true;
}
: {!programDeclDone}? PROGRAM ID {System.out.println("\t\t\t program <- " + $ID.text);}
;
usesDeclaration
: {usesDeclAllowed}? USES idList {System.out.println("\t\t\t uses <- " + $idList.text);}
;
statement
#after{
usesDeclAllowed = false;
}
: left=ID '=' right=(INT | ID) {System.out.println("\t\t\t " + $left.text + " <- " + $right.text);}
;
idList
: ID (',' ID)*
;
PROGRAM : 'program';
USES : 'uses';
ID : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;
INT : '0'..'9'+;
SPACE : (' ' | '\t' | '\r' | '\n') {skip();};
which can be tested wit the following class:
import org.antlr.runtime.*;
import java.util.Scanner;
public class Main {
public static void main(String[] args) throws Exception {
Scanner keyboard = new Scanner(System.in);
REPLParser parser = new REPLParser();
while(true) {
System.out.print("\n> ");
String input = keyboard.nextLine();
if(input.equals("quit")) {
break;
}
parser.process(input);
}
System.out.println("\nBye!");
}
}
To run this test class, do the following:
# generate a lexer and parser:
java -cp antlr-3.2.jar org.antlr.Tool REPL.g
# compile all .java source files:
javac -cp antlr-3.2.jar *.java
# run the main class on Windows:
java -cp .;antlr-3.2.jar Main
# or on Linux/Mac:
java -cp .:antlr-3.2.jar Main
As you can see, you can only declare a program once:
> program A
program <- A
> program B
line 1:0 rule programDeclaration failed predicate: {!programDeclDone}?
uses cannot come after statements:
> program X
program <- X
> uses a,b,c
uses <- a,b,c
> a = 666
a <- 666
> uses d,e
line 1:0 rule usesDeclaration failed predicate: {usesDeclAllowed}?
and you must start with a program declaration:
> uses foo
line 1:0 rule parse failed predicate: {programDeclDone}?
Here's an example of how to parse input from System.in without first manually parsing it one line at a time and without making major compromises in the grammar. I'm using ANTLR 3.4. ANTLR 4 may have addressed this problem already. I'm still using ANTLR 3, though, and maybe someone else with this problem still is too.
Before getting into the solution, here are the hurdles I ran into that keeps this seemingly trivial problem from being easy to solve:
The built-in ANTLR classes that derive from CharStream consume the entire stream of data up-front. Obviously an interactive mode (or any other indeterminate-length stream source) can't provide all the data.
The built-in BufferedTokenStream and derived class(es) will not end on a skipped or off-channel token. In an interactive setting, this means that the current statement can't end (and therefore can't execute) until the first token of the next statement or EOF has been consumed when using one of these classes.
The end of the statement itself may be indeterminate until the next statement begins.
Consider a simple example:
statement: 'verb' 'noun' ('and' 'noun')*
;
WS: //etc...
Interactively parsing a single statement (and only a single statement) isn't possible. Either the next statement has to be started (that is, hitting "verb" in the input), or the grammar has to be modified to mark the end of the statement, e.g. with a ';'.
I haven't found a way to manage a multi-channel lexer with my solution. It doesn't hurt me since I can replace my $channel = HIDDEN with skip(), but it's still a limitation worth mentioning.
A grammar may need a new rule to simplify interactive parsing.
For example, my grammar's normal entry point is this rule:
script
: statement* EOF -> ^(STMTS statement*)
;
My interactive session can't start at the script rule because it won't end until EOF. But it can't start at statement either because STMTS might be used by my tree parser.
So I introduced the following rule specifically for an interactive session:
interactive
: statement -> ^(STMTS statement)
;
In my case, there are no "first line" rules, so I can't say how easy or hard it would be to do something similar for them. It may be a matter of making a rule like so and execute it at the beginning of the interactive session:
interactive_start
: first_line
;
The code behind a grammar (e.g., code that tracks symbols) may have been written under the assumption that the lifespan of the input and the lifespan of the parser object would effectively be the same. For my solution, that assumption doesn't hold. The parser gets replaced after each statement, so the new parser must be able to pick up the symbol tracking (or whatever) where the last one left off. This is a typical separation-of-concerns problem so I don't think there's much else to say about it.
The first problem mentioned, the limitations of the built-in CharStream classes, was my only major hang-up. ANTLRStringStream has all the workings that I need, so I derived my own CharStream class off of it. The base class's data member is assumed to have all the past characters read, so I needed to override all the methods that access it. Then I changed the direct read to a call to (new method) dataAt to manage reading from the stream. That's basically all there is to this. Please note that the code here may have unnoticed problems and does no real error handling.
public class MyInputStream extends ANTLRStringStream {
private InputStream in;
public MyInputStream(InputStream in) {
super(new char[0], 0);
this.in = in;
}
#Override
// copied almost verbatim from ANTLRStringStream
public void consume() {
if (p < n) {
charPositionInLine++;
if (dataAt(p) == '\n') {
line++;
charPositionInLine = 0;
}
p++;
}
}
#Override
// copied almost verbatim from ANTLRStringStream
public int LA(int i) {
if (i == 0) {
return 0; // undefined
}
if (i < 0) {
i++; // e.g., translate LA(-1) to use offset i=0; then data[p+0-1]
if ((p + i - 1) < 0) {
return CharStream.EOF; // invalid; no char before first char
}
}
// Read ahead
return dataAt(p + i - 1);
}
#Override
public String substring(int start, int stop) {
if (stop >= n) {
//Read ahead.
dataAt(stop);
}
return new String(data, start, stop - start + 1);
}
private int dataAt(int i) {
ensureRead(i);
if (i < n) {
return data[i];
} else {
// Nothing to read at that point.
return CharStream.EOF;
}
}
private void ensureRead(int i) {
if (i < n) {
// The data has been read.
return;
}
int distance = i - n + 1;
ensureCapacity(n + distance);
// Crude way to copy from the byte stream into the char array.
for (int pos = 0; pos < distance; ++pos) {
int read;
try {
read = in.read();
} catch (IOException e) {
// TODO handle this better.
throw new RuntimeException(e);
}
if (read < 0) {
break;
} else {
data[n++] = (char) read;
}
}
}
private void ensureCapacity(int capacity) {
if (capacity > n) {
char[] newData = new char[capacity];
System.arraycopy(data, 0, newData, 0, n);
data = newData;
}
}
}
Launching an interactive session is similar to the boilerplate parsing code, except that UnbufferedTokenStream is used and the parsing takes place in a loop:
MyLexer lex = new MyLexer(new MyInputStream(System.in));
TokenStream tokens = new UnbufferedTokenStream(lex);
//Handle "first line" parser rule(s) here.
while (true) {
MyParser parser = new MyParser(tokens);
//Set up the parser here.
MyParser.interactive_return r = parser.interactive();
//Do something with the return value.
//Break on some meaningful condition.
}
Still with me? Okay, well that's it. :)
If you are using System.in as source, which is an input stream, why not just have ANTLR tokenize the input stream as it is read and then parse the tokens?
You have to put it in doStuff....
For instance, if you're declaring a function, the parse would return a function right? without body, so, that's fine, because the body will come later. You'd do what most REPL do.