ANTLR Decision can match input using multiple alternatives

ANTLR Decision can match input using multiple alternatives - java

I have this simple grammer:
expr: factor;
factor: atom (('*' ^ | '/'^) atom)*;
atom: INT
| ':' expr;
INT: ('0'..'9')+
when I run it it says :
Decision can match input such as '*' using multiple alternatives 1,2
Decision can match input such as '/' using multiple alternatives 1,2
I can't spot the ambiguity. How are the red arrows pointing ?
Any help would be appreciated.

Let's say you want to parse the input:
:3*4*:5*6
The parser generated by your grammar could match this input into the following parse trees:
and:
(I omitted the colons to keep the trees more clear)
Note that what you see is just a warning. By specifically instructing ANTLR that (('*' | '/') atom)* needs to be matched greedily, like this:
factor
: atom (options{greedy=true;}: ('*'^ | '/'^) atom)*
;
the parser "knows" which alternative to take, and no warning is emitted.
EDIT
I tested the grammar with ANTLR 3.3 as follows:
grammar T;
options {
output=AST;
}
parse
: expr EOF!
;
expr
: factor
;
factor
: atom (options{greedy=true;}: ('*'^ | '/'^) atom)*
;
atom
: INT
| ':'^ expr
;
INT : ('0'..'9')+;
And then from the command line:
java -cp antlr-3.3.jar org.antlr.Tool T.g
which does not produce any warning (or error).

Related

antlr grammar for triple quoted string

I am trying to update an ANTLR grammar that follows the following spec
https://github.com/facebook/graphql/pull/327/files
In logical terms its defined as
StringValue ::
- `"` StringCharacter* `"`
- `"""` MultiLineStringCharacter* `"""`
StringCharacter ::
- SourceCharacter but not `"` or \ or LineTerminator
- \u EscapedUnicode
- \ EscapedCharacter
MultiLineStringCharacter ::
- SourceCharacter but not `"""` or `\"""`
- `\"""`
(Not the above is logical - not ANTLR syntax)
I tried the follow in ANTRL 4 but it wont recognize more than 1 character inside a triple quoted string
string : triplequotedstring | StringValue ;
triplequotedstring: '"""' triplequotedstringpart? '"""';
triplequotedstringpart : EscapedTripleQuote* | SourceCharacter*;
EscapedTripleQuote : '\\"""';
SourceCharacter :[\u0009\u000A\u000D\u0020-\uFFFF];
StringValue: '"' (~(["\\\n\r\u2028\u2029])|EscapedChar)* '"';
With these rules it will recognize '"""a"""' but as soon as I add more characters it fails
eg: '"""abc"""' wont parse and the IntelliJ plugin for ANTLR says
line 1:14 extraneous input 'abc' expecting {'"""', '\\"""', SourceCharacter}
How do I do triple quoted strings in ANTLR with '\"""' escaping?

Some of your parer rules should really be lexer rules. And SourceCharacter should probably be a fragment.
Also, instead of EscapedTripleQuote* | SourceCharacter*, you probably want ( EscapedTripleQuote | SourceCharacter )*. The first matches aaa... or bbb..., while you probably meant to match aababbba...
Try something like this instead:
string
: Triplequotedstring
| StringValue
;
Triplequotedstring
: '"""' TriplequotedstringPart*? '"""'
;
StringValue
: '"' ( ~["\\\n\r\u2028\u2029] | EscapedChar )* '"'
;
// Fragments never become a token of their own: they are only used inside other lexer rules
fragment TriplequotedstringPart : EscapedTripleQuote | SourceCharacter;
fragment EscapedTripleQuote : '\\"""';
fragment SourceCharacter :[\u0009\u000A\u000D\u0020-\uFFFF];

Antlr Eclipse IDE White Space not being skipped

I apologize in advance if this question has already been asked, can't seem to find it.
I'm just beginning with Antlr, using the antlr4IDE for Eclipse to create a parser for a small subset of Java. For some reason, unless I explicitly state the presence of a white space in my regex, the parser will throw an error.
My grammar:
grammar Hello;
r :
(Statement ';')+
;
Statement:
DECL | INIT
;
DECL:
'int' ID
;
INIT:
DECL '=' NUMEXPR
;
NUMEXPR :
Number OP Number | Number
;
OP :
'+'
| '-'
| '/'
| '*'
;
WS :
[ \t\r\n\u000C]+ -> skip
;
Number:
[0-9]+
;
ID :
[a-zA-Z]+
;
When trying to parse
int hello = 76;
I receive the error:
Hello::r:1:0: mismatched input 'int' expecting Statement
Hello::r:1:10: token recognition error at: '='
However, when I manually add the token WS into the rules, I receive no error.
Any ideas where I'm going wrong? I'm new to Antlr, so I'm probably making a stupid mistake. Thanks in advance.
EDIT : Here is my parse tree and error log:
Error Log:

Change syntax like this.
grammar Hello;
r : (statement ';')+ ;
statement : decl | init ;
decl : 'int' ID ;
init : decl '=' numexpr ;
numexpr : Number op Number | Number ;
op : '+' | '-' | '/' | '*' ;
WS : [ \t\r\n\u000C]+ -> skip ;
Number : [0-9]+ ;
ID : [a-zA-Z]+ ;

After looking at the documentation on antlr4, it seems like you have to have a specification for all of the character combinations that you expect to see in your file, from start to finish - not just those that you want to handle.
In that regards, it's expected that you would have to explicitly state the whitespace, with something like:
WS : [ \t\r\n]+ -> skip;
That's why the skip command exists:
A 'skip' command tells the lexer to get another token and throw out the current text.
Though note that sometimes this can cause a little trouble such as in this post.

Modifying a plain ANLTR file in background

I would like to modify a grammar file by adding some Java code programatically in background. What I mean is that consider you have a println statement that you want to add it in a grammar before ANTLR works (i.e. creates lexer and parser files).
I have this trivial code: {System.out.println("print");}
Here is the simple grammar that I want to add the above snippet in the 'prog' rule after 'expr':
Before:
grammar Expr;
prog: (expr NEWLINE)* ;
expr: expr ('*'|'/') expr
| INT
;
NEWLINE : [\r\n]+ ;
INT : [0-9]+ ;
After:
grammar Expr;
prog: (expr {System.out.println("print");} NEWLINE)* ;
expr: expr ('*'|'/') expr
| INT
;
NEWLINE : [\r\n]+ ;
INT : [0-9]+ ;
Again note that I want to do this in runtime so that the grammar does not show any Java code (the 'before' snippet).
Is it possible to make this real before ANLTR generates lexer and parser files? Is there any way to visit (like AST visitor for ANTLR) a simple grammar?

ANTLR 4 generates a listener interface and base class (empty implementation) by default. If you also specify the -visitor flag when generating your parser, it will create a visitor interface and base class. Either of these features may be used to execute code using the parse tree rather than embedding actions directly in the grammar file.

If the code is always in the same place, simply insert a function call that acts as a hook to include real code afterwards.
This way you don't have to modify the source or generate the lexer/parser again.
If you want to insert code at predefined points (like enter rule/leave rule), go with Sam's solution to insert them into a listener. In either case it should not be necessary to modify the grammar file.
grammar Expr;
prog: (expr {Hooks.programHook();} NEWLINE)* ;
expr: expr ('*'|'/') expr
| INT
;
NEWLINE : [\r\n]+ ;
INT : [0-9]+ ;
In a java file of your choice (I'm no Java programmer, so the real syntax may be different):
public class Hooks
{
public static void programHook()
{
System.out.println("print");
}
}

Own DSL with XText. Problem with unlimited brackets ("(", ")")

I am developing my own DSL in XText.
I want do something like this:
1 AND (2 OR (3 OR 4))
Here my current .xtext file:
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model:
(greetings+=CONDITION_LEVEL)
;
terminal NUMBER :
('1'..'9') ('0'..'9')*
;
AND:
' AND '
;
OR:
' OR '
;
OPERATOR :
AND | OR
;
CONDITION_LEVEL:
('('* NUMBER (=>')')* OPERATOR)+ NUMBER ')'*
;
The problem I am having is that the dsl should have the possibility to make unlimited bracket, but show an error when the programmer don't closes all opened bracket.
example:
1 AND (2 OR (3 OR 4)
one bracket is missing --> should make error.
I don't know how I can realize this in XText. Can anybody help?
thx for helping.

Try this:
CONDITION_LEVEL
: ATOM ((AND | OR) ATOM)*
;
ATOM
: NUMBER
| '(' CONDITION_LEVEL ')'
;
Note that I have no experience with XText (so I did not test this), but this does work with ANTLR, on which XText is built (or perhaps it only uses ANTLR...).
Aslo, you probably don't want to surround your operator-tokens with spaces, but put them on a hidden-parser channel:
grammar org.xtext.example.mydsl.MyDsl hidden(SPACE)
...
terminal SPACE : (' '|'\t'|'\r'|'\n')+;
...
Otherwise source like this would fail:
1 AND(2 OR 3)
For details, see Hidden Terminal Symbols from the XText user guide.

You need to make your syntax recursive. The basic idea is that a CONDITION_LEVEL can be, for example, two CONDITION_LEVEL separated by an OPERATOR.
I don't know the specifics of the xtext syntax, but using a BCNF-like syntax you could have:
CONDITION_LEVEL:
NUMBER
'(' CONDITION_LEVEL OPERATOR CONDITION_LEVEL ')'

What is the equivalent for epsilon in ANTLR BNF grammar notation?

During taking advantage of ANTLR 3.3, I'm changing the current grammar to support inputs without parenthesis too. Here's the first version of my grammar :
grammar PropLogic;
NOT : '!' ;
OR : '+' ;
AND : '.' ;
IMPLIES : '->' ;
SYMBOLS : ('a'..'z') | '~' ;
OP : '(' ;
CP : ')' ;
prog : formula EOF ;
formula : NOT formula
| OP formula( AND formula CP | OR formula CP | IMPLIES formula CP)
| SYMBOLS ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
Then I changed it this way to support the appropriate features :
grammar PropLogic;
NOT : '!' ;
OR : '+' ;
AND : '.' ;
IMPLIES : '->' ;
SYMBOL : ('a'..'z') | '~' ;
OP : '(' ;
CP : ')' ;
EM : '' ;
prog : formula EOF ;
formula : OP formula( AND formula CP | OR formula CP | IMPLIES formula CP)
| ( NOT formula | SYMBOL )( AND formula | OR formula | IMPLIES formula | EM ) ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
But I've been faced with following error :
error<100>: syntax error: invalid char literal: ''
error<100>: syntax error: invalid char literal: ''
Does anybody know that how can I overcome this error?

Your EM token:
EM : '' ;
is invalid: you can't match an empty string in lexer rules.
To match epsilon (nothing), you should do:
rule
: A
| B
| /* epsilon */
;
Of course, the comment /* epsilon */ can safely be removed.
Note that when you do it like that in your current grammar, ANTLR will complain that there can be rules matched using multiple alternatives. This is because your grammar is ambiguous.

I'm not an ANTLR expert, but you might try:
formula : term ((AND | OR | IMPLIES ) term )*;
term : OP formula CP | NOT term | SYMBOL ;
If you want traditional precedence of operators this won't do the trick, but that's another issue.
EDIT: OP raised the ante; he wants precedence too. I'll meet him halfway, since it wasn't part
of the orginal question. I've added precedence to the grammar that makes IMPLIES
the lower precedence than other operators, and leave it to OP to figure out how to do the rest.
formula: disjunction ( IMPLIES disjunction )* ;
disjunction: term (( AND | OR ) term )* ;
term: OP formula CP | NOT term | SYMBOL ;
OP additionally asked, "how to convert (!p or q ) into p -> q". I think he should
have asked this as a separate question. However, I'm already here.
What he needs to do is walk the tree, looking for the pattern he doesn't
like, and change the tree into one he does, and then prettyprint the answer.
It is possible to do all this with ANTLR, which is part of the reason
it is popular.
As a practical matter, procedurally walking the tree and checking the node
types, and splicing out old nodes and splicing in new is doable, but a royal PitA.
Especially if you want to do this for lots of transformations.
A more effective way to do this is to use a
program transformation system, which allows surface syntax patterns to be expressed for matching and replacement. Program transformation systems of course include parsing machinery and more powerful ones let you (and indeed insist) that you define
a grammar up front much as you for ANTLR.
Our DMS Software Reengineering Toolkit is such a program transformation tool, and with a suitably defined grammar for propositions,
the following DMS transformation rule would carry out OP's additional request:
domain proplogic; // tell DMS to use OP's definition of logic as a grammar
rule normalize_implies_from_or( p: term, q: term): formula -> formula
" NOT \p OR \q " -> " \p IMPLIES \q ";
The " ... " is "domain notation", e.g, surface syntax from the proplogic domain, the "\" are meta-escapes,
so "\p" and "\q" represent any arbitrary term from the proplogic grammar. Notice the rule has to reach "across" precedence levels when being applied, as "NOT \p OR \q" isn't a formula and "\p IMPLIES \q" is; DMS takes care of all this (the "formula -> formula" notation is how DMS knows what to do). This rule does a tree-to-tree rewrite. The resulting tree can be prettyprinted by DMS.
You can see a complete example of something very similar, e.g., a grammar for conventional algebra and rewrite rule to simplify algebraic equations.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.