How do I make a Simple ANTLR grammar extension?

How do I make a Simple ANTLR grammar extension? - java

I'm writing a framework that uses ANTLR to parse Java-style expressions. I had in mind to create a new type of free-form literal. The literal will look similar to a string, so I thought to extend the Java8 grammar I'm using with a new literal identical to StringLiteral, but bounded by '`' characters instead of '"'.
So I created:
ExternalLiteral
: '`' StringCharacters? '`'
;
in the Lexer and modified:
fragment
StringCharacter
: ~["`\\\r\n]
| EscapeSequence
;
and
fragment
EscapeSequence
: '\\' [btnfr"'`\\]
| OctalEscape
| UnicodeEscape // This is not in the spec but prevents having to preprocess the input
;
so '`' would be treated as a special character, identically to '"'. Then in the grammar I modified
literal
: IntegerLiteral
| FloatingPointLiteral
| BooleanLiteral
| CharacterLiteral
| StringLiteral
| ExternalLiteral
| NullLiteral
;
That seems like it would work to me, but when I try to parse any such expressions, e.g.`0`, I get:
line 1:3 mismatched input '<EOF>' expecting {'boolean', 'byte', 'char', 'double', 'float', 'int', 'long', 'new', 'short', 'super', 'this', 'void', IntegerLiteral, FloatingPointLiteral, BooleanLiteral, CharacterLiteral, StringLiteral, ExternalLiteral, 'null', '(', '!', '~', '++', '--', '+', '-', Identifier, '#'}
line 1:3 missing {IntegerLiteral, FloatingPointLiteral, BooleanLiteral, CharacterLiteral, StringLiteral, ExternalLiteral, 'null'} at '<EOF>'
I've had fights with ANTLR before, I don't know if it's ANTLR or me that is more the problem. Does anyone with more experience than me see what I might've done wrong?

Related

Antlr - Parsing Multiline #define for C.g4

I am using Antlr4 to parse C code.
I want to parse multiline #defines alongwith C.g4 provided in
C.g4
But the grammar mentioned in the link above does not support preprocessor directives, so I have added the following new rules to support preprocessing.
Link to my previous question
Whitespace
: [ \t]+
-> channel(HIDDEN)
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> channel(HIDDEN)
;
BlockComment
: '/*' .*? '*/'
;
LineComment
: '//' ~[\r\n]*
;
IncludeBlock
: '#' Whitespace? 'include' ~[\r\n]*
;
DefineStart
: '#' Whitespace? 'define'
;
DefineBlock
: DefineStart ~[\r\n]*
;
MultiDefine
: DefineStart MultiDefineBody
;
MultiDefineBody
: [\\] [\r\n]+ MultiDefineBody
| ~[\r\n]
;
preprocessorDeclaration
: includeDeclaration
| defineDeclaration
;
includeDeclaration
: IncludeBlock
;
defineDeclaration
: DefineBlock | MultiDefine
;
comment
: BlockComment
| LineComment
;
declaration
: declarationSpecifiers initDeclaratorList ';'
| declarationSpecifiers ';'
| staticAssertDeclaration
| preprocessorDeclaration
| comment
;
It works only for Single line pre-processor directives if MultiBlock rule is removed
But for multiline #defines it is not working.
Any help will be appreciated
By Multiline #define I mean
#define MACRO(num, str) {\
printf("%d", num);\
printf(" is");\
printf(" %s number", str);\
printf("\n");\
}
Basically I need to find a grammar that can parse the above block

I'm shamelessly copying part of my answer from here:
This is because ANTLR's lexer matches "first come, first serve". That
means it will tray to match the given input with the first specified
(in the source code) rule and if that one can match the input, it
won't try to match it with the other ones.
In your case the input sequence DefineStart \\\r\n (where DefineStart stands for an input-sequence corresponsing to the respective rule) will be matched by DefineBlock because the \\ is being consumed by the ~[\r\n]* construct.
You now have two possibilities: Either you tweak your current set of rules in order to circumvent this problem or (my sugestion) you simply use one rule for matching a define-statement (single and multiline).
Such a merged rule could look like this:
DefineBlock:
DefineStart (~[\\\r\n] | '\\\\' '\r'? '\n' | '\\'. )*
;
Note that this code is untested but it should read like this: Match DefineStart and afterwards an arbitrary long character sequence matching the following pattern: The current character is either not \, \r or \n, it is an escaped newline or a backslash followed by an arbitrary character.
This should allow for the wished newline-escaping.

antlr grammar for triple quoted string

I am trying to update an ANTLR grammar that follows the following spec
https://github.com/facebook/graphql/pull/327/files
In logical terms its defined as
StringValue ::
- `"` StringCharacter* `"`
- `"""` MultiLineStringCharacter* `"""`
StringCharacter ::
- SourceCharacter but not `"` or \ or LineTerminator
- \u EscapedUnicode
- \ EscapedCharacter
MultiLineStringCharacter ::
- SourceCharacter but not `"""` or `\"""`
- `\"""`
(Not the above is logical - not ANTLR syntax)
I tried the follow in ANTRL 4 but it wont recognize more than 1 character inside a triple quoted string
string : triplequotedstring | StringValue ;
triplequotedstring: '"""' triplequotedstringpart? '"""';
triplequotedstringpart : EscapedTripleQuote* | SourceCharacter*;
EscapedTripleQuote : '\\"""';
SourceCharacter :[\u0009\u000A\u000D\u0020-\uFFFF];
StringValue: '"' (~(["\\\n\r\u2028\u2029])|EscapedChar)* '"';
With these rules it will recognize '"""a"""' but as soon as I add more characters it fails
eg: '"""abc"""' wont parse and the IntelliJ plugin for ANTLR says
line 1:14 extraneous input 'abc' expecting {'"""', '\\"""', SourceCharacter}
How do I do triple quoted strings in ANTLR with '\"""' escaping?

Some of your parer rules should really be lexer rules. And SourceCharacter should probably be a fragment.
Also, instead of EscapedTripleQuote* | SourceCharacter*, you probably want ( EscapedTripleQuote | SourceCharacter )*. The first matches aaa... or bbb..., while you probably meant to match aababbba...
Try something like this instead:
string
: Triplequotedstring
| StringValue
;
Triplequotedstring
: '"""' TriplequotedstringPart*? '"""'
;
StringValue
: '"' ( ~["\\\n\r\u2028\u2029] | EscapedChar )* '"'
;
// Fragments never become a token of their own: they are only used inside other lexer rules
fragment TriplequotedstringPart : EscapedTripleQuote | SourceCharacter;
fragment EscapedTripleQuote : '\\"""';
fragment SourceCharacter :[\u0009\u000A\u000D\u0020-\uFFFF];

Antlr how to escape quote symbol in quoted string

I want some grammar to represent a string, quoted by " and the " symbol inside string can be quoted like \". Following is my grammar:
fragment
NUM_LETTER : ('a'..'z'|'A'..'Z'|'0'..'9');
STRING_LITERAL : '"' (NUM_LETTER|'_'|('\\"'))* '"';
But it does not work. I try to interpret "\"a" in AntlrWorks1.5 and it gives a MismatchedTokenException in the generated syntax tree for STRING_LITERAL. Which part of my grammar is wrong?

There's nothing wrong with the grammar. You're probably getting this error because you're using the interpreter, which is buggy. Use ANTLRWorks' debugger instead. The debugger will show you the input "\"a" is parsed just fine (press CTRL+D to start debugging).
Also, your string rule would probably be better of looking like this:
STRING_LITERAL : '"' (~('"' | '\\' | '\r' | '\n') | '\\' ('"' | '\\'))* '"';
In other words, the contents of your string is zero or more:
any char other than a quote, backslash or line break: ~('"' | '\\' | '\r' | '\n')
or
an escaped quote or backslash '\\' ('"' | '\\')

Try the following expression:
STRING : '"' (options{greedy=false;}:( ~('\\'|'"') | ('\\' '"')))* '"';

What is the equivalent for epsilon in ANTLR BNF grammar notation?

During taking advantage of ANTLR 3.3, I'm changing the current grammar to support inputs without parenthesis too. Here's the first version of my grammar :
grammar PropLogic;
NOT : '!' ;
OR : '+' ;
AND : '.' ;
IMPLIES : '->' ;
SYMBOLS : ('a'..'z') | '~' ;
OP : '(' ;
CP : ')' ;
prog : formula EOF ;
formula : NOT formula
| OP formula( AND formula CP | OR formula CP | IMPLIES formula CP)
| SYMBOLS ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
Then I changed it this way to support the appropriate features :
grammar PropLogic;
NOT : '!' ;
OR : '+' ;
AND : '.' ;
IMPLIES : '->' ;
SYMBOL : ('a'..'z') | '~' ;
OP : '(' ;
CP : ')' ;
EM : '' ;
prog : formula EOF ;
formula : OP formula( AND formula CP | OR formula CP | IMPLIES formula CP)
| ( NOT formula | SYMBOL )( AND formula | OR formula | IMPLIES formula | EM ) ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
But I've been faced with following error :
error<100>: syntax error: invalid char literal: ''
error<100>: syntax error: invalid char literal: ''
Does anybody know that how can I overcome this error?

Your EM token:
EM : '' ;
is invalid: you can't match an empty string in lexer rules.
To match epsilon (nothing), you should do:
rule
: A
| B
| /* epsilon */
;
Of course, the comment /* epsilon */ can safely be removed.
Note that when you do it like that in your current grammar, ANTLR will complain that there can be rules matched using multiple alternatives. This is because your grammar is ambiguous.

I'm not an ANTLR expert, but you might try:
formula : term ((AND | OR | IMPLIES ) term )*;
term : OP formula CP | NOT term | SYMBOL ;
If you want traditional precedence of operators this won't do the trick, but that's another issue.
EDIT: OP raised the ante; he wants precedence too. I'll meet him halfway, since it wasn't part
of the orginal question. I've added precedence to the grammar that makes IMPLIES
the lower precedence than other operators, and leave it to OP to figure out how to do the rest.
formula: disjunction ( IMPLIES disjunction )* ;
disjunction: term (( AND | OR ) term )* ;
term: OP formula CP | NOT term | SYMBOL ;
OP additionally asked, "how to convert (!p or q ) into p -> q". I think he should
have asked this as a separate question. However, I'm already here.
What he needs to do is walk the tree, looking for the pattern he doesn't
like, and change the tree into one he does, and then prettyprint the answer.
It is possible to do all this with ANTLR, which is part of the reason
it is popular.
As a practical matter, procedurally walking the tree and checking the node
types, and splicing out old nodes and splicing in new is doable, but a royal PitA.
Especially if you want to do this for lots of transformations.
A more effective way to do this is to use a
program transformation system, which allows surface syntax patterns to be expressed for matching and replacement. Program transformation systems of course include parsing machinery and more powerful ones let you (and indeed insist) that you define
a grammar up front much as you for ANTLR.
Our DMS Software Reengineering Toolkit is such a program transformation tool, and with a suitably defined grammar for propositions,
the following DMS transformation rule would carry out OP's additional request:
domain proplogic; // tell DMS to use OP's definition of logic as a grammar
rule normalize_implies_from_or( p: term, q: term): formula -> formula
" NOT \p OR \q " -> " \p IMPLIES \q ";
The " ... " is "domain notation", e.g, surface syntax from the proplogic domain, the "\" are meta-escapes,
so "\p" and "\q" represent any arbitrary term from the proplogic grammar. Notice the rule has to reach "across" precedence levels when being applied, as "NOT \p OR \q" isn't a formula and "\p IMPLIES \q" is; DMS takes care of all this (the "formula -> formula" notation is how DMS knows what to do). This rule does a tree-to-tree rewrite. The resulting tree can be prettyprinted by DMS.
You can see a complete example of something very similar, e.g., a grammar for conventional algebra and rewrite rule to simplify algebraic equations.

Split a string on commas not contained within double-quotes with a twist

I asked this question earlier and it was closed because it was a duplicate, which I accept and actually found the answer in the question Java: splitting a comma-separated string but ignoring commas in quotes, so thanks to whoever posted it.
But I've since run into another issue. Apparently what I need to do is use "," as my delimiter when there are zero or an even number of double-quotes, but also ignore any "," contained in brackets.
So the following:
"Thanks,", "in advance,", "for("the", "help")"
Would tokenize as:
Thanks,
in advance,
for("the", "help")
I'm not sure if there's anyway to modify the current regex I'm using to allow for this, but any guidance would be appreciated.
line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");

Sometimes it is easier to match what you want instead of what you don't want:
String s = "\"Thanks,\", \"in advance,\", \"for(\"the\", \"help\")\"";
String regex = "\"(\\([^)]*\\)|[^\"])*\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(s.substring(m.start(),m.end()));
}
Output:
"Thanks,"
"in advance,"
"for("the", "help")"
If you also need it to ignore closing brackets inside the quotes sections that are inside the brackets, then you need this:
String regex = "\"(\\((\"[^\"]*\"|[^)])*\\)|[^\"])*\"";
An example of a string which needs this second, more complex version is:
"foo","bar","baz(":-)",":-o")"
Output:
"foo"
"bar"
"baz(":-)",":-o")"
However, I'd advise you to change your data format if at all possible. This would be a lot easier if you used a standard format like XML to store your tokens.

A home-grown parser is easily written.
For example, this ANTLR grammar takes care of your example input without much trouble:
parse
: line*
;
line
: Quoted ( ',' Quoted )* ( '\r'? '\n' | EOF )
;
Quoted
: '"' ( Atom )* '"'
;
fragment
Atom
: Parentheses
| ~( '"' | '\r' | '\n' | '(' | ')' )
;
fragment
Parentheses
: '(' ~( '(' | ')' | '\r' | '\n' )* ')'
;
Space
: ( ' ' | '\t' ) {skip();}
;
and it would be easy to extend this to take escaped quotes or parenthesis into account.
When feeding the parser generated by that grammar to following two lines of input:
"Thanks,", "in advance,", "for("the", "help")"
"and(,some,more)","data , here"
it gets parsed like this:
If you consider to use ANTLR for this, I can post a little HOW-TO to get a parser from that grammar I posted, if you want.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.