ANTLR 4: Parsing grammar - java

I want to parse some data from AppleSoft Basic script.
I choose ANTLR and download this grammar: jvmBasic
I'm trying to extract function name without parameters:
return parser.prog().line(0).amprstmt(0).statement().getText();
but it returns PRINT"HELLO" e.g full expression except the line number
Here is string i want to parse:
10 PRINT "Hello!"

I think this question really depends on your ANTLR program implementation but if you are using a treewalker/listener you probably want to be targeting the rule for the specific tokens not the entire "statement" rule which is circular and encompasses many types of statement :
//each line can have one to many amprstmt's
line
: (linenumber ((amprstmt (COLON amprstmt?)*) | (COMMENT | REM)))
;
amprstmt
: (amperoper? statement) //encounters a statement here
| (COMMENT | REM)
;
//statements can be made of 1 to many sub statements
statement
: (CLS | LOAD | SAVE | TRACE | NOTRACE | FLASH | INVERSE | GR | NORMAL | SHLOAD | CLEAR | RUN | STOP | TEXT | HOME | HGR | HGR2)
| prstmt
| printstmt1 //the print rule
//MANY MANY OTHER RULES HERE TOO LONG TO PASTE........
;
//the example rule that occurs when the token's "print" is encountered
printstmt1
: (PRINT | QUESTION) printlist?
;
printlist
: expression (COMMA | SEMICOLON)? printlist*
;
As you can see from the BNF type grammar here the statement rule in this grammar includes the rules for a print statement as well as every other type of statement so it will encompass 10, PRINT and hello and subsequently return the text with the getText() method when any of these are encountered in your case, everything but linenumber which is a rule outside of the statement rule.
If you want to target these specific rules to handle what happens when they are encountered you most likely want to add functionality to each of the methods ANTLR generates for each rule by extending the jvmBasiListener class as shown here
example:
-jvmBasicListener.java
-extended to jvmBasicCustomListener.java
void enterPrintstmt1(jvmBasicParser.Printstmt1Context ctx){
System.out.println(ctx.getText());
}
However if all this is setup and you are just wanting to return a string value etc using the single line you have then trying to access the methods at a lower level by addressing the child nodes of statement may work amprstmt->statement->printstmt1->value :
return parser.prog().line().amprstmt(0).statement().printstmt1().getText();
Just to maybe narrow my answer slightly, the rules specifically that address your input "10 PRINT "HELLO" " would be :
linenumber (contains Number) , statement->printstmt1 and statement->datastmt->datum (contains STRINGLITERAL)
So as shown above the linenumber rule exists on its own and the other 2 rules that defined your text are children of statement, which explains outputting everything except the line number when getting the statement rules text.
Addressing each of these and using getText() rather than an encompassing rule such as statement may give you the result you are looking for.
I will update to address your question since the answer may be slightly longer, the easiest way in my opinion to handle specific rules rather than generating a listener or visitor would be to implement actions within your grammar file rules like this :
printstmt1
: (PRINT | QUESTION) printlist? {System.out.println("Print"); //your java code }
;
This would simply allow you to address each rule and perform whichever java action you would wish to carry out. You can then simply compile your code with something like :
java -jar antlr-4.5.3-complete.jar jvmBasic.g4 -visitor
After this you can simply run your code however you wish, here is an example:
import JVM1.jvmBasicLexer;
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTree;
public class Jvm extends jvmBasicBaseVisitor<Object> {
public static void main(String[] args) {
jvmBasicLexer lexer = new jvmBasicLexer(new ANTLRInputStream("10 PRINT \"Hello!\""));
jvmBasicParser parser = new jvmBasicParser(new CommonTokenStream(lexer));
ParseTree tree = parser.prog();
}
}
The output for this example would then be just :
Print
You could also incorporate whatever Java methods you like within the grammar to address each rule encountered and either develop your own classes and methods to handle it or directly print it out a result.
Update
Just to address the latest question now :
parser.line().linenumber().getText() - For line Number, as line is not part of a statement
parser.prog().line(0).amprstmt(0).statement().printstmt1().PR‌​INT().getText() - For PRINT as it is isolated in printstmt1, however does not include CLR in the rule
parser.prog().line(0).amprstmt(0).statement().printstmt1().pr‌intlist().expression().getText() - To get the value "hello" as it is part of an expression contained within the printstmt1 rule.
:) Good luck

Related

ANTLR4 error recovery issues for class bodies

I've found a strange issue regarding error recovery in ANTLR4. If I take the grammar example from the ANTLR book
grammar simple;
prog: classDef+ ; // match one or more class definitions
classDef
: 'class' ID '{' member+ '}' // a class has one or more members
;
member
: 'int' ID ';' // field definition
| 'int' f=ID '(' ID ')' '{' stat '}' // method definition
;
stat: expr ';'
| ID '=' expr ';'
;
expr: INT
| ID '(' INT ')'
;
INT : [0-9]+ ;
ID : [a-zA-Z]+ ;
WS : [ \t\r\n]+ -> skip ;
and use the input
class T {
y;
int x;
}
it will see the first member as an error (as it expects 'int' before 'y').
classDef
| "class"
| ID 'T'
| "{"
|- member
| | ID "y" -> error
| | ";" -> error
|- member
| | "int"
| | ID "x"
| | ";"
In this case ANTLR4 recovers from the error in the first member subrule and parses the second member correct.
But if the member classDef is changed from mandatory member+ to optional member*
classDef
: 'class' ID '{' member* '}' // a class has zero or more members
;
then the parsed tree will look like
classDef
| "class" -> error
| ID "T" -> error
| "{" -> error
| ID "y" -> error
| ";" -> error
| "int" -> error
| ID "x" -> error
| ";" -> error
| "}" -> error
It seems that the error recovery cannot solve the issue inside the member subrule anymore.
Obviously using member+ is the way forward as it provides the correct error recovery result. But how do I allow empty class bodies? Am I missing something in the grammar?
The DefaultErrorStrategy class is quite complex with token deletions and insertions and the book explains the theory of this class in a very good way. But what I'm missing here is how to implement custom error recovery for specific rules?
In my case I would add something like "if { is already consumed, try to find int or }" to optimize the error recovery for this rule.
Is this possible with ANTLR4 error recovery in a reasonable way at all? Or do I have to implement manual parser by hand to really gain control over error recovery for those use cases?
It is worth noting that the parser never enters the sub rule for the given input. The classDef rule fails before trying to match a member.
Before trying to parse the sub-rule, the sync method on DefaultErrorStrategy is called. This sync recognizes there is a problem and tries to recover by deleting a single token to see if that fixes things up.
In this case it doesn't, so an exception is thrown and then tokens are consumed until a 'class' token is found. This makes sense because that is what can follow a classDef and it is the classDef rule, not the member rule that is failing at this point.
It doesn't look simple to do correctly, but if you install a custom subclass of DefaultErrorStrategy and override the sync() method, you can get any recovery strategy you like.
Something like the following could be a starting point:
#Override
public void sync(Parser recognizer) throws RecognitionException {
if (recognizer.getContext() instanceof simpleParser.ClassDefContext) {
return;
}
super.sync(recognizer);
}
The result being that the sync doesn't fail, and the member rule is executed. Parsing the first member fails, and the default recovery method handles moving on to the next member in the class.

Treat invalid chars as a single token in ANTLR4 lexer

I'm using the JSON grammar from the antlr4 grammar repository to parse JSON files for an editor plugin. It works, but reports invalid chars one by one. The following snippet results in 18 lexer errors:
{
sometext-without-quotes : 42
}
I want to boil it down to 1-2 by treating consecutive, invalid single-char tokens of the same type as one bigger invalid token.
For a similar question, a custom lexer was suggested that glues "unknown" elements to larger tokens: In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?
I assume that this bypasses the usual lexer error reporting, which I would like to avoid, if possible. Isn't there a proper solution for that rather simple task? It seems to have worked by default in ANTLR3.
The answer is in the link you provided. I don't want to copy the original answer completely so I'll try and paraphrase a bit...
In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?
Add unknowns to the lexer that will match multiples of these...
unknowns : Unknown+ ;
...
Unknown : . ;
There was an edit made to this post to cater for the case where you were only using a lexer and not using a parser. If using a parser then you do not need to override the nextToken method because the error can be handled in the parser in a much cleaner way ie unknowns are just another token type as far as the lexer is concerned. The lexer passes these to the parser which can then handle the errors. If using a parser I'd normally recognize all tokens as individual tokens and then in the parser emit the errors ie group them or not. The reason for doing this is all error handling is done in one place ie it's not in the lexer and in the parser. It also makes the lexer simpler to write and test ie it must recognize all text and never fail on any utf8 input. Some people would likely do it differently but this has worked for me with hand written lexers in C. The parser is in charge of determining what's actually valid and how to error on it. One other benefit is that the lexer is fairly generic and can be reused.
For lexer only solution...
Check the answer at the link and look for this comment in the answer...
... but I only have a lexer, no parsers ...
The answer states that you override the nextToken method and goes into some detail on how to do that
#Override
public Token nextToken() {
and the important part in the code is this...
Token next = super.nextToken();
if(next.getType() != Unknown) {
return next;
}
The code that comes after this handles the case where you can match the bad tokens.
What you could do is use lexer modes. For this you'd had to split grammar to parser and lexer grammar. Let's start with lexer grammar:
JSONLexer.g4
/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
lexer grammar JSONLexer;
STRING
: '"' (ESC | ~ ["\\])* '"'
;
fragment ESC
: '\\' (["\\/bfnrt] | UNICODE)
;
fragment UNICODE
: 'u' HEX HEX HEX HEX
;
fragment HEX
: [0-9a-fA-F]
;
NUMBER
: '-'? INT '.' [0-9] + EXP? | '-'? INT EXP | '-'? INT
;
fragment INT
: '0' | [1-9] [0-9]*
;
// no leading zeros
fragment EXP
: [Ee] [+\-]? INT
;
// \- since - means "range" inside [...]
TRUE : 'true';
FALSE : 'false';
NULL : 'null';
LCURL : '{';
RCURL : '}';
COL : ':';
COMA : ',';
LBRACK : '[';
RBRACK : ']';
WS
: [ \t\n\r] + -> skip
;
NON_VALID_STRING : . ->pushMode(MODE_ERR);
mode MODE_ERR;
WS1
: [ \t\n\r] + -> skip
;
COL1 : ':' ->popMode;
MY_ERR_TOKEN : ~[':']* ->type(NON_VALID_STRING);
Basically I have added some tokens used in the parser part (like LCURL, COL, COMA etc) and introduced NON_VALID_STRING token, which is basically the first character that's nothing that already is (should be) matched. Once this token is detected, I switch the lexer to MODE_ERR mode. In this mode I go back to default mode once : is detected (this can be changed and maybe refined, but server the purpose here :) ) or I say that everything else is MY_ERR_TOKEN to which I assign NON_VALID_STRING token type. Here is what ATNLRWorks says to this when I run interpret lexer option with your input:
So s is NON_VALID_STRING type and so is everything else until :. So, same type but two different tokens. If you want them not to be of the same type, simply omit the type call in the lexer grammar.
Here is the parser grammar now
JSONParser.g4
/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
parser grammar JSONParser;
options {
tokenVocab=JSONLexer;
}
json
: object
| array
;
object
: LCURL pair (COMA pair)* RCURL
| LCURL RCURL
;
pair
: STRING COL value
;
array
: LBRACK value (COMA value)* RBRACK
| LBRACK RBRACK
;
value
: STRING
| NUMBER
| object
| array
| TRUE
| FALSE
| NULL
;
and if you run the test rig (I do it with ANTLRworks) you'll get a single error (see screenshot)
Also you could accumulate lexer errors by overriding the generated lexer class, but I understood in the question that this is not desired or I didn't understand that part :)

How to structure my XText terminals? WORDS/SL_STRING/ML_STRING

In my XText DSL, I want to be able to use three different kinds of text terminals. They are all used for adding comments on top of arrows drawn in a UML diagram:
terminal WORD:
Actor -> Actor: WORD
terminal SL_STRINGS:
Actor -> Actor: A sequence of words on a single line
terminal ML_STRINGS:
Actor -> Actor: A series of words on
multiple
lines
My initial approach was to use the ID terminal from the org.eclipse.xtext.common.Terminals as my WORD terminal, and then just have SL_STRINGS be (WORD)*, and ML_STRINGS be (NEWLINE? WORD)*, but this creates a lot of problems with ambiguity between the rules.
How would I go about structuring this in a good way?
More information about the project. (And as this is the first time working with XText, please bear with me):
I am trying to implement a DSL to be used together with the Eclipse Plugin for PlantUML http://plantuml.sourceforge.net/sequence.html mainly for Syntax Checking and Colorization. Currently my grammar works as such:
Model:
(diagrams+=Diagram)*;
Diagram:
'#startuml' NEWLINE (instructions+=(Instruction))* '#enduml' NEWLINE*
;
An instruction can be lots of things:
Instruction:
((name1=ID SEQUENCE name2=ID (':' ID)?)
| Definition (Color)?
| AutoNumber
| Title
| Legend
| Newpage
| AltElse
| GroupingMessages
| Note
| Divider
| Reference
| Delay
| Space
| Hidefootbox
| Lifeline
| ParticipantCreation
| Box)? NEWLINE
;
Example of rules that need different kinds of text terminals:
Group:
'group' TEXT
;
Reference:
'ref over' ID (',' ID)* ((':' SL_TEXT)|((ML_TEXT) NEWLINE 'end ref'))
;
For Group, the text can only be on one line, while for Reference, the text can be on two lines if there is no ":" follwing the rule call.
Currently my terminals look like this:
terminal NEWLINE : ('\r'? '\n');
// Multiline comment begins with /', and ends with '/
terminal ML_COMMENT : '/\'' -> '\'/';
// Singleline comment begins with ', and continues until end of line.
terminal SL_COMMENT : '\'' !('\n'|'\r')* ('\r'? '\n')?;
// INT is a sequence of numbers 0-9.
terminal INT returns ecore::EInt: ('0'..'9')+;
terminal WS : (' '|'\t')+;
terminal ANY_OTHER: .;
And I want on top of this to add to this add three new terminals that takes care of the text.
You should implement a data type rule in order to achieve the desired behavior.
Sebastian wrote an excellent blog post on this topic which can be found here: http://zarnekow.blogspot.de/2012/11/xtext-corner-6-data-types-terminals-why.html
Here is a minimal example of a grammar:
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model:
greetings+=Greeting*;
Greeting:
'Example' ':' comment=Comment;
Comment:
(ID ('\r'? '\n')?)+
;
That will allow you to write something like this:
Example: A series of words
Example: A series of words on
multiple lines
You then may want to implement your own value converter in order to fine-tune the conversion to a String.
Let me know if that helps!

Capturing a regex group nested/enclosed by special character

I am trying to search for words that appear within tilde (~) sign borders.
e.g. ~albert~ is a ~good~ boy.
I know that this is possible by using ~.+?~,and it already works for me. But There are special cases when I need to match a nested tilde sentence.
e.g. ~The ~spectacle~~ was ~broken~
In the example above, I have to capture 'The Spectacle', 'spectacle', and 'broken' separately. These will be translated either word-by-word or with accompanying article (An, The, whatever). The reason is that in my system:
1) 'The spectacle' requires a separate translation on a specific cases.
2) 'Spectacle' also needs translation on specific cases.
3) IF a tranlsation exist for The spectacle, we will use that, ELSE
we will use
Another example to explain this is:
~The ~spectacle~~ was ~borken~, but that was not the same ~spectacle~
that was given to ~me~.
In the example above, I will have translation for:
1) 'The spectacle' (because the translation case exists for 'The spectacle', otherwise I would've only translated spectacle on it's own)
2) 'broken'
3) 'spectacle'
4) me
I am having trouble combining an expression which will make sure that this is captured in my regular expression. The one that I have managed to work with so far is '~.+?~'. But I know that with some form of lookahead or lookbehind, I can get this working. Could anyone help me on this?
The most important aspect in this is the regression-proofing, which will ensure that the existing stuff don't break. If I manage to get it right, I will post it.
N.B. If it helps, currently I will have instances where only one level of nesting will require decomposition. so ~The ~spectacle~~ will be deepest level (until i need more!!!!!)
I wrote something like this a while ago, I haven't tested it much though:
(~(?(?=.*?~~.*?~).*?~.*?~.*?~|[^~]+?~))
or
(~(?(?=.*?~[A-Za-z]*?~.*?~).*?~.*?~.*?~|[^~]+?~))
RegEx101
Another alternative
(~(?:.*?~.*?~){0,2}.*?~)
^^ change to max depth
which ever works best
To add more add a few extra sets of .*?~ in the two places where you see a bunch.
The main problem
If we allow unlimited nesting How would we know where it would end and begin? A clumsy diagram:
~This text could be nested ~ so could this~ and this~ this ~Also this~
| | |_________| | |
| |_______________________________| |
|____________________________________________________________________|
or:
~This text could be nested ~ so could this~ and this~ this ~Also this~
| | | | |_________|
| |______________| |
|___________________________________________________|
The compiler would have no idea which to choose
For your sentence
~The ~spectacle~~ was ~broken~, but that was not the same ~spectacle~ that was given to ~me~.
| | ||_____| | | |
| | |_____________| | |
| |____________________________________________________| |
|___________________________________________________________________|
or:
~The ~spectacle~~ was ~broken~, but that was not the same ~spectacle~ that was given to ~me~.
| |_________|| |______| |_________| |__|
|_______________|
What should I do?
Use an alternating character (as #tbraun suggested) so the compiler knows where to start and end:
{This text can be {properly {nested}} without problems} because {the compiler {can {see {the}}} start and end points} easily. Or use a compiler:
Note: I don't do Java much so some code might be incorrect
import java.util.List;
String[] chars = myString.split('');
int depth = 0;
int lastMath = 0;
List<String> results = new ArrayList<String>();
for (int i = 0; i < chars.length; i += 1) {
if (chars[i] === '{') {
depth += 1;
if (depth === 1) {
lastIndex = i;
}
}
if (chars[i] === '}') {
depth -= 1;
if (depth === 0) {
results.add(StringUtils.join(Arrays.copyOfRange(chars, lastIndex, i + 1), ''));
}
if (depth < 0) {
// Balancing problem Handle an error
}
}
}
This uses StringUtils
You'll need something to differentiate start/finish patterns. I.e. {}
Than you can use pattern \{[^{]*?\} to exclude {:
{The {spectacle}} was {broken}
First iteration
{spectacle}
{broken}
Second iteration
{The spectacle}

Regexp pattern for list of email

I need valid regexp for email seperated by " " and ends with #a.com or b.com
for example:
valid email string: "email1#a.com email2#b.com email3#a.com"
invalid email string: "email1#a.com email2#b.com email3#c.com"
I don't necessarily think a regexp is an extensible and maintainable solution here. I would rather:
split the list on whitespace (perhaps on whitespace preceeded by a .com/.org etc.)
extract the domain name post-#
compare this vs. a whitelist (or blacklist)
I like regexps a lot, but I don't always think they're the solution. See here for a discussion on this, and note the below!
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
You can try this expression:
^(( |^)[^ #]+#[ab]\.com)+$
// ^ ^ ^ ^ ^
// | | | | +- The mandatory .com
// | | | +------ Either a or b
// | | +--------- An # sign
// | +------------- Anything but space or # repeated at least once
// +----------------------- Preceded by a space or the beginning of line
Try this:
^(.#(a|b).com(|\s)$
Permalink - try entering an invalid string like "c.com" and see that it works too
Regexpal is a nice easy tool to start working on making regex for whatever problem you are trying to solve!
(email[1-3]\#[ab].com )*email[1-3]\#[ab].com ?
(replace [1-3] and [ab] with whatever really suits you).
[A-Za-z0-9_.-]+#[ab]\.com( [A-Za-z0-9_.-]+#[ab]\.com)*
You can change the [A-Za-z0-9_.-]+ part if you want to be more restrictive.

Categories