JavaCC: Matching with wildcard but not consuming for state switch - java

In JavaCC, for example in state DEFAULT, I want to perform a state switch, if the next token is <A>, I want to switch to state STATE_A, otherwise, I want to switch to state STATE_B.
I tried to use something like the following code with "" as a wildcard:
TOKEN:
{
<A: "aa"> : STATE_A
| <NOT_A: ""> : STATE_B
}
But it doesn't work, when a character that cannot be reduced to A is met, the function returns immediately, and doesn't get switched to STATE_B, therefore "" doesn't seem to be able to do the job.
Do you have any suggestions? Thanks.

Well I find that this actually works.
The empty string will be matched when A cannot be matched, however we need to explicitly refer to NOT_A in the definitions of non-terminals. Therefore expressions like
[ <A> ]
should be rewritten as
( <A> | <NOT_A> )
to enforce a state switch.

Related

How to keep a value taken from a rule in a variable and reuse it in the ANTLR grammar?

I declared String variables in my grammar, vAttributeName and vClassName, and I assign them with the text values of the tokens : className and attributeName. When I use one of the variables that should normally contain a value in an error alternative so in another rule, it returns a null in the message ... Why do not my variables keep the values? How can I fix that?
grammar TestExpression;
#parser::members {
String vAttributeName;
String vClassName;
}
/* SYNTAX RULES */
textInput : classifierContext ;
classifierContext : 'context' c=className attributeContext {vClassName = $c.text;};
attributeContext : '::' a=attributeName ':' dataType initDefinition {vAttributeName = $a.text;};
initDefinition : 'init' ':' initExpression ;
initExpression : boolExpression
| decimalExpression
| dateTimeExpression
| .+? {notifyErrorListeners("Corriger - "l'attribut "+vAttributeName+" de l'entité "+vClassName+" ne correspond pas");}
I'm trying to parse an expression who describe a class with an attribute that has a false value.
And the message i expected was : "Corriger - l'attribut seat de l'entité Car ne correspond pas".
But the actual message was : "Corriger - l'attribut null de l'entité null ne correspond pas".
If you have a rule like a: b c {action}; and an input that matches that rule, events will happen in the following order:
The rule b is applied and its action is executed if it has one.
The rule c is applied and its action is executed if it has one.
The action action is executed and is able to access the results of the b and c rules (including anything set by their actions, which is why it's important that those have run first).
If you have a: b {action} c; instead, then the action will be executed after b, but before c (and will consequently be unable to access c's result).
So for your code that means that the action of initExpression is run before those of classifierContext and attributeContext and that's why the variables aren't set. You can fix the problem by moving the action, so that it happens directly after the needed part (i.e. className/attributeName) has been parsed.
You could also get rid of the variables altogether and instead get the desired information by getting it from your grandparent contexts.

ANTLR 4: Parsing grammar

I want to parse some data from AppleSoft Basic script.
I choose ANTLR and download this grammar: jvmBasic
I'm trying to extract function name without parameters:
return parser.prog().line(0).amprstmt(0).statement().getText();
but it returns PRINT"HELLO" e.g full expression except the line number
Here is string i want to parse:
10 PRINT "Hello!"
I think this question really depends on your ANTLR program implementation but if you are using a treewalker/listener you probably want to be targeting the rule for the specific tokens not the entire "statement" rule which is circular and encompasses many types of statement :
//each line can have one to many amprstmt's
line
: (linenumber ((amprstmt (COLON amprstmt?)*) | (COMMENT | REM)))
;
amprstmt
: (amperoper? statement) //encounters a statement here
| (COMMENT | REM)
;
//statements can be made of 1 to many sub statements
statement
: (CLS | LOAD | SAVE | TRACE | NOTRACE | FLASH | INVERSE | GR | NORMAL | SHLOAD | CLEAR | RUN | STOP | TEXT | HOME | HGR | HGR2)
| prstmt
| printstmt1 //the print rule
//MANY MANY OTHER RULES HERE TOO LONG TO PASTE........
;
//the example rule that occurs when the token's "print" is encountered
printstmt1
: (PRINT | QUESTION) printlist?
;
printlist
: expression (COMMA | SEMICOLON)? printlist*
;
As you can see from the BNF type grammar here the statement rule in this grammar includes the rules for a print statement as well as every other type of statement so it will encompass 10, PRINT and hello and subsequently return the text with the getText() method when any of these are encountered in your case, everything but linenumber which is a rule outside of the statement rule.
If you want to target these specific rules to handle what happens when they are encountered you most likely want to add functionality to each of the methods ANTLR generates for each rule by extending the jvmBasiListener class as shown here
example:
-jvmBasicListener.java
-extended to jvmBasicCustomListener.java
void enterPrintstmt1(jvmBasicParser.Printstmt1Context ctx){
System.out.println(ctx.getText());
}
However if all this is setup and you are just wanting to return a string value etc using the single line you have then trying to access the methods at a lower level by addressing the child nodes of statement may work amprstmt->statement->printstmt1->value :
return parser.prog().line().amprstmt(0).statement().printstmt1().getText();
Just to maybe narrow my answer slightly, the rules specifically that address your input "10 PRINT "HELLO" " would be :
linenumber (contains Number) , statement->printstmt1 and statement->datastmt->datum (contains STRINGLITERAL)
So as shown above the linenumber rule exists on its own and the other 2 rules that defined your text are children of statement, which explains outputting everything except the line number when getting the statement rules text.
Addressing each of these and using getText() rather than an encompassing rule such as statement may give you the result you are looking for.
I will update to address your question since the answer may be slightly longer, the easiest way in my opinion to handle specific rules rather than generating a listener or visitor would be to implement actions within your grammar file rules like this :
printstmt1
: (PRINT | QUESTION) printlist? {System.out.println("Print"); //your java code }
;
This would simply allow you to address each rule and perform whichever java action you would wish to carry out. You can then simply compile your code with something like :
java -jar antlr-4.5.3-complete.jar jvmBasic.g4 -visitor
After this you can simply run your code however you wish, here is an example:
import JVM1.jvmBasicLexer;
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTree;
public class Jvm extends jvmBasicBaseVisitor<Object> {
public static void main(String[] args) {
jvmBasicLexer lexer = new jvmBasicLexer(new ANTLRInputStream("10 PRINT \"Hello!\""));
jvmBasicParser parser = new jvmBasicParser(new CommonTokenStream(lexer));
ParseTree tree = parser.prog();
}
}
The output for this example would then be just :
Print
You could also incorporate whatever Java methods you like within the grammar to address each rule encountered and either develop your own classes and methods to handle it or directly print it out a result.
Update
Just to address the latest question now :
parser.line().linenumber().getText() - For line Number, as line is not part of a statement
parser.prog().line(0).amprstmt(0).statement().printstmt1().PR‌​INT().getText() - For PRINT as it is isolated in printstmt1, however does not include CLR in the rule
parser.prog().line(0).amprstmt(0).statement().printstmt1().pr‌intlist().expression().getText() - To get the value "hello" as it is part of an expression contained within the printstmt1 rule.
:) Good luck

ANTLR: parse NULL as a function name and a parameter

I would like to be able to use 'NULL' as both a parameter (the value null) and a function name in my grammar. See this reduced example :
grammar test;
expr
: value # valueExpr
| FUNCTION_NAME '(' (expr (',' expr)* )* ')' # functionExpr
;
value
: INT
| 'NULL'
;
FUNCTION_NAME
: [a-zA-Z] [a-zA-Z0-9]*
;
INT: [0-9]+;
Now, trying to parse:
NULL( 1 )
Results in the parse tree failing because it parses NULL as a value, and not a function name.
Ideally, I should even be able to parse NULL(NULL)..
Can you tell me if this is possible, and if yes, how to make this happen?
That 'NULL' string in your grammar defines an implicit token type, it's equivalent to adding something along this:
NULL: 'NULL';
At the start of the lexer rules. When a token matches several lexer rules, the first one is used, so in your grammar the implicit rule get priority, and you get a token of type 'NULL'.
A simple solution would be to introduce a parser rule for function names, something like this:
function_name: FUNCTION_NAME | 'NULL';
and then use that in your expr rule. But that seems brittle, if NULL is not intended to be a keyword in your grammar. There are other solution to this, but I'm not quite sure what to advise since I don't know how you expect your grammar to expand.
But another solution could be to rename FUNCTION_NAME to NAME, get rid of the 'NAME' token type, and rewrite expr like that:
expr
: value # valueExpr
| NAME '(' (expr (',' expr)* )* ')' # functionExpr
| {_input.LT(1).getText().equals("NULL")}? NAME # nullExpr
;
A semantic predicate takes care of the name comparison here.

Treat invalid chars as a single token in ANTLR4 lexer

I'm using the JSON grammar from the antlr4 grammar repository to parse JSON files for an editor plugin. It works, but reports invalid chars one by one. The following snippet results in 18 lexer errors:
{
sometext-without-quotes : 42
}
I want to boil it down to 1-2 by treating consecutive, invalid single-char tokens of the same type as one bigger invalid token.
For a similar question, a custom lexer was suggested that glues "unknown" elements to larger tokens: In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?
I assume that this bypasses the usual lexer error reporting, which I would like to avoid, if possible. Isn't there a proper solution for that rather simple task? It seems to have worked by default in ANTLR3.
The answer is in the link you provided. I don't want to copy the original answer completely so I'll try and paraphrase a bit...
In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?
Add unknowns to the lexer that will match multiples of these...
unknowns : Unknown+ ;
...
Unknown : . ;
There was an edit made to this post to cater for the case where you were only using a lexer and not using a parser. If using a parser then you do not need to override the nextToken method because the error can be handled in the parser in a much cleaner way ie unknowns are just another token type as far as the lexer is concerned. The lexer passes these to the parser which can then handle the errors. If using a parser I'd normally recognize all tokens as individual tokens and then in the parser emit the errors ie group them or not. The reason for doing this is all error handling is done in one place ie it's not in the lexer and in the parser. It also makes the lexer simpler to write and test ie it must recognize all text and never fail on any utf8 input. Some people would likely do it differently but this has worked for me with hand written lexers in C. The parser is in charge of determining what's actually valid and how to error on it. One other benefit is that the lexer is fairly generic and can be reused.
For lexer only solution...
Check the answer at the link and look for this comment in the answer...
... but I only have a lexer, no parsers ...
The answer states that you override the nextToken method and goes into some detail on how to do that
#Override
public Token nextToken() {
and the important part in the code is this...
Token next = super.nextToken();
if(next.getType() != Unknown) {
return next;
}
The code that comes after this handles the case where you can match the bad tokens.
What you could do is use lexer modes. For this you'd had to split grammar to parser and lexer grammar. Let's start with lexer grammar:
JSONLexer.g4
/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
lexer grammar JSONLexer;
STRING
: '"' (ESC | ~ ["\\])* '"'
;
fragment ESC
: '\\' (["\\/bfnrt] | UNICODE)
;
fragment UNICODE
: 'u' HEX HEX HEX HEX
;
fragment HEX
: [0-9a-fA-F]
;
NUMBER
: '-'? INT '.' [0-9] + EXP? | '-'? INT EXP | '-'? INT
;
fragment INT
: '0' | [1-9] [0-9]*
;
// no leading zeros
fragment EXP
: [Ee] [+\-]? INT
;
// \- since - means "range" inside [...]
TRUE : 'true';
FALSE : 'false';
NULL : 'null';
LCURL : '{';
RCURL : '}';
COL : ':';
COMA : ',';
LBRACK : '[';
RBRACK : ']';
WS
: [ \t\n\r] + -> skip
;
NON_VALID_STRING : . ->pushMode(MODE_ERR);
mode MODE_ERR;
WS1
: [ \t\n\r] + -> skip
;
COL1 : ':' ->popMode;
MY_ERR_TOKEN : ~[':']* ->type(NON_VALID_STRING);
Basically I have added some tokens used in the parser part (like LCURL, COL, COMA etc) and introduced NON_VALID_STRING token, which is basically the first character that's nothing that already is (should be) matched. Once this token is detected, I switch the lexer to MODE_ERR mode. In this mode I go back to default mode once : is detected (this can be changed and maybe refined, but server the purpose here :) ) or I say that everything else is MY_ERR_TOKEN to which I assign NON_VALID_STRING token type. Here is what ATNLRWorks says to this when I run interpret lexer option with your input:
So s is NON_VALID_STRING type and so is everything else until :. So, same type but two different tokens. If you want them not to be of the same type, simply omit the type call in the lexer grammar.
Here is the parser grammar now
JSONParser.g4
/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
parser grammar JSONParser;
options {
tokenVocab=JSONLexer;
}
json
: object
| array
;
object
: LCURL pair (COMA pair)* RCURL
| LCURL RCURL
;
pair
: STRING COL value
;
array
: LBRACK value (COMA value)* RBRACK
| LBRACK RBRACK
;
value
: STRING
| NUMBER
| object
| array
| TRUE
| FALSE
| NULL
;
and if you run the test rig (I do it with ANTLRworks) you'll get a single error (see screenshot)
Also you could accumulate lexer errors by overriding the generated lexer class, but I understood in the question that this is not desired or I didn't understand that part :)

How to express advanced expressions between query parameters in a REST API?

The problem (or missing feature) is the lack of expression possibility between different query parameters. As I see it you can only specify and between parameters, but how do you solve it if you want to have not equal, or or xor?
I would like to be able to express things like:
All users with age 20 or the name Bosse
/users?age=22|name=Bosse
All users except David and Lennart
/users?name!=David&name!=Lennart
My first idea is to use a query parameter called _filter and take a String with my expression like this:
All users with with age 22 or a name that is not Bosse
/users?_filter=age eq 22 or name neq Bosse
What is the best solution for this problem?
I am writing my API with Java and Jersey, so if there is any special solution for Jersey, let me know.
I can see two solutions to achieve that:
Using a special query parameter containing the expression when executing a GET method. It's the way OData does with its $filter parameter (see this link: https://msdn.microsoft.com/fr-fr/library/gg309461.aspx#BKMK_filter). Here is a sample:
/AccountSet?$filter=AccountCategoryCode/Value eq 2 or AccountRatingCode/Value eq 1
Parse.com also uses such approach with its where parameter but the query is described using a JSON structure (see this link: https://parse.com/docs/rest/guide#queries). Here is a sample:
curl -X GET \
-H "X-Parse-Application-Id: ${APPLICATION_ID}" \
-H "X-Parse-REST-API-Key: ${REST_API_KEY}" \
-G \
--data-urlencode 'where={"score":{"$gte":1000,"$lte":3000}}' \
https://api.parse.com/1/classes/GameScore
If it's something too difficult to describe, you could also use a POST method and specify the query in the request payload. ElasticSearch uses such approach for its query support (see this link: https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html). Here is a sample:
$ curl -XGET 'http://localhost:9200/twitter/tweet/_search?routing=kimchy' -d '{
"query": {
"bool" : {
"must" : {
"query_string" : {
"query" : "some query string here"
}
},
"filter" : {
"term" : { "user" : "kimchy" }
}
}
}
}
'
Hope it helps you,
Thierry
OK so here it is
You could add + or - to include or exclude , and an inclusive filter keyword for AND and OR
For excluding
GET /users?name=-David,-Lennart
For including
GET /users?name=+Bossee
For OR
GET /users?name=+Bossee&age=22&inclusive=false
For AND
GET /users?name=+Bossee&age=22&inclusive=true
In this way the APIs are very intuitive, very readable also does the work you want it to do.
EDIT - very very difficult question , however I would do it this way
GET /users?name=+Bossee&age=22&place=NewYork&inclusive=false,true
Which means the first relation is not inclusive - or in other words it is OR
second relation is inclusive - or in other words it is AND
The solution is with the consideration that evaluation is from left to right.
Hey it seems impossible if you go for queryparams...
If it is the case to have advanced expressions go for PathParams so you will be having regular expressions to filter.
But to allow only a particular name="Bosse" you need to write a stringent regex.
Instead of taking a REST end point only for condition sake, allow any name value and then you need to write the logic to check manually with in the program.

Categories