antlr4 lexer rule to not match a string - java

How to write an antlr4 grammar lexer rule to not match a string. For example, I have the following input string :
CREATE TABLE person
( age integer,
id integer,
name character varying(30)),
PRIMARY KEY ( id )
);
Here, I need to skip those create table queries like above which contains "PRIMARY KEY" constraint.
Can we use regular expressions directly in lexer rules ?

Write a couple of rules for SQLs which are required and not required.
goodSQL:
'CREATE' 'TABLE' Id '('
Id typeDef (',' Id typeDef)? ','
')'
;
badSQL:
'CREATE' 'TABLE' Id '('
Id typeDef (',' Id typeDef)? ','
'PRIMARY' 'KEY' keyDef
')' ->skip
;
This is something to start with. You have to define Id, typeDef and keyDef after this. Adding ->skip will not parse SQL statements that match the rule.
Good Luck!

Related

ANTLR: parse NULL as a function name and a parameter

I would like to be able to use 'NULL' as both a parameter (the value null) and a function name in my grammar. See this reduced example :
grammar test;
expr
: value # valueExpr
| FUNCTION_NAME '(' (expr (',' expr)* )* ')' # functionExpr
;
value
: INT
| 'NULL'
;
FUNCTION_NAME
: [a-zA-Z] [a-zA-Z0-9]*
;
INT: [0-9]+;
Now, trying to parse:
NULL( 1 )
Results in the parse tree failing because it parses NULL as a value, and not a function name.
Ideally, I should even be able to parse NULL(NULL)..
Can you tell me if this is possible, and if yes, how to make this happen?
That 'NULL' string in your grammar defines an implicit token type, it's equivalent to adding something along this:
NULL: 'NULL';
At the start of the lexer rules. When a token matches several lexer rules, the first one is used, so in your grammar the implicit rule get priority, and you get a token of type 'NULL'.
A simple solution would be to introduce a parser rule for function names, something like this:
function_name: FUNCTION_NAME | 'NULL';
and then use that in your expr rule. But that seems brittle, if NULL is not intended to be a keyword in your grammar. There are other solution to this, but I'm not quite sure what to advise since I don't know how you expect your grammar to expand.
But another solution could be to rename FUNCTION_NAME to NAME, get rid of the 'NAME' token type, and rewrite expr like that:
expr
: value # valueExpr
| NAME '(' (expr (',' expr)* )* ')' # functionExpr
| {_input.LT(1).getText().equals("NULL")}? NAME # nullExpr
;
A semantic predicate takes care of the name comparison here.

ANTLR4: context-sensitive spaces?

In a grammar I would like to implement texts without string delimiting xxx.
The idea is to define things like
a = xxx;
instead of
a ="xxx";
to simplify typewriting. Otherwise there should be variable definitions
and other kind of stuff as well.
As a first approach I experimented with this grammar:
grammar SpaceNoSpace;
prog: stat+;
stat:
'somethingelse' ';'
| typed description* content
;
typed:
'something' '-'
| 'anotherthing' '-'
;
description:
'someSortOfDetails' COLON ID HASH
| 'otherSortOfDetails' COLON ID HASH
;
content:
contenttext ';'
;
contenttext:
(~';')*
;
COLON: ':' ;
HASH: '#';
SEMI: ';';
SPACE: ' ';
ID: [a-zA-Z][a-zA-z0-9]+;
WS : [ \t\n\r]+ -> channel(HIDDEN);
ANY_CHAR : . ;
This works fine for input files like this:
something-someSortOfDetails: aVariableName#
this is the content of this;
anotherthing-someSortOfDetails: aVariableName#
here spaces are accepted as much as you like;
somethingelse;
But modifying the last line to
somethingelse ;
leads to a syntax error:
line 7:15 extraneous input ' ' expecting ';'
This probably reveals that the lexer rule
WS : [ \t\n\r]+ -> channel(HIDDEN);
is not applied, (but the SPACE rule???).
Otherwise, if I delete the SPACE lexer-rule, the space
in "somethingelse ;" is ignored (by lexer-rule WS), so that the parser rule
stat : somethingelse as a consequence is detected correctly.
But as a consequence of the deleted SPACE-rule the content text will be reduced to single in-between-spaces,
so "this here" will be reduced to "this here".
This is not a big problem, but nevertheless it is an
interesting question:
is it possible to implement context-sensitive WS or SPACE
lexer rules:
within the content parser-rule any space should be preserved,
in any other rule spaces should be ignored.
Is this possible to define such a context-sensitive lexer-rule behavior in ANTLR4?
Have you considered Lexer Modes? The section with mode(), pushMode(), popMode is probably interesting for you.
Yet I think that lexer modes are more a problem than a solution. Their purpose is to use (parser) context in the lexer. Consequently one should discard the paradigm of separating lexer and parser - and use a PEG-Parser instead.
Since the SPACE rule is before the WS rule, the lexer is returning a space token to the parser. The ' ' is not being being placed on the hidden channel.

Xtext parsing rule uncomplete

I am using following excerpt in the grammar for my DSL:
SelectDml:
'select' columnList+=FieldColumn (',' columns+=FieldColumn)* from=FromClause;
FromClause:
'from' value=ID (alias=ID)?;
FieldColumn hidden():
fieldName=ID ('.' ID)?;
If I parse following line of my DSL, then there is one FieldColumn in the column-List which is absolutely fine. But the FieldColumn has the fieldName a and not the expected value: a.col.
select a.col from a
Is there a problem with my grammar? Something missing?
Per this rule
FieldColumn hidden():
fieldName=ID ('.' ID)?;
the first ID value is assigned to fieldName. Any further ID values are just skipped.

escaping of double quote with text type column at postgresql and json parsing via jackson

I have the following table defined at postgresql:
Column | Type | Modifiers
-----------------+------------------------+-----------
id | uuid | not null
entity_snapshot | text |
Indexes:
"pk_id" PRIMARY KEY, btree (id)
I would like to store the following JSon string:
[ "org.test.MyUniqueId", {: "uuid" : "f4b40050-9716-4346-bf84-8325fadd9876": } ]
for some testing, instead of doing this using Jackson, I try to type an SQL manually, and here is my problem - I can't seem to get ti right.
My current attempt is:
insert into my_table(id,entity_snapshot) values ('d11d944e-6e78-11e1-aae1-52540015fc3f','[ \"org.test.MyUniqueId\", {: \"uuid\" : \"f4b40050-9716-4346-bf84-8325fadd9876\": } ]');
Procuedure a record that looks like what I need when I select from the table, but when I try to use Jackson to parse it I get an error -
org.apache.commons.lang.SerializationException: org.codehaus.jackson.JsonParseException: Unexpected character (':' (code 58)): was expecting double-quote to start field name
Needless to say that if the same record is inserted via my java code , I can get it parsed and when it comes to looking at the record with human eyes, it looks the same to me.
Can you tell me where I am wrong in my SQL insert statement?
You can use Dollar-Quoted String Constants.
Details are here Postgres documentation
In your case query should look like
insert into my_table(id,entity_snapshot) values ('d11d944e-6e78-11e1-aae1-52540015fc3f',$$[ "org.test.MyUniqueId", {: "uuid" : "f4b40050-9716-4346-bf84-8325fadd9876": } ]$$);
Double quotes inside single quotes do not need any escaping.
insert into my_table (id,entity_snapshot)
values
(
'd11d944e-6e78-11e1-aae1-52540015fc3f',
'["org.test.MyUniqueId", {: "uuid" : "f4b40050-9716-4346-bf84-8325fadd9876": }]'
);
will work just fine:
postgres=> create table my_table (id uuid, entity_snapshot text);
CREATE TABLE
Time: 34,936 ms
postgres=> insert into my_table (id,entity_snapshot)
postgres-> values
postgres-> (
postgres(> 'd11d944e-6e78-11e1-aae1-52540015fc3f',
postgres(> '["org.test.MyUniqueId", {: "uuid" : "f4b40050-9716-4346-bf84-8325fadd9876": }]'
postgres(> );
INSERT 0 1
Time: 18,255 ms
postgres=>

antlr how to define optional parts in any order

Suppose need the grammar to parse the following templates:
1. REPORT
2. BEGIN
3. QUERY
4. BEGIN
5. AGGREGATION: day
6. DIMENSION: department
7. END
8. END
Where line #5 and #6 are optional and the order of the 2 lines doesn't matter. How can I specify this in my grammar file? Below is my solution (see line #12):
1. grammar PRL;
2. report
3. : REPORT
4. BEGIN
5. query
6. END
7. ;
8.
9. query
10. : QUERY
11. BEGIN
12. (aggregation_decl dimension_decl | dimension_decl aggregation_decl)?
13. END
14. ;
So it works, however it looks ugly, and if I have more than 2 parts it's going to become unmanageable very quickly? Any advice?
Something like this? Generally you would enforce only one of each item exists at a later processing step. Otherwise, as you see, the grammar gets unwieldy.
grammar PRL;
report
: REPORT
BEGIN
query
END
;
query
: QUERY
BEGIN
body_decl*
END
;
body_decl :
aggregation_decl dimension_decl
| dimension_decl aggregation_decl;
As already mentioned by Adam: this generally something done after the parser has created some sort of (abstract) parse tree. You simply collect all types of declarations like this:
grammar PRL;
report
: REPORT BEGIN query END
;
query
: QUERY BEGIN decl* END
;
decl
: NAME ':' NAME
;
REPORT : 'REPORT';
BEGIN : 'BEGIN';
END : 'END';
QUERY : 'QUERY';
NAME : ('a'..'z' | 'A'..'Z')+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
and after that, check if there are duplicates in decl* in your AST.
But if you really want to do this during parsing, you need to grab the left hand side of decl and add these in a Set and when you stumble upon a duplicate, throw a predicate exception:
grammar PRL;
#parser::header {
import java.util.Set;
import java.util.HashSet;
}
report
: REPORT BEGIN query END
;
query
: QUERY BEGIN unique_decls END
;
unique_decls
#init{Set<String> set = new HashSet<String>();}
: (decl {set.add($decl.key)}?)*
;
decl returns[String key]
: k=NAME ':' NAME {$key = $k.text;}
;
REPORT : 'REPORT';
BEGIN : 'BEGIN';
END : 'END';
QUERY : 'QUERY';
NAME : ('a'..'z' | 'A'..'Z')+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
The {set.add($decl.key)}?, called a Validating Semantic Predicates, will throw an exception when the code inside it (set.add($decl.key)) evaluates to false. In this case, it evaluates to false whenever the set already contains a certain key.

Categories