I am new to Xtext and I want to use it to generate some code for drools rules. I have the following problem, I don't know how to write the dialect to have that $order in front of a Order(). I would really appreciate if someone will show me how to handle this example.
This is what I have tried so far
Model:
declarations+=Declaration*;
Declaration:
Rule;
State:
name=ID
;
Rule:
'rule' ruleDescription=STRING
'#specification'specificationDescription=STRING
'ruleflow-group' ruleflowDescription=STRING
'when' when=[State|QualifiedName]
'then' then=[State|QualifiedName];
QualifiedName: ID ('.' ID)*;
DolarSign: ('$' ID)*;
And here is the code for the rule:
rule "apply 10% discount to all items over US$ 100,00 in an order"
#specification "101"
ruleflow-group "All"
when
$order : Order(appliedBefore == null)
Order($name : /customer/name) from $order
$item : OrderItem( value > 100 ) from $order.items
then
System.out.println("10% applied" + $name);
end
You should be able to simply use:
'when $' when=[State|QualifiedName]
Related
I'm using the JSON grammar from the antlr4 grammar repository to parse JSON files for an editor plugin. It works, but reports invalid chars one by one. The following snippet results in 18 lexer errors:
{
sometext-without-quotes : 42
}
I want to boil it down to 1-2 by treating consecutive, invalid single-char tokens of the same type as one bigger invalid token.
For a similar question, a custom lexer was suggested that glues "unknown" elements to larger tokens: In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?
I assume that this bypasses the usual lexer error reporting, which I would like to avoid, if possible. Isn't there a proper solution for that rather simple task? It seems to have worked by default in ANTLR3.
The answer is in the link you provided. I don't want to copy the original answer completely so I'll try and paraphrase a bit...
In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?
Add unknowns to the lexer that will match multiples of these...
unknowns : Unknown+ ;
...
Unknown : . ;
There was an edit made to this post to cater for the case where you were only using a lexer and not using a parser. If using a parser then you do not need to override the nextToken method because the error can be handled in the parser in a much cleaner way ie unknowns are just another token type as far as the lexer is concerned. The lexer passes these to the parser which can then handle the errors. If using a parser I'd normally recognize all tokens as individual tokens and then in the parser emit the errors ie group them or not. The reason for doing this is all error handling is done in one place ie it's not in the lexer and in the parser. It also makes the lexer simpler to write and test ie it must recognize all text and never fail on any utf8 input. Some people would likely do it differently but this has worked for me with hand written lexers in C. The parser is in charge of determining what's actually valid and how to error on it. One other benefit is that the lexer is fairly generic and can be reused.
For lexer only solution...
Check the answer at the link and look for this comment in the answer...
... but I only have a lexer, no parsers ...
The answer states that you override the nextToken method and goes into some detail on how to do that
#Override
public Token nextToken() {
and the important part in the code is this...
Token next = super.nextToken();
if(next.getType() != Unknown) {
return next;
}
The code that comes after this handles the case where you can match the bad tokens.
What you could do is use lexer modes. For this you'd had to split grammar to parser and lexer grammar. Let's start with lexer grammar:
JSONLexer.g4
/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
lexer grammar JSONLexer;
STRING
: '"' (ESC | ~ ["\\])* '"'
;
fragment ESC
: '\\' (["\\/bfnrt] | UNICODE)
;
fragment UNICODE
: 'u' HEX HEX HEX HEX
;
fragment HEX
: [0-9a-fA-F]
;
NUMBER
: '-'? INT '.' [0-9] + EXP? | '-'? INT EXP | '-'? INT
;
fragment INT
: '0' | [1-9] [0-9]*
;
// no leading zeros
fragment EXP
: [Ee] [+\-]? INT
;
// \- since - means "range" inside [...]
TRUE : 'true';
FALSE : 'false';
NULL : 'null';
LCURL : '{';
RCURL : '}';
COL : ':';
COMA : ',';
LBRACK : '[';
RBRACK : ']';
WS
: [ \t\n\r] + -> skip
;
NON_VALID_STRING : . ->pushMode(MODE_ERR);
mode MODE_ERR;
WS1
: [ \t\n\r] + -> skip
;
COL1 : ':' ->popMode;
MY_ERR_TOKEN : ~[':']* ->type(NON_VALID_STRING);
Basically I have added some tokens used in the parser part (like LCURL, COL, COMA etc) and introduced NON_VALID_STRING token, which is basically the first character that's nothing that already is (should be) matched. Once this token is detected, I switch the lexer to MODE_ERR mode. In this mode I go back to default mode once : is detected (this can be changed and maybe refined, but server the purpose here :) ) or I say that everything else is MY_ERR_TOKEN to which I assign NON_VALID_STRING token type. Here is what ATNLRWorks says to this when I run interpret lexer option with your input:
So s is NON_VALID_STRING type and so is everything else until :. So, same type but two different tokens. If you want them not to be of the same type, simply omit the type call in the lexer grammar.
Here is the parser grammar now
JSONParser.g4
/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
parser grammar JSONParser;
options {
tokenVocab=JSONLexer;
}
json
: object
| array
;
object
: LCURL pair (COMA pair)* RCURL
| LCURL RCURL
;
pair
: STRING COL value
;
array
: LBRACK value (COMA value)* RBRACK
| LBRACK RBRACK
;
value
: STRING
| NUMBER
| object
| array
| TRUE
| FALSE
| NULL
;
and if you run the test rig (I do it with ANTLRworks) you'll get a single error (see screenshot)
Also you could accumulate lexer errors by overriding the generated lexer class, but I understood in the question that this is not desired or I didn't understand that part :)
In a grammar I would like to implement texts without string delimiting xxx.
The idea is to define things like
a = xxx;
instead of
a ="xxx";
to simplify typewriting. Otherwise there should be variable definitions
and other kind of stuff as well.
As a first approach I experimented with this grammar:
grammar SpaceNoSpace;
prog: stat+;
stat:
'somethingelse' ';'
| typed description* content
;
typed:
'something' '-'
| 'anotherthing' '-'
;
description:
'someSortOfDetails' COLON ID HASH
| 'otherSortOfDetails' COLON ID HASH
;
content:
contenttext ';'
;
contenttext:
(~';')*
;
COLON: ':' ;
HASH: '#';
SEMI: ';';
SPACE: ' ';
ID: [a-zA-Z][a-zA-z0-9]+;
WS : [ \t\n\r]+ -> channel(HIDDEN);
ANY_CHAR : . ;
This works fine for input files like this:
something-someSortOfDetails: aVariableName#
this is the content of this;
anotherthing-someSortOfDetails: aVariableName#
here spaces are accepted as much as you like;
somethingelse;
But modifying the last line to
somethingelse ;
leads to a syntax error:
line 7:15 extraneous input ' ' expecting ';'
This probably reveals that the lexer rule
WS : [ \t\n\r]+ -> channel(HIDDEN);
is not applied, (but the SPACE rule???).
Otherwise, if I delete the SPACE lexer-rule, the space
in "somethingelse ;" is ignored (by lexer-rule WS), so that the parser rule
stat : somethingelse as a consequence is detected correctly.
But as a consequence of the deleted SPACE-rule the content text will be reduced to single in-between-spaces,
so "this here" will be reduced to "this here".
This is not a big problem, but nevertheless it is an
interesting question:
is it possible to implement context-sensitive WS or SPACE
lexer rules:
within the content parser-rule any space should be preserved,
in any other rule spaces should be ignored.
Is this possible to define such a context-sensitive lexer-rule behavior in ANTLR4?
Have you considered Lexer Modes? The section with mode(), pushMode(), popMode is probably interesting for you.
Yet I think that lexer modes are more a problem than a solution. Their purpose is to use (parser) context in the lexer. Consequently one should discard the paradigm of separating lexer and parser - and use a PEG-Parser instead.
Since the SPACE rule is before the WS rule, the lexer is returning a space token to the parser. The ' ' is not being being placed on the hidden channel.
I am using following excerpt in the grammar for my DSL:
SelectDml:
'select' columnList+=FieldColumn (',' columns+=FieldColumn)* from=FromClause;
FromClause:
'from' value=ID (alias=ID)?;
FieldColumn hidden():
fieldName=ID ('.' ID)?;
If I parse following line of my DSL, then there is one FieldColumn in the column-List which is absolutely fine. But the FieldColumn has the fieldName a and not the expected value: a.col.
select a.col from a
Is there a problem with my grammar? Something missing?
Per this rule
FieldColumn hidden():
fieldName=ID ('.' ID)?;
the first ID value is assigned to fieldName. Any further ID values are just skipped.
I am currently working on a java web server project, that requires the use of Natural Language processing, specifically Named Entity Recognition (NER).
I was using OpenNLP for java, since it was easy to add custom training data. It works perfectly.
However, I need to also be able to extract entites inside of entities (Nested named entity recognition). I tried doing this in OpenNLP, but I got parsing errors. So my guess is that OpenNLP sadly does not support nested entities.
Here is an example of what I need to parse:
Remind me to [START:reminder] give some presents to [START:contact] John [END] and [START:contact] Charlie [END][END].
If this cannot be achieved with OpenNLP, is there any other Java NLP Library that could do this. If there are no Java libraries at all, are there any NLP libraries in any other language that can do this?
Please help. Thanks!
The short answer is:
This cannot be achieved using openNLP NER which is suitable only for continuous entities because it use a BIO tagging scheme.
I don't know any library in any language capable of do this.
I think you are extending too much the concept of entity, which is habitually associated with persons, places, organizations, gene names etc.
But not with the identification of complex structures within text.
For that purpose you need to think in a more elaborated solution, taking into account the grammatical structure of the sentence, which can be obtained using a parser like the one in OpenNLP, and maybe combine this with the output of the NER process.
For the purpose of Name Entity Recognition (Java based) I use the following:
Apache UIMA
ClearTK
https://github.com/merishav/cleartk-tutorials
You can train models for your use case, I have already trained for NER for person, places, date of birth, profession.
ClearTK gives you a wrapper on MalletCRFClassifier.
Use this python source code (Python 3) https://gist.github.com/ttpro1995/cd8c60cfc72416a02713bb93dff9ae6f
It's create multiple un-nest version of nest data for you.
For input sentence below ( input data must be tokenized first, so there are space between and thing around it)
Remind me to <START:reminder> give some presents to <START:contact> John <END> and <START:contact> Charlie <END> <END> .
It output multiple sentence with different nest level.
Remind me to give some presents to John and Charlie .
Remind me to <START:reminder> give some presents to John and Charlie <END> .
Remind me to give some presents to <START:contact> John <END> and <START:contact> Charlie <END> .
Full source code here for quick copy-paste
import sys
END_TAG = 0
START_TAG = 1
NOT_TAG = -1
def detect_tag(in_token):
"""
detect tag in token
:param in_token:
:return:
"""
if "<START:" in in_token:
return START_TAG
elif "<END>" == in_token:
return END_TAG
return NOT_TAG
def remove_nest_tag(in_str):
"""
với <START:ORGANIZATION> Sở Cảnh sát Phòng cháy , chữa cháy ( PCCC ) và cứu nạn , cứu hộ <START:LOCATION> Hà Nội <END> <END>
:param in_str:
:return:
"""
state = 0
taglist = []
tag_dict = dict()
sentence_token = in_str.split()
## detect token tag
max_nest = 0
for index, token in enumerate(sentence_token):
# print(token + str(detect_tag(token)))
tag = detect_tag(token)
if tag > 0:
state += 1
if max_nest < state:
max_nest = state
token_info = (index, state, token)
taglist.append(token_info)
tag_dict[index] = token_info
elif tag == 0:
token_info = (index, state, token)
taglist.append(token_info)
tag_dict[index] = token_info
state -= 1
generate_sentences = []
for state in range(max_nest+1):
generate_sentence_token = []
for index, token in enumerate(sentence_token):
if detect_tag(token) >= 0: # is a tag
token_info = tag_dict[index]
if token_info[1] == state:
generate_sentence_token.append(token)
elif detect_tag(token) == -1 : # not a tag
generate_sentence_token.append(token)
sentence = ' '.join(generate_sentence_token)
generate_sentences.append(sentence)
return generate_sentences
# generate sentence
print(taglist)
def test():
tstr2 = "Remind me to <START:reminder> give some presents to <START:contact> John <END> and <START:contact> Charlie <END> <END> ."
result = remove_nest_tag(tstr2)
print("-----")
for sentence in result:
print(sentence)
if __name__ == "__main__":
"""
un-nest dataset for opennlp name
"""
# test()
# test()
if len(sys.argv) > 1:
inpath = sys.argv[1]
infile = open(inpath, 'r')
outfile = open(inpath+".out", 'w')
for line in infile:
sentences = remove_nest_tag(line)
for sentence in sentences:
outfile.write(sentence+"\n")
outfile.close()
else:
print("usage: python unnest_data.py input.txt")
I am developing my own DSL in XText.
I want do something like this:
1 AND (2 OR (3 OR 4))
Here my current .xtext file:
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model:
(greetings+=CONDITION_LEVEL)
;
terminal NUMBER :
('1'..'9') ('0'..'9')*
;
AND:
' AND '
;
OR:
' OR '
;
OPERATOR :
AND | OR
;
CONDITION_LEVEL:
('('* NUMBER (=>')')* OPERATOR)+ NUMBER ')'*
;
The problem I am having is that the dsl should have the possibility to make unlimited bracket, but show an error when the programmer don't closes all opened bracket.
example:
1 AND (2 OR (3 OR 4)
one bracket is missing --> should make error.
I don't know how I can realize this in XText. Can anybody help?
thx for helping.
Try this:
CONDITION_LEVEL
: ATOM ((AND | OR) ATOM)*
;
ATOM
: NUMBER
| '(' CONDITION_LEVEL ')'
;
Note that I have no experience with XText (so I did not test this), but this does work with ANTLR, on which XText is built (or perhaps it only uses ANTLR...).
Aslo, you probably don't want to surround your operator-tokens with spaces, but put them on a hidden-parser channel:
grammar org.xtext.example.mydsl.MyDsl hidden(SPACE)
...
terminal SPACE : (' '|'\t'|'\r'|'\n')+;
...
Otherwise source like this would fail:
1 AND(2 OR 3)
For details, see Hidden Terminal Symbols from the XText user guide.
You need to make your syntax recursive. The basic idea is that a CONDITION_LEVEL can be, for example, two CONDITION_LEVEL separated by an OPERATOR.
I don't know the specifics of the xtext syntax, but using a BCNF-like syntax you could have:
CONDITION_LEVEL:
NUMBER
'(' CONDITION_LEVEL OPERATOR CONDITION_LEVEL ')'