I have the follow grammar
GUID : GUIDBLOCK GUIDBLOCK '-' GUIDBLOCK '-' GUIDBLOCK '-' GUIDBLOCK
'-'
GUIDBLOCK GUIDBLOCK GUIDBLOCK;
SELF : 'self(' GUID ')';
fragment
GUIDBLOCK: [A-Za-z0-9][A-Za-z0-9][A-Za-z0-9][A-Za-z0-9];
atom : SELF # CurrentGuid
This is my visitor
#Override
public String visitCurrentGuid(CalcParser.CurrentRecordContext ctx) {
System.out.println("Guid is : " + ctx.getText());
System.out.println("Guid is : " + ctx.getChild(0));
return ctx.getText();
}
With input "self(5827389b-c8ab-4804-8194-e23fbdd1e370)"
There's only one child which is the whole input itself "self(5827389b-c8ab-4804-8194-e23fbdd1e370)"
How should I go about to get the guid part?
From my understanding if my grammar structure is construct as AST, I should be able to print out the tree.
How should I update my grammar?
Thanks
Fragments don't appear at all in the AST - they're basically treated as if you'd written their contents directly in the lexer rule that uses them. So moving code into fragments makes your code easier to read, but does not affect the generated AST at all.
Lexer rules that are used by other lexer rules are also treated as fragments in that context. That is, if a lexer rule uses another lexer rule, it will still produce a single token with no nested structure - just as if you had used a fragment. The fact that it's a lexer rule and not a fragment only makes a difference when the pattern occurs on its own without being part of the larger pattern.
The key is that a lexer rule always produces a single token and tokens have no subtokens. They're the leaves of the AST. Nodes with children are generated from parser rules.
The only parser rule that you have is atom. atom only invokes one other rule SELF. So the generated tree will consist of an atom that contains as its only child a SELF token and, as previously stated, tokens are leafs, so that's the end of the tree.
What you probably want to do to get a useful tree is to make GUIDBLOCK a lexer rule (your only lexer rule, in fact) and turn everything else into parser rules. That'd also mean that you can get rid of atom (possibly renaming SELF to atom if you want).
Then you'll end up with a tree consisting of a self (or atom if you renamed it) node that contains as its children a 'self(' token, a guid node (which you might want to assign a name for easy access) and a ) token. The guid node in turn would contain a sequence of GUIDBLOCK and '-' tokens. You can also add blocks+= before every use of GUIDBLOCK to get a list that only contains the GUIDBLOCK tokens without the dashes.
It might also make sense to turn 'self(' into two tokens (i.e. 'self' '(') - especially if you ever want to add a rule to ignore whitespace.
Related
I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.
This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).
This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.
I have a problem with antlr4 grammar in java.
I would like to have a lexer value, that is able to parse all of the following inputs:
Only letters
Letters and numbers
Only numbers
My code looks like this:
parser rule:
new_string: NEW_STRING+;
lexer rule:
NEW_DIGIT: [0-9]+;
STRING_CHAR : ~[;\r\n"'];
NEW_STRING: (NEW_DIGIT+ | STRING_CHAR+ | STRING_CHAR+ NEW_DIGIT+);
I know there must be an obvious solution, but I have been trying to find one, and I can't seem to figure out a way.
Thank you in advance!
Since the first two lexer rules are not fragments, they can (and will) be matched if the input contains just digits, or ~[;\r\n"'] (since if equally long sequence of input can be matched, first lexer rule wins).
In fact, STRING_CHAR can match anything that NEW_STRING can, so the latter will never be used.
You need to:
make sure STRING_CHAR does not match digits
make NEW_DIGIT and STRING_CHAR fragments
check the asterisks - almost everything is allowed to repeat in your lexer, it doesn't make sense at first look ( but you need to adjust that according to your requirements that we do not know)
Like this:
fragment NEW_DIGIT: [0-9];
fragment STRING_CHAR : ~[;\r\n"'0-9];
NEW_STRING: (NEW_DIGIT+ | STRING_CHAR+ (NEW_DIGIT+)?);
I have a grammar that works fine when parsing in one pass (entire file).
Now I wish to break the parsing up into components. And run the parser on subrules. I ran into an issue I assume others parsing subrules will see with the following rule:
thing : LABEL? THING THINGDATA thingClause?
//{System.out.println("G4 Lexer/parser thing encountered");}
;
...
thingClause : ',' ID ( ',' ID)?
;
When the above rule is parsed from a top level start rule which parses to EOF everything works fine. When parsed as a sub-rule (not parse to EOF) the parser gets upset when there is no thing clause, as it is expecting to see EITHER a "," character or an EOF character.
line 8:0 mismatched input '%' expecting {, ','}
When I parse to EOF, the % gets correctly parsed into another "thing" component, because the top level rule looks for:
toprule : thing+
| endOfThingsTokens
;
And endOfThingsTokens occurs before EOF... so I expect this is why the top level rule works.
For parsing the subrule, I want the ANTLR4 parser to accept or ignore the % token and say "OK we aren't seeing a thingClause", then reset the token stream so the next thing object can be parsed by a different instance of the parser.
In this specific case I could change the lexer to pass newlines to the parser, which I currently skip in the lexer grammar. That would require lots of other changes to accept newlines in the token stream which are currently not needed.
Essentially I need some way to make the rule have a "end of record" token. But I was wondering if there was some way to solve this with a semantic predicate rule.
something like:
thing : { if comma before %}? LABEL? THING THINGDATA thingClause?
| LABEL? THING THINGDATA
;
...
thingClause : ',' ID ( ',' ID)?
;
The above predicate pseudo code would hide the optional thingClause? if it won't be satisfied so that the parser would stop after parsing one "thing" without looking for a specific "end of thing" token (i.e. newline).
If I solve this I will post the answer.
The parser will (effectively) look-ahead in the token stream to determine if the current rule can be satisfied. The corresponding tokens are then consumed. If any look-ahead tokens remain unconsumed, the parser looks for another rule against which to consume these and additional look-ahead tokens.
The thingClause? element, when not matched, will result in unconsumed tokens in the parser. Hence the error you are seeing.
The parser look-ahead is data dependent. Meaning that the evaluation of the elements of a rule can easily read into the parser more tokens than the current rule could possibly consume.
While a predicate could help, it will not make the problem deterministic. That is, even if the parser matches the non-predicated alt, it may well have read more tokens into the parser than can be consumed by that alt.
The only way to avoid this non-determinism would be to pre-inject <EOF> tokens into the token stream at the sub-rule boundaries.
My code does not retrieve the entirety of element nodes that contain special characters.
For example, for this node:
<theaterName>P&G Greenbelt</theaterName>
It would only retrieve "P" due to the ampersand. I need to retrieve the entire string.
Here's my code:
public List<String> findTheaters() {
//Clear theaters application global
FilmhopperActivity.tData.clearTheaters();
ArrayList<String> theaters = new ArrayList<String>();
NodeList theaterNodes = doc.getElementsByTagName("theaterName");
for (int i = 0; i < theaterNodes.getLength(); i++) {
Node node = theaterNodes.item(i);
if (node.getNodeType() == Node.ELEMENT_NODE) {
//Found theater, add to return array
Element element = (Element) node;
NodeList children = element.getChildNodes();
String name = children.item(0).getNodeValue();
theaters.add(name);
//Logging
android.util.Log.i("MoviefoneFetcher", "Theater found: " + name);
//Add theater to application global
Theater t = new Theater(name);
FilmhopperActivity.tData.addTheater(t);
}
}
return theaters;
}
I tried adding code to extend the name string to concatenate additional children.items, but it didn't work. I'd only get "P&".
...
String name = children.item(0).getNodeValue();
for (int j = 1; j < children.getLength() - 1; j++) {
name += children.item(j).getNodeValue();
}
Thanks for your time.
UPDATE:
Found a function called normalize() that you can call on Nodes, that combines all text child nodes so doing a children.item(0) contains the text of all the children, including ampersands!
The & is an escape character in XML. XML that looks like this:
<theaterName>P&G Greenbelt</theaterName>
should actually be rejected by the parser. Instead, it should look like this:
<theaterName>P&G Greenbelt</theaterName>
There are a few such characters, such as < (<), > (>), " (") and ' ('). There are also other ways to escape characters, such as via their Unicode value, as in • or 〹.
For more information, the XML specification is fairly clear.
Now, the other thing it might be, depending on how your tree was constructed, is that the character is escaped properly, and the sample you showed isn't what's actually there, and it's how the data is represented in the tree.
For example, when using SAX to build a tree, entities (the &-thingies) are broken apart and delivered separately. This is because the SAX parser tries to return contiguous chunks of data, and when it gets to the escape character, it sends what it has, and starts a new chunk with the translated &-value. So you might need to combine consecutive text nodes in your tree to get the whole value.
The file you are trying to read is not valid XML. No self-respecting XML parser will accept it.
I'm retrieving my XML dynamically from the web. What's the best way to replace all my escape characters after fetching the Document object?
You are taking the wrong approach. The correct approach is to inform the people responsible for creating that file that it is invalid, and request that they fix it. Simply writing hacks to (try to) fix broken XML is not in your (or other peoples') long term interest.
If you decide to ignore this advice, then one approach is to read the file into a String, use String.replaceAll(regex, replacement) with a suitable regex to turn these bogus "&" characters into proper character entities ("&"), then pass the "fixed" XML string to the XML parser. You need to carefully design the regex so that it doesn't break valid character entities as an unwanted side-effect. A second approach is to do the parsing and replacement by hand, using appropriate heuristics to distinguish the bogus "&" characters from well-formed character entities.
But this all costs you development and test time, and slows down your software. Worse, there is a significant risk that your code will be fragile as a result of your efforts to compensate for the bad input files. (And guess who will get the blame!)
You need to either encode it properly or wrap it in a CDATA section. I'd recommend the former.
The numeric character references "<" and "&" may be used to escape < and & when they occur in character data.
All XML processors MUST recognize these entities whether they are declared or not. For interoperability, valid XML documents SHOULD declare these entities, like any others, before using them. If the entities lt or amp are declared, they MUST be declared as internal entities whose replacement text is a character reference to the respective character (less-than sign or ampersand) being escaped; the double escaping is REQUIRED for these entities so that references to them produce a well-formed result. If the entities gt, apos, or quot are declared, they MUST be declared as internal entities whose replacement text is the single character being escaped (or a character reference to that character; the double escaping here is OPTIONAL but harmless). For example:
<!ENTITY lt "<">
<!ENTITY gt ">">
<!ENTITY amp "&">
<!ENTITY apos "'">
<!ENTITY quot """>
I am using CocoR to generate a java-like scanner/parser:
I'm having some troubles in creating a EBNF expression to match a codeblock:
I'm assuming a code block is surrounded by two well-known tokens:
<& and &>
example:
public method(int a, int b) <&
various code
&>
If I define a nonterminal symbol
codeblock = "<&" {ANY} "&>"
If the code inside the two symbols contains a '<' character the generated compiler will not handle it thus giving a syntax error.
Any hint?
Edit:
COMPILER JavaLike
CHARACTERS
nonZeroDigit = "123456789".
digit = '0' + nonZeroDigit .
letter = 'A' .. 'Z' + 'a' .. 'z' + '_' + '$'.
TOKENS
ident = letter { letter | digit }.
PRODUCTIONS
JavaLike = {ClassDeclaration}.
ClassDeclaration ="class" ident ["extends" ident] "{" {VarDeclaration} {MethodDeclaration }"}" .
MethodDeclaration ="public" Type ident "("ParamList")" CodeBlock.
Codeblock = "<&" {ANY} "&>".
I have omitted some productions for the sake of simplicity.
This is my actual implementation of the grammar. The main bug is that it fails if the code in the block contains one of the symbols '>' or '&'.
Nick, late to the party here ...
A number of ways to do this:
Define tokens for <& and &> so the lexer knows about them.
You may be able to use a COMMENTS directive
COMMENTS FROM <& TO &> - quoted as CoCo expects.
Or make hack NextToken() in your scanner.frame file. Do something like this (pseudo-code):
if (Peek() == CODE_START)
{
while (NextToken() != CODE_END)
{
// eat tokens
}
}
Or can override the Read() method in the Buffer and eat at the lowest level.
HTH
You can expand the ANY term to include <&, &>, and another nonterminal (call it ANY_WITHIN_BLOCK say).
Then you just use
ANY = "<&" | {ANY_WITHIN_BLOCK} | "&>"
codeblock = "<&" {ANY_WITHIN_BLOCK} "&>"
And then the meaning of {ANY} is unchanged if you really need it later.
Okay, I didn't know anything about CocoR and gave you a useless answer, so let's try again.
As I started to say later in the comments, I feel that the real issue is that your grammar might be too loose and not sufficiently well specified.
When I wrote the CFG for the one language I've tried to create, I ended up using a sort of "meet-in-the-middle" approach: I wrote the top-level structure AND the immediate low-level combinations of tokens first, and then worked to make them meet in the mid-level (at about the level of conditionals and control flow, I guess).
You said this language is a bit like Java, so let me just show you the first lines I would write as a first draft to describe its grammar (in pseudocode, sorry. Actually it's like yacc/bison. And here, I'm using your brackets instead of Java's):
/* High-level stuff */
program: classes
classes: main-class inner-classes
inner-classes: inner-classes inner-class
| /* empty */
main-class: class-modifier "class" identifier class-block
inner-class: "class" identifier class-block
class-block: "<&" class-decls "&>"
class-decls: field-decl
| method
method: method-signature method-block
method-block: "<&" statements "&>"
statements: statements statement
| /* empty */
class-modifier: "public"
| "private"
identifier: /* well, you know */
And at the same time as you do all that, figure out your immediate token combinations, like for example defining "number" as a float or an int and then creating rules for adding/subtracting/etc. them.
I don't know what your approach is so far, but you definitely want to make sure you carefully specify everything and use new rules when you want a specific structure. Don't get ridiculous with creating one-to-one rules, but never be afraid to create a new rule if it helps you organize your thoughts better.