I have an ANTLR grammar consisting of a number of sub-items. The high-level grammar looks something like this:
grammar MyGrammar;
import MyLocation, MyName, MyTime;
composite
: myname (WS+ mylocation)? (WS+ mytime)?
I compile MyGrammar.g4 to obtain the required Java code and all is well when parsing items such as John at 4:30pm. However, I now have a situation where I need to parse times separately from the composite item, for example 4:30pm.
At the moment it appears that I have to duplicate code in MyGrammarListener and MyTimeListener to handle times. Is there any way instead in which I can tell MyGrammarListener to hand off to MyTimeListener when it sees a mytime so that I can avoid code duplication, or should I be handling this in a different way?
The answer to the first part of your question is no, you cannot do this (as of ANTLR 4.4 at least). See my answer here:
Is it possible to make Antlr4 generate lexer from base grammar lexer instead of gener Lexer?
Related
I'm building a program with ANTLR where I ask the user to enter some Java code, and it spits out equivalent C# code. In my program, I ask the user to enter some Java code and then parse it. Up until now I've been assuming that they will enter something that will parse as a valid compilation unit on its own, e.g. something like
package foo;
class A { ... }
class B { ... }
class C { ... }
However, that isn't always the case. They might just enter code from the inside of a class:
public void method1() {
...
}
public void method2() {
...
}
Or the inside of a method:
System.out.print("hello ");
System.out.println("world!");
Or even just an expression:
context.getSystemService(Context.ACTIVITY_SERVICE)
If I try to parse such snippets by calling parser.compilationUnit(), it won't work correctly because most of the code is parsed as error nodes. I need to call the correct method depending on the nature of the code, such as parser.expression() or parser.blockStatements(). However, I don't want to ask the user to explicitly indicate this. What's the best way to infer what kind of code I'm parsing?
Rather than trying to guess a valid grammar rule entry point to parse a language snippet of unknown scope, progressively add scope wrappers to the source text until a valid top-level rule parse is achieved.
That is, with each successive parse failure, progressively add dummy package, class, & method statements as source text wrappers.
Whichever wrapper was added to achieve a successful parse will then be a known quantity. Therefore, the parse tree node representing the original source text can be easily identified.
Probably want to use a fail-fast parser; construct the parser with the BailErrorStrategy to obtain this behavior.
Our algorithm in Swiftify tries to select the best suitable parse rule from the defined rule set. This web-service converts Objective-C code fragments to Swift and you can estimate the quality of conversion immediately by your own.
Algorithm
We use open-sourced ObjectiveC grammar. Detail Steps of algorithm look like this:
Parse input Objective-C code fragment with the following rules
translationUnit
implementationDefinitionList
interfaceDeclarationList
expression
compoundStatement
If parse result of the certain rule does not contain any error returns this
rule at once.
Select the rule with the nearest to the end parse error.
If there are two or more rules with the same nearest to the end error
location, select the rule with the minimum number of syntax errors.
Demo
There are test code samples that parsed with different parse rules:
translationUnit: http://swiftify.me/clye5z
implementationDefinitionList: http://swiftify.me/fpasza
interfaceDeclarationList: http://swiftify.me/13rv2j
compoundStatement: http://swiftify.me/4cpl9n
Our algorithm is able to detect suitably parse rule even with an incorrect input:
compoundStatement with errors: http://swiftify.me/13rv2j/1
This is the code sample which I want to parse. I want getSaveable PaymentMethodsSmartList() as a token, when I overwrite the function in the parserBaseListener.java file created by ANTLR.
/** #suppress */
public any function getSaveablePaymentMethodsSmartList() {
if(!structKeyExists(variables, "saveablePaymentMethodsSmartList")) {
variables.saveablePaymentMethodsSmartList = getService("paymentService").getPaymentMethodSmartList();
variables.saveablePaymentMethodsSmartList.addFilter('activeFlag', 1);
variables.saveablePaymentMethodsSmartList.addFilter('allowSaveFlag', 1);
variables.saveablePaymentMethodsSmartList.addInFilter('paymentMethodType', 'creditCard,giftCard,external,termPayment');
if(len(setting('accountEligiblePaymentMethods'))) {
variables.saveablePaymentMethodsSmartList.addInFilter('paymentMethodID', setting('accountEligiblePaymentMethods'));
}
}
return variables.saveablePaymentMethodsSmartList;
}
I already have the grammar that parses function declaration, but I need a new rule that can associate doctype comments with a function declaration and give the function name as separate token if there is a doctype comment associated with it.
Grammar looks like this:
functionDeclaration
: accessType? typeSpec? FUNCTION identifier
LEFTPAREN parameterList? RIGHTPAREN
functionAttribute* body=compoundStatement
;
You want grammar rules that:
return X if something "far away" in the source is a A,
returns Y if something far away is a B (or ...).
In general, this is context dependency. It is not handled well by context free grammars, which is something that ANTLR is trying to approximate with its BNF rules. In essence, what you think you want to do is to encode history of what the parser has seen long ago, to influence what is being produced now. Generally that is hard.
The usual solution to something like this is to not address it in the grammar at all. Instead:
have the grammar rules produce an X regardless of what is far away,
build a tree as you parse (ANTLR does this for you); this captures not only X but everything about the parsed entity, including tokens for A that are far away
walk over the tree, interpreting a found X as Y if the tree contains the A (usually far away in the tree).
For your specific case of docstring-influences-function name, you can probably get away with encoding far away history.
You need (IMHO, ugly) grammar rules that look something like this:
functionDeclaration: documented_function | undocumented_function ;
documented_function: docstring accessType? typeSpec? FUNCTION
documented_function_identifier rest_of_function ;
undocumented_function: accessType? typeSpec? FUNCTION
identifier rest_of_function ;
rest_of_function: // avoids duplication, not pretty
LEFTPAREN parameterList? RIGHTPAREN
functionAttribute* body=compoundStatement ;
You have to recognize the docstring as an explicit token that can be "seen" by the parser, which means modifying your lexer to make docstrings from comments (e.g, whitespace) into tokens. [This is the first ugly thing]. Then having seen such a docstring, the lexer has to switch to a lexical mode that will pick up identifier-like text and produce documented_function_identifier, and then switch back to normal mode. [This is the second ugly thing]. What you are doing is implementing literally a context dependency.
The reason you can accomplish this in spite of my remarks about context dependency is that A is not very far away; it is within few tokens of X.
So, you could do it this way. I would not do this; you are trying to make the parser do too much. Stick to the "usual solution". (You'll have different problem: your A is a comment/whitespace, and probably isn't stored in the tree by ANTLR. You'll have to solve that; I'm not an ANTLR expert.)
a teacher of mine said, that Java cannot be LL parsed.
I dont understand this and wonder if this is true.
I searched for a grammar of Java 8 and found this: https://github.com/antlr/grammars-v4/blob/master/java8/Java8.g4
But even if I try to analyze the grammar, I dont get the problem for LL parsing.
Does anyone know if this is true, know a scientific proof or just can explain to me why it should be not possible to find a grammar construct of Java which can be LL parsed?
Thanks a lot guys and girls.
The Java Language Specification for Java 7 says it is not LL(1):
The grammar presented in this chapter is the basis for the
reference implementation. Note that it is not an LL(1) grammar, though
in many cases it minimizes the necessary look ahead.
If you either find:
left recursion, or
an alternative (A|B) that the intersection of two or more alternatives share the same FIRST set; FIRST(A) has one or more symbols also in FIRST(B)
Your grammar won't be LL(1).
I think it's due to the left recursion. LL parsers cannot handle left recursion and the current Java grammar is specified in some cases using them, at least Java 7.
Of course, it is well known that one can construct equivalent grammars getting rid of left recursions, but in its current specification Java language could not be LL parsed.
I'm looking for a CFG parser implemented with Java. The thing is I'm trying to parse a natural language. And I need all possible parse trees (ambiguity) not only one of them. I already researched many NLP parsers such as Stanford parser. But they mostly require statistical data (a treebank which I don't have) and it is rather difficult and poorly documented to adapt them in to a new language.
I found some parser generators such as ANTRL or JFlex but I'm not sure that they can handle ambiguities. So which parser generator or java library is best for me?
Thanks in advance
You want a parser that uses the Earley algorithm. I haven't used either of these two libraries, but PEN and PEP appear implement this algorithm in Java.
Another option is Bison, which implements GLR. GLR is an LR type parsing algorithm that supports ambiguous grammars. Bison also generates Java code, in addition to C++.
Take a look at the related discussion here. In my last comment in that discussion I explain that you can make any parser generator produce all of the parse trees by cloning the parse tree derived so far before making the derivation fail.
If your grammar is:
G -> ...
You would augment is as this:
G' -> G {semantic:deal-with-complete-parse-tree} <NOT-VALID-TOKEN>.
The parsing engine will ultimately fail on all derivations, but your program will either have:
Saved clones of all the trees.
Dealt with the semantics of each of the trees as they were found.
Both ANTLR and JavaCC did well when I was teaching. My preference was for ANTLR because of its BNF lexical analysis, and its much less convoluted history, vision, y and licensing.
I worked the last 5 days to understand how unification algorithm works in Prolog .
Now ,I want to implement such algorithm in Java ..
I thought maybe best way is to manipulate the string and decompose its parts using some datastructure such as Stacks ..
to make it clear :
suppose user inputs is:
a(X,c(d,X)) = a(2,c(d,Y)).
I already take it as one string and split it into two strings (Expression1 and 2 ).
now, how can I know if the next char(s) is Variable or constants or etc.. ,
I can do it by nested if but it seems to me not good solution ..
I tried to use inheritance but the problem still ( how can I know the type of chars being read ?)
First you need to parse the inputs and build expression trees. Then apply Milner's unification algorithm (or some other unification algorithm) to figure out the mapping of variables to constants and expressions.
A really good description of Milner's algorithm may be found in the Dragon Book: "Compilers: Principles, Techniques and Tools" by Aho, Sethi and Ullman. (Milners algorithm can also cope with unification of cyclic graphs, and the Dragon Book presents it as a way to do type inference). By the sounds of it, you could benefit from learning a bit about parsing ... which is also covered by the Dragon Book.
EDIT: Other answers have suggested using a parser generator; e.g. ANTLR. That's good advice, but (judging from your example) your grammar is so simple that you could also get by with using StringTokenizer and a hand-written recursive descent parser. In fact, if you've got the time (and inclination) it is worth implementing the parser both ways as a learning exercise.
It sounds like this problem is more to do with parsing than unification specifically. Using something like ANTLR might help in terms of turning the original string into some kind of tree structure.
(It's not quite clear what you mean by "do it by nested", but if you mean that you're doing something like trying to read an expression, and recursing when meeting each "(", then that's actually one of the right ways to do it -- this is at heart what the code that ANTLR generates for you will do.)
If you are more interested in the mechanics of unifying things than you are in parsing, then one perfectly good way to do this is to construct the internal representation in code directly, and put off the parsing aspect for now. This can get a bit annoying during development, as your Prolog-style statements are now a rather verbose set of Java statements, but it lets you focus on one problem at a time, which is usually helpful.
(If you structure things this way, this should make it straightforward to insert a proper parser later, that will produce the same sort of tree as you have until then been constructing by hand. This will let you attack the two problems separately in a reasonably neat fashion.)
Before you get to do the semantics of the language, you have to convert the text into a form that's easy to operate on. This process is called parsing and the semantic representation is called an abstract syntax tree (AST).
A simple recursive descent parser for Prolog might be hand written, but it's more common to use a parser toolkit such as Rats! or Antlr
In an AST for Prolog, you might have classes for Term, and CompoundTerm, Variable, and Atom are all Terms. Polymorphism allows the arguments to a compound term to be any Term.
Your unification algorithm then becomes unifying the name of any compound term, and recursively unifying the value of each argument of corresponding compound terms.