Java CFG parser that supports ambiguities - java

I'm looking for a CFG parser implemented with Java. The thing is I'm trying to parse a natural language. And I need all possible parse trees (ambiguity) not only one of them. I already researched many NLP parsers such as Stanford parser. But they mostly require statistical data (a treebank which I don't have) and it is rather difficult and poorly documented to adapt them in to a new language.
I found some parser generators such as ANTRL or JFlex but I'm not sure that they can handle ambiguities. So which parser generator or java library is best for me?
Thanks in advance

You want a parser that uses the Earley algorithm. I haven't used either of these two libraries, but PEN and PEP appear implement this algorithm in Java.

Another option is Bison, which implements GLR. GLR is an LR type parsing algorithm that supports ambiguous grammars. Bison also generates Java code, in addition to C++.

Take a look at the related discussion here. In my last comment in that discussion I explain that you can make any parser generator produce all of the parse trees by cloning the parse tree derived so far before making the derivation fail.
If your grammar is:
G -> ...
You would augment is as this:
G' -> G {semantic:deal-with-complete-parse-tree} <NOT-VALID-TOKEN>.
The parsing engine will ultimately fail on all derivations, but your program will either have:
Saved clones of all the trees.
Dealt with the semantics of each of the trees as they were found.
Both ANTLR and JavaCC did well when I was teaching. My preference was for ANTLR because of its BNF lexical analysis, and its much less convoluted history, vision, y and licensing.

Related

Generate code from antlr tokens

We are currently working on trying to generate a new code using antlr. We have a grammar file that pretty much can recognize everything. Now, our problem is that we want to be able to create code again using the tokens that we generate to create this new file.
We have a .txt file with our tokens that looks like this:
[#0,0:6=' ',<75>,channel=1,1:0]
[#1,7:20='IDENTIFICATION',<6>,1:7]
[#2,21:21=' ',<75>,channel=1,1:21]
[#3,22:29='DIVISION',<4>,1:22]
[#4,30:30='.',<3>,1:30]
[#5,31:40='\n \t ',<75>,channel=1,1:31]
[#6,41:50='PROGRAM-ID',<16>,2:9]
[#7,51:51='.',<3>,2:19]
[#8,52:52=' ',<75>,channel=1,2:20]
[#9,53:59='testpro',<76>,2:21]
[#10,60:60='.',<3>,2:28]
[#11,61:70='\n \t ',<75>,channel=1,2:29]
[#12,71:76='AUTHOR',<31>,3:9]
[#13,77:77='.',<3>,3:15]
Or is there another way to create the old code using tokens?
Thanks in advance, Viktor
The most straight forward way to make the lexer output portable is to serialize the tokenized output of the lexer for transport and storage. You could equally serialize the entire parser generated parse tree. In either case, you will be capturing the full text of the source input.
The intrinsic complexity of the lexer stream object is a single class. The parse tree object complexity is also quite small, involving just a handful of standard classes. Consequently, the complexity of the serialization & deserialization is almost entirely a linear function of size of the parsed source input.
Google Gson is a simple-to-use, relatively fast Java object serialization library.
If your parser is generating some intermediate representation of the parsed source input, you could directly transport the IR using a defined record serialization library like Google FlatBuffers to save & restore IR model instances.

Java LR or LL Parsing

a teacher of mine said, that Java cannot be LL parsed.
I dont understand this and wonder if this is true.
I searched for a grammar of Java 8 and found this: https://github.com/antlr/grammars-v4/blob/master/java8/Java8.g4
But even if I try to analyze the grammar, I dont get the problem for LL parsing.
Does anyone know if this is true, know a scientific proof or just can explain to me why it should be not possible to find a grammar construct of Java which can be LL parsed?
Thanks a lot guys and girls.
The Java Language Specification for Java 7 says it is not LL(1):
The grammar presented in this chapter is the basis for the
reference implementation. Note that it is not an LL(1) grammar, though
in many cases it minimizes the necessary look ahead.
If you either find:
left recursion, or
an alternative (A|B) that the intersection of two or more alternatives share the same FIRST set; FIRST(A) has one or more symbols also in FIRST(B)
Your grammar won't be LL(1).
I think it's due to the left recursion. LL parsers cannot handle left recursion and the current Java grammar is specified in some cases using them, at least Java 7.
Of course, it is well known that one can construct equivalent grammars getting rid of left recursions, but in its current specification Java language could not be LL parsed.

Generating modular ANTLR Java

I have an ANTLR grammar consisting of a number of sub-items. The high-level grammar looks something like this:
grammar MyGrammar;
import MyLocation, MyName, MyTime;
composite
: myname (WS+ mylocation)? (WS+ mytime)?
I compile MyGrammar.g4 to obtain the required Java code and all is well when parsing items such as John at 4:30pm. However, I now have a situation where I need to parse times separately from the composite item, for example 4:30pm.
At the moment it appears that I have to duplicate code in MyGrammarListener and MyTimeListener to handle times. Is there any way instead in which I can tell MyGrammarListener to hand off to MyTimeListener when it sees a mytime so that I can avoid code duplication, or should I be handling this in a different way?
The answer to the first part of your question is no, you cannot do this (as of ANTLR 4.4 at least). See my answer here:
Is it possible to make Antlr4 generate lexer from base grammar lexer instead of gener Lexer?

Finding a "subject" from an array of part of speech tags

I know this question is more of a grammar question however how do you determine a "subject" of a sentence if you have a array of Penn Treebank tokens like:
[WP][VBZ][DT][NN]
Is there any java library that can take in such tokens and determine which one is the subject? Or which ones?
The standard way to label syntactic units of a sentence, including the subject, is with a constituent parser. A constituent tree labels substrings of the input with syntactic labels. See http://en.wikipedia.org/wiki/Parse_tree for an example.
If such a structure looks like it would serve your needs, I'd recommend you grab an off-the-shelf parser and extract the relevant phrase(s) from the output.
Most parsers I'm aware of include part-of-speech (POS) tagging during parsing, but if you're confident in the POS labels you have, you could constrain the parser to use yours.
Note that constitent parsing can be quite expensive computationally. To my knowledge, all state-of-the-art constituent parsers run at 4-80 sentences per second, although you might be able to achieve higher speeds if you're willing to sacrifice some accuracy.
A couple recommendations (more details at Simple Natural Language Processing Startup for Java).
The Berkeley Parser ( http://code.google.com/p/berkeleyparser/ ). State-of-the-art accuracy and reasonably fast (3-5 sentences per second).
The BUBS Parser ( http://code.google.com/p/bubs-parser/ ) can also run with the high-accuracy Berkeley grammar, giving up a bit of accuracy (about 1.5 points in F1-score for those who care) but improving efficiency to around 50-80 sentences/second. Full disclosure - I'm one of the primary researchers working on this parser.
Warning: both of these parsers are research code. But we're glad to have people using BUBS in the real world. If you give it a try, please contact me with problems, questions, comments, etc.
The free, java-based Stanford Dependency Parser (part of the Stanford Parser) does this trivially. It produces a dependency parse tree with dependencies such as nsubj(makes-8, Bell-1), telling you that Bell is the subject of makes. All you'd have to do is scan the list of dependencies the parser gives you looking for nsubj or nsubjpass entries and those are the subjects of verbs.
I have been successfully classifying subjects for Portuguese using OpenNLP. I created a shallow parser tweaking a little the OpenNLP Chunker component.
You can use the existing OpenNLP models for pos tagging and chunking, but you will train a new chunk model that takes the PoS tags + chunk tags to classify subjects.
The data format to train the Chunker is based on Conll 2000:
He PRP B-NP
reckons VBZ B-VP
the DT B-NP
current JJ I-NP
account NN I-NP
deficit NN I-NP
will MD B-VP
narrow VB I-VP
...
I then created a new corpus that looks like the following
He PRP+B-NP B-SUBJ
reckons VBZ+B-VP B-V
the DT+B-NP O
current JJ+I-NP O
account NN+I-NP O
deficit NN+I-NP O
will MD+B-VP O
narrow VB+I-VP O
If you have access to Penn Treebank you can create such data by looking for subject nodes in the corpus. Maybe you can start with this Perl script used to generate the data for the CoNLL-2000 Shared Task.
The evaluation results for Portuguese are 87.07 % for precision, 75.48 % for recall, and 80.86 % for F1.

implementing unification algorithm

I worked the last 5 days to understand how unification algorithm works in Prolog .
Now ,I want to implement such algorithm in Java ..
I thought maybe best way is to manipulate the string and decompose its parts using some datastructure such as Stacks ..
to make it clear :
suppose user inputs is:
a(X,c(d,X)) = a(2,c(d,Y)).
I already take it as one string and split it into two strings (Expression1 and 2 ).
now, how can I know if the next char(s) is Variable or constants or etc.. ,
I can do it by nested if but it seems to me not good solution ..
I tried to use inheritance but the problem still ( how can I know the type of chars being read ?)
First you need to parse the inputs and build expression trees. Then apply Milner's unification algorithm (or some other unification algorithm) to figure out the mapping of variables to constants and expressions.
A really good description of Milner's algorithm may be found in the Dragon Book: "Compilers: Principles, Techniques and Tools" by Aho, Sethi and Ullman. (Milners algorithm can also cope with unification of cyclic graphs, and the Dragon Book presents it as a way to do type inference). By the sounds of it, you could benefit from learning a bit about parsing ... which is also covered by the Dragon Book.
EDIT: Other answers have suggested using a parser generator; e.g. ANTLR. That's good advice, but (judging from your example) your grammar is so simple that you could also get by with using StringTokenizer and a hand-written recursive descent parser. In fact, if you've got the time (and inclination) it is worth implementing the parser both ways as a learning exercise.
It sounds like this problem is more to do with parsing than unification specifically. Using something like ANTLR might help in terms of turning the original string into some kind of tree structure.
(It's not quite clear what you mean by "do it by nested", but if you mean that you're doing something like trying to read an expression, and recursing when meeting each "(", then that's actually one of the right ways to do it -- this is at heart what the code that ANTLR generates for you will do.)
If you are more interested in the mechanics of unifying things than you are in parsing, then one perfectly good way to do this is to construct the internal representation in code directly, and put off the parsing aspect for now. This can get a bit annoying during development, as your Prolog-style statements are now a rather verbose set of Java statements, but it lets you focus on one problem at a time, which is usually helpful.
(If you structure things this way, this should make it straightforward to insert a proper parser later, that will produce the same sort of tree as you have until then been constructing by hand. This will let you attack the two problems separately in a reasonably neat fashion.)
Before you get to do the semantics of the language, you have to convert the text into a form that's easy to operate on. This process is called parsing and the semantic representation is called an abstract syntax tree (AST).
A simple recursive descent parser for Prolog might be hand written, but it's more common to use a parser toolkit such as Rats! or Antlr
In an AST for Prolog, you might have classes for Term, and CompoundTerm, Variable, and Atom are all Terms. Polymorphism allows the arguments to a compound term to be any Term.
Your unification algorithm then becomes unifying the name of any compound term, and recursively unifying the value of each argument of corresponding compound terms.

Categories