Slow ANTLR4 generated Parser in Python, but fast in Java - java

I am trying to convert ant ANTLR3 grammar to an ANTLR4 grammar, in order to use it with the antlr4-python2-runtime.
This grammar is a C/C++ fuzzy parser.
After converting it (basically removing tree operators and semantic/syntactic predicates), I generated the Python2 files using:
java -jar antlr4.5-complete.jar -Dlanguage=Python2 CPPGrammar.g4
And the code is generated without any error, so I import it in my python project (I'm using PyCharm) to make some tests:
import sys, time
from antlr4 import *
from parser.CPPGrammarLexer import CPPGrammarLexer
from parser.CPPGrammarParser import CPPGrammarParser
currenttimemillis = lambda: int(round(time.time() * 1000))
def is_string(object):
return isinstance(object,str)
def parsecommandstringline(argv):
if(2!=len(argv)):
raise IndexError("Invalid args size.")
if(is_string(argv[1])):
return True
else:
raise TypeError("Argument must be str type.")
def doparsing(argv):
if parsecommandstringline(argv):
print("Arguments: OK - {0}".format(argv[1]))
input = FileStream(argv[1])
lexer = CPPGrammarLexer(input)
stream = CommonTokenStream(lexer)
parser = CPPGrammarParser(stream)
print("*** Parser: START ***")
start = currenttimemillis()
tree = parser.code()
print("*** Parser: END *** - {0} ms.".format(currenttimemillis()-start))
pass
def main(argv):
tree = doparsing(argv)
pass
if __name__ == '__main__':
main(sys.argv)
The problem is that the parsing is very slow. With a file containing ~200 lines it takes more than 5 minutes to complete, while the parsing of the same file in antlrworks only takes 1-2 seconds.
Analyzing the antlrworks tree, I noticed that the expr rule and all of its descendants are called very often and I think that I need to simplify/change these rules to make the parser operate faster:
Is my assumption correct or did I make some mistake while converting the grammar? What can be done to make parsing as fast as on antlrworks?
UPDATE:
I exported the same grammar to Java and it only took 795ms to complete the parsing. The problem seems more related to python implementation than to the grammar itself. Is there anything that can be done to speed up Python parsing?
I've read here that python can be 20-30 times slower than java, but in my case python is ~400 times slower!

I confirm that the Python 2 and Python 3 runtimes have performance issues. With a few patches, I got a 10x speedup on the python3 runtime (~5 seconds down to ~400 ms).
https://github.com/antlr/antlr4/pull/1010

I faced a similar problem so I decided to bump this old post with a possible solution. My grammar ran instantly with the TestRig but was incredibly slow on Python 3.
In my case the fault was the non-greedy token that I was using to produce one line comments (double slash in C/C++, '%' in my case):
TKCOMM : '%' ~[\r\n]* -> skip ;
This is somewhat backed by this post from sharwell in this discussion here: https://github.com/antlr/antlr4/issues/658
When performance is a concern, avoid using non-greedy operators, especially in parser rules.
To test this scenario you may want to remove non-greedy rules/tokens from your grammar.

Posting here since it may be useful to people that find this thread.
Since this was posted, there have been several performance improvements to Antlr's Python target. That said, the Python interpreter will be intrinsically slower than Java or other compiled languages.
I've put together a Python accelerator code generator for Antlr's Python3 target. It uses Antlr C++ target as a Python extension. Lexing & parsing is done exclusively in C++, and then an auto-generated visitor is used to re-build the resulting parse tree in Python. Initial tests show a 5x-25x speedup depending on the grammar and input, and I have a few ideas on how to improve it further.
Here is the code-generator tool: https://github.com/amykyta3/speedy-antlr-tool
And here is a fully-functional example: https://github.com/amykyta3/speedy-antlr-example
Hope this is useful to those who prefer using Antlr in Python!

I use ANTLR in python3 target these days.
And a file with 500~ lines just take about less than 20 sec to parse.
So turning to Python3 target might help

Related

Java LR or LL Parsing

a teacher of mine said, that Java cannot be LL parsed.
I dont understand this and wonder if this is true.
I searched for a grammar of Java 8 and found this: https://github.com/antlr/grammars-v4/blob/master/java8/Java8.g4
But even if I try to analyze the grammar, I dont get the problem for LL parsing.
Does anyone know if this is true, know a scientific proof or just can explain to me why it should be not possible to find a grammar construct of Java which can be LL parsed?
Thanks a lot guys and girls.
The Java Language Specification for Java 7 says it is not LL(1):
The grammar presented in this chapter is the basis for the
reference implementation. Note that it is not an LL(1) grammar, though
in many cases it minimizes the necessary look ahead.
If you either find:
left recursion, or
an alternative (A|B) that the intersection of two or more alternatives share the same FIRST set; FIRST(A) has one or more symbols also in FIRST(B)
Your grammar won't be LL(1).
I think it's due to the left recursion. LL parsers cannot handle left recursion and the current Java grammar is specified in some cases using them, at least Java 7.
Of course, it is well known that one can construct equivalent grammars getting rid of left recursions, but in its current specification Java language could not be LL parsed.

Generating modular ANTLR Java

I have an ANTLR grammar consisting of a number of sub-items. The high-level grammar looks something like this:
grammar MyGrammar;
import MyLocation, MyName, MyTime;
composite
: myname (WS+ mylocation)? (WS+ mytime)?
I compile MyGrammar.g4 to obtain the required Java code and all is well when parsing items such as John at 4:30pm. However, I now have a situation where I need to parse times separately from the composite item, for example 4:30pm.
At the moment it appears that I have to duplicate code in MyGrammarListener and MyTimeListener to handle times. Is there any way instead in which I can tell MyGrammarListener to hand off to MyTimeListener when it sees a mytime so that I can avoid code duplication, or should I be handling this in a different way?
The answer to the first part of your question is no, you cannot do this (as of ANTLR 4.4 at least). See my answer here:
Is it possible to make Antlr4 generate lexer from base grammar lexer instead of gener Lexer?

Analyse C++ files from a Java program

After several days of research I turn to you.
I search to analyse a C++ file for:
Count the number of parameters in method/function
Count the numbers of line in method/function
etc...
To do this I first tried to with regex, but it has not been successful (Too many cases handled, the regex really get too illegible).
Now I try with ANTLR4. Unfortunately I can not seem to find a grammar for C + + (I find a grammar for C here https://github.com/antlr/grammars-v4)
(I also tried with ANTLR3 but with this grammar, I have a C++ code !!! )
http://www.antlr3.org/grammar/1295920686207/antlr3.2_cpp_parser4.1.0.zip
So do you know where I can find a C++ grammar for ANTLR4?
Or do you know another way to do what I want?
Thank you in advance for your help
PS: sorry for my english, I'm French student
There are some good answers here. If I were you I would use a pre-built parser. After having tried to use ANTLR, I would say it takes a long time to make anything good. Personally I would try Clang.
clang has a library to build AST from where you can get the info you want.
Some existing tools compute some statistics as
cccc
ccccc
...

program for A three-point Gauss integration

I want to write a java program to calculate integral with three-point Gauss.
How to calculate result of every function that is string?
For example want to calculate F(x) = x^4 + cos(x) + e^2x
Evaluating a string is not an easy task by itself.
You have to write your own Interpreter with Lexer and a Parser.
You can consider to use thirdparty libraries for mathematical functions parsing and execution. I've never used any one of them. Simple googling reveals this:
JbcParser
JepParser
I'm sure there are a couple of others around...
Hope this helps

Java CFG parser that supports ambiguities

I'm looking for a CFG parser implemented with Java. The thing is I'm trying to parse a natural language. And I need all possible parse trees (ambiguity) not only one of them. I already researched many NLP parsers such as Stanford parser. But they mostly require statistical data (a treebank which I don't have) and it is rather difficult and poorly documented to adapt them in to a new language.
I found some parser generators such as ANTRL or JFlex but I'm not sure that they can handle ambiguities. So which parser generator or java library is best for me?
Thanks in advance
You want a parser that uses the Earley algorithm. I haven't used either of these two libraries, but PEN and PEP appear implement this algorithm in Java.
Another option is Bison, which implements GLR. GLR is an LR type parsing algorithm that supports ambiguous grammars. Bison also generates Java code, in addition to C++.
Take a look at the related discussion here. In my last comment in that discussion I explain that you can make any parser generator produce all of the parse trees by cloning the parse tree derived so far before making the derivation fail.
If your grammar is:
G -> ...
You would augment is as this:
G' -> G {semantic:deal-with-complete-parse-tree} <NOT-VALID-TOKEN>.
The parsing engine will ultimately fail on all derivations, but your program will either have:
Saved clones of all the trees.
Dealt with the semantics of each of the trees as they were found.
Both ANTLR and JavaCC did well when I was teaching. My preference was for ANTLR because of its BNF lexical analysis, and its much less convoluted history, vision, y and licensing.

Categories