I'm writing a simple scripting language on top of Java/JVM, where you can also embed Java code using the {} brackets. The problem is, how do I parse this in the grammar? I have two options:
Allow everything to be in it, such as: [a-z|a-Z|0-9|_|$], and go on
Get an extra java grammar and use that grammar to parse that small code (is it actually possible and efficient?)
Since option 2] is basically a double-check since when evaluating java code it's also being checked. Now my last question is -- is way that can dynamically execute java code also with objects which have been created at runtime?
Thanks,
William van Doorn
1] Allow everything to be in it, such as: [a-z|a-Z|0-9|_|$], and go on
You can't just do that: you'll have to account for opening and closing brackets.
2] Get an extra java grammar and use that grammar to parse that small code (is it actually possible and efficient?)
Yes that's possible. But I suggest you first get something working, and then worry about efficiency (is that really an issue here?).
... is way that can dynamically execute java code also with objects which have been created at runtime?
Yes, since Java 6, there's a way to compile source files dynamically. See the JavaCompiler API.
I propose enclose your Java code inside characters like '`' which are not used in Java code and barely present in literals.
JavaCode: '' ( EscapeSequence | ~('\\'|'') )* '`'
;
Use java.g provided by antlr examples to get definition of EscapeSequence ,...
The only catch is that you need to ask programmers to use code of this character ('`') if it is required to be as an literal.
Related
Hopefully my title isn't completely terrible. I don't really know what this should be called. I'm trying to write a very basic scheme parser in Java. The issue I'm having is with implementation.
I open a file, and I want to parse individual tokens:
while(sc.hasNext()) {
System.out.println(sc.next());
}
Generally, to get tokens, this is fine. But in scheme, recognizing the begining and end of a list is crucial; my program's functionality depends on this, so I need a way to treat a token such as:
(define
or
poly))
As multiple tokens, where any parentheses is its own token:
(
define
poly
)
)
If I can do that, I can properly recognize different symbols to add to my symtab, and know when/how to add nodes to my parse tree.
The Java API shows that the scanner class doesn't have any methods for doing exactly what I want. The closest thing I could think of is using the parantheses as custom delimiters, which would make each token clean enough to be recognized more easily by my logic, but then what happens to my parentheses?
Another method I'm thinking about is forgoing the Java tokenizer, and just scanning char by char until I find a complete symbol.
What should I do? Try to work around the Java scanner methods, or just do a character by character approach?
First, you need to get your terminology straight. (define is not a single token; it's a ( token followed by a define one. Similarly, poly)) is not a single token, it's three.
Don't let java.util.Scanner (that's what you're using, right?) throw you for a loop -- when you say "Generally, to get tokens, this is fine", I say no, it's not. Don't settle for what it provides if it's not enough.
To correctly tokenize Scheme code, I'd expect you need to at least be able to deal with regular languages. That would probably be very tough to do using Scanner, so here's a couple of alternatives:
learn and apply a tried-and-true parsing tool like Antlr or Lex. Will be beneficial for any of your future parsing projects
roll your own regular expression approach (I don't know Scheme well enough to be sure that this will work) for tokenizing, but don't forget that you need at least context-free for full parsing
learn about parser combinators and recursive descent parsing, which are relatively easy to implement by hand -- and you'll end up learning a ton about Java's type system
I have not found anything of this sort on Google and would like to know if there is a quicker way of doing the following:
I need to parse build scripts for Java programs which are written in Python. More specifically, I want to parse the dictionaries which are hard-coded into these build scripts.
For example, these scripts contain entries like:
config = {}
config["Project"] = \
{
"Name" : "ProjName",
"Version" : "v2",
"MinimumPreviousVersion" : "v1",
}
def actualCode ():
# Some code that actually compiles the relevant files
(The actual compiling is done via a call to another program, this script just sets the required options which I want to extract).
For example, I want to extract, "Name"="ProjName" and so on.
I am aware of the ConfigParser library which is part of Python, but that was designed for .ini files and hence has problems (throws exception and crashes) with actual python code which may appear in the build scripts which I am talking about. So using this library would mean that I would first have to read the file in and remove lines of the file which ConfigParser would object to.
Is there a quicker way than reading the config file in as a normal file and parsing it? I am looking for libraries which can do this. I don't mind too much which languages this libraries is in.
I was trying to solve the similar problem. I converted the directory into a JSON object so that I can query keys using JSON object in simplest way possible. This solution worked for multi-level key values pairs for me. I
Here is the algorithm.
Locate the config["key_name"] using a regular expression from string or file. Use the following regular expression
config(.*?)\\[(.*?)\\]
Get the data within curly brackets into a string. Use some stack based code since there could be nested brackets of type {} or [] in complex directories.
Replace the circular bracket, if any, "()" with square brackets "[]" and backslash "\" with blank character " " as follows
expression.replace('(', '[')
.replace(')', ']')
.replace('\\', ' ')
JSONObject json = (JSONObject) parser.parse(expression)
Here is your JSON object. You can use it the way you want.
Try Parboiled. It is written in Java and you write your grammars in... Java too.
Use the stack to store elements etc; its parser class is generic and you can get the final result out of it.
I know this is an old question, but I have found an incredibly useful config parser library for Java here.
It provides a simple function getValue("sectionName", "optionName") that allows you to get the value of an option inside a section.
[sectionName]
optionName = optionValue
We are developing an eclipse plugin tool to remove sysout statements from the workspace projects. We are able to achieve our goal only partially. If the sysouts are in one line we are able to delete it easily. But if the sysout is spanned over a couple of lines (generally occurs due to code formatting), this is when we face the issue.
For Example :
System.out.println("Hello World");
The regular expression to remove this line would be simple:
System.out.println*
But if the code is this:
System.out.println(New Line)("HelloWorld");
This is where the issue comes. Can anyone please suggest how I can replace this using a java regular expression.
I suggest
String regex = "System\.out\.println[^)]+[)]\s*;"
Where the [^)]+ will scan until the closing parenthesis. However, this will fail in multiple cases:
(possibly-unbalanced) parenthesis inside the output
commented-out code
the few cases where it is possible to omit the ';'
cases where System.out is assigned to another variable, instead of being used directly
Go the extra mile and use a Eclipse's in-built parser (which understands lexical issues, comments, and can flag any compile-time references to System.out).
Searching for solutions for my problem, I got this question, suggesting composite grammars to get rid of code too large. Problem there, I'm already using grammar imports, but when I further extend one of the imported grammars, the root parser grammar shows the error. Apparently, the problem lies in the many tokens and DFA definitions that ANTLR generates after analyzing the whole grammar. Is there a way/what is the suggested way to get rid of this problem? Is it scalable, i.e. does it not depend on the parts changed by the workaround being small enough?
EDIT: To make this clear (the linked question didn't make it clear): The code too large error is a compiler error on the generated parser code, to my understanding usually caused by a grammar so large that some code is larger than the limit of the java specification. In my case, it's the static initializer of the root parser class, which contains tons of DFA lookahead variables, all resulting in code in the initializer. So, Ideally, ANTLR should be able to split that up in the case that the grammar is too big/the user tells ANTLR to do it. Is there such an option?
(I have to admit, the asker of the linked question had an... interesting rule that caused his grammar to bloat up, and it may be my error here, too. But the possibility of this being not the grammar's author's error (in any large grammar) stands, so I see this as a valid, non-grammar specific ANTLR question)
EDIT END
My grammar parses "Magic the Gathering" rules text and is available here (git). The problem specifically appears when exchanging line 33 for 34-36 in this file. I use Maven and antlr3-maven-plugin for building, so ideally, the workaround is doable using the plugin, but if it's not, that's a smaller problem than the one I have now...
Thanks a lot and I hope I haven't overseen any obvious documentation that would help me.
The fragment keyword can only be used before lexer rules, not before parser rules as I see you do. First change that in all your grammars (I only looked at ObjectExpressions.g). It's unfortunate that ANTLR does not produce an error when you try it. But believe me: it's wrong, and might be causing (a part of) your problem(s).
Also, your rule from line 34-36:
qualities
: qualities0
| qualities0 (COMMA qualities0)+ -> qualities0+
| qualities0 (Or qualities0)+ -> ^(Or qualities0+)
;
should be rewritten as:
qualities
: qualities0 (COMMA qualities0)* -> qualities0+
| qualities0 (Or qualities0)+ -> ^(Or qualities0+)
;
EDIT
So, Ideally, ANTLR should be able to split that up in the case that the grammar is too big/the user tells ANTLR to do it. Is there such an option?
No, there is no such option unfortunately. You'll have to divide the grammar into (even more) smaller ones.
I'd like to parse REXX source so that I can analyse the structure of the program from Java.
I need to do things like normalise equivalent logic structures in the source that are syntactically different, find duplicate variable declarations, etc. and I already have a Java background.
Any easier ways to do this than writing a load of code?
REXX is not an easy language to parse with common tools, especially those that expect a BNF grammar. Unlike most languages designed by people exposed to C, REXX doesn't have any reserved words, making the task somewhat complicated. Every term that looks like a reserved word is actually only resolved in its specific context (e.g., "PULL" is only reserved as the first word of a PULL instruction or the second word of a PARSE PULL instruction - you can also have a variable called PULL ("PULL = 1 + 2")). Plus there are some very surprising effects of comments. But the ANSI REXX standard has the full syntax and all the rules.
If you have BNF Rexx grammar, then javacc can help you build an AST (Abstract Syntax Tree) representation of that Rexx code.
More accurately, javacc will build the Java classes which will :
parse Rexx code and
actually builds the AST.
There would still be "load of code", but you would not to be the one doing the writing of the classes for that Rexx code parser. Only its generation.
Have a look at ANTLR, it really does a nice work of building an AST, transforming it etc...
It has a nice editor (ANTLRWorks), is built on Java, and can debug your parser / tree walkers while they run in your application. Really worth investigating for any kind of parsing job.