I have an older project, where a JavaCC grammar was used to generate classes to parse a custom language.
Now, several years later I have to adapt the grammar to add functionality (just a minor change).
This works, but when running all tests, I see I have a problem parsing UTF-8 characters.
I don't really have an idea what is causing this.
I reverted my change to the grammar and recreated the classes, but the problem remains.
As soon as I run javacc with the grammar and run my tests, the one with the UTF-8 characters fail.
This is the call I am using:
java -cp javacc-7.0.10.jar javacc -GRAMMAR_ENCODING=UTF-8 functionsGrammar.jj
I tried it with all major javacc versions from 4.x to 7.0.10, they all have the same problem.
I also tried this with different java version (6, 7, 8, 11) but that also did not make any difference.
Below you can find the relevant parts of the grammar:
options
{
JDK_VERSION = "1.6";
LOOKAHEAD= 2;
FORCE_LA_CHECK = true;
static = false;
}
TOKEN:
{
...
|< STRING : < QUOTES > (~["\"", "\\"])* ("\\"~[] (~["\"", "\\"])*)* < QUOTES > >
...}
TOKEN:
{
...
| < LIST :
< LCURLY_BRACE > < SPACES >
( < STRING > | < DATE > | < PARAMETER_FIELD_ID > | < PARAMETER_ELEMENT > | < NULL > )
( < COMMA > < SPACES >
( < STRING > | < DATE > | < PARAMETER_FIELD_ID > | < PARAMETER_ELEMENT > | < NULL > )
)*
...}
It fails for the string: "美丽的树" but works when changed to "slkdfj" for example.
I wonder if there are any options for JavaCC that I am missing? Or other java / javacc version combinations that might work?
So I'm learning how to create a parser in JavaCC. This will be the type of language we would be expected to parse.
bus(59) ->
"Beach Shuttle"
at 9:30 10:30 11:30 12:00
13:00 14:00 15:00
via stops 3 76 44 89 161 32
free
bus(1234) ->
"The Hills Loop"
at 7:15 7:30 7:45 8:05 8:20 8:40 9:00
via stops 99 97 77 66 145 168
bus(7) -> "City Transit"
at
16:08 16:39 16:55 17:01 17:12 17:28
via
stops
2 1 5 7 13 119
We have some rules that the parser needs to follow as well.
We must ignore whitespace except for those inside "".
We can have any number of bus declarations and the order within will always be the same.
The bus name (in double quotes) will contain any number of characters.
The times are in 24hopur format hh::mm and there must be one per bus declaration.
The stop numbers are all pre-defined locations and there must be at least 2 per bus declaration.
The word free may or may not be present for each bus declaration.
Below is my implementation so far, ill try and explain my thought process.
PARSER_BEGIN(MyParser)
import java.io.*;
public class MyParser
{
public static void parse(String fileName) throws IOException, ParseException
{
MyParser parser = new MyParser(new FileInputStream(fileName));
parser.dsl();
}
}
PARSER_END(MyParser);
//Remainder of the .jj file.
//Tokens to ignore in the BNF follow.
SKIP : { ' ' | '\t' | '\n' | '\r' }
TOKEN : {
< BUSNUMBER : "bus(["0"-"9"]) |
< BUSNAME : "(["a"-"z", "A"-"Z"])* //Match a single character which can be lowercase or upper. Happens 0 or more times.
< VIA : "via" > |
< STOPS : "stops" > |
< FREE : "free" >
}
// was used as a temporary comments indicator.
So i've created my characters to skip over.
And all the tokens I can think of.
But I'm not sure what i'm missing.
Any help would be appreciated, or an explanation would be better as I actually want to learn how to do this.
Thank you.
A few comments. For
< BUSNUMBER : "bus(["0"-"9"]) |
You perhaps mean
< BUSNUMBER : "bus(" (["0"-"9"])+ ")" > |
However, if you want to allow spaces, you should treat bus, (, ), and numbers as separate tokens.
For
< BUSNAME : "(["a"-"z", "A"-"Z"])* //Match a single character which can be lowercase or upper. Happens 0 or more times.
you might want
< BUSNAME : "\"" (["a"-"z", "A"-"Z", " "])* "\""> |
(I don't know what characters are possible in a bus name, but in your example you have spaces as well as letters.)
You are missing ->, stop numbers and times.
I mistakenly wrote the following in the groovy console but afterwards I realized that it should throw error but it did not. What is the reason behind groovy not throwing any error for colon at last of the statement?Is it allocated for documentation or sth like that?
a:
String a
println a
This threw no error when i tried executing this code in https://groovyconsole.appspot.com/
It's a label, just like it would be in Java. For example:
a:
for (int i = 0; i < 10; i++)
{
String a = "hello"
println a
break a; // This refers to the label before the loop
}
One good use of labels in Groovy I can think of is the Spock Framework, where they are used for clauses:
def 'test emailToNamespace'() {
given:
Partner.metaClass.'static'.countByNamespaceLike = { count }
expect:
Partner.emailToNamespace( email ) == res
where:
email | res | count
'aaa.com' | 'com.aaa' | 0
'aaa.com' | 'com.aaa1' | 1
}
I'm trying to code a software which should execute instructions of a basic programming language only with these capabilities:
Arithmetic expression evaluator (add,minus,mult,divide,parenthesis,...)
if-else statement
Function definition
And it should display the code being 'reduced' or 'simplified' at one step at a time, and to let me show an example of a sample output:
Iteration 1:
a=3;
b=2;
c=true;
if(c && (a < 3 * (5 -2) ) || b >= 3 * (5 -2))){
System.out.println("going through if");
}else{
System.out.println("going through else");
}
Iteration 2:
if(true && (a < 3 * (5 -2) ) || b >= 3 * (5 -2))){
System.out.println("going through if");
}else{
System.out.println("going through else");
}
Iteration 3:
if(true && (3 < 3 * (5 -2) ) || 2 >= 3 * (5 -2))){
System.out.println("going through if");
}else{
System.out.println("going through else");
}
Iteration 4:
if(true && (3 < 9 ) || 2 >= 3 * (5 -2))){
System.out.println("going through if");
}else{
System.out.println("going through else");
}
Iteration 5:
if(true && (3 < 9 ) || 2 >= 3 * 3)){
System.out.println("going through if");
}else{
System.out.println("going through else");
}
Iteration 6:
if(true && (3 < 9 ) || 2 >= 9)){
System.out.println("going through if");
}else{
System.out.println("going through else");
}
Iteration 7:
if(true && true || 2 >= 9)){
System.out.println("going through if");
}else{
System.out.println("going through else");
}
Iteration 8:
if(true && (true || false)){
System.out.println("going through if");
}else{
System.out.println("going through else");
}
Iteration 9:
if(true && false){
System.out.println("going through if");
}else{
System.out.println("going through else");
}
Iteration 10:
if(false){
System.out.println("going through if");
}else{
System.out.println("going through else");
}
Iteration 11:
System.out.println("going through else");
So it should parse the input code and executing it at the same time at every iteration, only doing a basic operation at a time, substituting the result of the operation and finishing the loop when there is no more simplications steps required. Anyone knows how to do this, for example with ANTLR tool? I've been checking the Tree Listener feature of ANTLR as it seems the way to go but it's not clear to me how to implement it.
Another is that the optimum is that it is implemented in Javascript in order to be able to be executed in a web browser, but Java code would be adequate (i.e. executing as a Java applet).
Grammar:
program
: (variable | function)*
statement*
;
variable
: IDENT ('=' expression)? ';'
;
type
: 'int'
| 'boolean'
| 'String'
| 'char'
;
statement
: assignmentStatement
| ifStatement
;
ifStatement
: 'if' '(' expression ')' '{' statement+ '}'
('else if' '(' expression ')' '{' statement+)* '}'
('else' '{' statement+)? '}'
;
assignmentStatement
: IDENT '=' expression ';'
;
returnStatement
: 'return' expression ';'
;
function
: 'function' IDENT '(' parameters? ')' '{'
(statement|returnStatement)*
'}'
;
parameters
: IDENT (',' IDENT)*
;
term
: IDENT
| '(' expression ')'
| INTEGER
;
negation
: '-' -> NEGATION
;
unary
: ('+'! | negation^)* term
;
mult
: unary (('*' | '/' | 'mod') unary)*
;
add
: mult (('+' | '-') mult)*
;
relation
: add (('=' | '/=' | '<' | '<=' | '>=' | '>') add)*
;
expression
: relation (('and' | 'or') relation)*
;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment DIGIT : '0'..'9';
INTEGER : DIGIT+ ;
IDENT : LETTER (LETTER | DIGIT)*;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel = HIDDEN;};
COMMENT : '//' .* ('\n'|'\r') {$channel = HIDDEN;};
[OP: ... (or any other tool)]
You use the phrase term rewriting and then show an interesting example of incrementally processing the source code for a program by substituting values and doing constant folding to produce the final program answer.
Abstractly, what you want from a term rewriting system is the ability to start with a set of "term rewrite" rules, that in essence specify
if you see this, replace it by that
e.g.
" if (true) then ... " ==> " ... "
and apply those rules in an organized way, repeatedly, until some stopping condition is achieved (often being "no more rules apply").
There are two ways to implement term rewriting, both starting with a term (the program's AST in your case), and producing the same result. The difference is in how the rewrite rules are actually implemented.
Procedural Tree Rewriting
The first way specifies the rewriting steps procedurally. That is, one constructs a visitor to walk over the AST, which procedurally examines the node types, and the subtrees of interesting node types ("match"), and where a match is found, modify the tree according to the desired effect of the rule.
For the "if" example above, the visitor would find "if" statement subtree roots, check the left/condition subtree to see whether it was a "true" subtree, and if so, replace the "if" subtree by the right subtree producing a modified tree.
Constant folding is a bit special: the visitor checks for an operator node, checks that all its children are constant values, computes the result of the operator on those constants, and then replaces the operator by a node containing the computed result.
To implement rewriting with a set of rules, you have to first write down all the abstract rules, and then code this visitor, combining all the tests from all the rules. This can produce a really messy bit of code which is checking for node types and walking up and down the subtrees to do further checks. (You can also implement this as one visitor per rules, which makes them easier to write, but now you have big pile of visitors, you have to run them repeatedly over the tree... this ends up being extremely slow).
The visitor has be a bit clever: you don't want it processing subtrees "before their time" Consider this code, being processed by a rewriter:
if (x) then y = 3/ 0; else y = 3;
You don't want to constant fold "3/0" before x has been evaluated.
You can implement procedural rewriting starting with ASTs from any parser generator, including ANTLR; it is just sweat to write the visitor. Perhaps a lot.
Procedural rewriting is awkward to implement because of the problem of composing all the rule matches into the visitor. If you have dozens of rules, this becomes unmanageable, fast. (If you are going to use rules to process a full computer language, you will have at least one rule per bit of syntax; it is easy to get dozens of rules in this case, if not hundreds).
To get the "incremental" display aspect desired by OP, you have to stop after each match/replace step and then prettyprint the AST, e.g., regenerate the surface syntax text from the AST. Modifying the visitor to call prettyprint after each tree modification isn't really hard.
Parser generators that produce ASTs usually don't provide good help for doing the pretty printing step, though. It is harder than it looks to do prettyprinting. The details are too complex to put here; see my SO answer on how to prettyprint for details on how to do this.
The next complication: when a variable in the program being evaluated is encountered, what value should be substituted in the tree? For this, one needs a symbol table for the language, and has to keep that symbol table up to date assignment of values to variables take place.
Not discussed is what happens if the program is ill-formed. They will be :-{ It is likely that the rwrites will need a lot of "error checking" to prevent nonsense computations (e.g, "x / y" where y is a string).
Direct Tree Rewriting
What you want ideally is an engine that accepts the explicit term rewriting rules directly, and can apply them.
Program Transformation Systems (PTS) do this. Specific systems include Mathematica (now called "Wolfram language"?), DMS, TXL, Stratego/XT.
(OP was looking for one implemented in Java: Stratego has a Java version, I think, the others a definitely not).
These tools accept rewrite rules written using the surface syntax of the target language, convert the rules in essence into pairs of pattern trees (a "match" tree with variable placeholders) and a "replace" tree with (the same) variable placeholders. The rewrite engines in these tools will take an arbitrary subset of specified rules, and apply them to the tree, checking all the matches by comparing the "match" tree against the targeted tree, and substituting the replacement tree, with its placeholder filled in from the match, when a match is found. This is a major convenience in writing complex sets of rewrites. (If you think about it, this is still procedural rewriting... just that the engine is doing it rather than you, the rule-specifier. Convenient nonetheless).
Such PTS include parser generators (Mathematica does not) that build ASTs, and a full prettyprinter (or at least allow you to define one conveniently).
For DMS, you can write rules like this:
rule fold_true_ifTE(s: statement, t:statement): statement->statement =
" if (true) then \s else \t " -> " \s ";
rule fold_false_ifTE(s: statement, t:statement): statement->statement =
" if (false) then \s else \t " -> " \t ";
rule fold_constants_add(x:NUMBER,y:NUMBER):sum -> sum =
" \x + \y " -> " \Add\(\x\,\y\)";
The first two rules realizes the "if" statement rewrite we sketched earlier; you also need rules for just "if-then" statements. The quotes are metaquotes; they separate text of the rule specification language (RSL) from the text of the language being manipulated by the rules. The metaescaped letters (s, t, x, y) are metavariables, and represent subtrees of the rule match. These metavariables must have the AST-type specified by the rule parameter list (e.g., s: statement means s is "any statement node").
The third rule realizes constant folding for "addition". The pattern looks for a "+" operator; it gets a match only if it finds one that has both children being number constants. It works by calling an external procedure "\Add" on its operators; Add returns a tree containing the sum, which the rewriting engine splices in place.
In DMS's case, there is a hook on rewriting machinery that is called after every rewrite attempt (both failed and successful) to trace the rewrite results. This hook would be where OP would call the prettyprinter to show how the tree has changed after every step.
For a detailed example of how to write rules to evaluate algebraic expressions, see how to implement Algebra with rewrite rules.
For a more detailed description of how DMS rewrite rules work, and an example of applying them to "simplifying" (evaluating) Nikolas Wirth's programing language Oberon, see DMS Rewrite Rules.
Not shown in either case is means to control the order in which the rules are applied. Because ordering constraints can be arbitrary, one has to step in and guide the rewriting engine. DMS provides complete procedural control of rule sequencing if needed. Often one can partition the rules into different sets: those that can be applied indiscriminately, and those that require sequencing (e.g., the if-then simplification rules).
PTS don't make the symbol table problem vanish; OP is still going to need one. Most PTS don't provide any support for this. DMS provides explicit support for this (it takes some effort to configure, but a lot less than when there isn't any!) as well as building static type checkers to help verify the program is well formed before one starts execution. As a practical matter, there are are a lot of issues that need to be addressed to prepare a program for execution (for instance, one might want to construct a map of the labels to source code points to enable efficient GOTO simulation). See Life After Parsing.
In this Symja example snippet below you can try the builtin Java term rewriting engine.
The main part of the term rewriting and pattern-matching engine is implemented in package: GIT: org.matheclipse.core.patternmatching
By implementing the IEvalStepListener interface you can see the steps the engine runs through internally:
package org.matheclipse.core.examples;
import org.matheclipse.core.eval.EvalEngine;
import org.matheclipse.core.eval.ExprEvaluator;
import org.matheclipse.core.interfaces.AbstractEvalStepListener;
import org.matheclipse.core.interfaces.IExpr;
import org.matheclipse.parser.client.SyntaxError;
import org.matheclipse.parser.client.math.MathException;
public class SO_StepListenerExample {
private static class StepListener extends AbstractEvalStepListener {
/** Listens to the evaluation step in the evaluation engine. */
#Override
public void add(IExpr inputExpr, IExpr resultExpr, int recursionDepth, long iterationCounter) {
System.out.println("Depth " + recursionDepth + " Iteration " + iterationCounter + ": " + inputExpr.toString() + " ==> "
+ resultExpr.toString());
}
}
public static void main(String[] args) {
try {
ExprEvaluator util = new ExprEvaluator( );
EvalEngine engine = util.getEvalEngine();
engine.setStepListener(new StepListener());
IExpr result = util.evaluate(
"a=3;b=2;c=True;If(c && (a < 3 * (5 -2) ) || ( b >= 3 * (5 -2)),"
+ "GOINGTHROUGHIF,"
+ "GOINGTHROUGHELSE )");
System.out.println("Result: " + result.toString());
// disable trace mode if the step listener isn't necessary anymore
engine.setTraceMode(false);
} catch (SyntaxError e) {
// catch Symja parser errors here
System.out.println(e.getMessage());
} catch (MathException me) {
// catch Symja math errors here
System.out.println(me.getMessage());
} catch (Exception e) {
e.printStackTrace();
}
}
}
The example generates the following output:
Depth 4 Iteration 0: a=3 ==> 3
Depth 4 Iteration 0: b=2 ==> 2
Depth 3 Iteration 0: a=3;b=2 ==> 2
Depth 3 Iteration 0: c=True ==> True
Depth 2 Iteration 0: a=3;b=2;c=True ==> True
Depth 5 Iteration 0: c ==> True
Depth 6 Iteration 0: a ==> 3
Depth 8 Iteration 0: (-1)*2 ==> -2
Depth 7 Iteration 0: -2+5 ==> -2+5
Depth 7 Iteration 1: 5-2 ==> 3
Depth 6 Iteration 0: 3*(-2+5) ==> 3*3
Depth 6 Iteration 1: 3*3 ==> 9
Depth 5 Iteration 0: a<3*(-2+5) ==> 3<9
Depth 5 Iteration 1: 3<9 ==> True
Depth 4 Iteration 0: c&&a<3*(-2+5) ==> True
Depth 3 Iteration 0: c&&a<3*(-2+5)||b>=3*(-2+5) ==> True
Depth 2 Iteration 0: If(c&&a<3*(-2+5)||b>=3*(-2+5),goingthroughif,goingthroughelse) ==> goingthroughif
Depth 1 Iteration 0: a=3;b=2;c=True;If(c&&a<3*(-2+5)||b>=3*(-2+5),goingthroughif,goingthroughelse) ==> goingthroughif
Result: goingthroughif
My current programming project is a sort of french dictionary in Java (using sqlite). I was wondering what would happen if someone wanted to find the present tense for "avoir" but typed in "avior" and how I would handle it. So I thought I could implement some sort of closest match/did you mean functionality. So my question is:
Is there a way to use the database to search for similar matches?
when I made the same program in python a while back (using xml instead) I used this system but it wasn't very effective and required a large error margin to be somewhat effective (and subsequently suggesting words with no relevance!)... but something similar could still be useful nether the less
def getSimilar(self, word, Return = False):
matches = list()
for verb in self.data.getElementsByTagName("Verb"):
for x in range(16):
if x % 2 != 0 and x>0:
if (x == 15 or x == 3 or x == 1):
part = Dict(self.data).removeBrackets(Dict(self.data).getAccents(verb.childNodes[x].childNodes[0].data))
diff = 0
for char in word:
if (not char in part):
diff += 1
if (diff < self.similarityValue) and (-self.errorAllowance <= len(part) - len(word) <= self.errorAllowance):
matches.append(part)
else:
for y in range(14):
if (y % 2 != 0 and y>0):
part = Dict(self.data).getAccents(verb.childNodes[x].childNodes[y].childNodes[0].data)
diff = 0
for char in word:
if (not char in part):
diff += 1
if (diff < self.similarityValue) and (-self.errorAllowance <= len(part) - len(word) <= self.errorAllowance):
matches.append(part)
if not Return:
for match in matches:
print "Did you mean '" + match + "'?"
if Return: return matches
Any help is welcomed!
Jamie
try using
https://github.com/mateusza/SQLite-Levenshtein
Works quite well