How to represent in Java a context-free grammar?

How to represent in Java a context-free grammar? - java

I have a simple grammar:
R --> R and R | R or R | atom
The only terminal we have is atom.
This is a recursive grammar because each R can be composed by nested R.
The problems I am facing are:
How to deal with recursion?
How to build a java class R that can be resolved by one of the 3 rules?
How would you represent this grammar by Java classes?

The easiest way is to normalize all rules as single choices, and then represent them as an array of arrays.
First we assign a unique code to each "atom" (token) in the grammar.
Then, rules should all be normalized as
LHS --> RHS1 RHS2 ... RHSn
e.g, rules of the from: a --> b | c should be normalized as two rules, a --> b and a --> c . If you have other fancy notational EBNF devices such as kleene start or plus, you normalize them also.
Now you have K rules; you can define an array with K slots, each slot holds one rule. A rule slot holds a pair: a LHS, and an array of size n for that rule. (Easier: a rule slot holds an array of size n+1, with the leftmost element index 0 holding LHS, index 1 holding RHS1, etc.).
Now you have the grammar represented in Java.
[Recursion is a semantic property of the grammar, not its representation.]
An alternative: if you build a classic parser for BNF (after all, (E)BNF has a grammar, too) you can parse your BNF using the parser, and build a tree for that. That's obviously also a representation. It isn't a convenient as the array of arrays to process.

Related

Unexpected Type Error when using Unary Operator --

I am learning Java and am experimenting with the unary operators --expr, expr--, and -expr.
In class, I was told that --3 should evaluate to 3. I wanted to test this concept in the following assignments:
jshell> int t = 10;
t ==> 10
| created variable t : int
jshell> int g = -3;
g ==> -3
| created variable g : int
jshell>
jshell> int d = --3;
| Error:
| unexpected type
| required: variable
| found: value
| int d = --3;
| ^
jshell> int d = --t;
d ==> 9
| created variable d : int
jshell> int f = d---t;
f ==> 0
| created variable f : int
jshell> int f = 1---t;
| Error:
| unexpected type
| required: variable
| found: value
| int f = 1---t;
| ^
| update overwrote variable f : int
My questions:
Why does assigning -3 work and not --3? I thought --3 would give 3.
Are there cases where --expr can be evaluated as double negation instead of decrement?
Why can't values suffice where the unexpected type errors were thrown?
How did Java evaluate d---t? Also, in what order?
For question 4, the way I thought of it was a right-to-left evaluation. So, if d = 9 and t is 9, the rightmost - operator is the first to act on t, making its value -9. Then the same for the second -, so then t's value becomes 9 again. Then I thought the compiler would notice that d is next to the leftmost operator and subtract the values. This would be 9-9, which evaluates to 0. Jshell shows the expression also evaluated to 0, but I want to make sure my reasoning is correct or can be improved.

-- is taken as the decrement operator. Adding spaces or using brackets will allow it to be interpreted as double negation.
int x = - -3;
//or
int x = -(-3);

Why does assigning -3 work and not --3? I thought --3 would give 3.
Java tokenizes the input. A source file consists of a sequence of relevant atomary units, which we shall call words. In y +foobar, we have 3 'words': y, +, and foobar. Tokenizing is the job of splitting them up.
This process is somewhat complicated; whitespace (obviously) separates things, but whitespace isn't neccessary, if the two 'words' don't share any legal characters. Thus, 5+2 is legal as is 5 + 2, and public static works, but publicstatic does not. This tokenization step occurs first and is done on limited knowledge. You could in theory surmize from context that, say, publicstatic void foo() {} can't really mean anything else, but the amount of knowledge you need to draw that conclusion is quite complicated, and tokenization just does not have it. Hence, publicstatic does not work, but 5+2 does.
Based on that rule on tokenization, int y = --3; and int y = - -3 is different for the same reason publicstatic and public static is different. -- is a single 'word', meaning: Unary decrement. You can't split it up (you can't put spaces in between the two minus signs), and 2 consecutive minus signs without spaces in between is going to be tokenized as the unary decrement operator, and not as 2 consecutive binary-minus/unary-negative words. You COULD draw the conclusion that int y = --3; only has one non-stupid interpretation (2 minus signs), because the other obvious interpretation (unary decrement the expression '3') is a compiler error: You can't decrement a constant. But, that goes back to the earlier rule: If you have to take into account all complications at all times, parsing java source files is incredibly complicated, and for no meaningful gain: You really do not want to write source code where things are interpreted correctly or not based on exactly how intelligent the compiler ended up being. It aint english, you want consistency and clarity at all times. Poetic license is not a good thing, when talking directly to computers.
CONCLUSION: --3 does not work. It never can. Whomever informed you just messed up, or you misread it, and they were talking about - -3 and you didn't notice the space, or it got lost in translation.
Are there cases where --expr can be evaluated as double negation instead of decrement?
Not like that, no. - -expr will, -(-expr) will, but -- written just like that, no spaces, no parens, nothing in between? No. Because of the tokenizer.
Why can't values suffice where the unexpected type errors were thrown?
Because there'd be absolutely no point to this, and makes javac (the part that turns your bag-o-characters into a tree structure and from there, into a class file) a few orders of magnitude more complicated. It's a computer, not english. You don't want 'best effort', and the few languages that do this (javascript, PHP) are incredibly stupid languages, universally derided, for trying (these languages are still popular, but that's despite the property they try their best instead of just having a clear spec and failing when the programmer fails to adhere to it - you'll find plenty of talks for such languages, by fans of it, making fun of the corner cases. In javascript, there's the famous WAT talk. For java there is the java puzzlers book. Both talks/books designed to teach something by showing off how you can use the language to write idiotic code that is nevertheless hard to read. Written by fans of these languages. It's a bit much to try to give you some sort of proof that you do not want a language to take its best wild stab in the dark at what you meant, but hopefully a few popular books and talks will go some way in making you realize it works like this.
How did Java evaluate d---t? Also, in what order?
Now we're getting into specifics. Java's tokenizer is such that it will tokenize that as d -- - t or d - -- t, and that's because --- is not a known word, and the java tokenizer is based on splitting things up by applying a list of known symbols and keywords to the job.
So which one does it tokenize to? It doesn't matter. Why do you want to know? So that you can write it? Don't - what possible purpose does that serve? If you write intentionally obfuscated code, you will be fired, and it'll be justified. If you are trying to train yourself to find that code perfectly readable, that's great! But less than 1% of your average java coder, even expert ones, do this, so now you've written code nobody can read, and you find code others wrote aggravating because nobody writes it like you do. So, you're unemployable as a programmer / you can't employ anybody to help you. You can build an entire career writing software on your own for the rest of your life, but it's quite limiting. Why bother?
It gets tokenized into d - -- t or d -- - t.
If you find this sort of thing entertaining (I certainly do!), then know full well this is no more useful than playing a silly game, there is no academic or career point to it, at all.
The problem is, you don't seem to find it entertaining. If you did, you'd have done a trivial experiment to find out. If it's d -- - t (subtract X and Y, where X is d, post-unary-decrement, and Y is t), then after all that, d will be one smaller, which you can trivially test for. If it's d - --t, then t would be one smaller afterwards. If it's d - -(-t) (as in, d minus Z, where Z is negative Y, where Y is negative t), then neither will have changed. You didn't do this experiment. That eliminates 'you find it fun'. I've eliminated 'this is useful', which leaves us with: This question does not matter, therefore neither does the answer.
There is a small chance this question shows up in a curriculum of some sort. If it does, do yourself a giant favour and find another curriculum. It's an incredibly strong indicator of extremely low quality java teaching, if you expect your students to know how d---t breaks down.
You talk about 'reasoning', but reasoning has no place here. d -- - t, d - -- t and d - -(-t) are all equally valid interpretations, reason isn't going to tell you which one is correct. The language designers threw some dice to decide. They knew it didn't matter (Except for the last one, which isn't possible without a 'smart' tokenizer, and you don't want one of those, but the reasoning needed to draw the conclusion that you need a smart tokenizer, let alone the reasoning needed to draw the conclusion that smart tokenizers are a bad idea, is incredibly complicated and requires a ton of experience writing parsers from scratch, or possibly a mere full year's worth of parser language courses at a uni level might give you the wherewithal to figure that one out - I think you can be excused for not picking up that detail :P).

Converting infix to binary tree

How can you convert an infix expression into a tree? I would like to do it manually rather than programming first. For example, let's look at this infix expression:
b = (x * a) - y / b * (c + d)
What are the rules to turning this into a tree? Or what steps do you suggest to take to do this? I'm having trouble here because sometimes these expressions don't have explicit parentheses in them:
b = x * a - y / b * c + d

The problem that you're describing here is generally called expression parsing and typically there are two steps in the process:
First, there's scanning, where you take your input string and break it apart into a bunch of smaller logical units, each of which represents a single "piece" of the input. For example, given the input string
b = x * a - y / b * c + d
You might produce this sequence of tokens:
[b] [=] [x] [*] [a] [-] [y] [/] [b] [*] [c] [+] [d]
This way, you move from "the input is a sequence of characters" to "the input is a sequence of individual variables, operators, etc." There are many ways to do this step, and they often involve doing some manually string processing or working with regular expressions (if you're familiar with those). As a first step, see if you can get this part working.
The second step, which is probably the one you're most hung up on, is parsing, where you take that sequence of tokens and reconstruct the meaning of the statement. This usually involves figuring out the operator precedence and actually building up the tree structure you want. That tree, by the way, is often called an expression tree or an abstract syntax tree (AST).
There are many ways to do this. For parsing expressions, my personal go-to is Dijkstra's shunting-yard algorithm. This algorithm works by maintaining two stacks and processing the tokens one at a time, using the stacks to determine what the operators should be applied to.
If you'd like to see an example of how to do this, I built a truth table generator for a discrete math class that I regularly teach. You type in a logical expression, and the code scans it to get a token sequence, then uses the shunting-yard algorithm to build up an AST, which is then used to generate the truth table tool. The source code is broken down so that each step is done separately, and that might make for a good reference.

Is it possible to get k-th element of m-character-length combination in O(1)?

Do you know any way to get k-th element of m-element combination in O(1)? Expected solution should work for any size of input data and any m value.
Let me explain this problem by example (python code):
>>> import itertools
>>> data = ['a', 'b', 'c', 'd']
>>> k = 2
>>> m = 3
>>> result = [''.join(el) for el in itertools.combinations(data, m)]
>>> print result
['abc', 'abd', 'acd', 'bcd']
>>> print result[k-1]
abd
For a given data the k-th (2-nd in this example) element of m-element combination is abd. Is it possible to that value (abd) without creating the whole combinatory list?
I'am asking because I have data of ~1,000,000 characters and it is impossible to create full m-character-length combinatory list to get k-th element.
The solution can be pseudo code, or a link the page describing this problem (unfortunately, I didn't find one).
Thanks!

http://en.wikipedia.org/wiki/Permutation#Numbering_permutations
Basically, express the index in the factorial number system, and use its digits as a selection from the original sequence (without replacement).

Not necessarily O(1), but the following should be very fast:
Take the original combinations algorithm:
def combinations(elems, m):
#The k-th element depends on what order you use for
#the combinations. Assuming it looks something like this...
if m == 0:
return [[]]
else:
combs = []
for e in elems:
combs += combinations(remove(e,elems), m-1)
For n initial elements and m combination length, we have n!/(n-m)!m! total combinations. We can use this fact to skip directly to our desired combination:
def kth_comb(elems, m, k):
#High level pseudo code
#Untested and probably full of errors
if m == 0:
return []
else:
combs_per_set = ncombs(len(elems) - 1, m-1)
i = k / combs_per_set
k = k % combs_per_set
x = elems[i]
return x + kth_comb(remove(x,elems), m-1, k)

first calculate r = !n/(!m*!(n-m)) with n the amount of elements
then floor(r/k) is the index of the first element in the result,
remove it (shift everything following to the left)
do m--, n-- and k = r%k
and repeat until m is 0 (hint when k is 0 just copy the following chars to the result)

I have written a class to handle common functions for working with the binomial coefficient, which is the type of problem that your problem appears to fall under. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it too is faster than other published techniques.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
It should not be hard to convert this class to Java, Python, or C++.

Parsing an arithmetic expression and building a tree from it in Java

I needed some help with creating custom trees given an arithmetic expression. Say, for example, you input this arithmetic expression:
(5+2)*7
The result tree should look like:
*
/ \
+ 7
/ \
5 2
I have some custom classes to represent the different types of nodes, i.e. PlusOp, LeafInt, etc. I don't need to evaluate the expression, just create the tree, so I can perform other functions on it later.
Additionally, the negative operator '-' can only have one child, and to represent '5-2', you must input it as 5 + (-2).
Some validation on the expression would be required to ensure each type of operator has the correct the no. of arguments/children, each opening bracket is accompanied by a closing bracket.
Also, I should probably mention my friend has already written code which converts the input string into a stack of tokens, if that's going to be helpful for this.
I'd appreciate any help at all. Thanks :)
(I read that you can write a grammar and use antlr/JavaCC, etc. to create the parse tree, but I'm not familiar with these tools or with writing grammars, so if that's your solution, I'd be grateful if you could provide some helpful tutorials/links for them.)

Assuming this is some kind of homework and you want to do it yourself..
I did this once, you need a stack
So what you do for the example is:
parse what to do? Stack looks like
( push it onto the stack (
5 push 5 (, 5
+ push + (, 5, +
2 push 2 (, 5, +, 2
) evaluate until ( 7
* push * 7, *
7 push 7 +7, *, 7
eof evaluate until top 49
The symbols like "5" or "+" can just be stored as strings or simple objects, or you could store the + as a +() object without setting the values and set them when you are evaluating.
I assume this also requires an order of precedence, so I'll describe how that works.
in the case of: 5 + 2 * 7
you have to push 5 push + push 2 next op is higher precedence so you push it as well, then push 7. When you encounter either a ) or the end of file or an operator with lower or equal precedence you start calculating the stack to the previous ( or the beginning of the file.
Because your stack now contains 5 + 2 * 7, when you evaluate it you pop the 2 * 7 first and push the resulting *(2,7) node onto the stack, then once more you evaluate the top three things on the stack (5 + *node) so the tree comes out correct.
If it was ordered the other way: 5 * 2 + 7, you would push until you got to a stack with "5 * 2" then you would hit the lower precedence + which means evaluate what you've got now. You'd evaluate the 5 * 2 into a *node and push it, then you'd continue by pushing the + and 3 so you had *node + 7, at which point you'd evaluate that.
This means you have a "highest current precedence" variable that is storing a 1 when you push a +/-, a 2 when you push a * or / and a 3 for "^". This way you can just test the variable to see if your next operator's precedence is < = your current precedence.
if ")" is considered priority 4 you can treat it as other operators except that it removes the matching "(", a lower priority would not.

I wanted to respond to Bill K.'s answer, but I lack the reputation to add a comment there (that's really where this answer belongs). You can think of this as a addendum to Bill K.'s answer, because his was a little incomplete. The missing consideration is operator associativity; namely, how to parse expressions like:
49 / 7 / 7
Depending on whether division is left or right associative, the answer is:
49 / (7 / 7) => 49 / 1 => 49
or
(49 / 7) / 7 => 7 / 7 => 1
Typically, division and subtraction are considered to be left associative (i.e. case two, above), while exponentiation is right associative. Thus, when you run into a series of operators with equal precedence, you want to parse them in order if they are left associative or in reverse order if right associative. This just determines whether you are pushing or popping to the stack, so it doesn't overcomplicate the given algorithm, it just adds cases for when successive operators are of equal precedence (i.e. evaluate stack if left associative, push onto stack if right associative).

The "Five minute introduction to ANTLR" includes an arithmetic grammar example. It's worth checking out, especially since antlr is open source (BSD license).

Several options for you:
Re-use an existing expression parser. That would work if you are flexible on syntax and semantics. A good one that I recommend is the unified expression language built into Java (initially for use in JSP and JSF files).
Write your own parser from scratch. There is a well-defined way to write a parser that takes into account operator precedence, etc. Describing exactly how that's done is outside the scope of this answer. If you go this route, find yourself a good book on compiler design. Language parsing theory is going to be covered in the first few chapters. Typically, expression parsing is one of the examples.
Use JavaCC or ANTLR to generate lexer and parser. I prefer JavaCC, but to each their own. Just google "javacc samples" or "antlr samples". You will find plenty.
Between 2 and 3, I highly recommend 3 even if you have to learn new technology. There is a reason that parser generators have been created.
Also note that creating a parser that can handle malformed input (not just fail with parse exception) is significantly more complicated that writing a parser that only accepts valid input. You basically have to write a grammar that spells out the various common syntax errors.
Update: Here is an example of an expression language parser that I wrote using JavaCC. The syntax is loosely based on the unified expression language. It should give you a pretty good idea of what you are up against.
Contents of org.eclipse.sapphire/plugins/org.eclipse.sapphire.modeling/src/org/eclipse/sapphire/modeling/el/parser/internal/ExpressionLanguageParser.jj

the given expression (5+2)*7 we can take as infix
Infix : (5+2)*7
Prefix : *+527
from the above we know the preorder and inorder taversal of tree ... and we can easily construct tree from this.
Thanks,

Representing Math Equations as Java Objects

I am trying to design a way to represent mathematical equations as Java Objects. This is what I've come up with so far:
Term
-Includes fields such as coefficient (which could be negative), exponent and variable (x, y, z, etc). Some fields may even qualify as their own terms alltogether, introducing recursion.
-Objects that extend Term would include things such as TrigTerm to represent trigonometric functions.
Equation
-This is a collection of Terms
-The toString() method of Equation would call the toString() method of all of its Terms and concatenate the results.
The overall idea is that I would be able to programmatically manipulate the equations (for example, a dirivative method that would return an equation that is the derivative of the equation it was called for, or an evaluate method that would evaluate an equation for a certain variable equaling a certain value).
What I have works fine for simple equations:
This is just two Terms: one with a variable "x" and an exponent "2" and another which is just a constant "3."
But not so much for more complex equations:
Yes, this is a terrible example but I'm just making a point.
So now for the question: what would be the best way to represent math equations as Java objects? Are there any libraries that already do this?

what would be the best way to
represent math equations as Java
objects?
I want you to notice, you don't have any equations. Equations look like this;
x = 3
What you have are expressions: collections of symbols that could, under some circumstances, evaluate out to some particular values.
You should write a class Expression. Expression has three subclasses: Constant (e.g. 3), Variable (e.g. x), and Operation.
An Operation has a type (e.g. "exponentiation" or "negation") and a list of Expressions to work on. That's the key idea: an Operation, which is an Expression, also has some number of Expressions.
So your is SUM(EXP(X, 2), 3) -- that is, the SUM Operation, taking two expressions, the first being the Exponentiation of the Expressions Variable X and Constant 2, and the second being the Constant 3.
This concept can be infinitely elaborated to represent any expression you can write on paper.
The hard part is evaluating a string that represents your expression and producing an Expression object -- as someone suggested, read some papers about parsing. It's the hardest part but still pretty easy.
Evaluating an Expression (given fixed values for all your Variables) and printing one out are actually quite easy. More complicated transforms (like differentiation and integration) can be challenging but are still not rocket science.

Consult a good compiler book for details about how to write the part of a compiler that converts input into an expression tree.
You might find this series inspirational: http://compilers.iecc.com/crenshaw/
If you "just" want to evaluate an input string, then have a look at the snippet compiler in the Javassist library.

Here I described the representation of parsed math expressions as Abstract Syntax Trees in the Symja project.
The D[f,x] function in the D.java file implements a derivative function by reading the initial Derivative[] rules from the System.mep file.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.