Parsing an arithmetic expression and building a tree from it in Java

Parsing an arithmetic expression and building a tree from it in Java - java

I needed some help with creating custom trees given an arithmetic expression. Say, for example, you input this arithmetic expression:
(5+2)*7
The result tree should look like:
*
/ \
+ 7
/ \
5 2
I have some custom classes to represent the different types of nodes, i.e. PlusOp, LeafInt, etc. I don't need to evaluate the expression, just create the tree, so I can perform other functions on it later.
Additionally, the negative operator '-' can only have one child, and to represent '5-2', you must input it as 5 + (-2).
Some validation on the expression would be required to ensure each type of operator has the correct the no. of arguments/children, each opening bracket is accompanied by a closing bracket.
Also, I should probably mention my friend has already written code which converts the input string into a stack of tokens, if that's going to be helpful for this.
I'd appreciate any help at all. Thanks :)
(I read that you can write a grammar and use antlr/JavaCC, etc. to create the parse tree, but I'm not familiar with these tools or with writing grammars, so if that's your solution, I'd be grateful if you could provide some helpful tutorials/links for them.)

Assuming this is some kind of homework and you want to do it yourself..
I did this once, you need a stack
So what you do for the example is:
parse what to do? Stack looks like
( push it onto the stack (
5 push 5 (, 5
+ push + (, 5, +
2 push 2 (, 5, +, 2
) evaluate until ( 7
* push * 7, *
7 push 7 +7, *, 7
eof evaluate until top 49
The symbols like "5" or "+" can just be stored as strings or simple objects, or you could store the + as a +() object without setting the values and set them when you are evaluating.
I assume this also requires an order of precedence, so I'll describe how that works.
in the case of: 5 + 2 * 7
you have to push 5 push + push 2 next op is higher precedence so you push it as well, then push 7. When you encounter either a ) or the end of file or an operator with lower or equal precedence you start calculating the stack to the previous ( or the beginning of the file.
Because your stack now contains 5 + 2 * 7, when you evaluate it you pop the 2 * 7 first and push the resulting *(2,7) node onto the stack, then once more you evaluate the top three things on the stack (5 + *node) so the tree comes out correct.
If it was ordered the other way: 5 * 2 + 7, you would push until you got to a stack with "5 * 2" then you would hit the lower precedence + which means evaluate what you've got now. You'd evaluate the 5 * 2 into a *node and push it, then you'd continue by pushing the + and 3 so you had *node + 7, at which point you'd evaluate that.
This means you have a "highest current precedence" variable that is storing a 1 when you push a +/-, a 2 when you push a * or / and a 3 for "^". This way you can just test the variable to see if your next operator's precedence is < = your current precedence.
if ")" is considered priority 4 you can treat it as other operators except that it removes the matching "(", a lower priority would not.

I wanted to respond to Bill K.'s answer, but I lack the reputation to add a comment there (that's really where this answer belongs). You can think of this as a addendum to Bill K.'s answer, because his was a little incomplete. The missing consideration is operator associativity; namely, how to parse expressions like:
49 / 7 / 7
Depending on whether division is left or right associative, the answer is:
49 / (7 / 7) => 49 / 1 => 49
or
(49 / 7) / 7 => 7 / 7 => 1
Typically, division and subtraction are considered to be left associative (i.e. case two, above), while exponentiation is right associative. Thus, when you run into a series of operators with equal precedence, you want to parse them in order if they are left associative or in reverse order if right associative. This just determines whether you are pushing or popping to the stack, so it doesn't overcomplicate the given algorithm, it just adds cases for when successive operators are of equal precedence (i.e. evaluate stack if left associative, push onto stack if right associative).

The "Five minute introduction to ANTLR" includes an arithmetic grammar example. It's worth checking out, especially since antlr is open source (BSD license).

Several options for you:
Re-use an existing expression parser. That would work if you are flexible on syntax and semantics. A good one that I recommend is the unified expression language built into Java (initially for use in JSP and JSF files).
Write your own parser from scratch. There is a well-defined way to write a parser that takes into account operator precedence, etc. Describing exactly how that's done is outside the scope of this answer. If you go this route, find yourself a good book on compiler design. Language parsing theory is going to be covered in the first few chapters. Typically, expression parsing is one of the examples.
Use JavaCC or ANTLR to generate lexer and parser. I prefer JavaCC, but to each their own. Just google "javacc samples" or "antlr samples". You will find plenty.
Between 2 and 3, I highly recommend 3 even if you have to learn new technology. There is a reason that parser generators have been created.
Also note that creating a parser that can handle malformed input (not just fail with parse exception) is significantly more complicated that writing a parser that only accepts valid input. You basically have to write a grammar that spells out the various common syntax errors.
Update: Here is an example of an expression language parser that I wrote using JavaCC. The syntax is loosely based on the unified expression language. It should give you a pretty good idea of what you are up against.
Contents of org.eclipse.sapphire/plugins/org.eclipse.sapphire.modeling/src/org/eclipse/sapphire/modeling/el/parser/internal/ExpressionLanguageParser.jj

the given expression (5+2)*7 we can take as infix
Infix : (5+2)*7
Prefix : *+527
from the above we know the preorder and inorder taversal of tree ... and we can easily construct tree from this.
Thanks,

Related

Unexpected Type Error when using Unary Operator --

I am learning Java and am experimenting with the unary operators --expr, expr--, and -expr.
In class, I was told that --3 should evaluate to 3. I wanted to test this concept in the following assignments:
jshell> int t = 10;
t ==> 10
| created variable t : int
jshell> int g = -3;
g ==> -3
| created variable g : int
jshell>
jshell> int d = --3;
| Error:
| unexpected type
| required: variable
| found: value
| int d = --3;
| ^
jshell> int d = --t;
d ==> 9
| created variable d : int
jshell> int f = d---t;
f ==> 0
| created variable f : int
jshell> int f = 1---t;
| Error:
| unexpected type
| required: variable
| found: value
| int f = 1---t;
| ^
| update overwrote variable f : int
My questions:
Why does assigning -3 work and not --3? I thought --3 would give 3.
Are there cases where --expr can be evaluated as double negation instead of decrement?
Why can't values suffice where the unexpected type errors were thrown?
How did Java evaluate d---t? Also, in what order?
For question 4, the way I thought of it was a right-to-left evaluation. So, if d = 9 and t is 9, the rightmost - operator is the first to act on t, making its value -9. Then the same for the second -, so then t's value becomes 9 again. Then I thought the compiler would notice that d is next to the leftmost operator and subtract the values. This would be 9-9, which evaluates to 0. Jshell shows the expression also evaluated to 0, but I want to make sure my reasoning is correct or can be improved.

-- is taken as the decrement operator. Adding spaces or using brackets will allow it to be interpreted as double negation.
int x = - -3;
//or
int x = -(-3);

Why does assigning -3 work and not --3? I thought --3 would give 3.
Java tokenizes the input. A source file consists of a sequence of relevant atomary units, which we shall call words. In y +foobar, we have 3 'words': y, +, and foobar. Tokenizing is the job of splitting them up.
This process is somewhat complicated; whitespace (obviously) separates things, but whitespace isn't neccessary, if the two 'words' don't share any legal characters. Thus, 5+2 is legal as is 5 + 2, and public static works, but publicstatic does not. This tokenization step occurs first and is done on limited knowledge. You could in theory surmize from context that, say, publicstatic void foo() {} can't really mean anything else, but the amount of knowledge you need to draw that conclusion is quite complicated, and tokenization just does not have it. Hence, publicstatic does not work, but 5+2 does.
Based on that rule on tokenization, int y = --3; and int y = - -3 is different for the same reason publicstatic and public static is different. -- is a single 'word', meaning: Unary decrement. You can't split it up (you can't put spaces in between the two minus signs), and 2 consecutive minus signs without spaces in between is going to be tokenized as the unary decrement operator, and not as 2 consecutive binary-minus/unary-negative words. You COULD draw the conclusion that int y = --3; only has one non-stupid interpretation (2 minus signs), because the other obvious interpretation (unary decrement the expression '3') is a compiler error: You can't decrement a constant. But, that goes back to the earlier rule: If you have to take into account all complications at all times, parsing java source files is incredibly complicated, and for no meaningful gain: You really do not want to write source code where things are interpreted correctly or not based on exactly how intelligent the compiler ended up being. It aint english, you want consistency and clarity at all times. Poetic license is not a good thing, when talking directly to computers.
CONCLUSION: --3 does not work. It never can. Whomever informed you just messed up, or you misread it, and they were talking about - -3 and you didn't notice the space, or it got lost in translation.
Are there cases where --expr can be evaluated as double negation instead of decrement?
Not like that, no. - -expr will, -(-expr) will, but -- written just like that, no spaces, no parens, nothing in between? No. Because of the tokenizer.
Why can't values suffice where the unexpected type errors were thrown?
Because there'd be absolutely no point to this, and makes javac (the part that turns your bag-o-characters into a tree structure and from there, into a class file) a few orders of magnitude more complicated. It's a computer, not english. You don't want 'best effort', and the few languages that do this (javascript, PHP) are incredibly stupid languages, universally derided, for trying (these languages are still popular, but that's despite the property they try their best instead of just having a clear spec and failing when the programmer fails to adhere to it - you'll find plenty of talks for such languages, by fans of it, making fun of the corner cases. In javascript, there's the famous WAT talk. For java there is the java puzzlers book. Both talks/books designed to teach something by showing off how you can use the language to write idiotic code that is nevertheless hard to read. Written by fans of these languages. It's a bit much to try to give you some sort of proof that you do not want a language to take its best wild stab in the dark at what you meant, but hopefully a few popular books and talks will go some way in making you realize it works like this.
How did Java evaluate d---t? Also, in what order?
Now we're getting into specifics. Java's tokenizer is such that it will tokenize that as d -- - t or d - -- t, and that's because --- is not a known word, and the java tokenizer is based on splitting things up by applying a list of known symbols and keywords to the job.
So which one does it tokenize to? It doesn't matter. Why do you want to know? So that you can write it? Don't - what possible purpose does that serve? If you write intentionally obfuscated code, you will be fired, and it'll be justified. If you are trying to train yourself to find that code perfectly readable, that's great! But less than 1% of your average java coder, even expert ones, do this, so now you've written code nobody can read, and you find code others wrote aggravating because nobody writes it like you do. So, you're unemployable as a programmer / you can't employ anybody to help you. You can build an entire career writing software on your own for the rest of your life, but it's quite limiting. Why bother?
It gets tokenized into d - -- t or d -- - t.
If you find this sort of thing entertaining (I certainly do!), then know full well this is no more useful than playing a silly game, there is no academic or career point to it, at all.
The problem is, you don't seem to find it entertaining. If you did, you'd have done a trivial experiment to find out. If it's d -- - t (subtract X and Y, where X is d, post-unary-decrement, and Y is t), then after all that, d will be one smaller, which you can trivially test for. If it's d - --t, then t would be one smaller afterwards. If it's d - -(-t) (as in, d minus Z, where Z is negative Y, where Y is negative t), then neither will have changed. You didn't do this experiment. That eliminates 'you find it fun'. I've eliminated 'this is useful', which leaves us with: This question does not matter, therefore neither does the answer.
There is a small chance this question shows up in a curriculum of some sort. If it does, do yourself a giant favour and find another curriculum. It's an incredibly strong indicator of extremely low quality java teaching, if you expect your students to know how d---t breaks down.
You talk about 'reasoning', but reasoning has no place here. d -- - t, d - -- t and d - -(-t) are all equally valid interpretations, reason isn't going to tell you which one is correct. The language designers threw some dice to decide. They knew it didn't matter (Except for the last one, which isn't possible without a 'smart' tokenizer, and you don't want one of those, but the reasoning needed to draw the conclusion that you need a smart tokenizer, let alone the reasoning needed to draw the conclusion that smart tokenizers are a bad idea, is incredibly complicated and requires a ton of experience writing parsers from scratch, or possibly a mere full year's worth of parser language courses at a uni level might give you the wherewithal to figure that one out - I think you can be excused for not picking up that detail :P).

How to break long lines in Java

In the official document of Oracle about coding conventions they write:
"Following are two examples of breaking an arithmetic expression. The first is preferred, since the break occurs outside the parenthesized expression, which is at a higher level."
longName1 = longName2 * (longName3 + longName4 - longName5)
+ 4 * longname6; // PREFER
longName1 = longName2 * (longName3 + longName4
- longName5) + 4 * longname6; // AVOID
What do they mean by "higher level"? Is it related to the order of evaluation in the expression?

From a previous statement from the official documentation:
"Prefer higher-level breaks to lower-level breaks"
In other words, avoid breaking nested expressions due to readability.
The more nested into parenthesis the expression is, the lower the level it is.

Yes - it sounds like they're referring to the order of operations here. Keeping operators of similar precedence grouped together on the same line increases visual readability.

The issue whether we break the line inside or outside of the following term:
(longName3 + longName4 - longName5)
The documentation suggests that it is preferable to not break the above term wrapped in parentheses, but rather that the break should occur at a higher level. It does not suggest why this preferable; both versions of the code you posted are logically identical. One possibility is that breaking at the higher level leaves the code easier to read.

"higher level" refers to a higher level of precedence. At the very end of https://www.tutorialspoint.com/java/java_basic_operators.htm there is a section on this topic.

No. The order of evaluation is the same but a "high level" means the order of evaluation, which in this case "outside the parenthesized expression".
This code has broken lines on the low level:
longName1 = longName2 * (longName3 + longName4
- longName5) + 4 * longname6; // AVOID
This code has broken lines on the high level:
ongName1 = longName2 * (longName3 + longName4 - longName5)
+ 4 * longname6; // PREFER
One would prefer breaking lines on the high level to make it readable if you see the code evaluated at low level in one line.

I think they mean that the first string is better because of the layout and it is just more organized. I program in java sometimes and it can get messy so more organized strings are preferred even though others may work.

Converting infix to binary tree

How can you convert an infix expression into a tree? I would like to do it manually rather than programming first. For example, let's look at this infix expression:
b = (x * a) - y / b * (c + d)
What are the rules to turning this into a tree? Or what steps do you suggest to take to do this? I'm having trouble here because sometimes these expressions don't have explicit parentheses in them:
b = x * a - y / b * c + d

The problem that you're describing here is generally called expression parsing and typically there are two steps in the process:
First, there's scanning, where you take your input string and break it apart into a bunch of smaller logical units, each of which represents a single "piece" of the input. For example, given the input string
b = x * a - y / b * c + d
You might produce this sequence of tokens:
[b] [=] [x] [*] [a] [-] [y] [/] [b] [*] [c] [+] [d]
This way, you move from "the input is a sequence of characters" to "the input is a sequence of individual variables, operators, etc." There are many ways to do this step, and they often involve doing some manually string processing or working with regular expressions (if you're familiar with those). As a first step, see if you can get this part working.
The second step, which is probably the one you're most hung up on, is parsing, where you take that sequence of tokens and reconstruct the meaning of the statement. This usually involves figuring out the operator precedence and actually building up the tree structure you want. That tree, by the way, is often called an expression tree or an abstract syntax tree (AST).
There are many ways to do this. For parsing expressions, my personal go-to is Dijkstra's shunting-yard algorithm. This algorithm works by maintaining two stacks and processing the tokens one at a time, using the stacks to determine what the operators should be applied to.
If you'd like to see an example of how to do this, I built a truth table generator for a discrete math class that I regularly teach. You type in a logical expression, and the code scans it to get a token sequence, then uses the shunting-yard algorithm to build up an AST, which is then used to generate the truth table tool. The source code is broken down so that each step is done separately, and that might make for a good reference.

The user input for a calculating the tangent of a graph

I am making a program that calculates the equation for the tangent of a graph at a given point and ideally I'd want it to work for any type of graph. e.g. 1/x ,x^2 ,ln(x), e^x, sin, tan. I know how to work out the tangent and everything but I just don't really know how to get the input from the user.
Would I have to have options where they choose the type of graph and then fill in the coefficients for it e.g. "Choice 1: 1/(Ax^B) Enter the values of A and B"? Or is there a way so that the program recognises what the user types in so that instead of entering a choice and then the values of A and B, the user can type "1/3x^2" and the program would recognise that the A and B are 3 and 2 and that the graph is a 1/x graph.
This website is kind of an example of what I would like to do be able to do: https://www.symbolab.com/solver/tangent-line-calculator
Thanks for any help :)

Looks like you want to evalute the expression. In that case, you could look into Dijkstra's Shunting-Yard algorithm to convert the expression to prefix notation, and then evaluate the expression using stacks. Alternatively, you can use a library such as exp4j. There are multiple tutorials for it, but remember that you need to add operations for both binary and unary operations (binary meaning it supports 2 operations while unary is like sin(x)).
Then, after you evaluate the expression, you can use first principles to solve. I have an example of this system working without exp4j on my github repository. If you go back in the commit history, you can see the implementation with exp4j as well.

Parsing a formula from user input is itself a problem much harder than calculating the tangent. If this is an assignment, see if the wording allows for the choice of the functions and its parameters, as you're suggesting, because otherwise you are going to spend 10% of time writing code for calculating the derivative and 90% for reading the function from the standard input.
If it's your own idea and you'd like to try your hand at it, a teaser is that you will likely need to design a whole class structure for different operators, constants, and the unknown. Keep a stack of mathematical operations, because in 1+2*(x+1)+3 the multiplication needs to happen before the outer additions, but after the inner one. You'll have to deal with reading non-uniform input that has a high level of freedom (in whitespace, omission of * sign, implicit zero before a –, etc.) Regular expressions may be of help, but be prepared for a debugging nightmare and a ton of special cases anyway.
If you're fine with restricting your users (yourself?) to valid expressions following JavaScript syntax (which your examples are not, due to the implied multiplication and the haphazard rules of precedence thereof to the 1/...) and you can trust them absolutely in having no malicious intentions, see this question. You wouldn't have your expression represented as a formula internally, but you would still be able to evaluate it in different points x. Then you can approximate the derivative by (f(x+ε) - f(x)) / ε with some sufficiently small ε (but not too small either, using trial and error for convergence). Watch out for points where the function has a jump, but in basic principle this works, too.

Is this Shunting Yard's fault or my own?

Given the expression:
1/2/3/4*5
It reaches the end of the expression and attempts to multiply out the 4 and 5 first WHICH IS WRONG because it begins to pop off the stack. I'm not necessarily doing RPN but just evaluating on the spot. How can I prevent this?
// Expression was completely read - so we should try and make sense of
// this now
while (operatorStack.size() != 0) {
ApplyOperation(operatorStack, operandStack);
}
At this point, I begin to pop off operators and operations. Since multiplication and division have the same presence, they start with multiplication.
A trace:
1/2/3/4*5
Applying * to 5 and 4
Result: 20
Applying / to 20 and 3
Result: 3/20
Applying / to 3/20 and 2
Result: 40/3
Applying / to 40/3 and 1
Result: 3/40

There is a point in the shunting yard algorithm at which you compare the precedence of the operator at the top of the stack with the precedence of the operator in the input stream, and decide whether to pop the stack (evaluate the stacked operator, in your case), or push the new operator.
It makes a big difference if the comparison is < or <=. One of those will produce left-associativity, and the other will produce right-associativity. Since you're getting right-associativity and you want left-associativity, I'm guessing (without seeing your code) that
you're using the wrong comparison operator.
By the way, your professor is quite correct. There is no need to explicitly produce RPN, and the evaluation algorithm will indeed pop the entire stack when it reaches the end of input. (The RPN algorithm would also do that; the evaluation algorithm is simply a shortcut.)

What operatorStack? The RPN resulting from the shunting-yard algorithm is a list, not a stack. It is processed from left to right, not FIFO,

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.