building a tree from left parenthetic string representation - java

I need to build a right parenthetic representation of a string from left parenthetic representation. Basically this means parsing a String input and later rebuild a right parenthetic representation. I need to implement 2 methods: One that would parse the input and one that creates the needed representation from that parsed input. This is part of a homework I need to do in java.
The code how I would test this:
String s = "A(B1,C)";
Node t = Node.parse (s);
String v = t.rightParentheticRepresentation();
System.out.println (s + " ==> " + v); // A(B1,C) ==> (B1,C)A
So i need to implement 2 methods: Node parse(String s) and String rightParentheticRepresentation()
I know in theory I have some idea how I should go on about doing it but I am struggling to implement the parsing method.
Is there any existing implementations I could use? Any hint for implementation approach is very welcome or if someone knows any good tutorial on building trees from string representation.

First you should get an idea of the data structure that you want to build. Basically here you want a tree where each node correspond to a content inside some parenthesis (the initial parenthis are implicit - '(' A(B1,C) ')' - in your sample).
For the parsing method: read the input String char by char. Whenever you meet an opening parenthesis '(' you create a child to the current node and change current to the new node, then start filling it. When you meet a closing parenthesis ')' you finalize the current node and come back to its parent.

Related

Creating a String expression when given an expression tree

I am having trouble creating a String expression when given an expression tree. If my expression tree looks like this (in the output console):
(*(+(5)(-(2)(3)))(6))
How do I create a method that goes through this to create an expression that is in normal format? For example, like this:
(2 - 3 + 5) * 6
Should I be working with the actual expression tree or the String orientation of the expression tree (as shown above as: (*(+(5)(-(2)(3)))(6))).
You should use prefix to infix conversion algorithm.
It's because your expression tree string is in prefix form and you want it in infix form.
You can remove all the braces in input string. That way it will be easier.
About that I advise you to read these documents.
Shunting-yard algorithm: https://en.wikipedia.org/wiki/Shunting-yard_algorithm
This algorithm is about 'tokens' stacking according to their "precedence power", per example, a function between parenthesis comes first. As for that read these:
https://en.wikipedia.org/wiki/Order_of_operations
http://introcs.cs.princeton.edu/java/11precedence/ (This one is specific for programming)
I hope I have helped.
Have a nice day. :)

How get Expression tree of regex in Java?

I'm working in the conversion algorithm obtains a DFA from a regular expression. This algorithm is limited to only operators (*, |, . ).
For those who do not know the meaning of each operator can check this.
The algorithm analyzes the nodes of a tree that is created from a regex
Here I attached an image showing a table with the functions voidable first position and last position which is applied to each node of the tree created.
For example: If apply for this regex (a│b)*a(a│b) analysis with the table.
The first step of the algorithm is to add the symbol # at the end (a│b)*a(a│b)# and enumerate each symbol:
Later the tree is constructed (my problem) and each node is discussed in the above table, remaining so. To the right of the node in {} PmraPos(first position) and left at {} UtmaPos (end position).
Problem: In java I was trying to build the tree we spoke with Stack but got good results, because, as you can see in the picture, not all nodes have two children. I want help for building the tree.
Note: What I did to try to build the tree, it was to pass the regular expression to its postfix form.

Shunting-yard functions

I am using the Shunting-Yard algorithm (https://en.wikipedia.org/wiki/Shunting-yard_algorithm) in a Java program in order to create a calculator. I am almost done, but I still have to implement functions. I have ran into a problem: I want the calculator to automatically multiply variables like x and y when put together - Example: calculator converts xy to x*y. Also, I want the calculator to convert (x)(y) to (x)*(y) and x(y) to x*(y). I have done all of this using the following code:
infix = infix.replaceAll("([a-zA-Z])([a-zA-Z])", "$1*$2");
infix = infix.replaceAll("([a-zA-Z])\\(", "$1*(");
infix = infix.replaceAll("\\)\\(", ")*(");
infix = infix.replaceAll("\\)([a-zA-Z])", ")*$1");
(In my calculator, variable names are always single characters.)
This works great right now, but when I implement functions this will, of course, not work. It will turn "sin(1)" into "s*i*n*(1)". How can I make this code do the multiplication converting only for operators, and not for functions?
Preprocessing the input to parse isn't a good way to implement what you want. The text replacement can't know what the parsing algorithm knows and you also lose the original input, which can be useful for printing helpful error messages.
Instead, you should decide on what to do according to the context. Keep the type of the previously parsed token wth a special type for the beginning of the input.
If the previous token was a value token – a number, a variable name or the closing brace of a subextression – and the current one is a value token, too, emit an extra multiplication operator.
The same logic can be used to decide whether a minus sign is a unary negation or a binary subtraction: It's a subtraction if the minus is found after a value token and a negation otherwise.
Your idea to convert x(y) to x * (y) will, of course, clash with function call syntax.
We can break this down into two parts. There is one rule for bracketed expressions and another for multiplications.
Rather than the wikipedia article, which is a deliberately simplified for explanatory purposes, I would follow a more details example like Parsing Expressions by Recursive Descent that deals with bracketed expressions.
This is the code I use for my parser which can work with implicit multiplication. I have multi-letter variable names and use a space to separate different variables so you can have "2 pi r".
protected void expression() throws ParseException {
prefixSuffix();
Token t = it.peekNext();
while(t!=null) {
if(t.isBinary()) {
pushOp(t);
it.consume();
prefixSuffix();
}
else if(t.isImplicitMulRhs()) {
pushOp(implicitMul);
prefixSuffix();
}
else
break;
t=it.peekNext();
}
while(!sentinel.equals(ops.peek())) {
popOp();
}
}
This require a few other functions.
I've used a separate tokenizing step which breaks the input into discrete tokens. The Tokens class has a number of methods, in particular Token.isBinary() test if the operator is a binary operator like +,=,*,/. Another method Token.isImplicitMulRhs() tests if the token can appear on the right hand side of an implicit multiplication, this will be true for numbers, variable names, and left brackets.
An Iterator<Token> is used for the input stream. it.peekNext() looks at the next token and it.consume() moves to the next token in the input.
pushOp(Token) pushes a token onto the operator stack and popOp removes one and . pushOp has the logic to handle the precedence of different operators. Popping operator if they have lower precedence
protected void pushOp(Token op)
{
while(compareOps(ops.peek(),op))
popOp();
ops.push(op);
}
Of particular note is implicitMul an artificial token with the same precedence as multiplication which is pushed onto the operator stack.
prefixSuffix() handles expressions which can be numbers and variables with optional prefix of suffix operators. This will recognise "2", "x", "-2", "x++" removing tokens from the input and added them to the output/operator stack as appropriate.
We can think of this routine in BNF as
<expression> ::=
<prefixSuffix> ( <binaryOp> <prefixSuffix> )* // normal binary ops x+y
| <prefixSuffix> ( <prefixSuffix> )* // implicit multiplication x y
Handling brackets is done in prefixSuffix(). If this detects a left bracket, it will then recursively call expression(). To detect the matching right bracket a special sentinel token is pushed onto the operator stack. When the right bracket is encountered in the input the main loop breaks, and all operators on the operator stack popped until the sentinel is encountered and control returned to prefixSuffix(). Code for this might be like
void prefixSuffix() {
Token t = it.peekNext();
if(t.equals('(')) {
it.consume(); // advance the input
operatorStack.push(sentinel);
expression(); // parse until ')' encountered
t = it.peekNext();
if(t.equals(')')) {
it.consume(); // advance the input
return;
} else throw Exception("Unmatched (");
}
// handle variable names, numbers etc
}
Another approach may be the use of tokens, in a similar way to how a parser work.
The first phase would be to convert the input text into a list of tokens, which are objects that represent both the type of entity found and its value.
For example you can have a variable token, with its value being the name of the variable ('x', 'y', etc.), a token for open or close parenthesis, etc.
Since, I assume, you know in advance the names of the functions that can be used by the calculator, you'll also have a function token, with its value being the function name.
So the output of the tokenizing phase differentiates between variables and functions.
Implementing this is not too hard, just always try to match function names first,
so "sin" will be recognized as a function and not as three variables.
Now the second phase can be to insert the missing multiplication operators. This will not be hard now, since you know you to just insert them between:
{VAR, RIGHT_PAREN} and {VAR, LEFT_PAREN, FUNCTION}
But never between FUNCTION and LEFT_PAREN.

Dumb it down for me: What is parsing?

Yesterday I asked about grammar, and today in Java, I am learning how to implement an algorithm for parsing the grammar, using tokens from the lexical analyzer that I finished.
For this question, I need a person to check my understanding.
Let's suppose given the Scheme grammar:
exp -> ( rest
| #f
| #t
| ' exp
| integer_constant
| string_constant
| identifier
rest -> )
| exp+ [ . exp ] )
Would the following pseudo code be correct? I studied the recursive descent parser, and would need to make a parse tree for the interpreter by creating nodes of the parse tree.
Node parseExp() {
check to see if the token is left parenthesis
if true, return a node for Cons (which is a non-terminating node in Scheme
parse tree) and call parseRest()
else check to see if the token is #f
if true, return a node for Boolean with stored value #f
else check to see if the token is #t
if true, return a node for Boolean with stored value #t
else check to see if the token is quote
if true, return a node for Quote and recursively call parseExp()
else check to see if the token is integer_constant
if true, return a node for Integer with stored value int
else check to see if the token is string_constant
if true, return a node for String with stored string value
else check to see if the token is identifier
if true, return a node for identifier with stored string value
else
print error message saying a Syntax error occured
return null
}
Node parseRest() {
check to see if the token is right parenthesis
if true, return a node for Nil (which is a terminating () node in scheme
parse tree)
else // I am having difficulty trying to put this into an algorithm here
call parseExp() for the first expression
while (token does not equal right parenthesis) {
getNextToken()
if (token equals right parenthesis)
return a node for right parenthesis
else if (token equals dot)
return a node for dot
getNextToken()
if (token equals right parenthesis)
print error message saying a Syntax error occurred
return null
else
call parseExp()
else
parseExp()
}
}
If I have a wrong idea about it, please correct me. parseRest() is said to require a lookahead token for it to make a decision, could that be explained and probably a pseudo code example?
Thanks!
You're on the right track, but there's some issues:
check to see if the token is left parenthesis
if true, return a node for Cons (which is a non-terminating node in Scheme
parse tree) and call parseRest()
This is a bit ambiguous since you don't mention what you intend to do with the result of parseRest(), but I assume you want to store it in the Cons node. The problem with that is that a Cons node should have two children (in case of a list that'd be the head of the list and its tail - if that isn't clear, you may have to review the rules of the Scheme language), yet parseRest only gives you one node, so that doesn't work. So let's take a step back and think about what we want when we see a (:
A ( is either the start of a pair (i.e. a dotted pair or a non-empty list) or it's the empty list (). In the first case we want a Cons node, but in the second case we want a Nil node as an empty list is not a cons cell. So we have two possibilities and we don't know which one to choose until we've looked at the rest of the list. Therefore the decision shouldn't be made here, but rather inside the parseRest function. So we change the code to:
check to see if the token is left parenthesis
if true, return the result of parseRest()
So now let's look at parseRest:
Here you sometimes returns nodes for dots and parentheses, but those aren't supposed to be nodes in the AST at all - they're tokens. Another issue is that when you call parseRest recursively, you again aren't clear about what you want to do with the result. One might think you want to return the result, but then your while-loop would be pointless since you return out of it right in the first iteration every time. In fact this is a problem even in the non-recursive cases: For example you return a dot node and then continue to parse the expression after it. But after the return the function exits, so anything that comes after the return will be ignored. So this doesn't work.
Before we talk about how to make it work, let's first get a clearer picture of what the generated AST is supposed to look like:
For "()" we want a Nil node. That works fine with your current code.
For "(x)" we want Cons(Ident("x"), Nil).
For "(x . y)" we want Cons(Ident("x"), Ident("y")).
For "(x y)" we want Cons(Ident("x"), Cons (Ident("y"), Nil)).
For "(x y . z)" we want Cons(Ident("x"), Cons (Ident("y"), Ident("z"))).
I hope the pattern is now clear (else you might want to review the Scheme language). So how do we get that kind of AST?
Well, if we see a ), we return Nil. Again that already works in your code. Otherwise we parse an expression (and if there is no valid expression here, we have an error). Now what happens after that? Well if we found an expression, that expression is the first element of a Cons cell. So we want to return Cons(theExpression, ...). But what goes into the ... part? Well that depends on whether the next token is a dot or not. If it is a dot, we have a dotted expression, so there needs to be an expression after the dot and we want to return Cons(theExpressionBeforeTheDot, theExpressionAfterTheDot). If there's no dot, it means we're in a list and what follows is its tail. So we want to return Cons(theExpression, parseRest()).
parseRest() is said to require a lookahead token for it to make a decision, could that be explained and probably a pseudo code example?
Lookahead means that you have to look at the token that comes next without actually removing it from the stream. In terms of your pseudo code that means that you want to know which token will be returned when you call nextToken() without actually changing what the next call to nextToken() will return. So you'd have another built-in function like peekNext() that returned the next token without actually advancing the iterator in the token stream.
The reason why you need this in parseRest is the dot: When you check whether the next token is a dot and it turns out that it isn't, then you don't want the token to be actually gone. That is, you'll call parseExpression and then parseExpression will call nextToken, right? And when that happens you want it to return the token that comes right after the current expression - you don't want to skip that token because you had to check whether it's a dot. So when checking for the dot, you need to call peekToken instead of nextToken (you still need to remove the token when it is a dot though).

Reading XML document nodes containing special characters (&, -, etc) with Java

My code does not retrieve the entirety of element nodes that contain special characters.
For example, for this node:
<theaterName>P&G Greenbelt</theaterName>
It would only retrieve "P" due to the ampersand. I need to retrieve the entire string.
Here's my code:
public List<String> findTheaters() {
//Clear theaters application global
FilmhopperActivity.tData.clearTheaters();
ArrayList<String> theaters = new ArrayList<String>();
NodeList theaterNodes = doc.getElementsByTagName("theaterName");
for (int i = 0; i < theaterNodes.getLength(); i++) {
Node node = theaterNodes.item(i);
if (node.getNodeType() == Node.ELEMENT_NODE) {
//Found theater, add to return array
Element element = (Element) node;
NodeList children = element.getChildNodes();
String name = children.item(0).getNodeValue();
theaters.add(name);
//Logging
android.util.Log.i("MoviefoneFetcher", "Theater found: " + name);
//Add theater to application global
Theater t = new Theater(name);
FilmhopperActivity.tData.addTheater(t);
}
}
return theaters;
}
I tried adding code to extend the name string to concatenate additional children.items, but it didn't work. I'd only get "P&".
...
String name = children.item(0).getNodeValue();
for (int j = 1; j < children.getLength() - 1; j++) {
name += children.item(j).getNodeValue();
}
Thanks for your time.
UPDATE:
Found a function called normalize() that you can call on Nodes, that combines all text child nodes so doing a children.item(0) contains the text of all the children, including ampersands!
The & is an escape character in XML. XML that looks like this:
<theaterName>P&G Greenbelt</theaterName>
should actually be rejected by the parser. Instead, it should look like this:
<theaterName>P&G Greenbelt</theaterName>
There are a few such characters, such as < (<), > (>), " (") and ' (&apos;). There are also other ways to escape characters, such as via their Unicode value, as in • or 〹.
For more information, the XML specification is fairly clear.
Now, the other thing it might be, depending on how your tree was constructed, is that the character is escaped properly, and the sample you showed isn't what's actually there, and it's how the data is represented in the tree.
For example, when using SAX to build a tree, entities (the &-thingies) are broken apart and delivered separately. This is because the SAX parser tries to return contiguous chunks of data, and when it gets to the escape character, it sends what it has, and starts a new chunk with the translated &-value. So you might need to combine consecutive text nodes in your tree to get the whole value.
The file you are trying to read is not valid XML. No self-respecting XML parser will accept it.
I'm retrieving my XML dynamically from the web. What's the best way to replace all my escape characters after fetching the Document object?
You are taking the wrong approach. The correct approach is to inform the people responsible for creating that file that it is invalid, and request that they fix it. Simply writing hacks to (try to) fix broken XML is not in your (or other peoples') long term interest.
If you decide to ignore this advice, then one approach is to read the file into a String, use String.replaceAll(regex, replacement) with a suitable regex to turn these bogus "&" characters into proper character entities ("&"), then pass the "fixed" XML string to the XML parser. You need to carefully design the regex so that it doesn't break valid character entities as an unwanted side-effect. A second approach is to do the parsing and replacement by hand, using appropriate heuristics to distinguish the bogus "&" characters from well-formed character entities.
But this all costs you development and test time, and slows down your software. Worse, there is a significant risk that your code will be fragile as a result of your efforts to compensate for the bad input files. (And guess who will get the blame!)
You need to either encode it properly or wrap it in a CDATA section. I'd recommend the former.
The numeric character references "<" and "&" may be used to escape < and & when they occur in character data.
All XML processors MUST recognize these entities whether they are declared or not. For interoperability, valid XML documents SHOULD declare these entities, like any others, before using them. If the entities lt or amp are declared, they MUST be declared as internal entities whose replacement text is a character reference to the respective character (less-than sign or ampersand) being escaped; the double escaping is REQUIRED for these entities so that references to them produce a well-formed result. If the entities gt, apos, or quot are declared, they MUST be declared as internal entities whose replacement text is the single character being escaped (or a character reference to that character; the double escaping here is OPTIONAL but harmless). For example:
<!ENTITY lt "&#60;">
<!ENTITY gt ">">
<!ENTITY amp "&#38;">
<!ENTITY apos "'">
<!ENTITY quot """>

Categories