Dumb it down for me: What is parsing?

Dumb it down for me: What is parsing? - java

Yesterday I asked about grammar, and today in Java, I am learning how to implement an algorithm for parsing the grammar, using tokens from the lexical analyzer that I finished.
For this question, I need a person to check my understanding.
Let's suppose given the Scheme grammar:
exp -> ( rest
| #f
| #t
| ' exp
| integer_constant
| string_constant
| identifier
rest -> )
| exp+ [ . exp ] )
Would the following pseudo code be correct? I studied the recursive descent parser, and would need to make a parse tree for the interpreter by creating nodes of the parse tree.
Node parseExp() {
check to see if the token is left parenthesis
if true, return a node for Cons (which is a non-terminating node in Scheme
parse tree) and call parseRest()
else check to see if the token is #f
if true, return a node for Boolean with stored value #f
else check to see if the token is #t
if true, return a node for Boolean with stored value #t
else check to see if the token is quote
if true, return a node for Quote and recursively call parseExp()
else check to see if the token is integer_constant
if true, return a node for Integer with stored value int
else check to see if the token is string_constant
if true, return a node for String with stored string value
else check to see if the token is identifier
if true, return a node for identifier with stored string value
else
print error message saying a Syntax error occured
return null
}
Node parseRest() {
check to see if the token is right parenthesis
if true, return a node for Nil (which is a terminating () node in scheme
parse tree)
else // I am having difficulty trying to put this into an algorithm here
call parseExp() for the first expression
while (token does not equal right parenthesis) {
getNextToken()
if (token equals right parenthesis)
return a node for right parenthesis
else if (token equals dot)
return a node for dot
getNextToken()
if (token equals right parenthesis)
print error message saying a Syntax error occurred
return null
else
call parseExp()
else
parseExp()
}
}
If I have a wrong idea about it, please correct me. parseRest() is said to require a lookahead token for it to make a decision, could that be explained and probably a pseudo code example?
Thanks!

You're on the right track, but there's some issues:
check to see if the token is left parenthesis
if true, return a node for Cons (which is a non-terminating node in Scheme
parse tree) and call parseRest()
This is a bit ambiguous since you don't mention what you intend to do with the result of parseRest(), but I assume you want to store it in the Cons node. The problem with that is that a Cons node should have two children (in case of a list that'd be the head of the list and its tail - if that isn't clear, you may have to review the rules of the Scheme language), yet parseRest only gives you one node, so that doesn't work. So let's take a step back and think about what we want when we see a (:
A ( is either the start of a pair (i.e. a dotted pair or a non-empty list) or it's the empty list (). In the first case we want a Cons node, but in the second case we want a Nil node as an empty list is not a cons cell. So we have two possibilities and we don't know which one to choose until we've looked at the rest of the list. Therefore the decision shouldn't be made here, but rather inside the parseRest function. So we change the code to:
check to see if the token is left parenthesis
if true, return the result of parseRest()
So now let's look at parseRest:
Here you sometimes returns nodes for dots and parentheses, but those aren't supposed to be nodes in the AST at all - they're tokens. Another issue is that when you call parseRest recursively, you again aren't clear about what you want to do with the result. One might think you want to return the result, but then your while-loop would be pointless since you return out of it right in the first iteration every time. In fact this is a problem even in the non-recursive cases: For example you return a dot node and then continue to parse the expression after it. But after the return the function exits, so anything that comes after the return will be ignored. So this doesn't work.
Before we talk about how to make it work, let's first get a clearer picture of what the generated AST is supposed to look like:
For "()" we want a Nil node. That works fine with your current code.
For "(x)" we want Cons(Ident("x"), Nil).
For "(x . y)" we want Cons(Ident("x"), Ident("y")).
For "(x y)" we want Cons(Ident("x"), Cons (Ident("y"), Nil)).
For "(x y . z)" we want Cons(Ident("x"), Cons (Ident("y"), Ident("z"))).
I hope the pattern is now clear (else you might want to review the Scheme language). So how do we get that kind of AST?
Well, if we see a ), we return Nil. Again that already works in your code. Otherwise we parse an expression (and if there is no valid expression here, we have an error). Now what happens after that? Well if we found an expression, that expression is the first element of a Cons cell. So we want to return Cons(theExpression, ...). But what goes into the ... part? Well that depends on whether the next token is a dot or not. If it is a dot, we have a dotted expression, so there needs to be an expression after the dot and we want to return Cons(theExpressionBeforeTheDot, theExpressionAfterTheDot). If there's no dot, it means we're in a list and what follows is its tail. So we want to return Cons(theExpression, parseRest()).
parseRest() is said to require a lookahead token for it to make a decision, could that be explained and probably a pseudo code example?
Lookahead means that you have to look at the token that comes next without actually removing it from the stream. In terms of your pseudo code that means that you want to know which token will be returned when you call nextToken() without actually changing what the next call to nextToken() will return. So you'd have another built-in function like peekNext() that returned the next token without actually advancing the iterator in the token stream.
The reason why you need this in parseRest is the dot: When you check whether the next token is a dot and it turns out that it isn't, then you don't want the token to be actually gone. That is, you'll call parseExpression and then parseExpression will call nextToken, right? And when that happens you want it to return the token that comes right after the current expression - you don't want to skip that token because you had to check whether it's a dot. So when checking for the dot, you need to call peekToken instead of nextToken (you still need to remove the token when it is a dot though).

Related

Shunting-yard functions

I am using the Shunting-Yard algorithm (https://en.wikipedia.org/wiki/Shunting-yard_algorithm) in a Java program in order to create a calculator. I am almost done, but I still have to implement functions. I have ran into a problem: I want the calculator to automatically multiply variables like x and y when put together - Example: calculator converts xy to x*y. Also, I want the calculator to convert (x)(y) to (x)*(y) and x(y) to x*(y). I have done all of this using the following code:
infix = infix.replaceAll("([a-zA-Z])([a-zA-Z])", "$1*$2");
infix = infix.replaceAll("([a-zA-Z])\\(", "$1*(");
infix = infix.replaceAll("\\)\\(", ")*(");
infix = infix.replaceAll("\\)([a-zA-Z])", ")*$1");
(In my calculator, variable names are always single characters.)
This works great right now, but when I implement functions this will, of course, not work. It will turn "sin(1)" into "s*i*n*(1)". How can I make this code do the multiplication converting only for operators, and not for functions?

Preprocessing the input to parse isn't a good way to implement what you want. The text replacement can't know what the parsing algorithm knows and you also lose the original input, which can be useful for printing helpful error messages.
Instead, you should decide on what to do according to the context. Keep the type of the previously parsed token wth a special type for the beginning of the input.
If the previous token was a value token – a number, a variable name or the closing brace of a subextression – and the current one is a value token, too, emit an extra multiplication operator.
The same logic can be used to decide whether a minus sign is a unary negation or a binary subtraction: It's a subtraction if the minus is found after a value token and a negation otherwise.
Your idea to convert x(y) to x * (y) will, of course, clash with function call syntax.

We can break this down into two parts. There is one rule for bracketed expressions and another for multiplications.
Rather than the wikipedia article, which is a deliberately simplified for explanatory purposes, I would follow a more details example like Parsing Expressions by Recursive Descent that deals with bracketed expressions.
This is the code I use for my parser which can work with implicit multiplication. I have multi-letter variable names and use a space to separate different variables so you can have "2 pi r".
protected void expression() throws ParseException {
prefixSuffix();
Token t = it.peekNext();
while(t!=null) {
if(t.isBinary()) {
pushOp(t);
it.consume();
prefixSuffix();
}
else if(t.isImplicitMulRhs()) {
pushOp(implicitMul);
prefixSuffix();
}
else
break;
t=it.peekNext();
}
while(!sentinel.equals(ops.peek())) {
popOp();
}
}
This require a few other functions.
I've used a separate tokenizing step which breaks the input into discrete tokens. The Tokens class has a number of methods, in particular Token.isBinary() test if the operator is a binary operator like +,=,*,/. Another method Token.isImplicitMulRhs() tests if the token can appear on the right hand side of an implicit multiplication, this will be true for numbers, variable names, and left brackets.
An Iterator<Token> is used for the input stream. it.peekNext() looks at the next token and it.consume() moves to the next token in the input.
pushOp(Token) pushes a token onto the operator stack and popOp removes one and . pushOp has the logic to handle the precedence of different operators. Popping operator if they have lower precedence
protected void pushOp(Token op)
{
while(compareOps(ops.peek(),op))
popOp();
ops.push(op);
}
Of particular note is implicitMul an artificial token with the same precedence as multiplication which is pushed onto the operator stack.
prefixSuffix() handles expressions which can be numbers and variables with optional prefix of suffix operators. This will recognise "2", "x", "-2", "x++" removing tokens from the input and added them to the output/operator stack as appropriate.
We can think of this routine in BNF as
<expression> ::=
<prefixSuffix> ( <binaryOp> <prefixSuffix> )* // normal binary ops x+y
| <prefixSuffix> ( <prefixSuffix> )* // implicit multiplication x y
Handling brackets is done in prefixSuffix(). If this detects a left bracket, it will then recursively call expression(). To detect the matching right bracket a special sentinel token is pushed onto the operator stack. When the right bracket is encountered in the input the main loop breaks, and all operators on the operator stack popped until the sentinel is encountered and control returned to prefixSuffix(). Code for this might be like
void prefixSuffix() {
Token t = it.peekNext();
if(t.equals('(')) {
it.consume(); // advance the input
operatorStack.push(sentinel);
expression(); // parse until ')' encountered
t = it.peekNext();
if(t.equals(')')) {
it.consume(); // advance the input
return;
} else throw Exception("Unmatched (");
}
// handle variable names, numbers etc
}

Another approach may be the use of tokens, in a similar way to how a parser work.
The first phase would be to convert the input text into a list of tokens, which are objects that represent both the type of entity found and its value.
For example you can have a variable token, with its value being the name of the variable ('x', 'y', etc.), a token for open or close parenthesis, etc.
Since, I assume, you know in advance the names of the functions that can be used by the calculator, you'll also have a function token, with its value being the function name.
So the output of the tokenizing phase differentiates between variables and functions.
Implementing this is not too hard, just always try to match function names first,
so "sin" will be recognized as a function and not as three variables.
Now the second phase can be to insert the missing multiplication operators. This will not be hard now, since you know you to just insert them between:
{VAR, RIGHT_PAREN} and {VAR, LEFT_PAREN, FUNCTION}
But never between FUNCTION and LEFT_PAREN.

How to Check a String in Java with an Zero Character RegEx?

The following piece of code checks for same variable portion /en(^$|.*) which is empty or any characters. So the expression should match /en AND /en/bla, /en/blue etc.
But the expression doesn't work when checking for just /en.
"/en".matches("/en(^$|.*)")
Is there a way to make this empty regex check (^$) perform with java?
edit
I mean: Is there a way to make this piece of code return true?

What you're currently doing is checking whether en is followed by the start of string then the end of string (which doesn't make sense, since the start of string needs to be first) or anything else. This should work:
"/en".matches("/en(|.*)")
Or just using ? (optional):
"/en".matches("/en(.*)?")
But it's rather pointless, since * is zero or more (so a blank string will match for .*), just this should do it:
"/en".matches("/en.*")
EDIT:
Your code was already returning true, but it was not matching the ^$ part, but rather .* (similar to the above).
I should point out that you may as well use startsWith, unless your real data is more complex:
"/en".startsWith("/en")

Is there a way to make this piece of code return true?
"/en".matches("/en(^$|.*)")
That code does return true. Just try it!
However, your pattern is unnecessarily complex. Try:
"/en".matches("/en.*")
This will match /en followed by anything (including nothing).

expression evaluation with right-associativity in java

I am trying to solve a problem in which I have to solve a given expression consisting of one or more initialization in a same string with no operator precedence (although with bracketed sub-expressions). All the operators have right precedence so I have to evaluate it from right to left. I am confused how to proceed for the given problem. Detailed problem is given here : http://uva.onlinejudge.org/index.php?option=com_onlinejudge&Itemid=8&page=show_problem&problem=108

I'll give you some ideas to try:
First off, you need to recursively evaluate inside brackets. You want to do brackets from most nested to least nested, so use a regex that matches brackets with no ) inside of them. Substring the result of the computation into the part of the string the bracketed expression took up.
If there are no brackets, then now you need to evaluate operators. The reason why the question requires right precedence is to force you to think about how to answer it - you can't just read the string and do calculations. You have to consider the whole string THEN start doing calculations, which means storing some structure describing it. There's a number of strategies you could use to do this, for example:
-You could tokenize the string, either using a scanner or regexes - continually try to see if the next item in the string is a number or which of the operators it is, and push what kind of token it is and its value onto a list. Then, you can evaluate the list from right to left using some kind of case/switch structure to determine what to do for each operator (either that, or each operator is associated with what it does to numbers). = itself would address a map of variable name keys to values, and insert the value under that variable's key, and then return (to be placed into the list) the value it produced, so it can be used for another assignment. It also seems like - can be determined as to whether it's subtraction or a negative number by whether there's a space on its right or not.
-Instead of tokenization, you could use regexes on the string as a whole. But tokenization is more robust. I tried to build a calculator based on applying regexes to the whole string over and over but it's so difficult to get all the rules right and I don't recommend it.
I've written an expression evaluating calculator like this before, so you can ask me questions if you run into specific problems.

How to do special handling for the last term in the TokenStream generated by WhitespaceTokenizer

My use case - if a Roman numeral comes at the very end of a TokenStream, then convert it to English numeral. Otherwise let it be.
Ex. "Something III" >>> "Something 3".
But "III Something" >>> "III Something" (remains same as III does not come at the very last)
How exactly do I make this logic work in Lucene?
p.s. input.incrementToken() seems to return true first, and then false for every term in the TokenStream generated by the WhitespaceTokenizer.

Is it possible to have just a bit more detail ? The piece of code ?
I suppose you already took a look at this :
http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/TokenStream.html
Saying :
"Consumers (i.e., IndexWriter) use this method to advance the stream to the next token"
It is normal that your incrementToken returns false the second time, since you are not having any space next.
You must loop with end() to know when the string is done (excuse my french, i don't see how to write it).

Recursively verifying if a string is a valid prefix expression?

I'm rather new to the community but I've seen some helpful posts on here so I thought I'd ask.
I've got a homework question that asks us to recursively check whether a given string is a valid prefix expression given by the two following rules (standard):
Variables (a-z) are prefix expressions
If O is a binary operator and F and E are prefix expressions, OFE
Now, I kind of get the evaluation and have looked at the prefix-to-infix algorithms, but I can't for the life of me figure out how to implement just the evaluation methods (as I only need to check if it's valid, so not +a-b for example).
I know most of the implementation for these problems is done using stacks but I don't see how I would do it recursively here... some help would be tremendously appreciated.

Think of it this way. (I'm not going to write the code, since that's what you need to learn).
You want to check if a certain string is a prefix expression, so you have a function:
boolean isPrefix(string)
Now, there's two way that string could be a prefix:
It's a character from a-z
It's in the format O(prefix)(prefix)
So first, you check if the string has a length of one and is between a-z, and if so, the answer is yes.
Next you can check if the string starts with an O. If it does, you need to test the rest of the string to see if it is composed of two prefix expressions (FE).
So you start iterating from 1 to length, and passing each substring (0->i, i->length) into isPrefix(). If both substrings are also valid prefix expressions, the answer is yes.
Otherwise, the answer is no.
That's pretty much it, but the implementation, however, is up to you.

I'm not sure I entirely understand the point of this, but I imagine you should have some method like checkPrefixIn(String s) that looks at only part of the given String, returns true if it is only a prefix, false if it is only an operator (or invalid character), or the return value of checkPrefixIn(partOfS), where partOfS is a substring of the input s

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.