I am trying to parse through text to see if it is a valid expression. I am faced with the following problem.
((5*4) + 3) is a valid expression.
How would I parse this to allow me to analyze what is in one level of parentheses at a time.
For example, I would want to have the following expressions returned in seperate substrings so that one substring reads "5*4" and another seperate substring reads "(5*4) + 3"
I know I can use substring as follows:
String test = "test (542)";
test = test.substring(test.indexOf("(") + 1);
test = test.substring(0, test.indexOf(")"));
But how can I best approach handling multiple levels of parentheses of an unknown string.
Divide and Conquer would be a promising approach. You could define a recursive function, which will only need to handle a simple base case (like 5*4) explicitly. Whenever there are parentheses, call the function again with the text inside the parentheses.
You can do that with the Shunting-yard algorithm. If you only validate an expression, you do not need to implement all the algorithm. Your expression is valid when you have all the required operators and operands. In YouTube, you can see Shunting Yard Algorithm - Intro and Reverse Polish Notation.
Related
I want to find out if there could ever be conflicts between two known regular expressions, in order to allow the user to construct a list of mutually exclusive regular expressions.
For example, we know that the regular expressions below are quite different but they both match xy50:
'^xy1\d'
'[^\d]\d2$'
Is it possible to determine, using a computer algorithm, if two regular expressions can have such a conflict? How?
There's no halting problem involved here. All you need is to compute if the intersection of ^xy1\d and [^\d]\d2$ in non-empty.
I can't give you an algorithm here, but here are two discussions of a method to generate the intersection without resorting the construction of a DFA:
http://sulzmann.blogspot.com/2008/11/playing-with-regular-expressions.html
And then there's RAGEL
http://www.complang.org/ragel/
which can compute the intersection of regular expressions too.
UPDATE: I just tried out Ragel with OP's regexp. Ragel can generate a "dot" file for graphviz from the resulting state machine, which is terrific. The intersection of the OP's regexp looks like this in Ragel syntax:
('xy1' digit any*) & (any* ^digit digit '2')
and has the following state machine:
While the empty intersection:
('xy1' digit any*) & ('q' any* ^digit digit '2')
looks like this:
So if all else fails, then you can still have Ragel compute the intersection and check if it outputs the empty state machine, by comparing the generated "dot" file.
The problem can be restated as, "do the languages described by two or more regular
expressions have a non-empty intersection"?
If you confine the question to pure regular expressions (no backreferences, lookahead,
lookbehind, or other features that allow recognition of context-free or more complex
languages), the question is at least decidable. Regular languages are closed under
intersection, and there is an algorithm that takes the two regular expressions
as inputs and produces, in finite time, a DFA that recognizes the intersection.
Each regular expression can be converted to a nondeterministic finite automaton,
and then to a deterministic finite automaton. A pair of DFAs can be converted
to a DFA that recognizes the intersection. If there is a path from the
start state to any accepting state of that final DFA, the intersection is non-empty
(a "conflict", using your terminology).
Unfortunately, there is a possibly-exponential blowup when converting the initial NFA
to a DFA, so the problem quickly becomes infeasible in practice as the size of
the input expressions grows.
And if extensions to pure regular expressions are permitted, all bets are off --
such languages are no longer closed under intersection, so this construction won't
work.
Yes I think this is solvable: instead of thinking of regular expressions as matching strings, you can also think of them as generating strings. That is, all the strings they would match.
Let [R] be the set of strings generated by the regular expression R. Then given to regular expressions R and T, we could try to match T against [R]. That is [R] matches T iff there is a string in [R] which matches T.
It should be possible to develop this into an algorithm where [R] is lazily constructed as needed: where normal regular expression matching would try to match the next character from an input string and then advance to the next character in the string, the modified algorithm would check whether the FSM corresponding to the input regular expression can generate a matching character at its current state and then advances to 'all next states' simultaneously.
Another approach would be to leverage Dan Kogai's Perl Regexp::Optimizer instead.
use Regexp::Optimizer;
my $o = Regexp::Optimizer->new->optimize(qr/foobar|fooxar|foozap/);
# $re is now qr/foo(?:[bx]ar|zap)/
.. first, optimize and then discard all redundant patterns.
Maybe Ron Savage's Regexp::Assemble could be even more helpful.
It allows assembling an arbitrary number of regular expressions into a single regular expression that matches all that the individual REs match.* Or a combination of both.
* However, you need to be aware of some differences between Perl and Java or other PCRE-flavors.
If you are looking for a lib in Java you can use Automaton using '&' operator:
RegExp re = new RegExp("(ABC_123.*56.txt)&(ABC_12.*456.*\\.txt)", RegExp.INTERSECTION); // Parse RegExp
Automaton a = re.toAutomaton(); // convert RegExp to automaton
if(a.isEmpty()) { // Test if intersection is empty
System.out.println("Intersection is empty!");
}
else {
// Print the shortest accepted string
System.out.println("Intersection is non-empty, example: " + a.getShortestExample(true));
}
Original Answer:
Detecting if two regexes could possibly match the same string
I am having trouble creating a String expression when given an expression tree. If my expression tree looks like this (in the output console):
(*(+(5)(-(2)(3)))(6))
How do I create a method that goes through this to create an expression that is in normal format? For example, like this:
(2 - 3 + 5) * 6
Should I be working with the actual expression tree or the String orientation of the expression tree (as shown above as: (*(+(5)(-(2)(3)))(6))).
You should use prefix to infix conversion algorithm.
It's because your expression tree string is in prefix form and you want it in infix form.
You can remove all the braces in input string. That way it will be easier.
About that I advise you to read these documents.
Shunting-yard algorithm: https://en.wikipedia.org/wiki/Shunting-yard_algorithm
This algorithm is about 'tokens' stacking according to their "precedence power", per example, a function between parenthesis comes first. As for that read these:
https://en.wikipedia.org/wiki/Order_of_operations
http://introcs.cs.princeton.edu/java/11precedence/ (This one is specific for programming)
I hope I have helped.
Have a nice day. :)
I am trying to solve a problem in which I have to solve a given expression consisting of one or more initialization in a same string with no operator precedence (although with bracketed sub-expressions). All the operators have right precedence so I have to evaluate it from right to left. I am confused how to proceed for the given problem. Detailed problem is given here : http://uva.onlinejudge.org/index.php?option=com_onlinejudge&Itemid=8&page=show_problem&problem=108
I'll give you some ideas to try:
First off, you need to recursively evaluate inside brackets. You want to do brackets from most nested to least nested, so use a regex that matches brackets with no ) inside of them. Substring the result of the computation into the part of the string the bracketed expression took up.
If there are no brackets, then now you need to evaluate operators. The reason why the question requires right precedence is to force you to think about how to answer it - you can't just read the string and do calculations. You have to consider the whole string THEN start doing calculations, which means storing some structure describing it. There's a number of strategies you could use to do this, for example:
-You could tokenize the string, either using a scanner or regexes - continually try to see if the next item in the string is a number or which of the operators it is, and push what kind of token it is and its value onto a list. Then, you can evaluate the list from right to left using some kind of case/switch structure to determine what to do for each operator (either that, or each operator is associated with what it does to numbers). = itself would address a map of variable name keys to values, and insert the value under that variable's key, and then return (to be placed into the list) the value it produced, so it can be used for another assignment. It also seems like - can be determined as to whether it's subtraction or a negative number by whether there's a space on its right or not.
-Instead of tokenization, you could use regexes on the string as a whole. But tokenization is more robust. I tried to build a calculator based on applying regexes to the whole string over and over but it's so difficult to get all the rules right and I don't recommend it.
I've written an expression evaluating calculator like this before, so you can ask me questions if you run into specific problems.
Pretty simple question and my brain is frozen today so I can't think of an elegant solution where I know one exists.
I have a formula which is passed to me in the form "A+B"
I also have a mapping of the formula variables to their "readable names".
Finally, I have a formula parser which will calculate the value of the formula, but only if its passed with the readable names for the variables.
For example, as an input I get
String formula = "A+B"
String readableA = "foovar1"
String readableB = "foovar2"
and I want my output to be "foovar1+foovar2"
The problem with a simple find and replace is that it can be easily be broken because we have no guarantees on what the 'readable' names are. Lets say I take my example again with different parameters
String formula = "A+B"
String readableA = "foovarBad1"
String readableB = "foovarAngry2"
If I do a simple find and replace in a loop, I'll end up replacing the capital A's and B's in the readable names I have already replaced.
This looks like an approximate solution but I don't have brackets around my variables
How to replace a set of tokens in a Java String?
That link you provided is an excellent source since matching using patterns is the way to go. The basic idea here is first get the tokens using a matcher. After this you will have Operators and Operands
Then, do the replacement individually on each Operand.
Finally, put them back together using the Operators.
A somewhat tedious solution would be to scan for all occurences of A and B and note their indexes in the string, and then use StringBuilder.replace(int start, int end, String str) method. (in naive form this would not be very efficient though, approaching smth like square complexity, or more precisely "number of variables" * "number of possible replacements")
If you know all of your operators, you could do split on them (like on "+") and then replace individual "A" and "B" (you'd have to do trimming whitespace chars first of course) in an array or ArrayList.
A simple way to do it is
String foumula = "A+B".replaceAll("\\bA\\b", readableA)
.replaceAll("\\bB\\b", readableB);
Your approach does not work fine that way
Formulas (mathematic Expressions) should be parsed into an expression structure (eg. expression tree).
Such that you have later Operand Nodes and Operator nodes.
Later this expression will be evaluated traversing the tree and considering the mathematical priority rules.
I recommend reading more on Expression parsing.
Matching Only
If you don't have to evaluate the expression after doing the substitution, you might be able to use a regex. Something like (\b\p{Alpha}\p{Alnum}*\b)
or the java string "(\\b\\p{Alpha}\\p{Alnum}*\\b)"
Then use find() over and over to find all the variables and store their locations.
Finally, go through the locations and build up a new string from the old one with the variable bits replaced.
Not that It will not do much checking that the supplied expression is reasonable. For example, it wouldn't mind at all if you gave it )A 2 B( and would just replace the A and B (like )XXX 2 XXX(). I don't know if that matters.
This is similar to the link you supplied in your question except you need a different regular expression than they used. You can go to http://www.regexplanet.com/advanced/java/index.html to play with regular expressions and figure out one that will work. I used it with the one I suggested and it finds what it needs in A+B and A + (C* D ) just fine.
Parsing
You parse the expression using one of the available parser generators (Antlr or Sable or ...) or find an algebraic expression parser available as open source and use it. (You would have to search the web to find those, I haven't used one but suspect they exist.)
Then you use the parser to generate a parsed form of the expression, replace the variables and reconstitute the string form with the new variables.
This one might work better but the amount of effort depends on whether you can find existing code to use.
It also depends on whether you need to validate the expression is valid according to the normal rules. This method will not accept invalid expressions, most likely.
I have a bunch of strings representing mathematical functions (which could be nested and have any number of arguments), and I want to be able to use regex to return an array of strings, each string being an argument of the outer-most function. Here's an example:
"f1(f2(x),f3(f4(f5(x,y,z))),f(f(1)))"
I would want a regex pattern that I could use to somehow get an array of all the arguments of f1, which in this case are the strings "f2(x)", "f3(f4(f5(x,y,z)))", and "f(f(1))". There will be no spaces in the input string.
Thank you very much to anyone who can help.
I don't think this can be done with regexes alone.
This would probably require being able to identify balanced parentheses -- for example, once we've parsed f1(f2(x), the next character could either be a ) or a , -- and that's a canonical example of something that can't be done with regexes, but requires a more sophisticated parser.