Regex pattern matching not working recursivily - java

I want to implement the pattern matching in the form
(a+b)(c-or*or/d)..............
in any number of times.
I use the following pattern but it is not working recursively.
It is just reading the first group.
Pattern pattern;
String regex="(([0-9]*)([+,-,/,*])([0-9]*)*)";
pattern=Pattern.compile(regex);
Matcher match = pattern.matcher(userInput);

The regular expression you need to match that sort of sequence is this:
\s*-?\d+(?:\s*[-+/*]\s*-?\d+)+\s*
Let's break that down to it's component parts!
\s* # Optional space
-? # Optional minus sign
\d+ # Mandatory digits
(?: # Start sub-regex
\s* # Optional space
[-+*/] # Mandatory single arithmetic operator
\s* # Optional space
-? # Optional minus sign
\d+ # Mandatory digits
)+ # End sub-regex: want one or more matches of it
\s* # Optional space
(If you don't want to match spaces, remove all of those \s* and be aware that it will surprise users quite a lot.)
Now, when encoding the above as a String literal in Java (before compilation) you need to be careful to escape each of the \ characters in it:
String regex="\\s*-?\\d+(?:\\s*[-+/*]\\s*-?\\d+)+\\s*";
The other thing to be aware of is that this doesn't pull the regular expression apart into pieces for Java to parse and build an expression evaluation tree from; it just (with the rest of your code) matches the whole string or not. (Even putting in capturing parentheses wouldn't help a lot; when put inside some form of repetition, they only report the first place in the string where they matched.) The simplest way of doing that properly would be to use a parser generator like Antlr (it would also let you do things like parenthesized subexpressions, managing operator precedence, etc.)

You will need an expression like this
[0-9]+-[0-9]+[\/*-+][0-9]+[\/*-+][0-9]+[\/*-+][0-9]+[\/*-+][0-9]+
You have to match the whole expression. You can not match part of the expression and do a second search because the pattern is repeated.
Note: In ruby \ is excape sequence of / character so you can omit it in C# or replace it with another characer.
Demo

The pattern
<!--
\((\d|[\+\-\/\\\*\^%!]+|(or|and) *)+\)
Options: ^ and $ match at line breaks
Match the character “(” literally «\(»
Match the regular expression below and capture its match into backreference number 1 «(\d|[\+\-\/\\\*\^%!]+|(or|and) *)+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Note: You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «+»
Match either the regular expression below (attempting the next alternative only if this one fails) «\d»
Match a single digit 0..9 «\d»
Or match regular expression number 2 below (attempting the next alternative only if this one fails) «[\+\-\/\\\*\^%!]+»
Match a single character present in the list below «[\+\-\/\\\*\^%!]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A + character «\+»
A - character «\-»
A / character «\/»
A \ character «\\»
A * character «\*»
A ^ character «\^»
One of the characters “%!” «%!»
Or match regular expression number 3 below (the entire group fails if this one fails to match) «(or|and) *»
Match the regular expression below and capture its match into backreference number 2 «(or|and)»
Match either the regular expression below (attempting the next alternative only if this one fails) «or»
Match the characters “or” literally «or»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «and»
Match the characters “and” literally «and»
Match the character “ ” literally « *»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “)” literally «\)»
-->
The calculation algorithm
For parsing and processing input string you have to use a stack. Visit here for the concept.
Regards
Cylian

your expression doesnt escape special chars like +,(,)
try this
/\(\d+[\+|-|\/|\*]\d+)\G?/
\G is the whole pattern over again
? means the previous thing is optional
i changed your [0-9]* to \d+ which i think is more correct
i changed your , to |

Related

How to match a string in this way?

I need to check if a String matches this specific pattern.
The pattern is:
(Numbers)(all characters allowed)(numbers)
and the numbers may have a comma ("." or ",")!
For instance the input could be 500+400 or 400,021+213.443.
I tried Pattern.matches("[0-9],?.?+[0-9],?.?+", theequation2), but it didn't work!
I know that I have to use the method Pattern.match(regex, String), but I am not being able to find the correct regex.
Dealing with numbers can be difficult. This approach will deal with your examples, but check carefully. I also didn't do "all characters" in the middle grouping, as "all" would include numbers, so instead I assumed that finding the next non-number would be appropriate.
This Java regex handles the requirements:
"((-?)[\\d,.]+)([^\\d-]+)((-?)[\\d,.]+)"
However, there is a potential issue in the above. Consider the following:
300 - -200. The foregoing won't match that case.
Now, based upon the examples, I think the point is that one should have a valid operator. The number of math operations is likely limited, so I would whitelist the operators in the middle. Thus, something like:
"((-?)[\\d,.]+)([\\s]*[*/+-]+[\\s]*)((-?)[\\d,.]+)"
Would, I think, be more appropriate. The [*/+-] can be expanded for the power operator ^ or whatever. Now, if one is going to start adding words (such as mod) in the equation, then the expression will need to be modified.
You can see this regular expression here
In your regex you have to escape the dot \. to match it literally and escape the \+ or else it would make the ? a possessive quantifier. To match 1+ digits you have to use a quantifier [0-9]+
For your example data, you could match 1+ digits followed by an optional part which matches either a dot or a comma at the start and at the end. If you want to match 1 time any character you could use a dot.
Instead of using a dot, you could also use for example a character class [-+*] to list some operators or list what you would allow to match. If this should be the only match, you could use anchors to assert the start ^ and the end $ of the string.
\d+(?:[.,]\d+)?.\d+(?:[.,]\d+)?
In Java:
String regex = "\\d+(?:[.,]\\d+)?.\\d+(?:[.,]\\d+)?";
Regex demo
That would match:
\d+(?:[.,]\d+)? 1+ digits followed by an optional part that matches . or , followed by 1+ digits
. Match any character (Use .+) to repeat 1+ times
Same as the first pattern

Replace substring of text matching regexp

I have text that looks like something like this:
1. Must have experience in Java 2. Team leader...
I want to render this in HTML as an ordered list. Now adding the </li> tag to the end is simple enough:
s = replace(s, ". ", "</li>");
But how do I go about replacing the 1., 2. etc with <li>?
I have the regular expression \d*\.$ which matches a number with a period, but the problem is is that is a substring so matching 1. Must have experience in Java 2. Team leader with \d*\.$ returns false.
Code
See regex in use here
\d+\.\s+(.*?)\s*(?=\d+\.\s+|$)
Replace
<li>$1</li>\n
Results
Input
Must have experience in Java 2. Team leader...
Output
<li>Must have experience in Java</li>
<li>Team leader...</li>
Explanation
\d+ Match one or more digits
\. Match the dot character . literally
\s+ Match one or more whitespace characters
(.*?) Capture any character any number of times, but as few as possible, into capture group 1
\s* Match any number of whitespace characters
(?=\d+\.\s+|$) Positive lookahead ensuring either of the following doesn't match
\d+\.\s+
\d+ Match one or more digits
\. Match the dot character . literally
\s+ Match one or more whitespace characters
$ Assert position at the end of the line
But how do I go about replacing the 1., 2. etc with <li>?
You can use String#replaceAll which can allow regex instead of replace :
s = s.replaceAll("\\d+\\.\\s", "</li>");
Note
You don't need to use $ in the end of your regex.
You have to escape dot . because it's mean any character in regex
You can use \s for one space or \s* for zero or more spaces or \s+ for one or more space
We want
<ol>
<li>one</li>
<li>two<li>
</ol>
This can be done as:
s = s.replaceAll("(?s)(\\d+\\.)\\s+(.*\\.)\\s*", "<li>$2</li></ol>");
s = s.replaceFirst("<li>", "<ol><li>");
s = s.replaceAll("(?s)</li></ol><li>", "</li>\n<li>");
The trick is to first add </li></ol> with a spurious </ol> that should only remain after the last list item.
(?s) is the DOTALL notation, causing . to also match line breaks.
In case of more than one numbered list this will not do. Also it assumes one single sentence per list item.

Analysing a more complex regex

In a previous question that i asked,
String split in java using advanced regex
someone gave me a fantastic answer to my problem (as described on the above link)
but i never managed to fully understand it. Can somebody help me? The regex i was given
is this"
"(?s)(?=(([^\"]+\"){2})*[^\"]*$)\\s+"
I can understand some basic things, but there are parts of this regex that even after
thoroughly searching google i could not find, like the question mark preceding the s in the
start, or how exactly the second parenthesis works with the question mark and the equation in the start. Is it possible also to expand it and make it able to work with other types of quotes, like “ ” for example?
Any help is really appreciated.
"(?s)(?=(([^\"]+\"){2})*[^\"]*$)\\s+" Explained;
(?s) # This equals a DOTALL flag in regex, which allows the `.` to match newline characters. As far as I can tell from your regex, it's superfluous.
(?= # Start of a lookahead, it checks ahead in the regex, but matches "an empty string"(1) read more about that [here][1]
(([^\"]+\"){2})* # This group is repeated any amount of times, including none. I will explain the content in more detail.
([^\"]+\") # This is looking for one or more occurrences of a character that is not `"`, followed by a `"`.
{2} # Repeat 2 times. When combined with the previous group, it it looking for 2 occurrences of text followed by a quote. In effect, this means it is looking for an even amount of `"`.
[^\"]* # Matches any character which is not a double quote sign. This means literally _any_ character, including newline characters without enabling the DOTALL flag
$ # The lookahead actually inspects until end of string.
) # End of lookahead
\\s+ # Matches one or more whitespace characters, including spaces, tabs and so on
That complicated group up there that is repeated twice will match in whitespaces in this string which is not in between two ";
text that has a "string in it".
When used with String.split, splitting the string into; [text, that, has, a, "string in it".]
It will only match if there are an even number of ", so the following will match on all spaces;
text that nearly has a "string in it.
Splitting the string into [text, that, nearly, has, a, "string, in, it.]
(1) When I say that a capture group matches "an empty string", I mean that it actually captures nothing, it only looks ahead from the point in the regex you are, and check a condition, nothing is actually captured. The actual capture is done by \\s+ which follows the lookahead.
The (?s) part is an embedded flag expression, enabling the DOTALL mode, which means the following:
In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.
The (?=expr) is a look-ahead expression. This means that the regex looks to match expr, but then moves back to the same point before continuing with the rest of the evaluation.
In this case, it means that the regex matches any \\s+ occurence, that is followed by any even number of ", then followed by non-" until the end ($). In other words, it checks that there are an even number of " ahead.
It can definitely be expanded to other quotes too. The only problem is the ([^\"]+\"){2} part, that will probably have to be made to use a back-reference (\n) instead of the {2}.
This is fairly simple..
Concept
It split's at \s+ whenever there are even number of " ahead.
For example:
Hello hi "Hi World"
^ ^ ^
| | |->will not split here since there are odd number of "
----
|
|->split here because there are even number of " ahead
Grammar
\s matches a \n or \r or space or \t
+ is a quantifier which matches previous character or group 1 to many times
[^\"] would match anything except "
(x){2} would match x 2 times
a(?=bc) would match if a is followed by bc
(?=ab)a would first check for ab from current position and then return back to its position.It then matches a.(?=ab)c would not match c
With (?s)(singleline mode) . would match newlines.So,In this case no need of (?s) since there are no .
I would use
\s+(?=([^"]*"[^"]*")*[^"]*$)

Match exact word using regex in java

I need a regular expression to match an exact word.
For example:
There is a string "draft guidance allerg Excellence" and I want to search allerg then I have written \ballerg\b. It gives me exact match. But when I pass string as "draft guidance 12=allerg Excellence" then it also return true, but this is wrong.
Which regular expression do I need to match only exact words?
The \b boundary would normally handle this situation, even in your case of "draft guidance 12=allerg Excellence"; however, you're saying that the = is part of the word (in normal English, this is not the case).
I'm assuming then that by "whole word", you mean a word that is surrounded by a space or normal sentence punctuation. In this case, a regex such as the following should work:
(?:^|[\s\.;\?\!,])allerg(?:$|[\s\.;\?\!,])
You can, obviously, add or remove characters as needed.
Regex Explained:
(?: # non-matching group
^ # beginning of string
| [\s\.;\?\!,] # OR a space, period, and other misc. punctuation
)
allerg # string to match
(?: # non-matching group
$ # end of string
| [\s\.;\?\!,] # OR a space, period, and other misc. punctuation
)
If i understood the question correctly, you want to match a word "allerg" . A word is enclosed with whitespace characters and "=allerg" has the "=" character which you dont want to match.
To match the word "allerg" you can use the following regex:
\s+allerg\s+

What does this Regular expression do?

I'm in the processing of converting a program from Perl to Java. I have come across the line
my ($title) = ($info{$host} =~ /^\s*\(([^\)]+)\)\s*$/);
I'm not very good with regular expressions but from what I can tell this is matching something in the string $info{$host} to the regular expression ^\s*(([^)]+))\s*$ and assigning the match to $title.
My problem is that I have no clue what the regular expression is doing and what it will match. Any help would be appreciated.
Thanks
The regular expression matches a string that contains exactly one pair of matching parentheses (actually, one opening and one matching closing parenthesis, but inside any number of further opening parentheses may occur).
The string may begin and end with whitespace characters, but no others. Inside the parantheses, however, arbitrary characters may occur (at least one).
The following strings should match it:
(abc)
(()
(ab)
By the way, you may simply use the regular expression as-is in Java (after escaping the backslashes), using the Pattern class.
It will match a bunch of leading whitespace, followed by a left paren, followed by some text not including a right paren, followed by a right paren, followed by some more whitespace.
Matches:
(some stuff)
Fails:
(some stuff
some stuff)
(some stuff) asadsad
Ok step by step
/ - quote the regex
^ - the begining of the string
\s* - zero or more of any spacelike character
( - an actual ( character
( - begin a capture group
[^)]+ any of the characters ^ or ) the + indicating at least one
) -end the capture group
) and actual ) character
\s* zero or more space like characters
$ - the end of the string
/ - close the regex quote
So as far as I can work out we are looking for strings like " (^) " or "())"
methinks I am missing something here.
my ($title) = ($info{$host} =~ /^\s*\(([^\)]+)\)\s*$/);
First, m// in list context returns the captured matches. my ($title) puts the right hand side in list context. Second, $info{$host} is matched against the following pattern:
/^ \s* \( ( [^\)]+) \) \s* $/x
Yes, used the x flag so I could insert some spaces. ^\s* skips any leading whitespace. Then we have an escaped paranthesis (therefore no capture group is created. Then we have a capture group containing [^\)]. That character class can be better written as [^)] because the right parenthesis is not special in a character class and means anything but a left parenthesis.
If there are one or more characters other than a closing parenthesis following the opening parenthesis followed by a closing parenthesis optionally surrounded on either side by whitespace, that sequence of characters is captured and put in to $x.

Categories