What does this Regular expression do? - java

I'm in the processing of converting a program from Perl to Java. I have come across the line
my ($title) = ($info{$host} =~ /^\s*\(([^\)]+)\)\s*$/);
I'm not very good with regular expressions but from what I can tell this is matching something in the string $info{$host} to the regular expression ^\s*(([^)]+))\s*$ and assigning the match to $title.
My problem is that I have no clue what the regular expression is doing and what it will match. Any help would be appreciated.
Thanks

The regular expression matches a string that contains exactly one pair of matching parentheses (actually, one opening and one matching closing parenthesis, but inside any number of further opening parentheses may occur).
The string may begin and end with whitespace characters, but no others. Inside the parantheses, however, arbitrary characters may occur (at least one).
The following strings should match it:
(abc)
(()
(ab)
By the way, you may simply use the regular expression as-is in Java (after escaping the backslashes), using the Pattern class.

It will match a bunch of leading whitespace, followed by a left paren, followed by some text not including a right paren, followed by a right paren, followed by some more whitespace.
Matches:
(some stuff)
Fails:
(some stuff
some stuff)
(some stuff) asadsad

Ok step by step
/ - quote the regex
^ - the begining of the string
\s* - zero or more of any spacelike character
( - an actual ( character
( - begin a capture group
[^)]+ any of the characters ^ or ) the + indicating at least one
) -end the capture group
) and actual ) character
\s* zero or more space like characters
$ - the end of the string
/ - close the regex quote
So as far as I can work out we are looking for strings like " (^) " or "())"
methinks I am missing something here.

my ($title) = ($info{$host} =~ /^\s*\(([^\)]+)\)\s*$/);
First, m// in list context returns the captured matches. my ($title) puts the right hand side in list context. Second, $info{$host} is matched against the following pattern:
/^ \s* \( ( [^\)]+) \) \s* $/x
Yes, used the x flag so I could insert some spaces. ^\s* skips any leading whitespace. Then we have an escaped paranthesis (therefore no capture group is created. Then we have a capture group containing [^\)]. That character class can be better written as [^)] because the right parenthesis is not special in a character class and means anything but a left parenthesis.
If there are one or more characters other than a closing parenthesis following the opening parenthesis followed by a closing parenthesis optionally surrounded on either side by whitespace, that sequence of characters is captured and put in to $x.

Related

Analysing a more complex regex

In a previous question that i asked,
String split in java using advanced regex
someone gave me a fantastic answer to my problem (as described on the above link)
but i never managed to fully understand it. Can somebody help me? The regex i was given
is this"
"(?s)(?=(([^\"]+\"){2})*[^\"]*$)\\s+"
I can understand some basic things, but there are parts of this regex that even after
thoroughly searching google i could not find, like the question mark preceding the s in the
start, or how exactly the second parenthesis works with the question mark and the equation in the start. Is it possible also to expand it and make it able to work with other types of quotes, like “ ” for example?
Any help is really appreciated.
"(?s)(?=(([^\"]+\"){2})*[^\"]*$)\\s+" Explained;
(?s) # This equals a DOTALL flag in regex, which allows the `.` to match newline characters. As far as I can tell from your regex, it's superfluous.
(?= # Start of a lookahead, it checks ahead in the regex, but matches "an empty string"(1) read more about that [here][1]
(([^\"]+\"){2})* # This group is repeated any amount of times, including none. I will explain the content in more detail.
([^\"]+\") # This is looking for one or more occurrences of a character that is not `"`, followed by a `"`.
{2} # Repeat 2 times. When combined with the previous group, it it looking for 2 occurrences of text followed by a quote. In effect, this means it is looking for an even amount of `"`.
[^\"]* # Matches any character which is not a double quote sign. This means literally _any_ character, including newline characters without enabling the DOTALL flag
$ # The lookahead actually inspects until end of string.
) # End of lookahead
\\s+ # Matches one or more whitespace characters, including spaces, tabs and so on
That complicated group up there that is repeated twice will match in whitespaces in this string which is not in between two ";
text that has a "string in it".
When used with String.split, splitting the string into; [text, that, has, a, "string in it".]
It will only match if there are an even number of ", so the following will match on all spaces;
text that nearly has a "string in it.
Splitting the string into [text, that, nearly, has, a, "string, in, it.]
(1) When I say that a capture group matches "an empty string", I mean that it actually captures nothing, it only looks ahead from the point in the regex you are, and check a condition, nothing is actually captured. The actual capture is done by \\s+ which follows the lookahead.
The (?s) part is an embedded flag expression, enabling the DOTALL mode, which means the following:
In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.
The (?=expr) is a look-ahead expression. This means that the regex looks to match expr, but then moves back to the same point before continuing with the rest of the evaluation.
In this case, it means that the regex matches any \\s+ occurence, that is followed by any even number of ", then followed by non-" until the end ($). In other words, it checks that there are an even number of " ahead.
It can definitely be expanded to other quotes too. The only problem is the ([^\"]+\"){2} part, that will probably have to be made to use a back-reference (\n) instead of the {2}.
This is fairly simple..
Concept
It split's at \s+ whenever there are even number of " ahead.
For example:
Hello hi "Hi World"
^ ^ ^
| | |->will not split here since there are odd number of "
----
|
|->split here because there are even number of " ahead
Grammar
\s matches a \n or \r or space or \t
+ is a quantifier which matches previous character or group 1 to many times
[^\"] would match anything except "
(x){2} would match x 2 times
a(?=bc) would match if a is followed by bc
(?=ab)a would first check for ab from current position and then return back to its position.It then matches a.(?=ab)c would not match c
With (?s)(singleline mode) . would match newlines.So,In this case no need of (?s) since there are no .
I would use
\s+(?=([^"]*"[^"]*")*[^"]*$)

Regex for "* word"

Any Regex masters out there? I need a regular expression in Java that matches:
"RANDOMSTUFF SPECIFICWORD"
Including the quotation marks.
Thus I need
to match the first quote,
RANDOMSTUFF (any number of words with spaces between preceding SPECIFICWORD)
SPECIFICWORD (a specific word which I won't specify here.)
and the ending quote.
I don't want to match things such as:
RANDOMSTUFF SPECIFICWORD
"RANDOMSTUFF NOTTHESPECIFICWORD"
"RANDOMSTUFF SPECIFICWORD MORERANDOMSTUFF"
\".*\sSPECIFICWORD\"
If you don't want to allow quotes in between, use \"[^"]*\sSPECIFICWORD\"
. matches any character
* says 0 or more of the preceding character (in this case, 0 or more of any characters)
\s matches any whitespace character
SPECIFICWORD will be treated as a string literal, assuming there are no special characters (escape them if there are)
\" matches the quote
[^"] means any character except a quote (the ^ is what makes it 'except')
Also, this link could be useful. Regex's are powerful expressions and are applicable across virtually any language, so it would be a good thing to become comfortable with using them.
EDIT:
As several other posters have pointed out, adding ^ to the beginning and $ to the end will only match if the entire line matches.
^ matches the beginning of the line
$ matches the end of the line
^.*\s+SPECIFICWORD"$
'^' matches 'from the start of the line'
.* matches anything
\s+ matches 'any amount of whitespace, but at least some'
SPECIFICWORD" is a string literal
$ means 'this is the end of the line'
Note that ^ and $ are not always 'line'-based; most languages allow you to specify a 'multiline' mode that would cause them to match 'start of the string/end of the string' instead of one line at a time.
Will this string be matched as a line by line basis or will it be found within the text? If so, you can add anchors to ensure that it matches the string.
^(\".*\sSPECIFICWPRD\")$
Saying, at the start of the line, look for a double quote followed by zero or more random characters followed by a single whitespace, followed by the specific word, followed by a double quote at the end of the string.
Optionally, there are excellent tools for designing regex patterns and seeing what they match in real time.
Here are a couple of examples:
http://gskinner.com/RegExr/
http://regex101.com/r/zC3fM1
Try:
\"[\w\s]*SPECIFICWORD\"
Works like this:
\" matches opening quote
[\w\s]* matches zero or more of the characters from the following sets:
[a-zA-Z_0-9] (\w part)
[ \t\n\x0B\f\r] (\s part)
SPECIFICWORD matches the SPECIFICWORD
\" matches closing quote

How does this Java regex to replace escaped characters work?

The following Java regex "seems" to be working . The intent is to remove the escapeChar - backslash "\". That is "\\{" should become "{".
My question is
Isn't the 10 char in the regex field - the closing parenthesis ")" - closing the regex group that began at char5? So how is this working for the chars after the closing parenthesis at char10?
Can someone break this regex down for me?
str = str.replaceAll("\\\\([{}()\\[\\]\\\\!&:^~-])", "$1");
Isn't the 10 char in the regex field - the closing parenthesis ")" - closing the regex group that began at char5? So how is this working for the chars after the closing parenthesis at char10?
No. The parentheses, both ( and ) are not meta-characters inside a character class. Note that inside a character class only these characters ^-[]\ have special meaning.
In the case of the caret (^) and the dash (-) they lose their special meaning if placed strategically within the char class: the caret if it's placed anywhere but the beginning, and the - if it's placed in the beginning or the end.
Can someone break this regex down for me?
Let's remove the double escapes needed by Java, which turns \\\\([{}()\\[\\]\\\\!&:^~-]) into:
\\([{}()\[\]\\!&:^~-]) # the actual regex
Which breaks down into:
\\ # match literal backslash
( # open capture group
[ # open character class, matching any of
{}()\[\]\\!&:^~- # these characters: {}()[]\!&:^~-
] # close character class
) # close capture group
Basically it says: match a backslash, followed by one of these characters {}()[]\!&:^~-, and put it into a capture group. This capture group is used in the replacement ($1), which replaces the whole match (backlash + character) with the character itself.
In other words, this removes leading backslashes from those special characters.
After removing the escapes, we are left with
\\([{}()\[\]\\!&:^~-])
^character class
Everything within the character class here is literal, except [, ] and \ which have been escaped.

Regex pattern matching not working recursivily

I want to implement the pattern matching in the form
(a+b)(c-or*or/d)..............
in any number of times.
I use the following pattern but it is not working recursively.
It is just reading the first group.
Pattern pattern;
String regex="(([0-9]*)([+,-,/,*])([0-9]*)*)";
pattern=Pattern.compile(regex);
Matcher match = pattern.matcher(userInput);
The regular expression you need to match that sort of sequence is this:
\s*-?\d+(?:\s*[-+/*]\s*-?\d+)+\s*
Let's break that down to it's component parts!
\s* # Optional space
-? # Optional minus sign
\d+ # Mandatory digits
(?: # Start sub-regex
\s* # Optional space
[-+*/] # Mandatory single arithmetic operator
\s* # Optional space
-? # Optional minus sign
\d+ # Mandatory digits
)+ # End sub-regex: want one or more matches of it
\s* # Optional space
(If you don't want to match spaces, remove all of those \s* and be aware that it will surprise users quite a lot.)
Now, when encoding the above as a String literal in Java (before compilation) you need to be careful to escape each of the \ characters in it:
String regex="\\s*-?\\d+(?:\\s*[-+/*]\\s*-?\\d+)+\\s*";
The other thing to be aware of is that this doesn't pull the regular expression apart into pieces for Java to parse and build an expression evaluation tree from; it just (with the rest of your code) matches the whole string or not. (Even putting in capturing parentheses wouldn't help a lot; when put inside some form of repetition, they only report the first place in the string where they matched.) The simplest way of doing that properly would be to use a parser generator like Antlr (it would also let you do things like parenthesized subexpressions, managing operator precedence, etc.)
You will need an expression like this
[0-9]+-[0-9]+[\/*-+][0-9]+[\/*-+][0-9]+[\/*-+][0-9]+[\/*-+][0-9]+
You have to match the whole expression. You can not match part of the expression and do a second search because the pattern is repeated.
Note: In ruby \ is excape sequence of / character so you can omit it in C# or replace it with another characer.
Demo
The pattern
<!--
\((\d|[\+\-\/\\\*\^%!]+|(or|and) *)+\)
Options: ^ and $ match at line breaks
Match the character “(” literally «\(»
Match the regular expression below and capture its match into backreference number 1 «(\d|[\+\-\/\\\*\^%!]+|(or|and) *)+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Note: You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «+»
Match either the regular expression below (attempting the next alternative only if this one fails) «\d»
Match a single digit 0..9 «\d»
Or match regular expression number 2 below (attempting the next alternative only if this one fails) «[\+\-\/\\\*\^%!]+»
Match a single character present in the list below «[\+\-\/\\\*\^%!]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A + character «\+»
A - character «\-»
A / character «\/»
A \ character «\\»
A * character «\*»
A ^ character «\^»
One of the characters “%!” «%!»
Or match regular expression number 3 below (the entire group fails if this one fails to match) «(or|and) *»
Match the regular expression below and capture its match into backreference number 2 «(or|and)»
Match either the regular expression below (attempting the next alternative only if this one fails) «or»
Match the characters “or” literally «or»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «and»
Match the characters “and” literally «and»
Match the character “ ” literally « *»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “)” literally «\)»
-->
The calculation algorithm
For parsing and processing input string you have to use a stack. Visit here for the concept.
Regards
Cylian
your expression doesnt escape special chars like +,(,)
try this
/\(\d+[\+|-|\/|\*]\d+)\G?/
\G is the whole pattern over again
? means the previous thing is optional
i changed your [0-9]* to \d+ which i think is more correct
i changed your , to |

Java regexp error: \( is not a valid character

I was using java regexp today and found that you are not allowed to use the following regexp sequence
String pattern = "[a-zA-Z\\s\\.-\\)\\(]*";
if I do use it it will fail and tell me that \( is not a valid character.
But if I change the regexp to
String pattern = "[[a-zA-Z\\s\\.-]|[\\(\\)]]*";
Then it will work. Is this a bug in the regxp engine or am I not understanding how to work with the engine?
EDIT: I've had an error in my string: there shouldnt be 2 starting [[, it should be only one. This is now corrected
Your regex has two problems.
You've not closed the character class.
The - is acting as a range operator with . on LHS and ( on RHS. But ( comes before . in unicode, so this results in an invalid range.
To fix problem 1, close the char class or if you meant to not include [ in the allowed characters delete one of the [.
To fix problem 2, either escape the - as \\- or move the - to the beginning or to the end of the char class.
So you can use:
String pattern = "[a-zA-Z\\s\\.\\-\\)\\(]*";
or
String pattern = "[a-zA-Z\\s\\.\\)\\(-]*";
or
String pattern = "[-a-zA-Z\\s\\.\\)\\(]*";
You should only use the dash - at the end of the character class, since it is normally used to show a range (as in a-z). Rearrange it:
String pattern = "[[a-zA-Z\\s\\.\\)\\(-]*";
Also, I don't think you have to escape (.) characters inside brackets.
Update: As others pointed out, you must also escape the [ in a java regex character class.
The problem here is that \.-\) ("\\.-\\)" in a Java string literal) tries to define a range from . to ). Since the Unicode codepoint of . (U+002E) is higher than that of ) (U+0029) this is an error.
Try using this pattern and you'll see: [z-a].
The correct solution is to either put the dash - at the end of the character group (at which point it will lose its special meaning) or to escape it.
You also need to close the unclosed open square bracket or escape it, if it was not intended for grouping.
Also, escaping the fullstop . is not necessary inside a character group.
You have to escape the dash and close the unmatched square bracket. So you are going to get two errors with this regex:
java.util.regex.PatternSyntaxException: Illegal character range near index 14
because the dash is used to specify a range, and \) is obviously a not valid range character. If you escape the dash, making it [[a-zA-Z\s\.\-\)\(]* you'll get
java.util.regex.PatternSyntaxException: Unclosed character class near index 19
which means that you have an extra opening square bracket that is used to specify character class. I don't know what you meant by putting an extra bracket here, but either escaping or removing it will make it a valid regex.

Categories