How does this Java regex to replace escaped characters work? - java

The following Java regex "seems" to be working . The intent is to remove the escapeChar - backslash "\". That is "\\{" should become "{".
My question is
Isn't the 10 char in the regex field - the closing parenthesis ")" - closing the regex group that began at char5? So how is this working for the chars after the closing parenthesis at char10?
Can someone break this regex down for me?
str = str.replaceAll("\\\\([{}()\\[\\]\\\\!&:^~-])", "$1");

Isn't the 10 char in the regex field - the closing parenthesis ")" - closing the regex group that began at char5? So how is this working for the chars after the closing parenthesis at char10?
No. The parentheses, both ( and ) are not meta-characters inside a character class. Note that inside a character class only these characters ^-[]\ have special meaning.
In the case of the caret (^) and the dash (-) they lose their special meaning if placed strategically within the char class: the caret if it's placed anywhere but the beginning, and the - if it's placed in the beginning or the end.
Can someone break this regex down for me?
Let's remove the double escapes needed by Java, which turns \\\\([{}()\\[\\]\\\\!&:^~-]) into:
\\([{}()\[\]\\!&:^~-]) # the actual regex
Which breaks down into:
\\ # match literal backslash
( # open capture group
[ # open character class, matching any of
{}()\[\]\\!&:^~- # these characters: {}()[]\!&:^~-
] # close character class
) # close capture group
Basically it says: match a backslash, followed by one of these characters {}()[]\!&:^~-, and put it into a capture group. This capture group is used in the replacement ($1), which replaces the whole match (backlash + character) with the character itself.
In other words, this removes leading backslashes from those special characters.

After removing the escapes, we are left with
\\([{}()\[\]\\!&:^~-])
^character class
Everything within the character class here is literal, except [, ] and \ which have been escaped.

Related

How to replace dashes with underscores within the square brackets using regex Java

I am trying to replace dashes within the square brackets with underscores but it replaces all dashes with underscores in string.
For example, I want to replace
"[a]-[a-gamma]"
with
"[a]-[a_gamma]"
but it replaces all dashes from the string with underscores.
You can use
String n="[a]-[a-gamma]";
System.out.println(n.replaceAll("-(?=[^\\[\\]]*\\])", "_"));
As for the regex itself, I match the - symbol only if it is followed by non-[s and non-]s until the engine finds the ]. Then, we are "inside" the []s. There can be a situation when this is not quite true (4th hyphen in [a-z]-[a-z] - ] [a-z]), but I hope it is not your case.
IDEONE Demo
Output:
[a]-[a_gamma]
Use a negative lookahead:
str = str.replaceAll("-(?![^\\]]*\\[)", "_");
The regex matches dashes whose next square bracket character is not an opening square bracket.
-(?=[^\\[]*\\])
You can use this.See demo.
https://regex101.com/r/bN8dL3/6
If your brackets are balanced (or if an unclosed bracket is considered opened by default until the end), you can use this way that needs few steps to find a match:
pattern:
((?:\\G(?!\\A)|[^\\[]*\\[)[^\\]-]*)-
replacement:
$1_
demo
pattern details:
( # open the capture group 1
(?: # open a non capturing group for the 2 possible beginings
\\G (?!\\A) # this one succeeds immediately after the last match
|
[^\\[]* \\[ # this one reach the first opening bracket
# (so it is the first match)
)
[^\\]-]* # all that is not a closing bracket or a dash
) # close the capture group
- # the dash
The \G anchor marks the position after the last match. But at the begining (since there isn't already a match), it matches by default the start of the string. This is the reason why I added (?!\A) to fail at the start of the string.
How about this?
/\[[^\]]*?(-)[^\[]*?\]/g
Match group extracted:
"[a]-[a-gamma] - [[ - - [123-567567]]]"
^ ^
Explanation available here: https://regex101.com/r/oC2xE0/1

How to match tab and newline but not space with REGEX?

I am trying to match "tab" and "newline" meta chars but without "spaces" with REGEX in Java.
\s matches evrything i.e. tab, space and new line... But, I don't want "space" to be matched.
How do I do that?
Thanks.
One way to do it is:
[^\\S ]
The negated character class makes this regex to match anything except - \\S (non-whitespace) and " "(space) character. So, it will match \\s except space.
Explicitly list them inside [...] (set of characters):
"[\\t\\n\\r\\f\\v]"

Java regexp error: \( is not a valid character

I was using java regexp today and found that you are not allowed to use the following regexp sequence
String pattern = "[a-zA-Z\\s\\.-\\)\\(]*";
if I do use it it will fail and tell me that \( is not a valid character.
But if I change the regexp to
String pattern = "[[a-zA-Z\\s\\.-]|[\\(\\)]]*";
Then it will work. Is this a bug in the regxp engine or am I not understanding how to work with the engine?
EDIT: I've had an error in my string: there shouldnt be 2 starting [[, it should be only one. This is now corrected
Your regex has two problems.
You've not closed the character class.
The - is acting as a range operator with . on LHS and ( on RHS. But ( comes before . in unicode, so this results in an invalid range.
To fix problem 1, close the char class or if you meant to not include [ in the allowed characters delete one of the [.
To fix problem 2, either escape the - as \\- or move the - to the beginning or to the end of the char class.
So you can use:
String pattern = "[a-zA-Z\\s\\.\\-\\)\\(]*";
or
String pattern = "[a-zA-Z\\s\\.\\)\\(-]*";
or
String pattern = "[-a-zA-Z\\s\\.\\)\\(]*";
You should only use the dash - at the end of the character class, since it is normally used to show a range (as in a-z). Rearrange it:
String pattern = "[[a-zA-Z\\s\\.\\)\\(-]*";
Also, I don't think you have to escape (.) characters inside brackets.
Update: As others pointed out, you must also escape the [ in a java regex character class.
The problem here is that \.-\) ("\\.-\\)" in a Java string literal) tries to define a range from . to ). Since the Unicode codepoint of . (U+002E) is higher than that of ) (U+0029) this is an error.
Try using this pattern and you'll see: [z-a].
The correct solution is to either put the dash - at the end of the character group (at which point it will lose its special meaning) or to escape it.
You also need to close the unclosed open square bracket or escape it, if it was not intended for grouping.
Also, escaping the fullstop . is not necessary inside a character group.
You have to escape the dash and close the unmatched square bracket. So you are going to get two errors with this regex:
java.util.regex.PatternSyntaxException: Illegal character range near index 14
because the dash is used to specify a range, and \) is obviously a not valid range character. If you escape the dash, making it [[a-zA-Z\s\.\-\)\(]* you'll get
java.util.regex.PatternSyntaxException: Unclosed character class near index 19
which means that you have an extra opening square bracket that is used to specify character class. I don't know what you meant by putting an extra bracket here, but either escaping or removing it will make it a valid regex.

problem understanding a string pattern

I'm learning GWT by following this tutorial but there's something I don't quite fully understand in step 4. The following line's checking that a string matches a pattern:
if (!str.matches("^[0-9A-Z\\.]{1,10}$")) {...}
After checking the documentation for the Pattern class I understand that the characters ^ and $ represent the beginning and the end of the line, and that [...]{1,10} means that the part in brackets [...] has to be present at least once but not more than 10 times. What I don't understand is the final characters of the part in brackets. 0-9A-Z means a range of characters from 0 to 9 or from A to Z. But what does \\. mean?
It matches a dot character. Since dot has a special meaning in regexp, it must be escaped with a backslash. And because backslash has a special meaning in Java strings, it must be escaped with another backslash.
dot .
As it is a special character in regexp syntax.
Also it has two escapes as \ is a special character in java strings.
The dot "." in regex means "any character". An escaped dot "." (or "\.") means the dot character itself (without any special regex behaviour like the unescaped dot).
So, for example, "123.ABC" could be a line that matches the given regex (line breaks etc. not included).
It matches a dot character. A double slash '\\' simply means a single '\' as you have to escape '\'s in java strings. So '\\.' is translated to '\.' which means match just a '.' character. If you just used '.' by itself, without escaping, it would match any character. So you have to escape it, to match a '.' character.

What does this Regular expression do?

I'm in the processing of converting a program from Perl to Java. I have come across the line
my ($title) = ($info{$host} =~ /^\s*\(([^\)]+)\)\s*$/);
I'm not very good with regular expressions but from what I can tell this is matching something in the string $info{$host} to the regular expression ^\s*(([^)]+))\s*$ and assigning the match to $title.
My problem is that I have no clue what the regular expression is doing and what it will match. Any help would be appreciated.
Thanks
The regular expression matches a string that contains exactly one pair of matching parentheses (actually, one opening and one matching closing parenthesis, but inside any number of further opening parentheses may occur).
The string may begin and end with whitespace characters, but no others. Inside the parantheses, however, arbitrary characters may occur (at least one).
The following strings should match it:
(abc)
(()
(ab)
By the way, you may simply use the regular expression as-is in Java (after escaping the backslashes), using the Pattern class.
It will match a bunch of leading whitespace, followed by a left paren, followed by some text not including a right paren, followed by a right paren, followed by some more whitespace.
Matches:
(some stuff)
Fails:
(some stuff
some stuff)
(some stuff) asadsad
Ok step by step
/ - quote the regex
^ - the begining of the string
\s* - zero or more of any spacelike character
( - an actual ( character
( - begin a capture group
[^)]+ any of the characters ^ or ) the + indicating at least one
) -end the capture group
) and actual ) character
\s* zero or more space like characters
$ - the end of the string
/ - close the regex quote
So as far as I can work out we are looking for strings like " (^) " or "())"
methinks I am missing something here.
my ($title) = ($info{$host} =~ /^\s*\(([^\)]+)\)\s*$/);
First, m// in list context returns the captured matches. my ($title) puts the right hand side in list context. Second, $info{$host} is matched against the following pattern:
/^ \s* \( ( [^\)]+) \) \s* $/x
Yes, used the x flag so I could insert some spaces. ^\s* skips any leading whitespace. Then we have an escaped paranthesis (therefore no capture group is created. Then we have a capture group containing [^\)]. That character class can be better written as [^)] because the right parenthesis is not special in a character class and means anything but a left parenthesis.
If there are one or more characters other than a closing parenthesis following the opening parenthesis followed by a closing parenthesis optionally surrounded on either side by whitespace, that sequence of characters is captured and put in to $x.

Categories