java regex escape all reserved characters - java

I understand that you can use Pattern.quote to escape characters within a string that is reserved by regex. But I do not understand why the following is not working:
String s="and this)";
String ps = "\\b("+Pattern.quote(s)+")\\b";
//String pp = Pattern.quote(pat);
Pattern p=Pattern.compile(ps);
Matcher mm = p.matcher("oh and this) is");
System.out.println(mm.find()); //print false, but expecting true?
When String s= "and this) is changed to String s="and this, i.e., no ), it works. How should I change the code so with ")" it also works as expected?
Thanks

Use negative look-arounds to check for non-word characters before and after the keyword:
String ps = "(?<!\\w)"+Pattern.quote(s)+"(?!\\w)";
This way you will still match the s as a whole word and it won't be a problem is the keyword has non-word characters at the beginning or end.
IDEONE demo:
String s="and this)";
String ps = "(?<!\\w)"+Pattern.quote(s)+"(?!\\w)";
Pattern p=Pattern.compile(ps);
Matcher mm = p.matcher("oh and this) is");
System.out.println(mm.find());
Result: true

Related

Regex java : replaceAll whitespace except between hyphen

I've got a string:
-----test test----- testestestest testestest -----test test-----
I'd like to replace each whitespace with \n, but I'd have to keep the whitespaces between the hyphens. Here is perfect result:
-----test test-----\ntestestestest\ntestestest\n-----test test-----
I've tried a lot of different regex but none of them work, here is my best try..
Pattern ws = Pattern.compile("\\s(?![\-]*\-)");
Matcher matcher = ws.matcher(myString);
String result = matcher.replaceAll("\n");
Could somebody help me?
PS: What I really don't understand is that by replacing the hyphens with brackets (in the string as well as the regex), it works correctly...\s(?![^\{]*\})
Just match whitespace at the end of a line:
/\s$/
Here's the code:
String result = myString.replaceAll("(?m)\\s$", "\\\\n");
Result:
-----test test-----\n
testestestest\n
testestest\n
-----test test-----\n
That's in your code:
Pattern ws = Pattern.compile("\\s$", Pattern.MULTILINE);
Matcher matcher = ws.matcher(myString);
String result = matcher.replaceAll("\\\\n");
Do you know there is always a single space at the end of 'every' line? If so, use this:
String text = "-----test test----- ";
text = text.substring(0, text.length() - 1) + "\\n";

Java Pattern / Matcher not finding word break

I am having trouble with Java Pattern and Matcher. I've included a very simplified example of what I'm trying to do.
I had expected the pattern ".\b" to find the last character of the first word (or "4" in the example), but as I step through the code, m.find() always returns false. What am I missing here?
Why does the following Java code always print out "Not Found"?
Pattern p = Pattern.compile(".\b");
Matcher m = p.matcher("102939384 is a word");
int ixEndWord = 0;
if (m.find()) {
ixEndWord = m.end();
System.out.println("Found: " + ixEndWord);
} else {
System.out.println("Not Found");
}
You need to escape special characters in the regex: ".\\b"
Basically, in a String the backslash has to be escaped. So "\\" becomes the character '\'.
So the String ".\\b" becomes the litteral String ".\b", which will be used by the Pattern.
To expand upton AntonH's comment, whenever you want the "\" character to appear in a regex expression, you have to escape it so that it first appears in the string you are passing in.
As is, ".\b" is the string of a dot . followed by the special backspace character represented by \b, compared to ".\\b", which is the regex .\b.

Find a subtring in a string using a regular expression - JAVA

Suppose i have a string " kk a.b.cjkmkc jjkocc a.b.c. jjj 'a.b.ckkkkkkkkkkkkkkkk ' "
I want to replace the substring a.b.c in the string which are only outside the single quote , but it is not working.
Here is my code
`
String str = " kk a.b.cjkmkc jjkocc a.b.c. jjj 'a.b.ckkkkkkkkkkkkkkkk ' ";
Pattern p = Pattern.compile("a\\.b\\.c");
Matcher m = p.matcher(str);
int x = m.find()
`
use this pattern : a\.b\.c(?=(([^']*'){2})*[^']*$) Demo
To search for a substring outside quotes, you can do something like this:
Pattern pat = Pattern.compile("^(?:[^']|'[^']*')*?a\\.b\\.c");
The first part will skip over:
every character that isn't a quote mark ([^']), or
every sequence of non-quote-mark characters enclosed in quotes ('[^']*').
Once those are skipped, then if it sees the pattern you want, it will know that it isn't inside quote marks.
This will handle a simple case. If things start getting more complicated, e.g. you want to allow \' to quote a quote mark in your input string the way C or Java does in a string literal, the regex starts getting more complicated, and you can quickly reach a point whether either your regex is unreadable or regexes aren't suitable solutions.
EDIT: fixed to put "reluctant" qualifier after second *, so that the first a.b.c will be found.
EDIT 2: If you want to replace the substring you find, it gets trickier. The above pattern matches the entire beginning of the string up through a.b.c, and I couldn't get a look-behind to work so that the match would be only the a.b.c part. I think you'll need to put the beginning of the string in a group, and then use $1 in the replacement string to copy the beginning:
Pattern pat = Pattern.compile("^((?:[^']|'[^']*')*?)a\\.b\\.c");
Matcher m = pat.matcher(source);
if (m.find()) {
result = m.replaceFirst("$1replacement");
}
I'm not sure replaceAll works with this, so if you want to replace all of them, you may need to loop.
I wouldn't mess with REGEX.
public static void main(String[] args) {
String str = " kk a.b.cjkmkc jjkocc a.b.c. jjj 'a.b.ckkkkkkkkkkkkkkkk ' ";
String[] s = str.split("'");
str = s[0].replaceAll("[abc]", "") + "'"+ s[1]+"'"
+ s[2].replaceAll("[abc]", "");
System.out.println(str);
}
OP:
kk ..jkmk jjko ... jjj 'a.b.ckkkkkkkkkkkkkkkk '
Inefficient.. but works

Remove occurrences of a given character sequence at the beginning of a string using Java Regex

I have a string that begins with one or more occurrences of the sequence "Re:". This "Re:" can be of any combinations, for ex. Re<any number of spaces>:, re:, re<any number of spaces>:, RE:, RE<any number of spaces>:, etc.
Sample sequence of string : Re: Re : Re : re : RE: This is a Re: sample string.
I want to define a java regular expression that will identify and strip off all occurrences of Re:, but only the ones at the beginning of the string and not the ones occurring within the string.
So the output should look like This is a Re: sample string.
Here is what I have tried:
String REGEX = "^(Re*\\p{Z}*:?|re*\\p{Z}*:?|\\p{Z}Re*\\p{Z}*:?)";
String INPUT = title;
String REPLACE = "";
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT);
while(m.find()){
m.appendReplacement(sb,REPLACE);
}
m.appendTail(sb);
I am using p{Z} to match whitespaces(have found this somewhere in this forum, as Java regex does not identify \s).
The problem I am facing with this code is that the search stops at the first match, and escapes the while loop.
Try something like this replace statement:
yourString = yourString.replaceAll("(?i)^(\\s*re\\s*:\\s*)+", "");
Explanation of the regex:
(?i) make it case insensitive
^ anchor to start of string
( start a group (this is the "re:")
\\s* any amount of optional whitespace
re "re"
\\s* optional whitespace
: ":"
\\s* optional whitespace
) end the group (the "re:" string)
+ one or more times
in your regex:
String regex = "^(Re*\\p{Z}*:?|re*\\p{Z}*:?|\\p{Z}Re*\\p{Z}*:?)"
here is what it does:
see it live here
it matches strings like:
\p{Z}Reee\p{Z: or
R\p{Z}}}
which make no sense for what you try to do:
you'd better use a regex like the following:
yourString.replaceAll("(?i)^(\\s*re\\s*:\\s*)+", "");
or to make #Doorknob happy, here's another way to achieve this, using a Matcher:
Pattern p = Pattern.compile("(?i)^(\\s*re\\s*:\\s*)+");
Matcher m = p.matcher(yourString);
if (m.find())
yourString = m.replaceAll("");
(which is as the doc says the exact same thing as yourString.replaceAll())
Look it up here
(I had the same regex as #Doorknob, but thanks to #jlordo for the replaceAll and #Doorknob for thinking about the (?i) case insensitivity part ;-) )

Getting dialogue snippets from text using regular expressions

I'm trying to extract snippets of dialogue from a book text. For example, if I have the string
"What's the matter with the flag?" inquired Captain MacWhirr. "Seems all right to me."
Then I want to extract "What's the matter with the flag?" and "Seem's all right to me.".
I found a regular expression to use here, which is "[^"\\]*(\\.[^"\\]*)*". This works great in Eclipse when I'm doing a Ctrl+F find regex on my book .txt file, but when I run the following code:
String regex = "\"[^\"\\\\]*(\\\\.[^\"\\\\]*)*\"";
String bookText = "\"What's the matter with the flag?\" inquired Captain MacWhirr. \"Seems all right to me.\""; Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);
if(m.find())
System.out.println(m.group(1));
The only thing that prints is null. So am I not converting the regex into a Java string properly? Do I need to take into account the fact that Java Strings have a \" for the double quotes?
In a natural language text, it's not likely that " is escaped by a preceding slash, so you should be able to use just the pattern "([^"]*)".
As a Java string literal, this is "\"([^\"]*)\"".
Here it is in Java:
String regex = "\"([^\"]*)\"";
String bookText = "\"What's the matter with the flag?\" inquired Captain MacWhirr. \"Seems all right to me.\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);
while (m.find()) {
System.out.println(m.group(1));
}
The above prints:
What's the matter with the flag?
Seems all right to me.
On escape sequences
Given this declaration:
String s = "\"";
System.out.println(s.length()); // prints "1"
The string s only has one character, ". The \ is an escape sequence present at the Java source code level; the string itself has no slash.
See also
JLS 3.10.6 Escape Sequences for Character and String Literals
The problem with the original code
There's actually nothing wrong with the pattern per se, but you're not capturing the right portion. \1 isn't capturing the quoted text. Here's the pattern with the correct capturing group:
String regex = "\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\"";
String bookText = "\"What's the matter?\" inquired Captain MacWhirr. \"Seems all right to me.\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);
while (m.find()) {
System.out.println(m.group(1));
}
For visual comparison, here's the original pattern, as a Java string literal:
String regex = "\"[^\"\\\\]*(\\\\.[^\"\\\\]*)*\""
^^^^^^^^^^^^^^^^^
why capture this part?
And here's the modified pattern:
String regex = "\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\""
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
we want to capture this part!
As mentioned before, though: this complicated pattern isn't necessary for natural language text, which isn't likely to contain escaped quotes.
See also
regular-expressions.info/Grouping and backreferences

Categories