I tried this code:
string.replaceAll("\\(.*?)","");
But it returns null. What am I missing?
Try:
string.replaceAll("\\(.*?\\)","");
You didn't escape the second parenthesis and you didn't add an additional "\" to the first one.
First, Do you wish to remove the parentheses along with their content? Although the title of the question indicates no, I am assuming that you do wish to remove the parentheses as well.
Secondly, can the content between the parentheses contain nested matching parentheses? This solution assumes yes. Since the Java regex flavor does not support recursive expressions, the solution is to first craft a regex which matches the "innermost" set of parentheses, and then apply this regex in an iterative manner replacing them from the inside-out. Here is a tested Java program which correctly removes (possibly nested) parentheses and their contents:
import java.util.regex.*;
public class TEST {
public static void main(String[] args) {
String s = "stuff1 (foo1(bar1)foo2) stuff2 (bar2) stuff3";
String re = "\\([^()]*\\)";
Pattern p = Pattern.compile(re);
Matcher m = p.matcher(s);
while (m.find()) {
s = m.replaceAll("");
m = p.matcher(s);
}
System.out.println(s);
}
}
Test Input:
"stuff1 (foo1(bar1)foo2) stuff2 (bar2) stuff3"
Test Output:
"stuff1 stuff2 stuff3"
Note that the lazy-dot-star solution will never work, because it fails to match the innermost set of parentheses when they are nested. (i.e. it erroneously matches: (foo1(bar1) in the example above.) And this is a very commonly made regex mistake: Never use the dot when there is a more precise expression! In this case, the contents between an "innermost" set of matching parentheses consists of any character that is not an opening or closing parentheses, (i.e. Use: [^()]* instead of: .*?).
Try string.replaceAll("\\(.*?\\)","").
string.replaceAll("\\([^\\)]*\\)","");
This way you are saying match a bracket, then all non-closing bracket chars, and then a closing bracket. This is usually faster than reluctant or greedy .* matchers.
Related
I'm trying to fetch first paragraph content from HTML snippet... nothing easier, huh? But for some reason, .*? operator seems to work greedy:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class test
{
public static void main(String[] args)
{
Pattern regex = Pattern.compile("<p(?: [^>]*)?>(.*?)</p>", Pattern.DOTALL);
Matcher match = regex.matcher("<p class=\"baz\">foo</p> <p>bar</p>");
System.out.println(match.matches());
System.out.println(match.group(1));
}
}
I expect to match just the content of the first paragraph (foo), but here is the result:
$ javac test.java && java test
true
foo</p> <p>bar
Any reason why the .*? continues to match after first </p>?
As explained by npinti in the comments, the problem is caused by calling match.match(). This attempts to match your pattern against the entire input string. It only succeeds if the regex engine finds some way to express your string as an instance of your pattern. The only way to achieve this is for it to match (.*?) against foo</p> <p>bar.
There are two ways to solve this:
The easiest is to switch to match.find(). This finds the first match of your pattern within the string. Since there is no requirement for the whole string to match, the non-greedy quantifier ensures you get foo as required.
Adjust your pattern to match the whole string. I.e. "<p(?: [^>]*)?>(.*?)</p>.*".
Inevitably, however, these "simple" plans to parse some HTML grow more and more unwieldy as requirements change. It really is quite simple to parse HTML with something like JSoup. Switch to that now and don't look back. Look how easy it is:
Document doc = Jsoup.parseBodyFragment("<p class=\"baz\">foo</p> <p>bar</p>");
Elements paragraphs = doc.getElementsByTag("p");
if (paragraphs.size() > 0) {
System.out.println(paragraphs.get(0).text());
}
Prints: foo.
Sorry for not posting this earlier, did not have an access to a Java environment.
The problem is that matches() will try to match the entire string. Meaning that it will implicitly add ^ and $. Replacing matches() with find() should fix the issue:
Pattern regex = Pattern.compile("<p(?: [^>]*)?>(.*?)</p>", Pattern.DOTALL);
Matcher match = regex.matcher("<p class=\"baz\">foo</p> <p>bar</p>");
System.out.println(match.find());
System.out.println(match.group(1));
Yields:
true
foo
I want a regular expression pattern that will match with the end of a string.
I'm implementing a stemming algorithm that will remove suffixes of a word.
E.g. for a word 'Developers' it should match 's'.
I can do it using following code :
Pattern p = Pattern.compile("s");
Matcher m = p.matcher("Developers");
m.replaceAll(" "); // it will replace all 's' with ' '
I want a regular expression that will match only a string's end something like replaceLast().
You need to match "s", but only if it is the last character in a word. This is achieved with the boundary assertion $:
input.replaceAll("s$", " ");
If you enhance the regular expression, you can replace multiple suffixes with one call to replaceAll:
input.replaceAll("(ed|s)$", " ");
Use $:
Pattern p = Pattern.compile("s$");
public static void main(String[] args)
{
String message = "hi this message is a test message";
message = message.replaceAll("message$", "email");
System.out.println(message);
}
Check this,
http://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
When matching a character at the end of string, mind that the $ anchor matches either the very end of string or the position before the final line break char if it is present even when the Pattern.MULTILINE option is not used.
That is why it is safer to use \z as the very end of string anchor in a Java regex.
For example:
Pattern p = Pattern.compile("s\\z");
will match s at the end of string.
See a related Whats the difference between \z and \Z in a regular expression and when and how do I use it? post.
NOTE: Do not use zero-length patterns with \z or $ after them because String.replaceAll(regex) makes the same replacement twice in that case. That is, do not use input.replaceAll("s*\\z", " ");, since you will get two spaces at the end, not one. Either use "s\\z" to replace one s, or use "s+\\z" to replace one or more.
If you still want to use replaceAll with a zero-length pattern anchored at the end of string to replace with a single occurrence of the replacement, you can use a workaround similar to the one in the How to make a regular expression for this seemingly simple case? post (writing "a regular expression that works with String replaceAll() to remove zero or more spaces from the end of a line and replace them with a single period (.)").
Consider the following code snippet:
String input = "Print this";
System.out.println(input.matches("\\bthis\\b"));
Output
false
What could be possibly wrong with this approach? If it is wrong, then what is the right solution to find the exact word match?
PS: I have found a variety of similar questions here but none of them provide the solution I am looking for.
Thanks in advance.
When you use the matches() method, it is trying to match the entire input. In your example, the input "Print this" doesn't match the pattern because the word "Print" isn't matched.
So you need to add something to the regex to match the initial part of the string, e.g.
.*\\bthis\\b
And if you want to allow extra text at the end of the line too:
.*\\bthis\\b.*
Alternatively, use a Matcher object and use Matcher.find() to find matches within the input string:
Pattern p = Pattern.compile("\\bthis\\b");
Matcher m = p.matcher("Print this");
m.find();
System.out.println(m.group());
Output:
this
If you want to find multiple matches in a line, you can call find() and group() repeatedly to extract them all.
Full example method for matcher:
public static String REGEX_FIND_WORD="(?i).*?\\b%s\\b.*?";
public static boolean containsWord(String text, String word) {
String regex=String.format(REGEX_FIND_WORD, Pattern.quote(word));
return text.matches(regex);
}
Explain:
(?i) - ignorecase
.*? - allow (optionally) any characters before
\b - word boundary
%s - variable to be changed by String.format (quoted to avoid regex
errors)
\b - word boundary
.*? - allow (optionally) any characters after
For a good explanation, see: http://www.regular-expressions.info/java.html
myString.matches("regex") returns true or false depending whether the
string can be matched entirely by the regular expression. It is
important to remember that String.matches() only returns true if the
entire string can be matched. In other words: "regex" is applied as if
you had written "^regex$" with start and end of string anchors. This
is different from most other regex libraries, where the "quick match
test" method returns true if the regex can be matched anywhere in the
string. If myString is abc then myString.matches("bc") returns false.
bc matches abc, but ^bc$ (which is really being used here) does not.
This writes "true":
String input = "Print this";
System.out.println(input.matches(".*\\bthis\\b"));
You may use groups to find the exact word. Regex API specifies groups by parentheses. For example:
A(B(C))D
This statement consists of three groups, which are indexed from 0.
0th group - ABCD
1st group - BC
2nd group - C
So if you need to find some specific word, you may use two methods in Matcher class such as: find() to find statement specified by regex, and then get a String object specified by its group number:
String statement = "Hello, my beautiful world";
Pattern pattern = Pattern.compile("Hello, my (\\w+).*");
Matcher m = pattern.matcher(statement);
m.find();
System.out.println(m.group(1));
The above code result will be "beautiful"
Is your searchString going to be regular expression? if not simply use String.contains(CharSequence s)
System.out.println(input.matches(".*\\bthis$"));
Also works. Here the .* matches anything before the space and then this is matched to be word in the end.
What would be a convenient and reliable way to extract all the "{...}" tags from a given string? (Using Java).
So, to give an example:
Say I have: http://www.something.com/{tag1}/path/{tag2}/else/{tag3}.html
I want to get all the "{}” tags; I was thinking about using the Java .split() functions, but not sure what the correct regex would be for this.
Note also: tags can be called anything, not just tagX!
I would use regular expressions to match this. Something like this could work for your expression:
String regex = "\\{.*?\\}";
As this will "reluctantly" match any sub string that has { and } surrounding it. The .*? makes it find any character between the { and }, but reluctantly, so it doesn't match the bigger String:
{tag1}/path/{tag2}/else/{tag3}
which would be a "greedy" match. Note that the curly braces in the regex need to be escaped with double backslashes since curly braces have a separate meaning inside a regular expression, and if you want to indicate the curly brace String, you need to escape it.
e.g.,
public static void main(String[] args) {
String test = "http://www.something.com/{tag1}/path/{tag2}/else/{tag3}.html";
String regex = "\\{.*?\\}";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(test);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
With an output of:
{tag1}
{tag2}
{tag3}
You can read more about regular expressions here:
Oracle Regular Expressions Tutorial
and for greater detail, here:
www.regular-expressions.info/tutorial
I am trying to match everything inside double curly brackets in a string. I am using the following expression:
\{\{.*\}\}
Some examples:
The {{dog}} is not a cat. This correctly matches {{dog}}
However,
The {{dog}} is a {{cat}} matches everything after the first match instead of returning two matches. I want it to match twice, once for {{dog}} and once for {{cat}}
Does anyone know how to do this?
Thanks.
The greedy .* matches anything (except line breaks), so when there are more than one }} in the string, it always matches the last }} (if there aren't any \r and \n between the two }}!).
Try to make the .* match reluctant (ungreedy) like this:
\{\{.*?}}
That's correct, you needn't escape the }.
You could also do:
\{\{[^}]*}}
if a {{ ... }} cannot contain a single } itself.
Try with \{\{.*?\}\}
I believe it's because the pattern you have is greedy.
Wikipedia explains it pretty well.
You have to use non-greedy match:
\{\{.*?\}\}
to match everything between braces, use:
\{\{(.*?)\}\}
What you need is the "non-greedy" modifier - so your regex is \{\{.+?\}\}
Try this, it worked for me:
Pattern pattern = Pattern.compile("\\{\\{(.*?)\\}\\}");
Matcher matchPattern = pattern.matcher("The {{cat}} loves the {{dog}}");
while(matchPattern.find()) {
System.out.println(matchPattern.group(1));
}