Regex dot operator in Java seems to always work greedy

Regex dot operator in Java seems to always work greedy - java

I'm trying to fetch first paragraph content from HTML snippet... nothing easier, huh? But for some reason, .*? operator seems to work greedy:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class test
{
public static void main(String[] args)
{
Pattern regex = Pattern.compile("<p(?: [^>]*)?>(.*?)</p>", Pattern.DOTALL);
Matcher match = regex.matcher("<p class=\"baz\">foo</p> <p>bar</p>");
System.out.println(match.matches());
System.out.println(match.group(1));
}
}
I expect to match just the content of the first paragraph (foo), but here is the result:
$ javac test.java && java test
true
foo</p> <p>bar
Any reason why the .*? continues to match after first </p>?

As explained by npinti in the comments, the problem is caused by calling match.match(). This attempts to match your pattern against the entire input string. It only succeeds if the regex engine finds some way to express your string as an instance of your pattern. The only way to achieve this is for it to match (.*?) against foo</p> <p>bar.
There are two ways to solve this:
The easiest is to switch to match.find(). This finds the first match of your pattern within the string. Since there is no requirement for the whole string to match, the non-greedy quantifier ensures you get foo as required.
Adjust your pattern to match the whole string. I.e. "<p(?: [^>]*)?>(.*?)</p>.*".
Inevitably, however, these "simple" plans to parse some HTML grow more and more unwieldy as requirements change. It really is quite simple to parse HTML with something like JSoup. Switch to that now and don't look back. Look how easy it is:
Document doc = Jsoup.parseBodyFragment("<p class=\"baz\">foo</p> <p>bar</p>");
Elements paragraphs = doc.getElementsByTag("p");
if (paragraphs.size() > 0) {
System.out.println(paragraphs.get(0).text());
}
Prints: foo.

Sorry for not posting this earlier, did not have an access to a Java environment.
The problem is that matches() will try to match the entire string. Meaning that it will implicitly add ^ and $. Replacing matches() with find() should fix the issue:
Pattern regex = Pattern.compile("<p(?: [^>]*)?>(.*?)</p>", Pattern.DOTALL);
Matcher match = regex.matcher("<p class=\"baz\">foo</p> <p>bar</p>");
System.out.println(match.find());
System.out.println(match.group(1));
Yields:
true
foo

Related

Java regex to parse a particular semicolon delimited param from a URL?

I have a URL I'm expecting like:
www.somewebsite.com/misc-session/;session-id=1FSDSF2132FSADASD13213
I want to parse out
session-id=1FSDSF2132FSADASD13213
Using a regular express in Java, what would be the best approach to take for this?
Using a test regex website I've experimented with some different ways but I'm wondering what is the best approach that is the most fail safe, and protected incase the URL is actually formed like:
www.somewebsite.com/misc-session/;session-id=1FSDSF2132FSADASD13213?someExtraParam=false
or
www.somewebsite.com/misc-session/extra-path/;session-id=1FSDSF2132FSADASD13213?someExtraParam=false
I am always just looking for the value of "session-id".
EDIT:
The value of session-id is NOT limited to digits and is guaranteed to contain a combination of both.

What is the best approach that is the most fail safe, and protected.
Well I think matching word boundary on both sides will be enough.
Regex: \bsession-id=\d+\b
Note:- Use \\d and \\b if regex flavor you are using needs double escaping.
Regex101 Demo
Just in case session-id have characters in range [A-Za-z0-9] use this regex.
Regex: \bsession-id=[A-Za-z0-9]+\b
Regex101 Demo
Ideone Demo
Remember to include
import java.util.regex.Matcher;
import java.util.regex.Pattern;

Try this one:
String str = "www.somewebsite.com/misc-session/;session-id=213213213";
Pattern p = Pattern.compile("(session-id=\\d+)");
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println(m.group(0));
}
Note that session-id= is always given and you are interested in the following number, that is represented with \d (use double \\d in Java). The + stands for at least one number at all.
However better look at the detailed description at Regex101.

Regex extract string in java

I'm trying to extract a string from a String in Regex Java
Pattern pattern = Pattern.compile("((.|\\n)*).{4}InsurerId>\\S*.{5}InsurerId>((.|\\n)*)");
Matcher matcher = pattern.matcher(abc);
I'm trying to extract the value between
<_1:InsurerId>F2021633_V1</_1:InsurerId>
I'm not sure where am I going wrong but I don't get output for
if (matcher.find())
{
System.out.println(matcher.group(1));
}

You can use:
Pattern pattern = Pattern.compile("<([^:]+:InsurerId)>([^<]*)</\\1>");
Matcher matcher = pattern.matcher(abc);
if (matcher.find()) {
System.out.println(matcher.group(2));
}
RegEx Demo

You may want to use the totally awesome page http://regex101.com/ to test your regular expressions. As you can see at https://regex101.com/r/rV8uM3/1, you only have empty capturing groups, but let me explain to you what you did. :D
((.|\n)*) This matches any character, or a new line, unimportant how often. It is capturing, so your first matching group will always be everything before <_1:InsurerId>, or an empty string. You can match any character instead, it will include new lines: .*. You can even leave it away as it isn't actually part of the String you want to match - using anything here will actually be a problem if you have multiple InsurerIds in your file and want to get them all.
.{4}InsurerId> This matches "InsurerId>" with any four characters in front of it and is exactly what you want. As the first character is probably always an opening angle bracket (and you don't want stuff like "<ExampleInsurerId>"), I'd suggest using <.{3}InsurerId> instead. This still could have some problems (<Test id="<" xInsurerId>), so if you know exactly that it's "_<a digit>:", why not use <_\d:InsurerId>?
\S* matches everything except for whitespaces - probably not the best idea as XML and similar files can be written to not contain any space at all. You want to have everything to the next tag, so use [^<]* - this matches everything except for an opening angle bracket. You also want to get this value later, so you have to use a capturing group: ([^<]*)
.{5}InsurerId> The same thing here: use <\/.{3}InsurerId> or <\/_\d:InsurerId> (forward slashes are actually characters interpreted by other RegEx implementations, so I suggest escaping them)
((.|\n)*) Again the same thing, just leave it away
The resulting Regular Expression would then be the following:
<_\d:InsurerId>([^<]*)<\/_\d:InsurerId>
And as you can see at https://regex101.com/r/mU6zZ3/1 - you have exactly one match, and it's even "F2021633_V1" :D
For Java, you have to escape the backslashes, so the resulting code would look like this:
Pattern pattern = Pattern.compile("<_\\d:InsurerId>([^<]*)<\\/_\\d:InsurerId>");

If you are using Java 7 and above, you can use naming groups to make the Regex a little bit more readable (also see the backreference group \k for close tag to match the openning tag):
Pattern pattern = Pattern.compile("(?:<(?<InsurancePrefix>.+)InsurerId>)(?<id>[A-Z0-9_]+)</\\k<InsurancePrefix>InsurerId>");
Matcher matcher = pattern.matcher("<_1:InsurerId>F2021633_V1</_1:InsurerId>");
if (matcher.matches()) {
System.out.println(matcher.group("id"));
}
Using back reference the matches() fails, for example, on this text
<_1:InsurerId>F2021633_V1</_2:InsurerId>
which is correct
Javadoc has a good explanation: https://docs.oracle.com/javase/8/docs/api/
Also you might consider using a different tool (XML parser) instead of Regex, as well, as other people have to support your code, and complex Regex is usually difficult to understand.

X? regex quantifier doesn't work as expected (by me)

Input string:
aaa---foo---ccc---ddd
aaa---bar---ccc---ddd
aaa---------ccc---ddd
Regex: aaa.*(foo|bar)?.*ccc.*(ddd)
This regex doesn't find first group (foo|bar) in any cases. It always returns null for capture group 1.
My question is why and how can I avoid that.
It's very oversimplified example of my regex for just demonstrating. It works if I remove ? quantifier but input string can be without this group at all (aaa---------ccc---ddd) and I still need to determine if it is foo or bar or null. But group 1 is always null.
Page with this regex and test strings: http://fiddle.re/45c766

Here's why it doesn't work: When you have .* in a pattern, the matcher's algorithm is to try to match as many characters as it can to make the rest of the pattern work. In this case, if it tries starting with the entire remainder of the string as .* and removing one character until it matches, it finds that (for "aaa---foo---ccc---ddd") it will work to have .* match 9 characters; then (foo|bar)? doesn't match anything, which is OK because it's optional; and the next .* matches 0 characters, and then the rest of the pattern matches. So that's the one it selects.
The reason changing .* to .*?:
aaa.*?(foo|bar)?.*?ccc.*(ddd)
doesn't work is that the matcher does the same thing in reverse. It starts with a 0-character match and then figures out if it can make the pattern work. When it tries this, it will find that it works to make .*? match 0 characters; then (foo|bar)? doesn't match anything; then the second .*? matches 9 characters; then the rest of the pattern matches ccc---ddd. So either way, it won't do what you want.
There are a couple solutions in the answers, both involving lookahead. Here's another solution:
aaa.*(foo|bar).*ccc.*(ddd)|aaa.*ccc.*(ddd)
This basically checks for two patterns, in order; first it checks to see if there's a pattern with foo|bar in it, and if that doesn't match, it will then search for the other possibility, without foo|bar. This will always find foo|bar if it's there.
All of these solutions involve rather difficult-to-read regexes, though. This is how I might code it:
Pattern pat1 = Pattern.compile("aaa(.*)ccc.*ddd");
Pattern pat2 = Pattern.compile("foo|bar");
Matcher m1 = pat1.matcher(source);
String foobar;
if (m1.matches()) {
Matcher m2 = pat2.matcher(m1.group(1));
if (m2.find()) {
foobar = m2.group(0);
} else {
foobar = null;
}
}
Often, attempting to use one whiz-bang regex to solve a problem results in less-readable (and possibly less-efficient) code than just breaking the problem into parts.

Change your regex to the below if you want to capture the inbetween foo or bar strings.
aaa(?:(?!foo|bar).)*(foo|bar)?.*?ccc.*?(ddd)
Because the .* would also eats up the in-between strings foo or bar, you could use (?:(?!foo|bar).)* instead of that. This (?:(?!foo|bar).)* regex would match any character but not of foo or bar zero or more times.
DEMO
String s = "aaa---foo---ccc---ddd\n" +
"aaa---bar---ccc---ddd\n" +
"aaa---------ccc---ddd";
Pattern regex = Pattern.compile("aaa(?:(?!foo|bar).)*(foo|bar)?.*?ccc.*?(ddd)");
Matcher matcher = regex.matcher(s);
while(matcher.find()){
System.out.println(matcher.group(1));
}
Output:
foo
bar
null

Try:
.{3}\-{3}(.{3})\-{3}.{3}\-{3}(.{3})

Whitespace in Java's regular expression

I'm trying to write a regular expression to mach an IRC PRIVMSG string. It is something like:
:nick!name#some.host.com PRIVMSG #channel :message body
So i wrote the following code:
Pattern pattern = Pattern.compile("^:.*\\sPRIVMSG\\s#.*\\s:");
Matcher matcher = pattern.matcher(msg);
if(matcher.matches()) {
System.out.println(msg);
}
It does not work. I got no matches. When I test the regular expression using online javascript testers, I got matches.
I tried to find the reason, why it doesn't work and I found that there's something wrong with the whitespace symbol. The following pattern will give me some matches:
Pattern.compile("^:.*");
But the pattern with \s will not:
Pattern.compile("^:.*\\s");
It's confusing.

The java matches method strikes again! That method only returns true if the entire string matches the input. You didn't include anything that captures the message body after the second colon, so the entire string is not a match. It works in testers because 'normal' regex is a 'match' if any part of the input matches.
Pattern pattern = Pattern.compile("^:.*?\\sPRIVMSG\\s#.*?\\s:.*$");
Should match

If you look at the documentation for matches(), uou will notice that it is trying to match the entire string. You need to fix your regexp or use find() to iterate through the substring matches.

Regular expression to replace content between parentheses ()

I tried this code:
string.replaceAll("\\(.*?)","");
But it returns null. What am I missing?

Try:
string.replaceAll("\\(.*?\\)","");
You didn't escape the second parenthesis and you didn't add an additional "\" to the first one.

First, Do you wish to remove the parentheses along with their content? Although the title of the question indicates no, I am assuming that you do wish to remove the parentheses as well.
Secondly, can the content between the parentheses contain nested matching parentheses? This solution assumes yes. Since the Java regex flavor does not support recursive expressions, the solution is to first craft a regex which matches the "innermost" set of parentheses, and then apply this regex in an iterative manner replacing them from the inside-out. Here is a tested Java program which correctly removes (possibly nested) parentheses and their contents:
import java.util.regex.*;
public class TEST {
public static void main(String[] args) {
String s = "stuff1 (foo1(bar1)foo2) stuff2 (bar2) stuff3";
String re = "\\([^()]*\\)";
Pattern p = Pattern.compile(re);
Matcher m = p.matcher(s);
while (m.find()) {
s = m.replaceAll("");
m = p.matcher(s);
}
System.out.println(s);
}
}
Test Input:
"stuff1 (foo1(bar1)foo2) stuff2 (bar2) stuff3"
Test Output:
"stuff1 stuff2 stuff3"
Note that the lazy-dot-star solution will never work, because it fails to match the innermost set of parentheses when they are nested. (i.e. it erroneously matches: (foo1(bar1) in the example above.) And this is a very commonly made regex mistake: Never use the dot when there is a more precise expression! In this case, the contents between an "innermost" set of matching parentheses consists of any character that is not an opening or closing parentheses, (i.e. Use: [^()]* instead of: .*?).

Try string.replaceAll("\\(.*?\\)","").

string.replaceAll("\\([^\\)]*\\)","");
This way you are saying match a bracket, then all non-closing bracket chars, and then a closing bracket. This is usually faster than reluctant or greedy .* matchers.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex dot operator in Java seems to always work greedy - java

Related

Java regex to parse a particular semicolon delimited param from a URL?

Regex extract string in java

X? regex quantifier doesn't work as expected (by me)

Whitespace in Java's regular expression

Regular expression to replace content between parentheses ()

Categories

Resources