Regular Expression to extract content from inside div - java

I am trying to extract from a webpage which has the following markup
<div id="div">
content
content
content
content
</div>
The regex I currently have is
Pattern div = Pattern.compile("<div id=\"div\">(.*?)</div>");
This works when there is only one line but with new lines it doesn't recognise stuff inside the div tag..
Any help will be grateful (I am using java by the way)

Personally, I would strongly discourage you from using regular expressions in this case. It is well documented as being a bad idea to attempt to suck information out of an HTML document with regular expressions. Take a look at a proper HTML parser instead!

The fact that it doesn't work when there are line breaks is because . (DOT) does not match any type of line break character. To let . match line breaks as well, do:
Pattern.compile("<div id=\"div\">(.*?)</div>", Pattern.DOTALL)
or:
Pattern.compile("<div id=\"div\">([\\s\\S]*?)</div>")
or:
Pattern.compile("(?s)<div id=\"div\">(.*?)</div>")
See: http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#DOTALL

I think, this should work (you need to add the DOTALL modifier):
Pattern div = Pattern.compile("<div id=\"div\">(.*?)</div>", Pattern.DOTALL);

You could add the Pattern.Multiline option
Pattern div = Pattern.compile("<div id=\"div\">(.*?)</div>", Pattern.MULTILINE);
or add the ?m operator in your reg ex ( at the end)
Hope this helps

Related

Regex to find XML tag in multiline string

Here is a simple function I wrote to get the value from a tag.
public static String getTagAValue(String xmlAsString) {
Pattern pattern = Pattern.compile("<TagA>(.+)</TagA>");
Matcher matcher = pattern.matcher(xmlAsString);
if (matcher.find()) {
return matcher.group(1);
} else {
return null;
}
}
It is not finding a match and returning null.
XML Sample
<xml>
<sample>
<TagA>result</TagA>
</sample>
</xml>
Note, here I used 4 spaces for tabs, but the real string would contain tabs.
Don't use regular expressions to parse XML: it's the wrong tool for the job.
Classic answer here: RegEx match open tags except XHTML self-contained tags
The answer you have accepted gives wrong answers, for example:
It doesn't accept whitespace in places where whitespace is allowed, such as before ">"
It will match a commented-out element, or one that appears in a CDATA section
It does a greedy match, so it will find the LAST matching end tag, not the first one.
However hard you try, you will never get it 100% right.
And in case you care more about performance than correctness, it's also grossly inefficient because of the need for backtracking.
To do the job properly and professionally, use an XML parser.
You probably want to enable that the RegExp works on multi-line:
Pattern.compile("<TagA>(.+)</TagA>", Pattern.DOTALL);
Documentation explains the parameter Pattern.DOTALL:
Enables dotall mode. In dotall mode, the expression . matches any
character, including a line terminator. By default this expression
does not match line terminators.
Edit: While this works in this particular case, please everyone refer to the answert of Michael Kay if you want to solve such problems professionally, efficiently and right.

Java Regex doesn't work with special chars

I got a problem with my parser. I want to read an image-link on a webiste and this normally works fine. But today I got a link that contains special chars and the usual regex did not work.
This is how my code looks like.
Pattern t = Pattern.compile(regex.trim());
Matcher x = t.matcher(content[i].toString());
if(x.find())
{
values[i] = x.group(1);
}
And this is the part of html, that causes trouble
<div class="open-zoomview zoomlink" itemscope="" itemtype="http://schema.org/Product">
<img class="zoomLink productImage" src="
http://tnm.scene7.com/is/image/TNM/template_335x300?$plus_335x300$&$image=is{TNM/1098845000_prod_001}&$ausverkauft=1&$0prozent=1&$versandkostenfrei=0" alt="Produkt Atika HB 60 Benzin-Heckenschere" title="Produkt Atika HB 60 Benzin-Heckenschere" itemprop="image" />
</div>
And this is the regex I am using to get the part in the src-attribute:
<img .*src="(.*?)" .*>
I believe that it has something to do with all the special character inside the link. But I'm not sure how to escape all of them. I Already tried
Pattern.quote(content[i].toString())
But the outcome was the same: nothing found.
The . character usually only matches everything except new line characters. Therefore, your pattern won't match if there are newlines in the img-tag.
Use Pattern.compile(..., Pattern.DOTALL) or prepend your pattern with (?s).
In dotall mode, the expression . matches any character, including a
line terminator. By default this expression does not match line
terminators.
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html#DOTALL
You regex should be like:
String regex = "<img .*src=\"(.*?)\" .*>";
This probably caused by the newline within the tag. The . character won't match it.
Did you consider not using regex to parse HTML? Using regex for HTML parsing is notoriously fragile construct. Please consider using a parsing library such as JSoup for this.
You should actually use <img\\s\\.*?\\bsrc=["'](\\.*?)["']\\.*?> with (?s) modifier.

Replace string inside tags?

I want to replace a content inside some tags, eg:
<p>this it to be replaced</p>
I could extract the content between with groups like this, but can i actually replace the group?
str = str.replaceAll("<p>([^<]*)</p>", "replacement");
You can use lookaround (positive lookahead and lookbehind) for this:
Change the regex to: "(?<=<p>)(.*?)(?=</p>)" and you will be fine.
Example:
String str = "<p>this it to be replaced</p>";
System.out.println(str.replaceAll("(?<=<p>)(.*?)(?=</p>)", "replacement"));
Output:
<p>replacement</p>
Note however that if you are parsing HTML you should be using some kind of a HTML parser, often regular expressions is not good enough...
Change the regex to this:
(?<=<p>).*?(?=</p>)
ie
str = str.replaceAll("(?<=<p>).*?(?=</p>)", "replacement");
This uses a "look behind" and a "look ahead" to assert, but not capture, input before/after the matching (non-greedy) regex
Just in case anyone is wondering, this answer is different to dacwe's: His uses unnecessary brackets. This answer is the more elegant :)

Regex in java question, multiple matches

I am trying to match multiple CSS style code blocks in a HTML document. This code will match the first one but won't match the second. What code would I need to match the second. Can I just get a list of the groups that are inside of my 'style' brackets? Should I call the 'find' method to get the next match?
Here is my regex pattern
^.*(<style type="text/css">)(.*)(</style>).*$
Usage:
final Pattern pattern_css = Pattern.compile(css_pattern_buf.toString(),
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
final Matcher match_css = pattern_css.matcher(text);
if (match_css.matches() && (match_css.groupCount() >= 3)) {
System.out.println("Woot ==>" + match_css.groupCount());
System.out.println(match_css.group(2));
} else {
System.out.println("No Match");
}
I am trying to match multiple CSS style code blocks in a HTML document.
Standard Answer: don't use regex to parse HTML. regex cannot parse HTML reliably, no matter how complicated and clever you make your expression. Unless you are absolutely sure the exact format of the target document is totally fixed, string or regex processing is insufficient and you must use an HTML parser.
(<style type="text/css">)(.*)(</style>)
That's a greedy expression. The (.*) in the middle will match as much as it possibly can. If you have two style blocks:
<style type="text/css">1</style> <style type="text/css">2</style>
then it will happily match '1</style> <style type="text/css">2'.
Use (.*?) to get a non-greedy expression, which will allow the trailing (</style>) to match at the first opportunity.
Should I call the 'find' method to get the next match?
Yes, and you should have used it to get the first match too. The usual idiom is:
while (matcher.find()) {
s= matcher.group(n);
}
Note that standard string processing (indexOf, etc) may be a simpler approach for you than regex, since you're only using completely fixed strings. However, the Standard Answer still applies.
You can simplify the regex as follows:
(<style type="text/css">)(.*?)(</style>)
And if you don’t need the groups 1 and 3 (probably not), I would drop the parentheses, remaining only:
<style type="text/css">(.*?)</style>

Using Condition in Regular Expressions

Source:
<TD>
<IMG SRC="/images/home.gif">
<IMG SRC="/images/spacer.gif">
<IMG SRC="/images/search.gif">
<IMG SRC="/images/spacer.gif">
<IMG SRC="/images/help.gif">
</TD>
Regex:
(<[Aa]\s+[^>]+>\s*)?<[Ii][Mm][Gg]\s+[^>]+>(?(1)\s*</[Aa]>)
Result:
<IMG SRC="/images/home.gif">
<IMG SRC="/images/spacer.gif">
<IMG SRC="/images/search.gif">
<IMG SRC="/images/spacer.gif">
<IMG SRC="/images/help.gif">
what's the "?(1)" mean?
When I run it in Java ,it cause a exception: java.util.regex.PatternSyntaxException,the
"?(1)" can't be recognized.
The explanation in the book is :
This pattern requires explanation. (<[Aa]\s+[^>]+>\s*)? matches an opening <A> or <a> tag (with any attributes that may be present), if present (the closing ? makes the expression optional). <[Ii][Mm][Gg]\s+[^>]+> then matches the <IMG> tag (regardless of case) with any of its attributes. (?(1)\s*</[Aa]>) starts off with a condition: ?(1) means execute only what comes next if backreference 1 (the opening <A> tag) exists (or in other words, execute only what comes next if the first <A> match was successful). If (1) exists, then \s*</[Aa]> matches any trailing whitespace followed by the closing </A> tag.
The syntax is correct. The strange looking (?....) sets up a conditional. This is the regular expression syntax for an if...then statement. The (1) is a back-reference to the capture group at the beginning of the regex, which matches an html <a> tag, if there is one since that capture group is optional. Since the back-reference to the captured tag follows the "if" part of the regex, what it is doing is making sure that there was an opening <a> tag captured before trying to match the closing one. A pretty clever way of making both tags optional, but forcing both when the first one exists. That's how it's able to match all the lines in the sample text even though some of them just have <img> tags.
As to why it throws an exception in your case, most likely the flavor of regex you're using doesn't support conditionals. Not all do.
EDIT: Here's a good reference on conditionals in regular expressions: http://www.regular-expressions.info/conditional.html
What you're looking at is a conditional construct, as Bryan said, and Java doesn't support them. The parenthesized expression immediately after the question mark can actually be any zero-width assertion, like a lookahead or lookbehind, and not just a reference to a capture group. (I prefer to call those back-assertions, to avoid confusion. A back-reference matches the same thing the capture group did, but a back-assertion just asserts that the capture group matched something.)
I learned about conditionals when I was working in Perl years ago, but I've never missed them in Java. In this case, for example, a simple alternation will do the trick:
(?i)<a\s+[^>]+>\s*<img\s+[^>]+>\s*</a]>|<img\s+[^>]+>
One advantage of the conditional version is that you can capture the IMG tag with a single capture group:
(?i)(<a\s+[^>]+>\s*)?(<img\s+[^>]+>)(?(1)\s*</a>)
In the alternation version you have to have a capturing group for each alternative, but that's not as important in Java as it is in Perl, with all its built-in regex magic. Here's how I would pluck the IMG tags in Java:
Pattern p = Pattern.compile(
"<a\\s+[^>]+>\\s*(<img\\s+[^>]+>)\\s*</a>|(<img\\s+[^>]+>)"
Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(s);
while (m.find())
{
System.out.println(m.start(1) != -1 ? m.group(1) : m.group(2));
}
Could it be a non capturing group as described here:
There is also a special group, group
0, which always represents the entire
expression. This group is not included
in the total reported by groupCount.
Groups beginning with (? are pure,
non-capturing groups that do not
capture text and do not count towards
the group total. (You'll see examples
of non-capturing groups later in the
section Methods of the Pattern Class.)
Java Regex Tutorial
The short answer: it doesn't mean anything. The problem lies in this whole snippet:
(?(1)\s*)
() creates a back reference, so you can reuse any text matched inside. They also allow you to apply operators to everything inside of them (but this isn't done in your example).
? means that the item before it should be matched if it's there but it is also OK if it's not. This simply doesn't make sense when it appears after (
(?:MoreTextHere)
Can be used to speed up RegExs when you don't need to reuse the matched text. But that still doesn't really make sense, why match a 1 when your input is HTML?
Try:
(?:<[Aa]\s+[^>]+>\s*)?<[Ii][Mm][Gg]\s+[^>]+>
You never said exactly what you were trying to match so if this answer doesn't satisfy you, please explain what you're trying to do with RegEx.

Categories