I got a problem with my parser. I want to read an image-link on a webiste and this normally works fine. But today I got a link that contains special chars and the usual regex did not work.
This is how my code looks like.
Pattern t = Pattern.compile(regex.trim());
Matcher x = t.matcher(content[i].toString());
if(x.find())
{
values[i] = x.group(1);
}
And this is the part of html, that causes trouble
<div class="open-zoomview zoomlink" itemscope="" itemtype="http://schema.org/Product">
<img class="zoomLink productImage" src="
http://tnm.scene7.com/is/image/TNM/template_335x300?$plus_335x300$&$image=is{TNM/1098845000_prod_001}&$ausverkauft=1&$0prozent=1&$versandkostenfrei=0" alt="Produkt Atika HB 60 Benzin-Heckenschere" title="Produkt Atika HB 60 Benzin-Heckenschere" itemprop="image" />
</div>
And this is the regex I am using to get the part in the src-attribute:
<img .*src="(.*?)" .*>
I believe that it has something to do with all the special character inside the link. But I'm not sure how to escape all of them. I Already tried
Pattern.quote(content[i].toString())
But the outcome was the same: nothing found.
The . character usually only matches everything except new line characters. Therefore, your pattern won't match if there are newlines in the img-tag.
Use Pattern.compile(..., Pattern.DOTALL) or prepend your pattern with (?s).
In dotall mode, the expression . matches any character, including a
line terminator. By default this expression does not match line
terminators.
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html#DOTALL
You regex should be like:
String regex = "<img .*src=\"(.*?)\" .*>";
This probably caused by the newline within the tag. The . character won't match it.
Did you consider not using regex to parse HTML? Using regex for HTML parsing is notoriously fragile construct. Please consider using a parsing library such as JSoup for this.
You should actually use <img\\s\\.*?\\bsrc=["'](\\.*?)["']\\.*?> with (?s) modifier.
Related
I was trying to match the example in ,
<p>LinkToPage</p>
With rubular.com I could get something like <a href=\"(.*)?\/index.html\">.*<\/a>.
I'll be using this in Pattern.compile in Java. I know that \ has to be escaped as well, and I've come up with <a href=\\\"(.*)?\\\/index.html\\\">.*<\\\/a> and a few more variations but I'm getting it wrong. I tested on regexplanet. Can anyone help me with this?
Use ".*" in your Java code.
You only need to escape " because it's a Java string literal.
You don't need to escape /, because you aren't delimiting your regex with slashes (as you would be in Ruby).
Also, (.*)? makes no sense. Just use (.*). * can already match "nothing", so there's no point in having the ?.
Pattern.compile(".*");
That should fix your regex. You do not need to escape the forward slashes.
However I am obligated to present you with the standard caution against parsing HTML with regex:
RegEx match open tags except XHTML self-contained tags
You can tell Java what to match and call Pattern.quote(str) to make it escape the correct things for you.
I have a java string demonstrating a div element:
String source = "<div class = \"ads\">\n" +
"\t<dl style = \"font-size:14px; color:blue;\">\n" +
"\t\t<li>\n" +
"\t\t\tGgicci's Blog\n" +
"\t\t</li>\n" +
"\t</dl>\n" +
"</div>\n";
which in html form is:
<div class = "ads">
<dl style = "font-size:14px; color:blue;">
<li>
Ggicci's Blog
</li>
</dl>
</div>
And I write such a regex to extract dl element:
<dl[.\\s]*?>[.\\s]*?</div>
But it finds nothing and I modified it to be:
<dl(.|\\s)*?>(.|\\s)*?</div>
then it works. So I tested like this:
System.out.println(Pattern.matches("[.\\s]", "a")); --> false
System.out.println(Pattern.matches("[abc\\s]", "a")); --> true
so why the '.' cant match 'a' ?
Inside the square brackets, the characters are treated literaly. [.\\s] means "Match a dot, or a backslash or a s".
(.|\\s) is equivalent to ..
I think you really want the following regex:
<dl[^>]*>.*?</div>
When you include regexes in a post, it's a good idea to post them as you're actually using them--in this case, as Java string literals.
"[.\\s]" is a Java string literal representing the regex [.\s]; it matches a literal dot or a whitespace character. Your regex is not trying to match a backslash or an 's', as others have said, but the crucial factor is that . loses its special meaning inside a character class.
"(.|\\s)" is a Java string literal representing the regex (.|\s); it matches (anything but a line separator character OR any whitespace character). It works as you intended, but don't use it! It leaves you extremely vulnerable to catastrophic backtracking, as explained in this answer.
But no worries, all you really need to do is use DOTALL mode (also known as single-line mode), which enables . to match anything including line separator characters.
(?s)<dl\b[^>]*>.*?</dl>
+1 for above.
I would do:
<dl[^>]*>(.*?)</dl>
To match the content of dl
the syntax [.\\s] makes no sense, because, and Daniel said, the . just means "a dot" in this context.
Why can't you replace your [.\\s] with a much simpler . ?
I want to replace a content inside some tags, eg:
<p>this it to be replaced</p>
I could extract the content between with groups like this, but can i actually replace the group?
str = str.replaceAll("<p>([^<]*)</p>", "replacement");
You can use lookaround (positive lookahead and lookbehind) for this:
Change the regex to: "(?<=<p>)(.*?)(?=</p>)" and you will be fine.
Example:
String str = "<p>this it to be replaced</p>";
System.out.println(str.replaceAll("(?<=<p>)(.*?)(?=</p>)", "replacement"));
Output:
<p>replacement</p>
Note however that if you are parsing HTML you should be using some kind of a HTML parser, often regular expressions is not good enough...
Change the regex to this:
(?<=<p>).*?(?=</p>)
ie
str = str.replaceAll("(?<=<p>).*?(?=</p>)", "replacement");
This uses a "look behind" and a "look ahead" to assert, but not capture, input before/after the matching (non-greedy) regex
Just in case anyone is wondering, this answer is different to dacwe's: His uses unnecessary brackets. This answer is the more elegant :)
I am trying to extract from a webpage which has the following markup
<div id="div">
content
content
content
content
</div>
The regex I currently have is
Pattern div = Pattern.compile("<div id=\"div\">(.*?)</div>");
This works when there is only one line but with new lines it doesn't recognise stuff inside the div tag..
Any help will be grateful (I am using java by the way)
Personally, I would strongly discourage you from using regular expressions in this case. It is well documented as being a bad idea to attempt to suck information out of an HTML document with regular expressions. Take a look at a proper HTML parser instead!
The fact that it doesn't work when there are line breaks is because . (DOT) does not match any type of line break character. To let . match line breaks as well, do:
Pattern.compile("<div id=\"div\">(.*?)</div>", Pattern.DOTALL)
or:
Pattern.compile("<div id=\"div\">([\\s\\S]*?)</div>")
or:
Pattern.compile("(?s)<div id=\"div\">(.*?)</div>")
See: http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#DOTALL
I think, this should work (you need to add the DOTALL modifier):
Pattern div = Pattern.compile("<div id=\"div\">(.*?)</div>", Pattern.DOTALL);
You could add the Pattern.Multiline option
Pattern div = Pattern.compile("<div id=\"div\">(.*?)</div>", Pattern.MULTILINE);
or add the ?m operator in your reg ex ( at the end)
Hope this helps
I need to parse a string and escape all html tags except <a> links.
For example:
"Hello, this is <b>A BOLD</b> bit and this is a google link"
When printed out in my jsp, I want to see the tags printed out as is (i.e. escaped so "A BOLD" is not actually in bold on the page) but the <a> link to be an actual link to google on the page.
I have got a little method that splits the incoming string based on a regex to match <a> links in various formats (with whites spaces, single or double quotes, etc). The regex is as follows:
myString.split("<a\\s[^>]*href\\s*=\\s*[\\\"\\|\\\'][^>]*[\\\"\\|\\\']\\s*>[^<\\/a>]*<\\/a>");
Yes it's horrid and probably hopelessly inefficient so open to alternative suggestions, but it does work up to a point. Where it falls down is parsing the link text bit. I want it to accept zero or more occurrences of any characters other than the </a> closing tag but it is parsing it as zero or more occurrences of any characters other than a "<" or "/" or "a" or ">", i.e. as individual characters rather than the complete </a> word. So it matches with any text that has an "e" in it for example.
The bit in question is: [^<\\/a>]*
How do I change this to match on the entire word not it's constituent characters? I've tried parentheses etc but nothing works.
You can clean your HTML without ruining <a> tags by using the jsoup HTML Cleaner with a Whitelist:
String unsafe =
"<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.addTags("a"));
// now: <p>r;Link</p>r;
Although I agree with the consensual opinion that regex were not designed to parse x*ml, I feel that sometimes, you just haven't the time to learn, practice and implement new concepts and that a simple regex might well suffice in your case.
If you get enough time, learn xml parsers. Otherwise, here is an untested and maybe not userproof regex proposition to your problem (escape the slashes for java strings):
<\s*(?:[^aA]\b|[a-zA-Z0-9]{2,})[^>]*>
Which translates into:
<\s* # less-than character with optional space
(?: # non capturing group of
[^aA]\b # a single letter which is not a nor A
| # or
[a-zA-Z0-9]{2,} # at least two alphanumeric characters
)
[^>]*> # ... anything until the first greater-than character