Java .split() with regex to match html <a> links - java

I need to parse a string and escape all html tags except <a> links.
For example:
"Hello, this is <b>A BOLD</b> bit and this is a google link"
When printed out in my jsp, I want to see the tags printed out as is (i.e. escaped so "A BOLD" is not actually in bold on the page) but the <a> link to be an actual link to google on the page.
I have got a little method that splits the incoming string based on a regex to match <a> links in various formats (with whites spaces, single or double quotes, etc). The regex is as follows:
myString.split("<a\\s[^>]*href\\s*=\\s*[\\\"\\|\\\'][^>]*[\\\"\\|\\\']\\s*>[^<\\/a>]*<\\/a>");
Yes it's horrid and probably hopelessly inefficient so open to alternative suggestions, but it does work up to a point. Where it falls down is parsing the link text bit. I want it to accept zero or more occurrences of any characters other than the </a> closing tag but it is parsing it as zero or more occurrences of any characters other than a "<" or "/" or "a" or ">", i.e. as individual characters rather than the complete </a> word. So it matches with any text that has an "e" in it for example.
The bit in question is: [^<\\/a>]*
How do I change this to match on the entire word not it's constituent characters? I've tried parentheses etc but nothing works.

You can clean your HTML without ruining <a> tags by using the jsoup HTML Cleaner with a Whitelist:
String unsafe =
"<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.addTags("a"));
// now: <p&gtr;Link</p&gtr;

Although I agree with the consensual opinion that regex were not designed to parse x*ml, I feel that sometimes, you just haven't the time to learn, practice and implement new concepts and that a simple regex might well suffice in your case.
If you get enough time, learn xml parsers. Otherwise, here is an untested and maybe not userproof regex proposition to your problem (escape the slashes for java strings):
<\s*(?:[^aA]\b|[a-zA-Z0-9]{2,})[^>]*>
Which translates into:
<\s* # less-than character with optional space
(?: # non capturing group of
[^aA]\b # a single letter which is not a nor A
| # or
[a-zA-Z0-9]{2,} # at least two alphanumeric characters
)
[^>]*> # ... anything until the first greater-than character

Related

Regex to match any / in HTML except for <p> tag

Basically i need to match any / from a HTML that isn't part of a closed <p> tag.
This is what i got so far, but it doesn't really work as expected and I've been trying for some time now.
((?<!(p))\/(?!(>))) | ((?<!(<))\/(?!(p)))
I also need the regex to work in Java.
As an example:
<div>test</div> <span>test</span> <p>something<p/> </p>
I would like it to match every / except for the ones in the <p> tags at the end!
Fortunately, Java supports both lookbehind and lookahead (in contrast, the language I spend most of my time in, JavaScript, supports only lookahead).
So the pattern you're looking for is:
(?<!<p)/(?!p>)
This pattern will match any slash that's neither preceded by a <p or followed by a p>. Therefore it excludes <p/> as well as </p>.
The lookahead/lookbehind assertions (often called "zero-width" assertions) are not actually included in the match, which sounds like what you want. It basically asserts that the thing you are trying to match is preceded by (lookbehind) or followed by (lookahead) a sub-expression. In this case we're using negative assertions (not preceded by / not followed by).
Parsing HTML with regex is a trikcy business. As one answer pointed out, HTML is context-free, and therefore cannot be completely parsed by HTML, leaving open the possibility of HTML that will confound the match. Let's not even get started on ill-formed HTML.
I would consider the following common variation on an empty tag, though:
<p />
To handle this, I would add some whitespace to the match:
(?<!<p\s*)/(?!p>)
Where you might run into problems is weird whitespace (still valid HTML). The following slashes WILL match with the above regex:
< p/>
<p/ >
This can be dealt with by adding more whitespace reptitions to your regex. As mentioned before, this will also match slashes in text, so the following input will match only one slash (the one in the text):
<p>some text / other text</p>
Lastly, of course, there are CDATA groups. The following input will match NO slashes:
<![CDATA[This <p/> isn't actually a tag...it's just text.]]>
/(?!p)
This seems to work. but im not sure what the question is.
<div>test</div> <span>test</span> <p>something<p/> </p>
matches: / / /

Java Regex doesn't work with special chars

I got a problem with my parser. I want to read an image-link on a webiste and this normally works fine. But today I got a link that contains special chars and the usual regex did not work.
This is how my code looks like.
Pattern t = Pattern.compile(regex.trim());
Matcher x = t.matcher(content[i].toString());
if(x.find())
{
values[i] = x.group(1);
}
And this is the part of html, that causes trouble
<div class="open-zoomview zoomlink" itemscope="" itemtype="http://schema.org/Product">
<img class="zoomLink productImage" src="
http://tnm.scene7.com/is/image/TNM/template_335x300?$plus_335x300$&$image=is{TNM/1098845000_prod_001}&$ausverkauft=1&$0prozent=1&$versandkostenfrei=0" alt="Produkt Atika HB 60 Benzin-Heckenschere" title="Produkt Atika HB 60 Benzin-Heckenschere" itemprop="image" />
</div>
And this is the regex I am using to get the part in the src-attribute:
<img .*src="(.*?)" .*>
I believe that it has something to do with all the special character inside the link. But I'm not sure how to escape all of them. I Already tried
Pattern.quote(content[i].toString())
But the outcome was the same: nothing found.
The . character usually only matches everything except new line characters. Therefore, your pattern won't match if there are newlines in the img-tag.
Use Pattern.compile(..., Pattern.DOTALL) or prepend your pattern with (?s).
In dotall mode, the expression . matches any character, including a
line terminator. By default this expression does not match line
terminators.
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html#DOTALL
You regex should be like:
String regex = "<img .*src=\"(.*?)\" .*>";
This probably caused by the newline within the tag. The . character won't match it.
Did you consider not using regex to parse HTML? Using regex for HTML parsing is notoriously fragile construct. Please consider using a parsing library such as JSoup for this.
You should actually use <img\\s\\.*?\\bsrc=["'](\\.*?)["']\\.*?> with (?s) modifier.

java regex why these two regular expressions are different

I have a java string demonstrating a div element:
String source = "<div class = \"ads\">\n" +
"\t<dl style = \"font-size:14px; color:blue;\">\n" +
"\t\t<li>\n" +
"\t\t\tGgicci's Blog\n" +
"\t\t</li>\n" +
"\t</dl>\n" +
"</div>\n";
which in html form is:
<div class = "ads">
<dl style = "font-size:14px; color:blue;">
<li>
Ggicci's Blog
</li>
</dl>
</div>
And I write such a regex to extract dl element:
<dl[.\\s]*?>[.\\s]*?</div>
But it finds nothing and I modified it to be:
<dl(.|\\s)*?>(.|\\s)*?</div>
then it works. So I tested like this:
System.out.println(Pattern.matches("[.\\s]", "a")); --> false
System.out.println(Pattern.matches("[abc\\s]", "a")); --> true
so why the '.' cant match 'a' ?
Inside the square brackets, the characters are treated literaly. [.\\s] means "Match a dot, or a backslash or a s".
(.|\\s) is equivalent to ..
I think you really want the following regex:
<dl[^>]*>.*?</div>
When you include regexes in a post, it's a good idea to post them as you're actually using them--in this case, as Java string literals.
"[.\\s]" is a Java string literal representing the regex [.\s]; it matches a literal dot or a whitespace character. Your regex is not trying to match a backslash or an 's', as others have said, but the crucial factor is that . loses its special meaning inside a character class.
"(.|\\s)" is a Java string literal representing the regex (.|\s); it matches (anything but a line separator character OR any whitespace character). It works as you intended, but don't use it! It leaves you extremely vulnerable to catastrophic backtracking, as explained in this answer.
But no worries, all you really need to do is use DOTALL mode (also known as single-line mode), which enables . to match anything including line separator characters.
(?s)<dl\b[^>]*>.*?</dl>
+1 for above.
I would do:
<dl[^>]*>(.*?)</dl>
To match the content of dl
the syntax [.\\s] makes no sense, because, and Daniel said, the . just means "a dot" in this context.
Why can't you replace your [.\\s] with a much simpler . ?

RegEx - match the whole <a> tag in java

I'm trying to match this <a href="**something**"> using regex in java using this code:
Pattern regex = Pattern.compile("<([a-z]+) *[^/]*?>");
Matcher matcher = regex.matcher(string);
string= matcher.replaceAll("");
I'm not really familiar with regex. What am I doing wrong? Thanks
If you just want to find the start tag you could use:
"<a(?=[>\\s])[^>]*>"
If you are trying to get the href attribute it would be better to use:
"<a\\s+[^>]*href=(['\"])(.*?)\\1[^>]*>"
This would capture the link into capturing group 2.
To give you an idea of why people always say "don't try to parse HTML with a regular expression", here'e a simplified regex for matching an <a> tag:
<\s*a(?:\s+[a-z]+(?:\s*=\s*(?:[a-z0-9]+|"[^"]*"|'[^']*'))?)*\s*>
It actually is possible to match a tag with a regular expression. It just isn't as easy as most people expect.
All of HTML, on the other hand, is not "regular" and so you can't do it with a regular expression. (The "regex" support in many/most languages is actually more powerful than "regular", but few are powerful enough to deal with balanced structures like those in HTML.)
Here's a breakdown of what the above expression does:
<\s* < and possibly some spaces
a "a"
(?: 0 or more...
\s+ some spaces
[a-z]+ attribute name (simplified)
(?: and maybe...
\s*=\s* an equal sign, possibly with surrounding spaces
(?: and one of:
[a-z0-9]+ - a simple attribute value (simplified)
|"[^"]*" - a double-quoted attr value
|'[^']*' - a single quoted atttr value
)
)?
)*
\s*> possibly more spaces and then >
(The comments at the start of each group also talk about the operator at
the end of the group, or even in the group.)
There are possibly other simplifications here -- I wrote this from
memory, not from the spec. Even if you follow the spec to the letter, browsers are even more fault tolerant and will accept all sorts of invalid input.
you can just match against:
"<a[^>]*>"
If the * is "greedy" in java (what I think it is, this is correct)
But you cannot match < a whatever="foo" > with that, because of the whitespaces.
Although the following is better, but more complicated to understand:
"<\\s*a\\s+[^>]*>"
(The double \\ is needed because \ is a special char in a java strings)
This handles optional whitespaces before a and at minimum one whitespace after a.
So you don't match <abcdef> which is not a correct a tag.
(I assume your a tag stands isolated in one line and you are not working with multiline mode enabled. Else it gets far far more complicated.)
your last *[^/]*?> seems a little bit strange, maybe it doesn't work cause of that.
Ok lets check what you are doing:
<([a-z]+) *[^/]*?>
<([a-z]+)
match something that contains an <followed by a [a-z] at least one time. This is grouped by the brackets.
Now you use a * which means the defined group ([a-z])* may appear multiple time, or not.
[^/]*
This means now match everything, but a / or nothing (because of the *)
The question mark is just wrong, not sure how this is interpreted.
Last char > matched as last element, which must appear.
To sum up, your expression is just wrong and cannot work :)
Take a look at: http://www.regular-expressions.info/
This is a good starting point.

Java regular expression to match _all_ whitespace characters

I'm looking for a regular expression in Java which matches all whitespace characters in a String. "\s" matches only some, it does not match and similar non-ascii whitespaces. I'm looking for a regular expression which matches all (common) white-space characters which can occur in a Java String.
[Edit]
To clarify: I do not mean the string sequence " " I mean the sincle unicode character U+00A0 that is often represented by " ", e.g. in HTML, and all other unicode characters with a similar white-space meainig, e.g. "NARROW NO-BREAK SPACE" (U+202F), Word joiner encoded in Unicode 3.2 and above as U+2060, "ZERO WIDTH NO-BREAK SPACE" (U+FEFF) and any other character that can be regareded as white-space.
[Answer]
For my pupose, ie catching all whitespace characters, unicode + traditional, the following expression does the job:
[\p{Z}\s]
The answer is in the comments below but since it is a bit hidden I repeat it here.
is not a whitespace character, as far as regexpes are concerned. You need to either modify the regexp to include those strings in addition to \s, like /(\s| |%20)/, or previously parse the string contents to get the ASCII or Unicode representation of the data.
You are mixing abstraction levels here.
If, what after a careful reread of the question seems to be the case, you are after a way to match all whitespace characters referring to standard ASCII plus the whitespace codepoints, \p{Z} or \p{Zs} will do the work.
You should really clarify your question because it has misled a lot of people (even making the correct answer to have some downvotes).
You clarified the question the way as I expected: you're actually not looking for the String literal as many here seem to think and for which the solution is too obvious.
Well, unfortunately, there's no way to match them using regex. Best is to include the particular codepoints in the pattern, for example: "[\\s\\xA0]".
Edit as turned out in one of the comments, you could use the undocumented "\\p{Z}" for this. Alan, can you please leave comment how you found that out? This one is quite useful.
The is only whitespace in HTML. Use an HTML parser to extract the plain text. and \s should work just fine.
In case anyone runs into this question again looking for help, I suggest pursuing the following answer: https://stackoverflow.com/a/6255512/1678392
The short version: \\p{javaSpaceChar}
Why: Per the Pattern class, this maps the Character.isSpaceChar method:
Categories that behave like the java.lang.Character boolean ismethodname methods (except for the deprecated ones) are available through the same \p{prop} syntax where the specified property has the name javamethodname.
👍
Click here for a summary I made of several competing definitions of "whitespace".
You might end up having to explicitly list the additional ones you care about that aren't matched by one of the prefab ones.
is not white space. It is a character encoding sequence that represents whitespace in HTML. You most likely want to convert HTML encoded text into plain text before running your string match against it. If that is the case, go look up
javax.swing.text.html
The regex characters are the only ones independent of encoding. Here is a list of some characters which - in Unicode - are non-printing:
How many non-printing characters are in common use?

Categories