I have a java string demonstrating a div element:
String source = "<div class = \"ads\">\n" +
"\t<dl style = \"font-size:14px; color:blue;\">\n" +
"\t\t<li>\n" +
"\t\t\tGgicci's Blog\n" +
"\t\t</li>\n" +
"\t</dl>\n" +
"</div>\n";
which in html form is:
<div class = "ads">
<dl style = "font-size:14px; color:blue;">
<li>
Ggicci's Blog
</li>
</dl>
</div>
And I write such a regex to extract dl element:
<dl[.\\s]*?>[.\\s]*?</div>
But it finds nothing and I modified it to be:
<dl(.|\\s)*?>(.|\\s)*?</div>
then it works. So I tested like this:
System.out.println(Pattern.matches("[.\\s]", "a")); --> false
System.out.println(Pattern.matches("[abc\\s]", "a")); --> true
so why the '.' cant match 'a' ?
Inside the square brackets, the characters are treated literaly. [.\\s] means "Match a dot, or a backslash or a s".
(.|\\s) is equivalent to ..
I think you really want the following regex:
<dl[^>]*>.*?</div>
When you include regexes in a post, it's a good idea to post them as you're actually using them--in this case, as Java string literals.
"[.\\s]" is a Java string literal representing the regex [.\s]; it matches a literal dot or a whitespace character. Your regex is not trying to match a backslash or an 's', as others have said, but the crucial factor is that . loses its special meaning inside a character class.
"(.|\\s)" is a Java string literal representing the regex (.|\s); it matches (anything but a line separator character OR any whitespace character). It works as you intended, but don't use it! It leaves you extremely vulnerable to catastrophic backtracking, as explained in this answer.
But no worries, all you really need to do is use DOTALL mode (also known as single-line mode), which enables . to match anything including line separator characters.
(?s)<dl\b[^>]*>.*?</dl>
+1 for above.
I would do:
<dl[^>]*>(.*?)</dl>
To match the content of dl
the syntax [.\\s] makes no sense, because, and Daniel said, the . just means "a dot" in this context.
Why can't you replace your [.\\s] with a much simpler . ?
Related
I have such a string in the velocity template file:
<a id="superurl_${getItemid()}" href="http://example.com?$param1=345&$param2=abf¶m3=${par3}">link1</a>
That renders as
<a id="superurl_1288" href="http://example.com?$param1=345&$param2=abf¶m3=${par3}">link1</a>
However, it should be rendered as
<a id="superurl_1288" href="http://example.com?$param1=345&$param2=abf¶m3=zzz">link1</a>
How to modify the source so that ${par3} rendered by its value and is not treated as a part of the string?
You can use #[[ .. ]]# to escape strings in Velocity. I think in your case the preceding $ may be conflicting with what goes after them. Try the following:
<a id="superurl_${getItemid()}" href="#[[http://example.com$param1=345&$param2=abf¶m3=]]#${par3}">link1</a>
Also, make sure you actually pass a variable called "par3". (This is more likely to be the reason why it's not parsed?)
You could set a variable for the dollar sign like this
#set ( $d = "$")
<a id="superurl_${d}{getItemid()}" href="http://example.com?$param1=345&$param2=abf¶m3=${par3}">link1</a>
This answer does not apply for the OP's specific example, but in general dollar signs only need to be escaped when they are followed by a (upper or lower case) letter. A dollar sign that is followed by any other character already stands for itself and does not require escaping. If there is some flexibility in the template, another simple solution is to insert a non-letter character (space, comma, underscore, ...) after the dollar sign.
I got a problem with my parser. I want to read an image-link on a webiste and this normally works fine. But today I got a link that contains special chars and the usual regex did not work.
This is how my code looks like.
Pattern t = Pattern.compile(regex.trim());
Matcher x = t.matcher(content[i].toString());
if(x.find())
{
values[i] = x.group(1);
}
And this is the part of html, that causes trouble
<div class="open-zoomview zoomlink" itemscope="" itemtype="http://schema.org/Product">
<img class="zoomLink productImage" src="
http://tnm.scene7.com/is/image/TNM/template_335x300?$plus_335x300$&$image=is{TNM/1098845000_prod_001}&$ausverkauft=1&$0prozent=1&$versandkostenfrei=0" alt="Produkt Atika HB 60 Benzin-Heckenschere" title="Produkt Atika HB 60 Benzin-Heckenschere" itemprop="image" />
</div>
And this is the regex I am using to get the part in the src-attribute:
<img .*src="(.*?)" .*>
I believe that it has something to do with all the special character inside the link. But I'm not sure how to escape all of them. I Already tried
Pattern.quote(content[i].toString())
But the outcome was the same: nothing found.
The . character usually only matches everything except new line characters. Therefore, your pattern won't match if there are newlines in the img-tag.
Use Pattern.compile(..., Pattern.DOTALL) or prepend your pattern with (?s).
In dotall mode, the expression . matches any character, including a
line terminator. By default this expression does not match line
terminators.
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html#DOTALL
You regex should be like:
String regex = "<img .*src=\"(.*?)\" .*>";
This probably caused by the newline within the tag. The . character won't match it.
Did you consider not using regex to parse HTML? Using regex for HTML parsing is notoriously fragile construct. Please consider using a parsing library such as JSoup for this.
You should actually use <img\\s\\.*?\\bsrc=["'](\\.*?)["']\\.*?> with (?s) modifier.
I'm trying to match this <a href="**something**"> using regex in java using this code:
Pattern regex = Pattern.compile("<([a-z]+) *[^/]*?>");
Matcher matcher = regex.matcher(string);
string= matcher.replaceAll("");
I'm not really familiar with regex. What am I doing wrong? Thanks
If you just want to find the start tag you could use:
"<a(?=[>\\s])[^>]*>"
If you are trying to get the href attribute it would be better to use:
"<a\\s+[^>]*href=(['\"])(.*?)\\1[^>]*>"
This would capture the link into capturing group 2.
To give you an idea of why people always say "don't try to parse HTML with a regular expression", here'e a simplified regex for matching an <a> tag:
<\s*a(?:\s+[a-z]+(?:\s*=\s*(?:[a-z0-9]+|"[^"]*"|'[^']*'))?)*\s*>
It actually is possible to match a tag with a regular expression. It just isn't as easy as most people expect.
All of HTML, on the other hand, is not "regular" and so you can't do it with a regular expression. (The "regex" support in many/most languages is actually more powerful than "regular", but few are powerful enough to deal with balanced structures like those in HTML.)
Here's a breakdown of what the above expression does:
<\s* < and possibly some spaces
a "a"
(?: 0 or more...
\s+ some spaces
[a-z]+ attribute name (simplified)
(?: and maybe...
\s*=\s* an equal sign, possibly with surrounding spaces
(?: and one of:
[a-z0-9]+ - a simple attribute value (simplified)
|"[^"]*" - a double-quoted attr value
|'[^']*' - a single quoted atttr value
)
)?
)*
\s*> possibly more spaces and then >
(The comments at the start of each group also talk about the operator at
the end of the group, or even in the group.)
There are possibly other simplifications here -- I wrote this from
memory, not from the spec. Even if you follow the spec to the letter, browsers are even more fault tolerant and will accept all sorts of invalid input.
you can just match against:
"<a[^>]*>"
If the * is "greedy" in java (what I think it is, this is correct)
But you cannot match < a whatever="foo" > with that, because of the whitespaces.
Although the following is better, but more complicated to understand:
"<\\s*a\\s+[^>]*>"
(The double \\ is needed because \ is a special char in a java strings)
This handles optional whitespaces before a and at minimum one whitespace after a.
So you don't match <abcdef> which is not a correct a tag.
(I assume your a tag stands isolated in one line and you are not working with multiline mode enabled. Else it gets far far more complicated.)
your last *[^/]*?> seems a little bit strange, maybe it doesn't work cause of that.
Ok lets check what you are doing:
<([a-z]+) *[^/]*?>
<([a-z]+)
match something that contains an <followed by a [a-z] at least one time. This is grouped by the brackets.
Now you use a * which means the defined group ([a-z])* may appear multiple time, or not.
[^/]*
This means now match everything, but a / or nothing (because of the *)
The question mark is just wrong, not sure how this is interpreted.
Last char > matched as last element, which must appear.
To sum up, your expression is just wrong and cannot work :)
Take a look at: http://www.regular-expressions.info/
This is a good starting point.
I need to parse a string and escape all html tags except <a> links.
For example:
"Hello, this is <b>A BOLD</b> bit and this is a google link"
When printed out in my jsp, I want to see the tags printed out as is (i.e. escaped so "A BOLD" is not actually in bold on the page) but the <a> link to be an actual link to google on the page.
I have got a little method that splits the incoming string based on a regex to match <a> links in various formats (with whites spaces, single or double quotes, etc). The regex is as follows:
myString.split("<a\\s[^>]*href\\s*=\\s*[\\\"\\|\\\'][^>]*[\\\"\\|\\\']\\s*>[^<\\/a>]*<\\/a>");
Yes it's horrid and probably hopelessly inefficient so open to alternative suggestions, but it does work up to a point. Where it falls down is parsing the link text bit. I want it to accept zero or more occurrences of any characters other than the </a> closing tag but it is parsing it as zero or more occurrences of any characters other than a "<" or "/" or "a" or ">", i.e. as individual characters rather than the complete </a> word. So it matches with any text that has an "e" in it for example.
The bit in question is: [^<\\/a>]*
How do I change this to match on the entire word not it's constituent characters? I've tried parentheses etc but nothing works.
You can clean your HTML without ruining <a> tags by using the jsoup HTML Cleaner with a Whitelist:
String unsafe =
"<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.addTags("a"));
// now: <p>r;Link</p>r;
Although I agree with the consensual opinion that regex were not designed to parse x*ml, I feel that sometimes, you just haven't the time to learn, practice and implement new concepts and that a simple regex might well suffice in your case.
If you get enough time, learn xml parsers. Otherwise, here is an untested and maybe not userproof regex proposition to your problem (escape the slashes for java strings):
<\s*(?:[^aA]\b|[a-zA-Z0-9]{2,})[^>]*>
Which translates into:
<\s* # less-than character with optional space
(?: # non capturing group of
[^aA]\b # a single letter which is not a nor A
| # or
[a-zA-Z0-9]{2,} # at least two alphanumeric characters
)
[^>]*> # ... anything until the first greater-than character
I need 2 simple reg exps that will:
Match if a string is contained within square brackets ([] e.g [word])
Match if string is contained within double quotes ("" e.g "word")
\[\w+\]
"\w+"
Explanation:
The \[ and \] escape the special bracket characters to match their literals.
The \w means "any word character", usually considered same as alphanumeric or underscore.
The + means one or more of the preceding item.
The " are literal characters.
NOTE: If you want to ensure the whole string matches (not just part of it), prefix with ^ and suffix with $.
And next time, you should be able to answer this yourself, by reading regular-expressions.info
Update:
Ok, so based on your comment, what you appear to be wanting to know is if the first character is [ and the last ] or if the first and last are both " ?
If so, these will match those:
^\[.*\]$ (or ^\\[.*\\]$ in a Java String)
"^.*$"
However, unless you need to do some special checking with the centre characters, simply doing:
if ( MyString.startsWith("[") && MyString.endsWith("]") )
and
if ( MyString.startsWith("\"") && MyString.endsWith("\"") )
Which I suspect would be faster than a regex.
Important issues that may make this hard/impossible in a regex:
Can [] be nested (e.g. [foo [bar]])? If so, then a traditional regex cannot help you. Perl's extended regexes can, but it is probably better to write a parser.
Can [, ], or " appear escaped (e.g. "foo said \"bar\"") in the string? If so, see How can I match double-quoted strings with escaped double-quote characters?
Is it possible for there to be more than one instance of these in the string you are matching? If so, you probably want to use the non-greedy quantifier modifier (i.e. ?) to get the smallest string that matches: /(".*?"|\[.*?\])/g
Based on comments, you seem to want to match things like "this is a "long" word"
#!/usr/bin/perl
use strict;
use warnings;
my $s = 'The non-string "this is a crazy "string"" is bad (has own delimiter)';
print $s =~ /^.*?(".*").*?$/, "\n";
Are they two separate expressions?
[[A-Za-z]+]
\"[A-Za-z]+\"
If they are in a single expression:
[[\"]+[a-zA-Z]+[]\"]+
Remember that in .net you'll need to escape the double quotes " by ""