Using Condition in Regular Expressions

Using Condition in Regular Expressions - java

Source:
<TD>
<IMG SRC="/images/home.gif">
<IMG SRC="/images/spacer.gif">
<IMG SRC="/images/search.gif">
<IMG SRC="/images/spacer.gif">
<IMG SRC="/images/help.gif">
</TD>
Regex:
(<[Aa]\s+[^>]+>\s*)?<[Ii][Mm][Gg]\s+[^>]+>(?(1)\s*</[Aa]>)
Result:
<IMG SRC="/images/home.gif">
<IMG SRC="/images/spacer.gif">
<IMG SRC="/images/search.gif">
<IMG SRC="/images/spacer.gif">
<IMG SRC="/images/help.gif">
what's the "?(1)" mean?
When I run it in Java ,it cause a exception: java.util.regex.PatternSyntaxException,the
"?(1)" can't be recognized.
The explanation in the book is :
This pattern requires explanation. (<[Aa]\s+[^>]+>\s*)? matches an opening <A> or <a> tag (with any attributes that may be present), if present (the closing ? makes the expression optional). <[Ii][Mm][Gg]\s+[^>]+> then matches the <IMG> tag (regardless of case) with any of its attributes. (?(1)\s*</[Aa]>) starts off with a condition: ?(1) means execute only what comes next if backreference 1 (the opening <A> tag) exists (or in other words, execute only what comes next if the first <A> match was successful). If (1) exists, then \s*</[Aa]> matches any trailing whitespace followed by the closing </A> tag.

The syntax is correct. The strange looking (?....) sets up a conditional. This is the regular expression syntax for an if...then statement. The (1) is a back-reference to the capture group at the beginning of the regex, which matches an html <a> tag, if there is one since that capture group is optional. Since the back-reference to the captured tag follows the "if" part of the regex, what it is doing is making sure that there was an opening <a> tag captured before trying to match the closing one. A pretty clever way of making both tags optional, but forcing both when the first one exists. That's how it's able to match all the lines in the sample text even though some of them just have <img> tags.
As to why it throws an exception in your case, most likely the flavor of regex you're using doesn't support conditionals. Not all do.
EDIT: Here's a good reference on conditionals in regular expressions: http://www.regular-expressions.info/conditional.html

What you're looking at is a conditional construct, as Bryan said, and Java doesn't support them. The parenthesized expression immediately after the question mark can actually be any zero-width assertion, like a lookahead or lookbehind, and not just a reference to a capture group. (I prefer to call those back-assertions, to avoid confusion. A back-reference matches the same thing the capture group did, but a back-assertion just asserts that the capture group matched something.)
I learned about conditionals when I was working in Perl years ago, but I've never missed them in Java. In this case, for example, a simple alternation will do the trick:
(?i)<a\s+[^>]+>\s*<img\s+[^>]+>\s*</a]>|<img\s+[^>]+>
One advantage of the conditional version is that you can capture the IMG tag with a single capture group:
(?i)(<a\s+[^>]+>\s*)?(<img\s+[^>]+>)(?(1)\s*</a>)
In the alternation version you have to have a capturing group for each alternative, but that's not as important in Java as it is in Perl, with all its built-in regex magic. Here's how I would pluck the IMG tags in Java:
Pattern p = Pattern.compile(
"<a\\s+[^>]+>\\s*(<img\\s+[^>]+>)\\s*</a>|(<img\\s+[^>]+>)"
Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(s);
while (m.find())
{
System.out.println(m.start(1) != -1 ? m.group(1) : m.group(2));
}

Could it be a non capturing group as described here:
There is also a special group, group
0, which always represents the entire
expression. This group is not included
in the total reported by groupCount.
Groups beginning with (? are pure,
non-capturing groups that do not
capture text and do not count towards
the group total. (You'll see examples
of non-capturing groups later in the
section Methods of the Pattern Class.)
Java Regex Tutorial

The short answer: it doesn't mean anything. The problem lies in this whole snippet:
(?(1)\s*)
() creates a back reference, so you can reuse any text matched inside. They also allow you to apply operators to everything inside of them (but this isn't done in your example).
? means that the item before it should be matched if it's there but it is also OK if it's not. This simply doesn't make sense when it appears after (
(?:MoreTextHere)
Can be used to speed up RegExs when you don't need to reuse the matched text. But that still doesn't really make sense, why match a 1 when your input is HTML?
Try:
(?:<[Aa]\s+[^>]+>\s*)?<[Ii][Mm][Gg]\s+[^>]+>
You never said exactly what you were trying to match so if this answer doesn't satisfy you, please explain what you're trying to do with RegEx.

Related

Jsoup Selector Regex matching

I want to get just the elements with this id pattern "answer-[0-9]*"
I'm using this regex in select "div[id~=answer-[0-9]*]"
The matching elements are:
<div class="post-text" id="answer-45881">
and
<div class="hidden modal modal-flag" id="answer-flag-modal45881">
What must I change to get only the first one?

Based on example from official tutorial
[attr~=regex]: elements with attribute values that match the regular expression;
e.g. img[src~=(?i)\.(png|jpe?g)]
it looks like jsoup simply checks if attribute contains some part which can be matched with regex (like in this example .png or .jpg), not if entire value of attribute is matched by regex.
To check if regex matches entire string you need to place anchors representing start of the string ^ and end of the string $.
Also instead of * you probably should use + if you want to make number part mandatory.
So try with div[id~=^answer-[0-9]+$]

The * operator means "zero or more" times so it will still match the second example. You need to use the + operator instead meaning "one or more" times. So, your syntax would be:
div[id~=answer-[0-9]+]

It looks like it searches id to contain this pattern, not to match.
"div[id~=answer-[0-9]*$]"
should work then.

Regex why does negative lookahead not work when there are two groups here

when I tried this regex
\"(\S\S+)\"(?!;c)
on this string "MM:";d it comes as matched as I wanted
and on this string "MM:";c it comes as not matched as desired.
But when I add a second group, by moving the semicolon inside that group and making it optional using |
\"(\S\S+)\"(;|)(?!c)
for this string "MM:";c it comes as matched when I expected it to not like before.
I tried this on Java and then on Javascript using Regex tool debuggex:
This link contains a snippet of the above
What am I doing wrong?
note the | is so it is not necessary to have a semicolon.Also in the examples I put c, it is just a substitute in the example for a word, that's why I am using negative lookahead.
After following Holgers response of using Possessive Quantifiers,
\"(\S\S+)\";?+(?!c)
it worked, here is a link to it on RegexPlanet

I believe that the regex will do what it can to find a match; since your expression said the semicolon could be optional, it found that it could match the entire expression (since if the semicolon is not consumed by the first group, it becomes a "no-match" for the negative lookahead. This has to do with the recursive way that regex works: it keeps trying to find a match...
In other words, the process goes like this:
MM:" - matched
(;|) - try semicolon? matched
(?!c) - oops - negative lookahead fails. No match. Go back
(;|) - try nothing. We still have ';c' left to match
(?!c) - negative lookahead not matched. We have a match
An update (based on your comment). The following code may work better:
\"(\S\S+)\"(;|)((?!c)|(?!;c))
Debuggex Demo

The problem is that you don’t want to make the semicolon optional in the sense of regular expression. An optional semicolon implies that the matcher is allowed to try both, matching with or without it. So even if the semicolon is there the matcher can ignore it creating an empty match for the group letting the lookahead succeed.
But you want to consume the semicolon if it’s there, so it is not allowed to be used to satisfy the negative look-ahead. With Java’s regex engine that’s pretty easy: use ;?+
This is called a “possessive quantifier”. Like with the ? the semicolon doesn’t need to be there but if it’s there it must match and cannot be ignored. So the regex engine has no alternatives any more.
So the entire pattern looks like \"(\S\S+)\";?+(?!c) or \"(\S\S+)\"(;?+)(?!c) if you need the semicolon in a group.

Regex to match any / in HTML except for <p> tag

Basically i need to match any / from a HTML that isn't part of a closed <p> tag.
This is what i got so far, but it doesn't really work as expected and I've been trying for some time now.
((?<!(p))\/(?!(>))) | ((?<!(<))\/(?!(p)))
I also need the regex to work in Java.
As an example:
<div>test</div> <span>test</span> <p>something<p/> </p>
I would like it to match every / except for the ones in the <p> tags at the end!

Fortunately, Java supports both lookbehind and lookahead (in contrast, the language I spend most of my time in, JavaScript, supports only lookahead).
So the pattern you're looking for is:
(?<!<p)/(?!p>)
This pattern will match any slash that's neither preceded by a <p or followed by a p>. Therefore it excludes <p/> as well as </p>.
The lookahead/lookbehind assertions (often called "zero-width" assertions) are not actually included in the match, which sounds like what you want. It basically asserts that the thing you are trying to match is preceded by (lookbehind) or followed by (lookahead) a sub-expression. In this case we're using negative assertions (not preceded by / not followed by).
Parsing HTML with regex is a trikcy business. As one answer pointed out, HTML is context-free, and therefore cannot be completely parsed by HTML, leaving open the possibility of HTML that will confound the match. Let's not even get started on ill-formed HTML.
I would consider the following common variation on an empty tag, though:
<p />
To handle this, I would add some whitespace to the match:
(?<!<p\s*)/(?!p>)
Where you might run into problems is weird whitespace (still valid HTML). The following slashes WILL match with the above regex:
< p/>
<p/ >
This can be dealt with by adding more whitespace reptitions to your regex. As mentioned before, this will also match slashes in text, so the following input will match only one slash (the one in the text):
<p>some text / other text</p>
Lastly, of course, there are CDATA groups. The following input will match NO slashes:
<![CDATA[This <p/> isn't actually a tag...it's just text.]]>

/(?!p)
This seems to work. but im not sure what the question is.
<div>test</div> <span>test</span> <p>something<p/> </p>
matches: / / /

RegEx - match the whole <a> tag in java

I'm trying to match this <a href="**something**"> using regex in java using this code:
Pattern regex = Pattern.compile("<([a-z]+) *[^/]*?>");
Matcher matcher = regex.matcher(string);
string= matcher.replaceAll("");
I'm not really familiar with regex. What am I doing wrong? Thanks

If you just want to find the start tag you could use:
"<a(?=[>\\s])[^>]*>"
If you are trying to get the href attribute it would be better to use:
"<a\\s+[^>]*href=(['\"])(.*?)\\1[^>]*>"
This would capture the link into capturing group 2.

To give you an idea of why people always say "don't try to parse HTML with a regular expression", here'e a simplified regex for matching an <a> tag:
<\s*a(?:\s+[a-z]+(?:\s*=\s*(?:[a-z0-9]+|"[^"]*"|'[^']*'))?)*\s*>
It actually is possible to match a tag with a regular expression. It just isn't as easy as most people expect.
All of HTML, on the other hand, is not "regular" and so you can't do it with a regular expression. (The "regex" support in many/most languages is actually more powerful than "regular", but few are powerful enough to deal with balanced structures like those in HTML.)
Here's a breakdown of what the above expression does:
<\s* < and possibly some spaces
a "a"
(?: 0 or more...
\s+ some spaces
[a-z]+ attribute name (simplified)
(?: and maybe...
\s*=\s* an equal sign, possibly with surrounding spaces
(?: and one of:
[a-z0-9]+ - a simple attribute value (simplified)
|"[^"]*" - a double-quoted attr value
|'[^']*' - a single quoted atttr value
)
)?
)*
\s*> possibly more spaces and then >
(The comments at the start of each group also talk about the operator at
the end of the group, or even in the group.)
There are possibly other simplifications here -- I wrote this from
memory, not from the spec. Even if you follow the spec to the letter, browsers are even more fault tolerant and will accept all sorts of invalid input.

you can just match against:
"<a[^>]*>"
If the * is "greedy" in java (what I think it is, this is correct)
But you cannot match < a whatever="foo" > with that, because of the whitespaces.
Although the following is better, but more complicated to understand:
"<\\s*a\\s+[^>]*>"
(The double \\ is needed because \ is a special char in a java strings)
This handles optional whitespaces before a and at minimum one whitespace after a.
So you don't match <abcdef> which is not a correct a tag.
(I assume your a tag stands isolated in one line and you are not working with multiline mode enabled. Else it gets far far more complicated.)
your last *[^/]*?> seems a little bit strange, maybe it doesn't work cause of that.
Ok lets check what you are doing:
<([a-z]+) *[^/]*?>
<([a-z]+)
match something that contains an <followed by a [a-z] at least one time. This is grouped by the brackets.
Now you use a * which means the defined group ([a-z])* may appear multiple time, or not.
[^/]*
This means now match everything, but a / or nothing (because of the *)
The question mark is just wrong, not sure how this is interpreted.
Last char > matched as last element, which must appear.
To sum up, your expression is just wrong and cannot work :)
Take a look at: http://www.regular-expressions.info/
This is a good starting point.

Regex in java question, multiple matches

I am trying to match multiple CSS style code blocks in a HTML document. This code will match the first one but won't match the second. What code would I need to match the second. Can I just get a list of the groups that are inside of my 'style' brackets? Should I call the 'find' method to get the next match?
Here is my regex pattern
^.*(<style type="text/css">)(.*)(</style>).*$
Usage:
final Pattern pattern_css = Pattern.compile(css_pattern_buf.toString(),
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
final Matcher match_css = pattern_css.matcher(text);
if (match_css.matches() && (match_css.groupCount() >= 3)) {
System.out.println("Woot ==>" + match_css.groupCount());
System.out.println(match_css.group(2));
} else {
System.out.println("No Match");
}

I am trying to match multiple CSS style code blocks in a HTML document.
Standard Answer: don't use regex to parse HTML. regex cannot parse HTML reliably, no matter how complicated and clever you make your expression. Unless you are absolutely sure the exact format of the target document is totally fixed, string or regex processing is insufficient and you must use an HTML parser.
(<style type="text/css">)(.*)(</style>)
That's a greedy expression. The (.*) in the middle will match as much as it possibly can. If you have two style blocks:
<style type="text/css">1</style> <style type="text/css">2</style>
then it will happily match '1</style> <style type="text/css">2'.
Use (.*?) to get a non-greedy expression, which will allow the trailing (</style>) to match at the first opportunity.
Should I call the 'find' method to get the next match?
Yes, and you should have used it to get the first match too. The usual idiom is:
while (matcher.find()) {
s= matcher.group(n);
}
Note that standard string processing (indexOf, etc) may be a simpler approach for you than regex, since you're only using completely fixed strings. However, the Standard Answer still applies.

You can simplify the regex as follows:
(<style type="text/css">)(.*?)(</style>)
And if you don’t need the groups 1 and 3 (probably not), I would drop the parentheses, remaining only:
<style type="text/css">(.*?)</style>

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using Condition in Regular Expressions - java

Related

Jsoup Selector Regex matching

Regex why does negative lookahead not work when there are two groups here

Regex to match any / in HTML except for <p> tag

RegEx - match the whole <a> tag in java

Regex in java question, multiple matches

Categories

Resources