Jsoup Selector Regex matching

Jsoup Selector Regex matching - java

I want to get just the elements with this id pattern "answer-[0-9]*"
I'm using this regex in select "div[id~=answer-[0-9]*]"
The matching elements are:
<div class="post-text" id="answer-45881">
and
<div class="hidden modal modal-flag" id="answer-flag-modal45881">
What must I change to get only the first one?

Based on example from official tutorial
[attr~=regex]: elements with attribute values that match the regular expression;
e.g. img[src~=(?i)\.(png|jpe?g)]
it looks like jsoup simply checks if attribute contains some part which can be matched with regex (like in this example .png or .jpg), not if entire value of attribute is matched by regex.
To check if regex matches entire string you need to place anchors representing start of the string ^ and end of the string $.
Also instead of * you probably should use + if you want to make number part mandatory.
So try with div[id~=^answer-[0-9]+$]

The * operator means "zero or more" times so it will still match the second example. You need to use the + operator instead meaning "one or more" times. So, your syntax would be:
div[id~=answer-[0-9]+]

It looks like it searches id to contain this pattern, not to match.
"div[id~=answer-[0-9]*$]"
should work then.

Related

Freemarker regex is not matching on all lowercase substrings

So I am following the user guide, which seems straight forward so I'm not sure what it is I am doing wrong. I want to use the matches builtin to find all lower case words in a string. So, taking the example straight from the docs, into my code (with some obvious changes), I always get the Does not match output. Any help is much appreciated:
<#assign res = "<UPPERCASE_WORD<lowercase_word>>"?matches("[a-z]+")>
<#if res>
Matches
<#else>
Does not match
</#if>
One thing that I've noticed between my code and the docs is that the example has spaces and I do not, but I doubt that's the issue as a quick test with < > replaced with spaces shows no difference. I was thinking the regex is incorrect or not supported by Freemarker, but the docs link directly to OracleRegexPattern docs so I think that's OK.

Don't use matches if you don't expect an exact match:
This built-in determines if the string exactly matches the pattern
If you know the appropriate exact regex use it,
For example for lower letters and then upper cases letters use:
?matches("[a-z]+[A-Z]+")>

If you want to check if the string contains [a-z] somewhere, then the regular expression should be ".*[a-z]+.*", because ?matches checks if the pattern matches the whole string.

How to find optional group with some prefix using Regex

This is my pattern regex:
"subcategory.html?.*id=(.*?)&.*title=(.+)?"
for below input
http://example.com/xyz/subcategory.html?id=3000080292&backTitle=Back&title=BabySale
I want to capturebelow group
group one (id) : 3000080292
group two (title) : BabySale
For which it is working fine. The problem is I want to make second group i.e. value of title to be optional, so that even if title is not present, regex should match and get me value of group 1(id). But for input
http://example.com/xyz/subcategory.html?id=3000080292&backTitle=Back&
Regex match is failing even if group one is present. So my question is how to make second group optional here?

Maybe make the entire substring optional?
Try subcategory.html?.*id=(.*?)&.*(?:title=(.+)?)?
Also note that your (and my) regex might be matching too much. For example, the dot here should probably be escaped: subcategory\.html instead of subcategory.html or you will match subcategory€html, too. Your question mark says the l of html is optional; you are probably saved by the .* ("match anything"), that follows.
Last but not least, the final .* means that even this will match (which you probably don't want to match):
http://example.com/xyz/subcategory.html?id=3000080292&backTitle=Back&title=BabySale&Lorem Ipsum Sit Atem http://&%$
It's usually a bad idea to match .* as it will nearly always match too much. Consider using character classes instead of the dot, and to anchor he beginning (^) and end ($) of the string... :)

One of the possible ways is to use something like:
subcategory\.html\?.*id=(.*?)&(.*title=(.+)?)?
(.*title=(.+)?)? is optional now.
please see an example here.
As suggested by #Christian it is better to make .*title non capturing group and it won't be part of the result.
subcategory\.html\?.*id=(.*?)&(?:.*title=(.+)?)?

If you know that parameter id comes before optional title then you can use this regex to capture id and optional title parameters:
subcategory\.html\?id=([^&]*)(?:.*&)?(?:title=([^&]*))?
RegEx Demo
In Java use this regex:
final String regex = "subcategory\\.html\\?id=([^&]*)(?:.*&)?(?:title=([^&]*))?";

How to replace a continuous occurrence of a substring with a single substring?

I have an html string parsed in android froom a spannable string. :-
<p dir="ltr"><b><b><b><b><b>qwert</b></b></b></b></b><b><b><b><b><b><b>y</b></b></b></b></b></b></p>
As you can see, there are multiple occurences of tags.
Now i have done hit and trials ,user methods like replaceAll(), but they replace all occurences.
What i want is that, when i pass a substring to find, lets say "<b>", and then it should replace, lets say the first five consecutive bold tags in the above string with a single "<b>" tag.
Any Suggestions
Required Result :- <p dir="ltr"><b>qwert</b><b>y</b></p>

If I understand your problem correctly, you can try this regex then:
(<[^>]+>)\\1+
And replace with:
\\1
In code...
String test = "<p dir=\"ltr\"><b><b><b><b><b>qwert</b></b></b></b></b><b><b><b><b><b><b>y</b></b></b></b></b></b></p>";
String out = test.replaceAll("(<[^>]+>)\\1+", "$1");
Output:
<p dir="ltr"><b>qwert</b><b>y</b></p>
(<[^>]+>) matches and catches in group 1, the first tag that it finds.
\\1 in the regex refers to the first captured tag. The + indicates unlimited repetition (well, the limit is a big number I don't think you need to worry about).
The replacement $1 then also refers to the first captured tag.
ideone demo

you want somehting like this
find : (<b>)\1+|(<\/b>)\2+
replace: \1\2
demo here :
http://regex101.com/r/aC6iP4

Regular Expression - Return all matches as a single match

I'm working with a piece of code that applies a regex to a string and returns the first match. I don't have access to modify the code to return all matches, nor do I have the ability to implement alternative code.
I have the following example target string:
usera,userb,,userc,,userd,usere,userf,
This is a list of comma delimited usernames joined from multiple sources, some of which were blank resulting in two commas in some places. I'm trying to write a regex that will return all of the comma delimited usernames except for specific values.
For example, consider the following expression:
[^,]\w{1,},(?<!(userb|userc|userd),)
This results in three matches:
usera,
usere,
userf,
Is there any way to get these results as a single match, instead of a match collection, e.g. a single match having the text 'usera,usere,userf,' ?
If I could write code in any language this would be trivial, but I'm limited to input of only the target string and the pattern, and I need a single match that has all items except for the ones I'm omitting. I'm not sure if this is even possible, everything I've ever done with regex's involves processing multiple items in a match collection.
Here is an example in Regex Coach. This image shows that there are the three matches I want, but my requirement is to have the text in a single match, not three separate matches.
EDIT1:
To clarify this ticket is specifically intended to solve the use case using only regular expression syntax. Solving this problem in code is trivial but solving it using only a regex was the requirement given the fact that the executing code is part of a 3rd party product that I didn't want to reverse engineer, wrap, or replace.

Is there any way to get these results as a single match, instead of a match collection, e.g. a single match having the text 'usera,usere,userf,'?
No. Regex matches are consecutive.
A regular expression matches a (sub)string from start to finish. You cannot drop the middle part, this is not how regex engines work. But you can apply the expression again to find another matching substring (incremental search - that's what Regex Coach does). This would result in a match collection.
That being said, you could also just match everything you don't want to keep and remove it, e.g.
,(?=[\s,]+)|(userb|userc|userd)[\s,]*
http://rubular.com/r/LOKOg6IeBa

Using Condition in Regular Expressions

Source:
<TD>
<IMG SRC="/images/home.gif">
<IMG SRC="/images/spacer.gif">
<IMG SRC="/images/search.gif">
<IMG SRC="/images/spacer.gif">
<IMG SRC="/images/help.gif">
</TD>
Regex:
(<[Aa]\s+[^>]+>\s*)?<[Ii][Mm][Gg]\s+[^>]+>(?(1)\s*</[Aa]>)
Result:
<IMG SRC="/images/home.gif">
<IMG SRC="/images/spacer.gif">
<IMG SRC="/images/search.gif">
<IMG SRC="/images/spacer.gif">
<IMG SRC="/images/help.gif">
what's the "?(1)" mean?
When I run it in Java ,it cause a exception: java.util.regex.PatternSyntaxException,the
"?(1)" can't be recognized.
The explanation in the book is :
This pattern requires explanation. (<[Aa]\s+[^>]+>\s*)? matches an opening <A> or <a> tag (with any attributes that may be present), if present (the closing ? makes the expression optional). <[Ii][Mm][Gg]\s+[^>]+> then matches the <IMG> tag (regardless of case) with any of its attributes. (?(1)\s*</[Aa]>) starts off with a condition: ?(1) means execute only what comes next if backreference 1 (the opening <A> tag) exists (or in other words, execute only what comes next if the first <A> match was successful). If (1) exists, then \s*</[Aa]> matches any trailing whitespace followed by the closing </A> tag.

The syntax is correct. The strange looking (?....) sets up a conditional. This is the regular expression syntax for an if...then statement. The (1) is a back-reference to the capture group at the beginning of the regex, which matches an html <a> tag, if there is one since that capture group is optional. Since the back-reference to the captured tag follows the "if" part of the regex, what it is doing is making sure that there was an opening <a> tag captured before trying to match the closing one. A pretty clever way of making both tags optional, but forcing both when the first one exists. That's how it's able to match all the lines in the sample text even though some of them just have <img> tags.
As to why it throws an exception in your case, most likely the flavor of regex you're using doesn't support conditionals. Not all do.
EDIT: Here's a good reference on conditionals in regular expressions: http://www.regular-expressions.info/conditional.html

What you're looking at is a conditional construct, as Bryan said, and Java doesn't support them. The parenthesized expression immediately after the question mark can actually be any zero-width assertion, like a lookahead or lookbehind, and not just a reference to a capture group. (I prefer to call those back-assertions, to avoid confusion. A back-reference matches the same thing the capture group did, but a back-assertion just asserts that the capture group matched something.)
I learned about conditionals when I was working in Perl years ago, but I've never missed them in Java. In this case, for example, a simple alternation will do the trick:
(?i)<a\s+[^>]+>\s*<img\s+[^>]+>\s*</a]>|<img\s+[^>]+>
One advantage of the conditional version is that you can capture the IMG tag with a single capture group:
(?i)(<a\s+[^>]+>\s*)?(<img\s+[^>]+>)(?(1)\s*</a>)
In the alternation version you have to have a capturing group for each alternative, but that's not as important in Java as it is in Perl, with all its built-in regex magic. Here's how I would pluck the IMG tags in Java:
Pattern p = Pattern.compile(
"<a\\s+[^>]+>\\s*(<img\\s+[^>]+>)\\s*</a>|(<img\\s+[^>]+>)"
Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(s);
while (m.find())
{
System.out.println(m.start(1) != -1 ? m.group(1) : m.group(2));
}

Could it be a non capturing group as described here:
There is also a special group, group
0, which always represents the entire
expression. This group is not included
in the total reported by groupCount.
Groups beginning with (? are pure,
non-capturing groups that do not
capture text and do not count towards
the group total. (You'll see examples
of non-capturing groups later in the
section Methods of the Pattern Class.)
Java Regex Tutorial

The short answer: it doesn't mean anything. The problem lies in this whole snippet:
(?(1)\s*)
() creates a back reference, so you can reuse any text matched inside. They also allow you to apply operators to everything inside of them (but this isn't done in your example).
? means that the item before it should be matched if it's there but it is also OK if it's not. This simply doesn't make sense when it appears after (
(?:MoreTextHere)
Can be used to speed up RegExs when you don't need to reuse the matched text. But that still doesn't really make sense, why match a 1 when your input is HTML?
Try:
(?:<[Aa]\s+[^>]+>\s*)?<[Ii][Mm][Gg]\s+[^>]+>
You never said exactly what you were trying to match so if this answer doesn't satisfy you, please explain what you're trying to do with RegEx.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jsoup Selector Regex matching - java

The * operator means "zero or more" times so it will still match the second example. You need to use the + operator instead meaning "one or more" times. So, your syntax would be: div[id~=answer-[0-9]+]

It looks like it searches id to contain this pattern, not to match. "div[id~=answer-[0-9]*$]" should work then.

Related

Freemarker regex is not matching on all lowercase substrings

How to find optional group with some prefix using Regex

How to replace a continuous occurrence of a substring with a single substring?

Regular Expression - Return all matches as a single match

Using Condition in Regular Expressions

Categories

Resources