Using patern matcher to extract html

Using patern matcher to extract html - java

I have a pice of HTML:
<div class="content" itemprop="softwareVersion"> 2.3 </div>
(This is the version of my app in the play store) What i am trying to do, is get the latest version using Pattern matching.
what i have thus far for matching the pattern is:
String htmlString = "Some very long webpage string that includes the above tag"
Pattern pattern = Pattern.compile("softwareVersion\"> [^ <]*</dd");
Matcher matcher = pattern.matcher(Html);
matcher.find();
How do i now go about extractin 2.3 from the htmlString?

Using JSoup xhtml parser
It's well known that you should not parse xhtml with regex unless you know the html character set you are going to parse. You should use a xhtml parser instead like JSoup. So, you could use something like this:
String htmlString = "YOUR HTML HERE";
Document document=Jsoup.parse(htmlString);
Element element=document.select("div[itemprop=softwareVersion]").first();
System.out.println(element.text());
Regex approach
However, if you want to use regex, then you have to use capturing groups and then grab its content.
String htmlString = "Some very long webpage string that includes the above tag"
Pattern pattern = Pattern.compile("softwareVersion\"> ([^ <]*)</dd");
// ^------^ Here
Matcher matcher = pattern.matcher(htmlString);
while (matcher.find()) {
System.out.println(matcher.group(1));
}

Try to capture it in a capture group?
("softwareVersion\"> ([^ <]*)< /dd");
Then accessing the value with matcher.group(1)

I had to tweak a few things to make this work:
String htmlString = "String that includes <div class=\"content\" itemprop=\"softwareVersion\"> 2.3 </div>";
Pattern pattern = Pattern.compile("softwareVersion\"> ([^ <]*) +</div");
Matcher matcher = pattern.matcher(htmlString);
if (matcher.find())
{
System.out.println(matcher.group(1));
}
//else??
The () in the RE make it possible to use matcher,group(1)

Try this Regex \"softwareVersion\">\s([0-9].?[0-9]?+)\s\s<\/div>:
\" matches the character " literally
softwareVersion matches the characters softwareVersion literally (case sensitive)
\" matches the character " literally
> matches the characters > literally
\s match any white space character [\r\n\t\f ]
1st Capturing group ([0-9].?[0-9]?+)
[0-9] match a single character present in the list below
0-9 a single character in the range between 0 and 9
.? matches any character (except newline)
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
[0-9]?+ match a single character present in the list below
Quantifier: ?+ Between zero and one time, as many times as possible, without giving back [possessive]
0-9 a single character in the range between 0 and 9
\s match any white space character [\r\n\t\f ]
\s match any white space character [\r\n\t\f ]
< matches the characters < literally
\/ matches the character / literally
div> matches the characters div> literally (case sensitive)
https://regex101.com/r/kR7lC2/1

First, as comments point out, you can't parse HTML with a regex (thanks to Jeff Burka for linking to the canonical answer).
Second, since you are looking at a very limited and particular situation you can match using a capturing group to get the version.
Assuming that the div in question is not broken across lines, my strategy would be much like your posted attempt; look for the string softwareVersion and the tag close > character, optional whitespace, the version string, optional whitespace, and the closing tag.
That gives a regex like softwareVersion[^>]*>\s*([0-9.]+)\s*</
From debuggex (which needs the .* to match the leading part):
.*softwareVersion[^>]*>\s*([0-9.]+)\s*</
Debuggex Demo
This will give you the version in a capturing group, which will be matcher.group(1)
As a Java string, that's softwareVersion[^>]*>\\s*([0-9.]+)\\s*</
I omitted the div after </ because, while it's in a div now, maybe it'll be a span or something else in the future.
I went simple with [0-9.] so it can match 2.3 but also 3.0.1, however it would also match ..382.1...33 — you could make one that matches a limited or arbitrary set of n(.n)* dotted numbers if it was important.
softwareVersion[^>]*>\\s*([1-9][0-9]*(\\.[0-9]+){0,3})\\s*</ matches a version number n with zero to three .n point releases, so 3.0.2.1 but not 1.2.3.4.5

Related

Match starting and ending character using Java Matcher class

I want to get words from string that starts with # and end with space. I've tried using this Pattern.compile("#\\s*(\\w+)") but it doesn't include characters like ' or :.
I want the solution with only Pattern Matching method.

We can try matching using the pattern (?<=\\s|^)#\\S+, which would match any word starting with #, followed by any number of non whitespace characters.
String line = "Here is a #hashtag and here is #another has tag.";
String pattern = "(?<=\\s|^)#\\S+";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(line);
while (m.find()) {
System.out.println(m.group(0));
}
#hashtag
#another
Demo
Note: The above solution might give you an edge case of pulling in punctuation which appears at the end of a hashtag. If you don't want this, then we can rephrase the regex to only match positive certain characters, e.g. letters and numbers. But, maybe this is not a concern for you.

The opposite of \s is \S, so you can use a regex like this:
#\s*(\S+)
Or for Java:
Pattern.compile("#\\s*(\\S+)")
It will capture anything that is not a white space.
See demo here.
If you want to stop on the space character and not any white space change the \S to [^ ].
The ^ inside the brackets means it will negate whatever comes after it.
Pattern.compile("#\\s*([^ ]+)")
See demo here.

Java regex for string pattern

I would want to write a regex for this string pattern:
<Col name="SKU_UPC_NBR">85634546495</Col>
I want to fetch the value between Col tag.
I tried the below pattern :
Pattern TAG_REGEX = Pattern.compile("<Col name='SKU_UPC_NBR'>(.+?)</col>");
Matcher matcher = TAG_REGEX.matcher(str);
The above is not matching my string and returns empty.
Please help me on this problem.

You can try:
<Col[^>]*>(.+?)<\/Col>
<Col[^>]*> will match the opening tag. [^>]* means match any character but >, so that the match ends at the first > encountered.
(.+?) means grab 1 or more characters between the opening and closing tag
<\/Col> this matches the closing tag

Regex matches exactly what you type. It does not generalize, it does not understand that sometimes to you ' == ", it does not match mixing cases.
The data format you've specified is open tag, space, name attribute, equals, double quote, name attr data ...
The regex format you've specified is open tag, space, name attribute, equals, single quote, name attr data ...
What you need is
Pattern TAG_REGEX = Pattern.compile("<Col name=\"SKU_UPC_NBR\">(.+?)</Col>");
NOTE: You may want to use (\d+?) instead of (.+?) as \d will match any digit, so the regex is more specific to the data you're matching, and is easier to read. This won't work however, if you know some Col tags won't have just digits in them
You may want to refer to this neat interactive Regex tutorial for practice with regex's.
You also may want to refer to the Java documentation for Regex patterns, this is useful when you need special characters.

Try this please:
(?<=">)\d*(?=<\/)
It will match 0 or more digits preceded by "> (quotation mark and greater than sign) and followed by (less than sign and forward slash)
You can test this here:
https://regex101.com/

Java regexp in matcher input

I'm trying to get quoted strings using regexp.
String regexp = "('([^\\\\']+|\\\\([btnfr\"'\\\\]|[0-3]?[0-7]{1,2}|u[0-9a-fA-F]{4}))*'|\"([^\\\\\"]+|\\\\([btnfr\"'\\\\]|[0-3]?[0-7]{1,2}|u[0-9a-fA-F]{4}))*\")";
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(source);
while (m.find()) {
String newElement = m.group(1);
//...
}
It works well, but if source text contains
' onkeyup="this.value = this.value.replace (/\D/, \'\')">'
program goes into eternal loop.
How can I correctly get this string?
For example, I have a text(php code):
'qty'=>'<input type="text" maxlength="3" class="qty_text" id='.$key.' value ='
The result should be
'qty'
'<input type="text" maxlength="3" class="qty_text" id='
' value ='

Your regex seems to work okay when presented with a string it matches; it's when it can't match that it goes into the endless loop. (In this case it's the \D that's causing it to choke.) But that regex is much more complicated than it needs to be; you're trying to match them, not validate them. Here's the quintessential regex for a string literal in C-style languages:
"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*"
...and the single-quoted version, for languages that support that style:
'[^'\\\r\n]*(?:\\.[^'\\\r\n]*)*'
It uses Friedl's "unrolled loop" technique for maximum efficiency. Here's the Java code for it, as generated by RegexBuddy 4:
Pattern regex = Pattern.compile(
"\"[^\"\\\\\r\n]*(?:\\\\.[^\"\\\\\r\n]*)*\"|'[^'\\\\\r\n]*(?:\\\\.[^'\\\\\r\n]*)*'"
);

Maybe I misunderstand the principle, but that looks rather trivial now that you added the example.
Consider this for instance:
String input = "'qty'=>'<input type=\"text\" maxlength=\"3\" class=\"qty_text\" id='.$key.' value ='";
String otherInput = "' onkeyup=\"this.value = this.value.replace (/\\D/, \'\')\">'";
// matching anything starting with single quote and ending with single quote
// included, reluctant quantified
Pattern p = Pattern.compile("'.+?'");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println(m.group());
}
m = p.matcher(otherInput);
System.out.println();
while (m.find()) {
System.out.println(m.group());
}
Output:
'qty'
'<input type="text" maxlength="3" class="qty_text" id='
' value ='
' onkeyup="this.value = this.value.replace (/\D/, '
')">'
See the Java Pattern documentation for more detailed explanations.

The character groups that match neither backslashes nor quotes shouldn't be followed by a +. Remove the +es to fix the hang (which was due to catastrophic backtracking).
Also, your original regex wasn't recognizing \D as a valid backslash escape - therefore the string constant in your test input containing \D wasn't being matched. If you make the rules of your regex more liberal to recognize any character immediately following a backslash as part of the string constant, it will behave the way you expect.
"('([^\\\\']|\\\\.)*'|\"([^\\\\\"]|\\\\.)*\")"

You can do it all in one line using split() with the right regex:
String[] array = source.replaceAll("^[^']+", "").split("(?<!\\G.)(?<=').*?(?='|$)");
There's a reasonable amount of regex kung fu going on here, so I'll break it down:
The delimiter is wrapped by even/odd quotes, but can not contain the quotes because split() consumes the delimiter, so a look behind (?<=') and look ahead (?=') (which are non-consuming) is used to match the quotes instead of a literal quote in the regex
a reluctant match .*? for characters between the quotes ensures that it stops at the next quote (instead of matching through to the last quote)
I added an alternate match for end of input tot he look ahead (?='|$) in case there's no trailing close quote
And saving the best for last, the regex that is key to making this all work is the negative look behind (?<!\\G.) which means "don't match on the end of the previous match" and ensures the next match advances past the end of the previous delimiter, without which you would end up with just the quote characters in your array. \G matches the end of the previous match, but also matches start of input for the first match, so it rather neatly automatically handles not matching on the first quote - thus making the delimiter wrapped in even/odd quote instead of odd/even as it would be otherwise.
To cater for the input's first character not being a quote, you need to strip off the leading characters before splitting - that's why the replaceAll() is needed
Here's some test code using your sample input:
String source = "'qty'=>'<input type=\"text\" maxlength=\"3\" class=\"qty_text\" id='.$key.' value ='";
String[] array = source.replaceAll("^[^']+", "").split("(?<!\\G.)(?<=').*?(?='|$)");
System.out.println(Arrays.toString(array));
Output:
['qty', '<input type="text" maxlength="3" class="qty_text" id=', ' value =']

Match word in String in Java

I'm trying to match Strings that contain the word "#SP" (sans quotes, case insensitive) in Java. However, I'm finding using Regexes very difficult!
Strings I need to match:
"This is a sample #sp string",
"#SP string text...",
"String text #Sp"
Strings I do not want to match:
"Anything with #Spider",
"#Spin #Spoon #SPORK"
Here's what I have so far: http://ideone.com/B7hHkR .Could someone guide me through building my regexp?
I've also tried: "\\w*\\s*#sp\\w*\\s*" to no avail.
Edit: Here's the code from IDEone:
java.util.regex.Pattern p =
java.util.regex.Pattern.compile("\\b#SP\\b",
java.util.regex.Pattern.CASE_INSENSITIVE);
java.util.regex.Matcher m = p.matcher("s #SP s");
if (m.find()) {
System.out.println("Match!");
}

(edit: positive lookbehind not needed, only matching is done, not replacement)
You are yet another victim of Java's misnamed regex matching methods.
.matches() quite unfortunately so tries to match the whole input, which is a clear violation of the definition of "regex matching" (a regex can match anywhere in the input). The method you need to use is .find().
This is a braindead API, and unfortunately Java is not the only language having such misguided method names. Python also pleads guilty.
Also, you have the problem that \\b will detect on word boundaries and # is not part of a word. You need to use an alternation detecting either the beginning of input or a space.
Your code would need to look like this (non fully qualified classes):
Pattern p = Pattern.compile("(^|\\s)#SP\\b", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("s #SP s");
if (m.find()) {
System.out.println("Match!");
}

You're doing fine, but the \b in front of the # is misleading. \b is a word boundary, but # is already not a word character (i.e. it isn't in the set [0-9A-Za-z_]). Therefore, the space before the # isn't considered a word boundary. Change to:
java.util.regex.Pattern p =
java.util.regex.Pattern.compile("(^|\\s)#SP\\b",
java.util.regex.Pattern.CASE_INSENSITIVE);
The (^|\s) means: match either ^ OR \s, where ^ means the beginning of your string (e.g. "#SP String"), and \s means a whitespace character.

The regular expression "\\w*\\s*#sp\\w*\s*" will match 0 or more words, followed by 0 or more spaces, followed by #sp, followed by 0 or more words, followed by 0 or more spaces. My suggestion is to not use \s* to break words up in your expression, instead, use \b.
"(^|\b)#sp(\b|$)"

What's wrong with this regex?

I need to match Twitter-Hashtags within an Android-App, but my code doesn't seem to do what it's supposed to.
What I came up with is:
ArrayList<String> tags = new ArrayList<String>(0);
Pattern p = Pattern.compile("\b#[a-z]+", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(tweet); // tweet contains the tweet as a String
while(m.find()){
tags.add(m.group());
}
The variable tweet contains a regular tweet including hashtags - but find() doesn't trigger. So I guess my regular expression is wrong.

Your regex fails because of the \b word boundary anchor. This anchor only matches between a non-word character and a word-character (alphanumeric character). So putting it directly in front of the # causes the regex to fail unless there is an alphanumeric character before the #! Your regex would match a hashtag in foobarfoo#hashtag blahblahblah but not in foobarfoo #hashtag blahblahblah.
Use #\w+ instead, and remember, inside a string, you need to double the backslashes:
Pattern p = Pattern.compile("#\\w+");

Your pattern should be "#(\\w+)" if you are trying to just match the hash tag. Using this and the tweet "retweet pizza to #pizzahut", doing m.group() would give "#pizzahut" and m.group(1) would give "pizzahut".
Edit: Note, the html display is messing with the backslashes for escape, you'll need to have two for the w in your string literal in Java.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using patern matcher to extract html - java

Try to capture it in a capture group? ("softwareVersion\"> ([^ <]*)< /dd"); Then accessing the value with matcher.group(1)

Related

Match starting and ending character using Java Matcher class

Java regex for string pattern

Java regexp in matcher input

Match word in String in Java

What's wrong with this regex?

Categories

Resources