Java RegExp - Extracting only numbers from a webpage

Java RegExp - Extracting only numbers from a webpage - java

I have a webpage converted to a string and I'm trying to extract three numbers from it from this line.
<td class="col_stat">1</td><td class="col_stat">0</td><td class="col_stat">1</td>
From the line above I already have it extracting the first '1' using this
String filePattern = "<td class=\"col_stat\">(.+)</td>";
pattern = Pattern.compile(filePattern);
matcher = pattern.matcher(text);
if(matcher.find()){
String number = matcher.group(1);
System.out.println(number);
}
Now what I want to do is extract the 0 and the last 1 but anytime I try edit the regular expression above it just outputs the complete webpage on the console. Anyone have any suggestions??
Thanks

Regex matching is greedy, try this instead (looking only for (\d+) instead of (.+) (which matches everything until the last </td>):
String text =
"<td class=\"col_stat\">1</td>" +
"<td class=\"col_stat\">0</td>" +
"<td class=\"col_stat\">1</td>";
String filePattern = "<td class=\"col_stat\">(\\d+)</td>";
Pattern pattern = Pattern.compile(filePattern);
Matcher matcher = pattern.matcher(text);
while (matcher.find())
{
String number = matcher.group(1);
System.out.println(number);
}
On a related note, I completely agree with other's suggestions to use a more structured approach to interpreting HTML.

Given that using regexps on HTML/XML is a notorious gotcha (see here for the definitive answer), I'd recommend doing this reliably using an HTML parser (e.g. JTidy - although it's a HTML pretty-printer, it also provides a DOM interface to the document)

<td class=\"col_stat\">(.+)</td>
this regex is greedy. If you wish to make it work with numbers - change it as:
<td class=\"col_stat\">(\\d+?)</td>
and I'd rather suggest to use XPath for such kind of matching, see Saxon and TagSoup

This is because your matcher is greedy. You need a non-greedy matcher to fix this.
String text = "<td class=\"col_stat\">1</td><td class=\"col_stat\">0</td><td class=\"col_stat\">1</td>";
String filePattern = "<td class=\"col_stat\">(.+?)</td>";
Pattern pattern = Pattern.compile(filePattern);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
String number = matcher.group(1);
System.out.println(number);
}

Try this regular expression:
<td class="col_stat">(\d+)[^\d]+(\d+)[^\d]+(\d+)
This does the following:
search for your start string
select a chain of decimals
skip any NON-decimals
select a chain of decimals
skip any NON-decimals
select a chain of decimals

Related

JAVA regex to find string

i have a string like this:
font-size:36pt;color:#ffffff;background-color:#ff0000;font-family:Times New Roman;
How can I get the value of the color and the value of background-color?
color:#ffffff;
background-color:#ff0000;
i have tried the following code but the result is not my expected.
Pattern pattern = Pattern.compile("^.*(color:|background-color:).*;$");
The result will display:
font-size:36pt; color:#ffffff; background-color:#ff0000; font-family:Times New Roman;

If you want to have multiple matches in a string, don't assert ^ and $ because if those matches, then the whole string matches, which means that you can't match it again.
Also, use a lazy quantifier like *?. This will stop matching as soon as it finds some string that matches the pattern after it.
This is the regex you should use:
(color:|background-color:)(.*?);
Group 1 is either color: or background-color:, group 2 is the color code.
Demo

To do this you should use the (?!abc) expression in regex. This finds a match but doesn't select it. After that you can simply select the hexcode, like this:
String s = "font-size:36pt;color:#ffffff;background-color:#ff0000;font-family:Times New Roman";
Pattern pattern = Pattern.compile("(?!color:)#.{6}");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group());
}

Pattern pattern = Pattern.compile("color\\s*:\\s*([^;]+)\\s*;\\s*background-color\\s*:\\s*([^;]+)\\s*;");
Matcher matcher = pattern.matcher("font-size:36pt; color:#ffffff; background-color:#ff0000; font-family:Times New Roman;");
if (matcher.find()) {
System.out.println("color:" + matcher.group(1));
System.out.println("background-color:" + matcher.group(2));
}

No need to describe the whole input, only the relevant part(s) that you're looking to extract.
The regex color:(#[\\w\\d]+); does the trick for me:
String input = "font-size:36pt;color:#ffffff;background-color:#ff0000;font-family:Times New Roman;";
String regex = "color:(#[\\w\\d]+);";
Matcher m = Pattern.compile(regex).matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
Notice that m.group(1) returns the matching group which is inside the parenthesis in the regex. So the regex actually matches the whole color:#ffffff; and color:#ff0000; parts, but the print only handles the number itself.

Use a CSS parser like ph-css
String input = "font-size:36pt; color:#ffffff; background-color:#ff0000; font-family:Times New Roman;";
final CSSDeclarationList cssPropertyList =
CSSReaderDeclarationList.readFromString(input, ECSSVersion.CSS30);
System.out.println(cssPropertyList.get(1).getProperty() + " , "
+ cssPropertyList.get(1).getExpressionAsCSSString());
System.out.println(cssPropertyList.get(2).getProperty() + " , "
+ cssPropertyList.get(2).getExpressionAsCSSString());
Prints:
color , #ffffff
background-color , #ff0000
Find more about ph-css on github

Regex to extrat particular strings from a response data

As a response am getting following string
String response = "<span class="timeTempText">6:35a</span><span class="dividerText"> | </span><span class="timeTempText">59Â°F</span>"
From this i have to fetch only 6:35a and 59Â°F.
using subString and indexOf method i can get the values from the string and but it seems like lots of code.
Is there any easy way to get it.I mean using regular expression?
Using regular experssion how can i get the strings.

Try this:
timeTempText">(.*?)<\/span>
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "timeTempText\">(.*?)<\\/span>";
final String string = "<span class="timeTempText">6:35a</span><span class="dividerText"> | </span><span class="timeTempText">59Â°F</span>"
+ "asdfasdf asdfasdf timeTempText\">59Â°F</span> asdfasdf\n";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
/*
1st Capturing Group (.?)
.? matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
*/
Capturing Group 1 has the value

You can do it with regexes, but that's lot of code:
String response = "<span class=\"timeTempText\">6:35a</span><span class=\"dividerText\"> | </span><span class=\"timeTempText\">59Â°F</span>";
Matcher matcher = Pattern.compile("\"timeTempText\">(.*?)</span>").matcher(response);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Explanation:
You create a Pattern
Then retrieve the Matcher
You loop through the matches with the while(matcher.find()) idiom.
matcher.group(1) returns the 1st group of the pattern. I.e. the matched text between the first ()
Please note that this code is very brittle. You're better off with XPATH.

How to delete <script>..</script> by regexp in Java?

Now I have this:
String s = "1<script type='text/javascript'>2</script>3<script type='text/javascript'>3</script>5";
Pattern pattern = Pattern.compile("<script.*</script>");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
s = s.replace(matcher.group(), "");
}
System.out.println(s);
The result is
15
But I need
135
In PHP we have /U modificator, but what should I do in Java? I thought about sth like this, but it is incorrect:
Pattern pattern = Pattern.compile("<script[^(script)]*</script>");

<script([^>]*)?>.*?<\/script>
Try this.You needed a ? for lazy match or shorter match.
See demo.
http://regex101.com/r/kO7lO2/3

replaceAll the below regex by empty string:
<script [^>]*>[^<]*</script>

How do I use regex in Java to pull this from html?

I'm trying to pull data from the ESPN box scores, and one of the html files has:
<td style="text-align:left" nowrap>Channing Frye, PF</td>
and I'm only interested in grabbing the name (Channing Frye) and the position (PF)
Right now, I've been using Pattern.quote(start) + "(.*?)" + Pattern.quote(end) to grab text in between start and end, but I'm not sure how I'm supposed to grab text that starts with pattern .../http://espn.go.com/nba/player/_/id/ and then can contain (any integer)/anyfirst-anylast"> then grab the name I need (Channing Frye), then </a>, and then grab the position I need (PF) and ends with pattern </td>
Thanks!

Here is the pattern:
http://espn.go.com/nba/player/_/id/(\d+)/([\w-]+)">(.*?)</a>,\s*(\w+)</td>
You can use this tool - http://www.regexplanet.com/advanced/java/index.html for verifying regular expressions.

You could use this pattern:
\\/nba\\/player\\/_\\/.*\\\">(.*)<.+>,\\s(.*)<
This will match any link in the html that contains `/nba/player/
String re = "\\/nba\\/player\\/_\\/.*\\">(.*)<.+>,\\s(.*)<";
String str = "<td style=\"text-align:left\" nowrap>Channing Frye, PF</td>";
Pattern p = Pattern.compile(re, Pattern.MULTILINE | Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(str);
example: http://regex101.com/r/hA3uV0

Use this regex:
[A-Z\sa-z0-9]+(?=</a>)|\w+(?=</td>)

Here is one regex:
. is used for any item, .+ is used for any 1+ items
.* means o or more items
\s is used for space
String str = "<td style=\"text-align:left\" nowrap>Channing Frye, PF</td>";
Pattern pattern = Pattern.compile("<td.+>.*<a.+>(.+)</a>[\\s,]+(.+)</td>");
Matcher matcher = pattern.matcher(str);
while(matcher.find()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}

You can use :
String lString = "<td style=\"text-align:left\" nowrap>Channing Frye, PF</td>";
Pattern lPattern = Pattern.compile("<td.+><a.+id/\\d+/.+\\-.+>(.+)</a>, (.+)</td>");
Matcher lMatcher = lPattern.matcher(lString);
while(lMatcher.find()) {
System.out.println(lMatcher.group(1));
System.out.println(lMatcher.group(2));
}
This will give you :
Channing Frye
PF

Using Regex in Java based on working php code

I've been using this in php...
preg_match_all('|<img src="(.*?)" alt="(.*?)" />|is',$xx, $matches, PREG_SET_ORDER);
where $xx is the entire webpage content as a string, to find all occurrences of matches.
This sets $matches to a two dimensional array which I can then loop through with a for statement based on the length of $matches and use for example ..
$matches[$i][1] which is would be the first (.*?)
$matches[$i][2] which is would be the second (.*?)
and so on....
My question is how can this be replicated in java?
I've been reading tutorials and blogs on java regex and have been using Pattern and Matcher but can't seem to figure it out. Also, matcher never finds anything. so my while(matcher.find()) have been futile and usually throws an error saying no matches have been found yet
This is my java code for the pattern to be matched is ...
String pattern = new String(
"<img src=\"(w+)\" alt=\"(w+)\" />");
I've also tried ..
String pattern = new String(
"<img src=\"(.*?)\" alt=\"(.*?)\" />");
and
String pattern = new String(
"<img src=\"(\\w+)\" alt=\"(\\w+)\" />");
no matches are ever found.

The regex you posted worked for me so perhaps your fault is in how you use it :
String test = "<html>\n<img src=\"quack-quack\" alt=\"hi\" />\n</html>";
// This is exactly the pattern code you posted :
String pattern = new String(
"<img src=\"(.*?)\" alt=\"(.*?)\" />");
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(test);
m.find(); // returns true
See Java Tutorial on how this should be used.

Not an expert on Java, but shouldn't the strings escape double quotes and escapes?
"<img src=\"(.*?)\" alt=\"(.*?)\" />"
or
"<a\\ href=\"http://www.mysite.com/photoid/(.*?)\"><img\\ src=\"(.*?)\"\\ alt=\"(.*?)\"\\ /></a>"

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java RegExp - Extracting only numbers from a webpage - java

Given that using regexps on HTML/XML is a notorious gotcha (see here for the definitive answer), I'd recommend doing this reliably using an HTML parser (e.g. JTidy - although it's a HTML pretty-printer, it also provides a DOM interface to the document)

<td class=\"col_stat\">(.+)</td> this regex is greedy. If you wish to make it work with numbers - change it as: <td class=\"col_stat\">(\\d+?)</td> and I'd rather suggest to use XPath for such kind of matching, see Saxon and TagSoup

Try this regular expression: <td class="col_stat">(\d+)[^\d]+(\d+)[^\d]+(\d+) This does the following: search for your start string select a chain of decimals skip any NON-decimals select a chain of decimals skip any NON-decimals select a chain of decimals

Related

JAVA regex to find string

Regex to extrat particular strings from a response data

How to delete <script>..</script> by regexp in Java?

How do I use regex in Java to pull this from html?

Using Regex in Java based on working php code

Categories

Resources