Using Regex in Java based on working php code - java

I've been using this in php...
preg_match_all('|<img src="(.*?)" alt="(.*?)" />|is',$xx, $matches, PREG_SET_ORDER);
where $xx is the entire webpage content as a string, to find all occurrences of matches.
This sets $matches to a two dimensional array which I can then loop through with a for statement based on the length of $matches and use for example ..
$matches[$i][1] which is would be the first (.*?)
$matches[$i][2] which is would be the second (.*?)
and so on....
My question is how can this be replicated in java?
I've been reading tutorials and blogs on java regex and have been using Pattern and Matcher but can't seem to figure it out. Also, matcher never finds anything. so my while(matcher.find()) have been futile and usually throws an error saying no matches have been found yet
This is my java code for the pattern to be matched is ...
String pattern = new String(
"<img src=\"(w+)\" alt=\"(w+)\" />");
I've also tried ..
String pattern = new String(
"<img src=\"(.*?)\" alt=\"(.*?)\" />");
and
String pattern = new String(
"<img src=\"(\\w+)\" alt=\"(\\w+)\" />");
no matches are ever found.

The regex you posted worked for me so perhaps your fault is in how you use it :
String test = "<html>\n<img src=\"quack-quack\" alt=\"hi\" />\n</html>";
// This is exactly the pattern code you posted :
String pattern = new String(
"<img src=\"(.*?)\" alt=\"(.*?)\" />");
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(test);
m.find(); // returns true
See Java Tutorial on how this should be used.

Not an expert on Java, but shouldn't the strings escape double quotes and escapes?
"<img src=\"(.*?)\" alt=\"(.*?)\" />"
or
"<a\\ href=\"http://www.mysite.com/photoid/(.*?)\"><img\\ src=\"(.*?)\"\\ alt=\"(.*?)\"\\ /></a>"

Related

JAVA regex to find string

i have a string like this:
font-size:36pt;color:#ffffff;background-color:#ff0000;font-family:Times New Roman;
How can I get the value of the color and the value of background-color?
color:#ffffff;
background-color:#ff0000;
i have tried the following code but the result is not my expected.
Pattern pattern = Pattern.compile("^.*(color:|background-color:).*;$");
The result will display:
font-size:36pt; color:#ffffff; background-color:#ff0000; font-family:Times New Roman;
If you want to have multiple matches in a string, don't assert ^ and $ because if those matches, then the whole string matches, which means that you can't match it again.
Also, use a lazy quantifier like *?. This will stop matching as soon as it finds some string that matches the pattern after it.
This is the regex you should use:
(color:|background-color:)(.*?);
Group 1 is either color: or background-color:, group 2 is the color code.
Demo
To do this you should use the (?!abc) expression in regex. This finds a match but doesn't select it. After that you can simply select the hexcode, like this:
String s = "font-size:36pt;color:#ffffff;background-color:#ff0000;font-family:Times New Roman";
Pattern pattern = Pattern.compile("(?!color:)#.{6}");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group());
}
Pattern pattern = Pattern.compile("color\\s*:\\s*([^;]+)\\s*;\\s*background-color\\s*:\\s*([^;]+)\\s*;");
Matcher matcher = pattern.matcher("font-size:36pt; color:#ffffff; background-color:#ff0000; font-family:Times New Roman;");
if (matcher.find()) {
System.out.println("color:" + matcher.group(1));
System.out.println("background-color:" + matcher.group(2));
}
No need to describe the whole input, only the relevant part(s) that you're looking to extract.
The regex color:(#[\\w\\d]+); does the trick for me:
String input = "font-size:36pt;color:#ffffff;background-color:#ff0000;font-family:Times New Roman;";
String regex = "color:(#[\\w\\d]+);";
Matcher m = Pattern.compile(regex).matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
Notice that m.group(1) returns the matching group which is inside the parenthesis in the regex. So the regex actually matches the whole color:#ffffff; and color:#ff0000; parts, but the print only handles the number itself.
Use a CSS parser like ph-css
String input = "font-size:36pt; color:#ffffff; background-color:#ff0000; font-family:Times New Roman;";
final CSSDeclarationList cssPropertyList =
CSSReaderDeclarationList.readFromString(input, ECSSVersion.CSS30);
System.out.println(cssPropertyList.get(1).getProperty() + " , "
+ cssPropertyList.get(1).getExpressionAsCSSString());
System.out.println(cssPropertyList.get(2).getProperty() + " , "
+ cssPropertyList.get(2).getExpressionAsCSSString());
Prints:
color , #ffffff
background-color , #ff0000
Find more about ph-css on github

Java Regex Multiline issue

I have a String read from a file via apache commons FileUtils.readFileToString, which has the following format:
<!--LOGHEADER[START]/-->
<!--HELP[Manual modification of the header may cause parsing problem!]/-->
<!--LOGGINGVERSION[2.0.7.1006]/-->
<!--NAME[./log/defaultTrace_00.trc]/-->
<!--PATTERN[defaultTrace_00.trc]/-->
<!--FORMATTER[com.sap.tc.logging.ListFormatter]/-->
<!--ENCODING[UTF8]/-->
<!--FILESET[0, 20, 10485760]/-->
<!--PREVIOUSFILE[defaultTrace_00.19.trc]/-->
<!--NEXTFILE[defaultTrace_00.1.trc]/-->
<!--ENGINEVERSION[7.31.3301.368426.20141205114648]/-->
<!--LOGHEADER[END]/-->
#2.0#2015 03 04 11:04:19:687#+0100#Debug#...(few lines to follow)
I am trying to filter out everything between the LOGHEADER[START] and LOGHEADER[END] line. Therefore I created a java regex:
String fileContent = FileUtils.readFileToString(file);
String logheader = "LOGHEADER\\[START\\].*LOGHEADER\\[END\\]";
Pattern p = Pattern.compile(logheader, Pattern.DOTALL);
Matcher m = p.matcher(fileContent);
System.out.println(m.matches());
(Dotall since it is a Multiline pattern and i want to cover linebreaks as well)
However this pattern does not match the String. If I try to remove the LOGHEADER\[END\] part of the regex I get a match, that contains the whole String. I don't get why it is not matching for the original RegEx.
Any help is appreciated - thanks a lot!
The important thing to remember about this Java matches() method is that your regular expression must match the entire line.
So, you have to use find() this way to capture all in-between <!--LOGHEADER[START]/--> and n<!--LOGHEADER[END]/--:
String logheader = "(?<=LOGHEADER\\[START\\]/-->).*(?=<!--LOGHEADER\\[END\\])";
Pattern p = Pattern.compile(logheader, Pattern.DOTALL);
Matcher m = p.matcher(fileContent);
while(m.find()) {
System.out.println(m.group());
}
Or, to follow the logics you suggest (just using matches), we need to add ^.* and .*$:
String logheader = "^.*LOGHEADER\\[START\\].*LOGHEADER\\[END\\].*$";
Pattern p = Pattern.compile(logheader, Pattern.DOTALL);
Matcher m = p.matcher(fileContent);
System.out.println(m.matches());
You actually need to use Pattern and Matcher classes along with find method. The below regex will fetch all the lines which exists between LOGHEADER[START] and LOGHEADER[END].
String s = "<!--LOGHEADER[START]/-->\n" +
"<!--HELP[Manual modification of the header may cause parsing problem!]/-->\n" +
"<!--LOGGINGVERSION[2.0.7.1006]/-->\n" +
"<!--NAME[./log/defaultTrace_00.trc]/-->\n" +
"<!--PATTERN[defaultTrace_00.trc]/-->\n" +
"<!--FORMATTER[com.sap.tc.logging.ListFormatter]/-->\n" +
"<!--ENCODING[UTF8]/-->\n" +
"<!--FILESET[0, 20, 10485760]/-->\n" +
"<!--PREVIOUSFILE[defaultTrace_00.19.trc]/-->\n" +
"<!--NEXTFILE[defaultTrace_00.1.trc]/-->\n" +
"<!--ENGINEVERSION[7.31.3301.368426.20141205114648]/-->\n" +
"<!--LOGHEADER[END]/-->\n" +
"#2.0#2015 03 04 11:04:19:687#+0100#Debug#...(few lines to follow)";
Matcher m = Pattern.compile("(?s)\\bLOGHEADER\\[START\\][^\\n]*\\n(.*?)\\n[^\\n]*\\bLOGHEADER\\[END\\]").matcher(s);
while(m.find())
{
System.out.println(m.group(1));
}
Output:
<!--HELP[Manual modification of the header may cause parsing problem!]/-->
<!--LOGGINGVERSION[2.0.7.1006]/-->
<!--NAME[./log/defaultTrace_00.trc]/-->
<!--PATTERN[defaultTrace_00.trc]/-->
<!--FORMATTER[com.sap.tc.logging.ListFormatter]/-->
<!--ENCODING[UTF8]/-->
<!--FILESET[0, 20, 10485760]/-->
<!--PREVIOUSFILE[defaultTrace_00.19.trc]/-->
<!--NEXTFILE[defaultTrace_00.1.trc]/-->
<!--ENGINEVERSION[7.31.3301.368426.20141205114648]/-->
If you do want to match also the LOGHEADER lines, then a capturing group would be an unnecessary one.
Matcher m = Pattern.compile("(?s)[^\\n]*\\bLOGHEADER\\[START\\].*?\\bLOGHEADER\\[END\\][^\\n]*").matcher(s);
while(m.find())
{
System.out.println(m.group());
}

How do I use regex in Java to pull this from html?

I'm trying to pull data from the ESPN box scores, and one of the html files has:
<td style="text-align:left" nowrap>Channing Frye, PF</td>
and I'm only interested in grabbing the name (Channing Frye) and the position (PF)
Right now, I've been using Pattern.quote(start) + "(.*?)" + Pattern.quote(end) to grab text in between start and end, but I'm not sure how I'm supposed to grab text that starts with pattern .../http://espn.go.com/nba/player/_/id/ and then can contain (any integer)/anyfirst-anylast"> then grab the name I need (Channing Frye), then </a>, and then grab the position I need (PF) and ends with pattern </td>
Thanks!
Here is the pattern:
http://espn.go.com/nba/player/_/id/(\d+)/([\w-]+)">(.*?)</a>,\s*(\w+)</td>
You can use this tool - http://www.regexplanet.com/advanced/java/index.html for verifying regular expressions.
You could use this pattern:
\\/nba\\/player\\/_\\/.*\\\">(.*)<.+>,\\s(.*)<
This will match any link in the html that contains `/nba/player/
String re = "\\/nba\\/player\\/_\\/.*\\">(.*)<.+>,\\s(.*)<";
String str = "<td style=\"text-align:left\" nowrap>Channing Frye, PF</td>";
Pattern p = Pattern.compile(re, Pattern.MULTILINE | Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(str);
example: http://regex101.com/r/hA3uV0
Use this regex:
[A-Z\sa-z0-9]+(?=</a>)|\w+(?=</td>)
Here is one regex:
. is used for any item, .+ is used for any 1+ items
.* means o or more items
\s is used for space
String str = "<td style=\"text-align:left\" nowrap>Channing Frye, PF</td>";
Pattern pattern = Pattern.compile("<td.+>.*<a.+>(.+)</a>[\\s,]+(.+)</td>");
Matcher matcher = pattern.matcher(str);
while(matcher.find()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}
You can use :
String lString = "<td style=\"text-align:left\" nowrap>Channing Frye, PF</td>";
Pattern lPattern = Pattern.compile("<td.+><a.+id/\\d+/.+\\-.+>(.+)</a>, (.+)</td>");
Matcher lMatcher = lPattern.matcher(lString);
while(lMatcher.find()) {
System.out.println(lMatcher.group(1));
System.out.println(lMatcher.group(2));
}
This will give you :
Channing Frye
PF

Extracting a pattern from String

I have a Random string from which i need to match a certain pattern and parse it out.
My String-
{"sid":"zw9cmv1pzybexi","parentId":null,"time":1373271966311,"color":"#e94d57","userId":"255863","st":"comment","type":"section","cType":"parent"},{},null,null,null,null,{"sid":"zwldv1lx4f7ovx","parentId":"zw9cmv1pzybexi","time":1373347545798,"color":"#774697","userId":"5216907","st":"comment","type":"section","cType":"child"},{},null,null,null,null,null,{"sid":"zw76w68c91mhbs","parentId":"zw9cmv1pzybexi","time":1373356224065,"color":"#774697","userId":"5216907","st":"comment","type":"section","cType":"child"},
From the above I want to parse out (using regex) all the values for userId attribute. Can anyone help me out on how to do this ? It is a Random string and not JSON. Can you provide me a regex solution for this ?
Is that a random string ? It looks like JSON to me, and if it is I would recommend a JSON parser in preference to a regexp. The right thing to do when faced with a particular language/grammar is to use the corresponding parser, rather than a (potentially) fragile regexp.
To get the user Ids, you can use this pattern:
String input = "{\"sid\":\"zw9cmv1pzybexi\",\"parentId\":null,\"time\":1373271966311,\"color\":\"#e94d57\",\"userId\":\"255863\",\"st\":\"comment\",\"type\":\"section\",\"cType\":\"parent\"},{},null,null,null,null,{\"sid\":\"zwldv1lx4f7ovx\",\"parentId\":\"zw9cmv1pzybexi\",\"time\":1373347545798,\"color\":\"#774697\",\"userId\":\"5216907\",\"st\":\"comment\",\"type\":\"section\",\"cType\":\"child\"},{},null,null,null,null,null,{\"sid\":\"zw76w68c91mhbs\",\"parentId\":\"zw9cmv1pzybexi\",\"time\":1373356224065,\"color\":\"#774697\",\"userId\":\"5216907\",\"st\":\"comment\",\"type\":\"section\",\"cType\":\"child\"},";
Pattern p = Pattern.compile("\"userId\":\"(.*?)\"");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
which outputs:
255863
5216907
5216907
If you want the full string "userId":"xxxx", you can use m.group(); instead of m.group(1);.
Use JSON parser instead of using Regex, your code will be much more readable and maintainable
http://json.org/java/
https://code.google.com/p/json-simple/
As other already told you, it looks like a JSON String, but if you really want to parse this string on your own, you could use this piece of code:
final Pattern pattern = Pattern.compile("\"userId\":\"(\\d+)\"");
final Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
The matcher will match every "userId":"12345" pattern. matcher.group(1) will return every userId, 12345 in this case (matcher.group() without parameter returns the entire group, ie "userId":"12345").
Here's the regex-code you're asking for ..
//assign subject
String subject = "{\"sid\":\"zw9cmv1pzybexi\",\"parentId\":null,\"time\":1373271966311,\"color\":\"#e94d57\",\"userId\":\"255863\",\"st\":\"comment\",\"type\":\"section\",\"cType\":\"parent\"},{},null,null,null,null,{\"sid\":\"zwldv1lx4f7ovx\",\"parentId\":\"zw9cmv1pzybexi\",\"time\":1373347545798,\"color\":\"#774697\",\"userId\":\"5216907\",\"st\":\"comment\",\"type\":\"section\",\"cType\":\"child\"},{},null,null,null,null,null,{\"sid\":\"zw76w68c91mhbs\",\"parentId\":\"zw9cmv1pzybexi\",\"time\":1373356224065,\"color\":\"#774697\",\"userId\":\"5216907\",\"st\":\"comment\",\"type\":\"section\",\"cType\":\"child\"},";
//specify pattern and matcher
Pattern pat = Pattern.compile( "userId\":\"(\\d+)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL );
Matcher mat = pat.matcher( subject );
//browse all
while ( mat.find() )
{
System.out.println( "result [" + mat.group( 1 ) + "]" );
}
But OF COURSE I´d suggest to solve this using a JSON-Parser like
http://json.org/java/
Greetings
Christopher
It's a JSON format, so you have to use a JSON Parser:
JSONArray array = new JSONArray(yourString);
for (int i=0;i<array.length();i++){
JSONObject jo = inputArray.getJSONObject(i);
userId = jo.getString("userId");
}
EDIT : Regex pattern
"userId"[ :]+((?=\[)\[[^]]*\]|(?=\{)\{[^\}]*\}|\"[^"]*\")
Result :
"userId" : "Some user ID (numeric or letters)"

Java RegExp - Extracting only numbers from a webpage

I have a webpage converted to a string and I'm trying to extract three numbers from it from this line.
<td class="col_stat">1</td><td class="col_stat">0</td><td class="col_stat">1</td>
From the line above I already have it extracting the first '1' using this
String filePattern = "<td class=\"col_stat\">(.+)</td>";
pattern = Pattern.compile(filePattern);
matcher = pattern.matcher(text);
if(matcher.find()){
String number = matcher.group(1);
System.out.println(number);
}
Now what I want to do is extract the 0 and the last 1 but anytime I try edit the regular expression above it just outputs the complete webpage on the console. Anyone have any suggestions??
Thanks
Regex matching is greedy, try this instead (looking only for (\d+) instead of (.+) (which matches everything until the last </td>):
String text =
"<td class=\"col_stat\">1</td>" +
"<td class=\"col_stat\">0</td>" +
"<td class=\"col_stat\">1</td>";
String filePattern = "<td class=\"col_stat\">(\\d+)</td>";
Pattern pattern = Pattern.compile(filePattern);
Matcher matcher = pattern.matcher(text);
while (matcher.find())
{
String number = matcher.group(1);
System.out.println(number);
}
On a related note, I completely agree with other's suggestions to use a more structured approach to interpreting HTML.
Given that using regexps on HTML/XML is a notorious gotcha (see here for the definitive answer), I'd recommend doing this reliably using an HTML parser (e.g. JTidy - although it's a HTML pretty-printer, it also provides a DOM interface to the document)
<td class=\"col_stat\">(.+)</td>
this regex is greedy. If you wish to make it work with numbers - change it as:
<td class=\"col_stat\">(\\d+?)</td>
and I'd rather suggest to use XPath for such kind of matching, see Saxon and TagSoup
This is because your matcher is greedy. You need a non-greedy matcher to fix this.
String text = "<td class=\"col_stat\">1</td><td class=\"col_stat\">0</td><td class=\"col_stat\">1</td>";
String filePattern = "<td class=\"col_stat\">(.+?)</td>";
Pattern pattern = Pattern.compile(filePattern);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
String number = matcher.group(1);
System.out.println(number);
}
Try this regular expression:
<td class="col_stat">(\d+)[^\d]+(\d+)[^\d]+(\d+)
This does the following:
search for your start string
select a chain of decimals
skip any NON-decimals
select a chain of decimals
skip any NON-decimals
select a chain of decimals

Categories