Java Regex - Extract link from HTML anchor - java

I have the following code
private String anchorRegex = "\\<\\s*?a\\s+.*?href\\s*?=\\s*?([^\\s]*?).*?\\>";
private Pattern anchorPattern = Pattern.compile(anchorRegex, Pattern.CASE_INSENSITIVE);
String content = getContentAsString();
Matcher matcher = anchorPattern.matcher(content);
while(matcher.find()) {
System.out.println(matcher.group(1));
}
The call to getContentAsString() returns the HTML content from a web page. The problem I'm having is that the only thing that gets printed in my System.out is a space. Can anyone see what's wrong with my regex?
Regex drives me crazy sometimes.

You need to delimit your capturing group from the following .*?. There's probably double quotes " around the href, so use those:
<\s*a\s+.*?href\s*=\s*"(\S*?)".*?>
Your regex contains:
([^\s]*?).*?
The ([^\s]*?) says to reluctantly find all non-whitespace characters and save them in a group. But the reluctant *? depends on the next part, which is .; any character. So the matching of the href aborts at the first possible chance and it is the .*? which matches the rest of the URL.

The regex you should be using is this:
String anchorRegex = "(?s)<\\s*a\\s+.*?href\\s*=\\s*['\"]([^\\s>]*)['\"]";

This should be able to pull out the href without too much trouble.
The link is in capture group 2, its expanded and assumes dot-all.
Use Java delimiters as necessary.
(?s)
<a
(?=\s)
(?:[^>"']|"[^"]*"|'[^']*')*? (?<=\s) href \s*=\s* (['"]) (.*?) \1
(?:".*?"|'.*?'|[^>]*?)+
>
or not expanded, not dot-all.
<a(?=\s)(?:[^>"']|"[^"]*"|'[^']*')*?(?<=\s)href\s*=\s*(['"])([\s\S]*?)\1(?:"[\s\S]*?"|'[\s\S]*?'|[^>]*?)+>

Related

Java regex only finding one match

I'm using the following regex:
(?<=<((Pswrd>)|([^/]{1,2147483646}?:Pswrd>)))((?s).+?)(?=</(\\1))
And I have the following text to match:
<abc:Pswrd>PASSWORD_ONE</abc:Pswrd>
<Pswrd>PASSWORD_TWO</Pswrd>
I need to match the context of both XML tags but is only working for the second one.
The output is:
PASSWORD_TWO
And it should be:
PASSWORD_ONE
PASSWORD_TWO
It seems the OR is not working for some reason?
String message = " <abc:Pswrd>PASSWORD_ONE</abc:Pswrd>\n" +
" <Pswrd>PASSWORD_TWO</Pswrd>";
String regex = "(?<=<((Pswrd>)|([^/]{1,2147483646}?:Pswrd>)))((?s).+?)(?=</(\\1))";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(message);
while (matcher.find()) {
String group = matcher.group();
System.out.println(group);
}
Thanks
Update: It needs to be the matching group 0.
So in order to either match <Pswd> or <abc:Pswd> or <something:Pswd>, the RegEx would need to look something like <\w*:*Pswrd>. The problem however is that the look behind does not like non-fixed width quantifiers, so you can't create a look behind that caters for a "dynamic"
Instead I would suggest just go for something simple, such as :
(?<=Pswrd>)(.*)(?=<\/)
Essentially here you just look for the last bit of the opening tag (namely "Pswrd>") then you match any thing between that and the closing portion of the tag.

Search and Extract a string with a specific keyword from a string

I am processing a tsv file. I have a bunch of urls in one entry and I am looking for a specific url with '.ab.' keyword in it.
This is my data: http://this/is/anexample.jpg,http://this/is/anexample.jpg,http://this/is/anexample.jpg,http://this/is/anexample.jpg,http://this/is/anexamplewith.AB.jpg
and I want output to be http://this/is/anexamplewith.ab.jpg
This is what I am using: '^http://.*[.AB.jpg]' but it's giving me entire string.
What RegEx can I use?
Thank you!
Note that ^http://.*[.AB.jpg] matches http:// at the beginning of the string and .* matches every character other than newline to the end (of line) looking for the last occurrence of the following characters - ., A, B, ., j, p, g. At the end you have g - thus the whole string is matched.
You can use
http:\/\/(?:(?!http:\/\/).)*\.ab\.(?:(?!http:\/\/).)*(?=$|http)
See demo
Regex matches:
http:\/\/ - matches http://
(?:(?!http:\/\/).)* - matches any symbol that is not starting the substring http:// (thus ensuring the shortest window between the first http:// and .ab.)
\.ab\. - literal .ab.
(?:(?!http:\/\/).)* - see above
(?=$|http) - a lookahead that tells the engine to stop in front of end of string ($) or http://.
A Java implementation:
String str = "http://this/is/anexample.jpg,http://this/is/anexample.jpg,http://this/is/anexample.jpg,http://this/is/anexample.jpg,http://this/is/anexamplewith.AB.jpg";
Pattern ptrn = Pattern.compile("(?i)http://(?:(?!http://).)*\\.ab\\.(?:(?!http://).)*(?=$|http)");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
Output of the sample demo program:
http://this/is/anexamplewith.AB.jpg
REPLACEMENT
To replace that match, you just need to use a replaceAll:
str = str.replaceAll("(?i)http://(?:(?!http://).)*\\.ab\\.(?:(?!http://).)*(?=$|http)", "");

How Java regex match for non-subsequence string?

Example:
String url = "http://www.google.com/abcd.jpg";
String url2 = "http://www.google.com/abcd.jpg_xxx.jpg";
I want to match "http://www.google.com/abcd" whatever url or url2.
I write a regex:
(http://.*\.com/[^(.jpg)]*).jpg
But [^(.jpg)]* doesn't like correct. What the regex should be?
Forward slash need to be escaped as well. Use this regex:
^(http:\/\/.+?\.com\/[^.]+)\.jpg
Live Demo
Reluctant quantifier .*? matches to first ".jpg":
(http:\/\/.*\.com\/.*?)\.jpg.*

Use RegEx in Java to extract parameters in between parentheses

I'm writing a utility to extract the names of header files from JSPs. I have no problem reading the JSPs line by line and finding the lines I need. I am having a problem extracting the specific text needed using regex. After looking at many similar questions I'm hitting a brick wall.
An example of the String I'll be matching from within is:
<jsp:include page="<%=Pages.getString(\"MY_HEADER\")%>" flush="true"></jsp:include>
All I need is MY_HEADER for this example. Any time I have this tag:
<%=Pages.getString
I need what comes between this:
<%=Pages.getString(\" and this: )%>
Here is what I have currently (which is not working, I might add) :
String currentLine;
while ((currentLine = fileReader.readLine()) != null)
{
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\]*)");
Matcher matcher = pattern.matcher(currentLine);
while(matcher.find()) {
System.out.println(matcher.group(1).toString());
}}
I need to be able to use the Java RegEx API and regex to extract those header names.
Any help on this issue is greatly appreciated. Thanks!
EDIT:
Resolved this issue, thankfully. The tricky part was, after being given the right regex, it had to be taken into account that the String I was feeding to the regex was always going to have two " / " characters ( (/"MY_HEADER"/) ) that needed to be escaped in the pattern.
Here is what worked (thanks to the help ;-)):
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\\"]*)");
This should do the trick:
<%=Pages\\.getString\\(\\\\\"([^\\\\]*)
Yeah that's a scary number of back slashes. matcher.group(1) should return MY_HEADER. It starts at the \" and matches everything until the next \ (which I assume here will be at \")%>.)
Of course, if your target text contains a backslash (\), this will not work. But you didn't give an indication that you'd ever be looking for something like <%=Pages.getString(\"Fun!\Yay!\")%> -- where this regex would only return Fun! and ignore the rest.
EDIT
The reason your test case was failing is because you were using this test string:
String currentLine = "<%=Pages.getString(\"MY_HEADER\")%>";
This is the equivalent of reading it in from a file and seeing:
<%=Pages.getString("MY_HEADER")%>
Note the lack of any \. You need to use this instead:
String sCurrentLine = "<%=Pages.getString(\\\"MY_HEADER\\\")%>";
Which is the equivalent of what you want.
This is test code that works:
String currentLine = "<%=Pages.getString(\\\"MY_HEADER\\\")%>";
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\]*)");
Matcher matcher = pattern.matcher(currentLine);
while(matcher.find()) {
System.out.println(matcher.group(1).toString());
}

How to match the beginning of a String in Java?

I want to match an HTML file:
If the file starts with spaces and then an end tag </sometag>, return true.
Else return false.
I used the "(\\s)*</(\\w)*>.*", but it doesn't match \n </p>\n </blockquote> ....
Thanks to Gabe's help. Gabe is correct. The . doesn't match \n by default. I need to set the DOTALL mode on.
To do it, add the (?s) to the beginning of the regex, i.e. (?s)(\\s)*</(\\w)*>.*.
You can also do this:
Pattern p = Pattern.compile("(\\s)*</(\\w)*>");
Matcher m = p.matcher(s);
return m.lookingAt();
It just checks if the string starts with the pattern, rather than checking the whole string matches the pattern.

Categories