attributes pattern matcher takes a long time

attributes pattern matcher takes a long time - java

I have a regex to get the src and the remaining attributes for all the images present in the content.
<img *((.|\s)*?) *src *= *['"]([^'"]*)['"] *((.|\s)*?) */*>
If the content I am matching against is like
<img src=src1"/> <img src=src2"/>
the find(index) hangs and I see the following in the thread dump
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
Is there a solution or a workaround for solving this issue?

A workaround is to use a HTML parser such as JSoup, for example
Document doc =
Jsoup.parse("<html><img src=\"src1\"/> <img src=\"src2\"/></html>");
Elements elements = doc.select("img[src]");
for (Element element: elements) {
System.out.println(element.attr("src"));
System.out.println(element.attr("alt"));
System.out.println(element.attr("height"));
System.out.println(element.attr("width"));
}

It looks like what you've got is an "evil regex", which is not uncommon when you try to construct a complicated regex to match one thing (src) within another thing (img). In particular, evil regexs usually happen when you try to apply repetition to a complex subexpression, which you are doing with (.|\s)*?.
A better approach would be to use two regexes; one to match all <img> tags, and then another to match the src attribute within it.
My Java's rusty, so I'll just give you the pseudocode solution:
foreach( imgTag in input.match( /<img .*?>/ig ) ) {
src = imgTag.match( /\bsrc *= *(['\"])(.*?)\1/i );
// if you want to get other attributes, you can do that the same way:
alt = imgTag.match( /\balt *= *(['\"])(.*?)\1/i );
// even better, you can get all the attributes in one go:
attrs = imgTag.match( /\b(\w+) *= *(['\"])(.*?)\2/g );
// attrs is now an array where the first group is the attr name
// (alt, height, width, src, etc.) and the second group is the
// attr value
}
Note the use of a backreference to match the appropriate type of closing quote (i.e., this will match src='abc' and src="abc". Also note that the quantifiers are lazy here (*? instead of just *); this is necessary to prevent too much from being consumed.
EDIT: even though my Java's rusty, I was able to crank out an example. Here's the solution in Java:
import java.util.regex.*;
public class Regex {
public static void main( String[] args ) {
String input = "<img alt=\"altText\" src=\"src\" height=\"50\" width=\"50\"/> <img alt='another image' src=\"foo.jpg\" />";
Pattern attrPat = Pattern.compile( "\\b(\\w+) *= *(['\"])(.*?)\\2" );
Matcher imgMatcher = Pattern.compile( "<img .*?>" ).matcher( input );
while( imgMatcher.find() ) {
String imgTag = imgMatcher.group();
System.out.println( imgTag );
Matcher attrMatcher = attrPat.matcher( imgTag );
while( attrMatcher.find() ) {
String attr = attrMatcher.group(1);
System.out.format( "\tattr: %s, value: %s\n", attrMatcher.group(1), attrMatcher.group(3) );
}
}
}
}

Related

GWT - extract text in between two characters

In GWT i have a servlet that returns an image from the database to the client. I need to extract out part of the string to properly show the image. What is returned in chrome, firefox, and IE has a slash in the src part. Ex: String s = "src=\""; Which is not visible in the string below. Maybe the slash is adding more parentheses around the http string. Im not sure?
what is returned in those 3 browsers is = <img style="-webkit-user-select: none;" src="http://localhost:8080/dashboardmanager/downloadfile?entityId=4886">
EDGE browser doesn't have the slash in the src so my method to extract the image doesnt work in edge
What edge returns:
String edge = "<img src=”http://localhost:8080/dashboardmanager/downloadfile?entityId=4886”>";
Problem: I need to extract the string below.
http://localhost:8080/dashboardmanager/downloadfile?entityId=4886
either with src= or src=\
What I tried and works with the browsers that return without the parentheses "src=\":
String s = "src=\"";
int index = returned.indexOf(s) + s.length();
image.setUrl(returned.substring(index, returned.indexOf("\"", index + 1)));
But fails to work in EDGE because it doesnt return a slash
I do not have access to Pattern, and matcher in GWT.
How can i extract and keep in mind the entityId number will change
http://localhost:8080/dashboardmanager/downloadfile?entityId=4886
out of what is returned string above?
EDIT:
I need a generic way to extract out http://localhost:8080/dashboardmanager/downloadfile?entityId=4886
When the string might look like this both ways.
String edge = "<img src=”http://localhost:8080/dashboardmanager/downloadfile?entityId=4886”>";
3 browsers is = <img style="-webkit-user-select: none;" src="http://localhost:8080/dashboardmanager/downloadfile?entityId=4886">

public static void main(String[] args) {
String toParse = "<img style=\"-webkit-user-select: none;\" src=\"http://localhost:8080/dashboardmanager/downloadfile?entityId=4886\">";
String delimiter = "src=\"";
int index = toParse.indexOf(delimiter) + delimiter.length();
System.out.println(toParse.substring(index, toParse.length()).split("\"")[0]);
}

Java regex for google maps url?

I want to parse all google map links inside a String. The format is as follows :
1st example
https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z/data=!3m1!4b1!4m5!3m4!1s0x89b7b7bcdecbb1df:0x715969d86d0b76bf!8m2!3d38.8976763!4d-77.0365298
https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z
https://www.google.com/maps/place//#38.8976763,-77.0387185,17z
https://maps.google.com/maps/place//#38.8976763,-77.0387185,17z
https://www.google.com/maps/place/#38.8976763,-77.0387185,17z
https://google.com/maps/place/#38.8976763,-77.0387185,17z
http://google.com/maps/place/#38.8976763,-77.0387185,17z
https://www.google.com.tw/maps/place/#38.8976763,-77.0387185,17z
These are all valid google map URLs (linking to White House)
Here is what I tried
String gmapLinkRegex = "(http|https)://(www\\.)?google\\.com(\\.\\w*)?/maps/(place/.*)?#(.*z)[^ ]*";
Pattern patternGmapLink = Pattern.compile(gmapLinkRegex , Pattern.CASE_INSENSITIVE);
Matcher m = patternGmapLink.matcher(s);
while (m.find()) {
logger.info("group0 = {}" , m.group(0));
String place = m.group(4);
place = StringUtils.stripEnd(place , "/"); // remove tailing '/'
place = StringUtils.stripStart(place , "place/"); // remove header 'place/'
logger.info("place = '{}'" , place);
String latLngZ = m.group(5);
logger.info("latLngZ = '{}'" , latLngZ);
}
It works in simple situation , but still buggy ...
for example
It need post-process to grab optional place information
And it cannot extract one line with two urls such as :
s = "https://www.google.com/maps/place//#38.8976763,-77.0387185,17z " +
" and http://google.com/maps/place/#38.8976763,-77.0387185,17z";
It should be two urls , but the regex matches the whole line ...
The points :
The whole URL should be matched in group(0) (including the tailing data part in 1st example),
in the 1st example , if the zoom level : 17z is removed , it is still a valid gmap URL , but my regex cannot match it.
Easier to extract optional place info
Lat / Lng extraction is must , zoom level is optional.
Able to parse multiple urls in one line
Able to process maps.google.com(.xx)/maps , I tried (www|maps\.)? but seems still buggy
Any suggestion to improve this regex ? Thanks a lot !

The dot-asterisk
.*
will always allow anything to the end of the last url.
You need "tighter" regexes, which match a single URL but not several with anything in between.
The "[^ ]*" might include the next URL if it is separated by something other than " ", which includes line break, tab, shift-space...
I propose (sorry, not tested on java), to use "anything but #" and "digit, minus, comma or dot" and "optional special string followed by tailored charset, many times".
"(http|https)://(www\.)?google\.com(\.\w*)?/maps/(place/[^#]*)?#([0123456789\.,-]*z)(\/data=[\!:\.\-0123456789abcdefmsx]+)?"
I tested the one above on a perl-regex compatible engine (np++).
Please adapt yourself, if I guessed anything wrong. The explicit list of digits can probably be replaced by "\d", I tried to minimise assumptions on regex flavor.
In order to match "URL" or "URL and URL", please use a variable storing the regex, then do "(URL and )*URL", replacing "URL" with regex var. (Asuming this is possible in java.) If the question is how to then retrieve the multiple matches: That is java, I cannot help. Let me know and I delete this answer, not to provoke deserved downvotes ;-)
(Edited to catch the data part in, previously not seen, first example, first line; and the multi URLs in one line.)

I wrote this regex to validate google maps links:
"(http:|https:)?\\/\\/(www\\.)?(maps.)?google\\.[a-z.]+\\/maps/?([\\?]|place/*[^#]*)?/*#?(ll=)?(q=)?(([\\?=]?[a-zA-Z]*[+]?)*/?#{0,1})?([0-9]{1,3}\\.[0-9]+(,|&[a-zA-Z]+=)-?[0-9]{1,3}\\.[0-9]+(,?[0-9]+(z|m))?)?(\\/?data=[\\!:\\.\\-0123456789abcdefmsx]+)?"
I tested with the following list of google maps links:
String location1 = "http://www.google.com/maps/place/21.01196755,105.86306012";
String location2 = "https://www.google.com.tw/maps/place/#38.8976763,-77.0387185,17z";
String location3 = "http://www.google.com/maps/place/21.01196755,105.86306012";
String location4 = "https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z/data=!3m1!4b1!4m5!3m4!1s0x89b7b7bcdecbb1df:0x715969d86d0b76bf!8m2!3d38.8976763!4d-77.0365298";
String location5 = "https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z";
String location6 = "https://www.google.com/maps/place//#38.8976763,-77.0387185,17z";
String location7 = "https://maps.google.com/maps/place//#38.8976763,-77.0387185,17z";
String location8 = "https://www.google.com/maps/place/#38.8976763,-77.0387185,17z";
String location9 = "https://google.com/maps/place/#38.8976763,-77.0387185,17z";
String location10 = "http://google.com/maps/place/#38.8976763,-77.0387185,17z";
String location11 = "https://www.google.com/maps/place/#/data=!4m2!3m1!1s0x3135abf74b040853:0x6ff9dfeb960ec979";
String location12 = "https://maps.google.com/maps?q=New+York,+NY,+USA&hl=no&sll=19.808054,-63.720703&sspn=54.337928,93.076172&oq=n&hnear=New+York&t=m&z=10";
String location13 = "https://www.google.com/maps";
String location14 = "https://www.google.fr/maps";
String location15 = "https://google.fr/maps";
String location16 = "http://google.fr/maps";
String location17 = "https://www.google.de/maps";
String location18 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location19 = "https://www.google.de/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location20 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4&layer=t&lci=com.panoramio.all,com.google.webcams,weather";
String location21 = "https://www.google.com/maps?ll=37.370157,0.615234&spn=45.047033,93.076172&t=m&z=4&layer=t";
String location22 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location23 = "https://www.google.de/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location24 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4&layer=t&lci=com.panoramio.all,com.google.webcams,weather";
String location25 = "https://www.google.com/maps?ll=37.370157,0.615234&spn=45.047033,93.076172&t=m&z=4&layer=t";
String location26 = "http://www.google.com/maps/place/21.01196755,105.86306012";
String location27 = "http://google.com/maps/bylatlng?lat=21.01196022&lng=105.86298748";
String location28 = "https://www.google.com/maps/place/C%C3%B4ng+vi%C3%AAn+Th%E1%BB%91ng+Nh%E1%BA%A5t,+354A+%C4%90%C6%B0%E1%BB%9Dng+L%C3%AA+Du%E1%BA%A9n,+L%C3%AA+%C4%90%E1%BA%A1i+H%C3%A0nh,+%C4%90%E1%BB%91ng+%C4%90a,+H%C3%A0+N%E1%BB%99i+100000,+Vi%E1%BB%87t+Nam/#21.0121535,105.8443773,13z/data=!4m2!3m1!1s0x3135ab8ee6df247f:0xe6183d662696d2e9";

RSS Feed - Parse/Extract src image tag inside Description tag in JAVA

Extending this question
How to extract an image src from RSS feed
for JAVA, answer is already made for ios, but to make it work in JAVA there is not enough solutions made for it.
RSS Feeds parsing the direct tag is known for me, but parsing tag inside another tag is quite complicated like this below
<description>
<![CDATA[
<img width="745" height="410" src="http://example.com/image.png" class="attachment-large wp-post-image" alt="alt tag" style="margin-bottom: 15px;" />description text
]]>
</description>
How to split up the src tag alone?

Take a look at jsoup. I think it's what you need.
EDIT:
private String extractImageUrl(String description) {
Document document = Jsoup.parse(description);
Elements imgs = document.select("img");
for (Element img : imgs) {
if (img.hasAttr("src")) {
return img.attr("src");
}
}
// no image URL
return "";
}

You could try to use a regular expression to get the value,
give a look to this little example, I hope it's help you.
For more info about regular expression you can find more info here.
http://www.tutorialspoint.com/java/java_regular_expressions.htm
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test{
public static void main(String []args){
String regularExpression = "src=\"(.*)\" class";
String html = "<description> <![CDATA[ <img width=\"745\" height=\"410\" src=\"http://example.com/image.png\" class=\"attachment-large wp-post-image\" alt=\"alt tag\" style=\"margin-bottom: 15px;\" />description text ]]> </description>";
// Create a Pattern object
Pattern pattern = Pattern.compile(regularExpression);
// Now create matcher object.
Matcher matcher = pattern.matcher(html);
if (matcher.find( )) {
System.out.println("Found value: " + matcher.group(1) );
//It's prints Found value: http://example.com/image.png
}
}
}

How can I adjust this regex to filter out "

I got the following regex working to search for video links in a page
(http(s?):/)(/[^/]+)\\S+.\\.(?:avi|flv|mp4)
Unfortunately it does not stop at the end of the link if there is another match right behind it, for example this video link
somevideoname.avi
would, after regex return this:
http://somevideo.flv">somevideoname.avi
How can I adjust the regex to avoid this? I would like to learn more about regex, its fascinating but so complex!

Here is how you can do something similar with JSoup parser.
Scanner scanner = new Scanner(new File("input.txt"));
scanner.useDelimiter("\\Z");
String htmlString = scanner.next();
scanner.close();
Document doc = Jsoup.parse(htmlString);
// or to get connect of some page use
// Document doc = Jsoup.connect("http://example.com/").get();
Elements elements = doc.select("a[href]");//find all anchors with href attribute
for (Element el : elements) {
URL url = new URL(el.attr("href"));
if (url.getPath().matches(".*\\.(?:avi|flv|mp4)")) {
System.out.println("url: " + url);
//System.out.println("file: " + url.getPath());
System.out.println("file name: "
+ new File(url.getPath()).getName());
System.out.println("------");
}
}

I'm not sure I understand the groupings in your regexp. At any rate, this one should work:
\\bhttps?://[^\"]+?\\.(?:avi|flv|mp4)\\b

If you only want to extract href attribute values then you're better off matching against the following pattern:
href=("|')(.*?)\.(avi|flv|mp4)\1
This should match "href" followed by either a double-quote or single-quote character, then capture everything up to (and including) the next character which matches the starting quote character. Then your href attribute can be extracted by
matcher.group(2) + "." + matcher.group(3)
to concatenate the file path and name with a period and then the file extension.

Your regex is greedy:
Limit its greediness read this:
(http(s?):/)(/[^/]+?)\\S+.\\.(?:avi|flv|mp4)

Java : replacing text URL with clickable HTML link

I am trying to do some stuff with replacing String containing some URL to a browser compatible linked URL.
My initial String looks like this :
"hello, i'm some text with an url like http://www.the-url.com/ and I need to have an hypertext link !"
What I want to get is a String looking like :
"hello, i'm some text with an url like http://www.the-url.com/ and I need to have an hypertext link !"
I can catch URL with this code line :
String withUrlString = myString.replaceAll(".*://[^<>[:space:]]+[[:alnum:]/]", "HereWasAnURL");
Maybe the regexp expression needs some correction, but it's working fine, need to test in further time.
So the question is how to keep the expression catched by the regexp and just add a what's needed to create the link : catched string
Thanks in advance for your interest and responses !

Try to use:
myString.replaceAll("(.*://[^<>[:space:]]+[[:alnum:]/])", "HereWasAnURL");
I didn't check your regex.
By using () you can create groups. The $1 indicates the group index.
$1 will replace the url.
I asked a simalir question: my question
Some exemples: Capturing Text in a Group in a regular expression

public static String textToHtmlConvertingURLsToLinks(String text) {
if (text == null) {
return text;
}
String escapedText = HtmlUtils.htmlEscape(text);
return escapedText.replaceAll("(\\A|\\s)((http|https|ftp|mailto):\\S+)(\\s|\\z)",
"$1$2$4");
}
There may be better REGEXs out there, but this does the trick as long as there is white space after the end of the URL or the URL is at the end of the text. This particular implementation also uses org.springframework.web.util.HtmlUtils to escape any other HTML that may have been entered.

For anybody who is searching a more robust solution I can suggest the Twitter Text Libraries.
Replacing the URLs with this library works like this:
new Autolink().autolink(plainText)

Belows code replaces links starting with "http" or "https", links starting just with "www." and finally replaces also email links.
Pattern httpLinkPattern = Pattern.compile("(http[s]?)://(www\\.)?([\\S&&[^.#]]+)(\\.[\\S&&[^#]]+)");
Pattern wwwLinkPattern = Pattern.compile("(?<!http[s]?://)(www\\.+)([\\S&&[^.#]]+)(\\.[\\S&&[^#]]+)");
Pattern mailAddressPattern = Pattern.compile("[\\S&&[^#]]+#([\\S&&[^.#]]+)(\\.[\\S&&[^#]]+)");
String textWithHttpLinksEnabled =
"ajdhkas www.dasda.pl/asdsad?asd=sd www.absda.pl maiandrze#asdsa.pl klajdld http://dsds.pl httpsda http://www.onet.pl https://www.onsdas.plad/dasda";
if (Objects.nonNull(textWithHttpLinksEnabled)) {
Matcher httpLinksMatcher = httpLinkPattern.matcher(textWithHttpLinksEnabled);
textWithHttpLinksEnabled = httpLinksMatcher.replaceAll("$0");
final Matcher wwwLinksMatcher = wwwLinkPattern.matcher(textWithHttpLinksEnabled);
textWithHttpLinksEnabled = wwwLinksMatcher.replaceAll("$0");
final Matcher mailLinksMatcher = mailAddressPattern.matcher(textWithHttpLinksEnabled);
textWithHttpLinksEnabled = mailLinksMatcher.replaceAll("$0");
System.out.println(textWithHttpLinksEnabled);
}
Prints:
ajdhkas www.dasda.pl/asdsad?asd=sd www.absda.pl maiandrze#asdsa.pl klajdld http://dsds.pl httpsda http://www.onet.pl https://www.onsdas.plad/dasda

Assuming your regex works to capture the correct info, you can use backreferences in your substitution. See the Java regexp tutorial.
In that case, you'd do
myString.replaceAll(....., "\1")

In case of multiline text you can use this:
text.replaceAll("(\\s|\\^|\\A)((http|https|ftp|mailto):\\S+)(\\s|\\$|\\z)",
"$1<a href='$2'>$2</a>$4");
And here is full example of my code where I need to show user's posts with urls in it:
private static final Pattern urlPattern = Pattern.compile(
"(\\s|\\^|\\A)((http|https|ftp|mailto):\\S+)(\\s|\\$|\\z)");
String userText = ""; // user content from db
String replacedValue = HtmlUtils.htmlEscape(userText);
replacedValue = urlPattern.matcher(replacedValue).replaceAll("$1$2$4");
replacedValue = StringUtils.replace(replacedValue, "\n", "<br>");
System.out.println(replacedValue);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

attributes pattern matcher takes a long time - java

Related

GWT - extract text in between two characters

Java regex for google maps url?

RSS Feed - Parse/Extract src image tag inside Description tag in JAVA

How can I adjust this regex to filter out "

Java : replacing text URL with clickable HTML link

Categories

Resources