JSoup: Replacing a String adds new lines - java

I have the following issue with JSoup.
I want to parse and modify the following html code:
<code>
<style type="text/css" media="all">
#import url("http://hakkon-aetterni.at/modules/system/system.base.css?ll3lgd");
#import url("http://hakkon-aetterni.at/modules/system/system.menus.css?ll3lgd");
#import url("http://hakkon-aetterni.at/modules/system/system.messages.css?ll3lgd");
#import url("http://hakkon-aetterni.at/modules/system/system.theme.css?ll3lgd");
</style>
</code>
I'm using the following Code to acheive that:
Elements cssImports= doc.select("style");
for (Element src : cssImports) {
String regex ="url\\(\"(.)*\"\\)";
String data =src.data();
String link;
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(data);
while (m.find()){
link=m.group().substring(5,m.group().length()-2);
doc=Jsoup.parse(doc.html().replace(link, ""));
}
}
First, it works. All the import urls are replaced with the String "FOUND". The issue I'm having is that I get a lot new lines between the last import statement and the closed </style> Tag which where not there before.
Any clues why this is happenign and how I can avoid it?
Sorry for the bad formatting but I seems like some parts of my code is just getting removed on posting. There is a style Tag surrounding the first code block...

Well, I landed on this page today looking to do a very similar thing, and I believe that I've solved it. Hopefully someone's still watching this now that it's a month later. ;)
What I found to work well was, instead of doing string replaces and re-parsing the document on every loop, to rebuild the content of the style element. One of the places where JSoup really shines is in how easy it's API makes editing a parsed document.
The other trick, is to use the data() function. JSoup differentiates between data (e.g. script and style) and html/text nodes. The main difference is that HTML escaping is not applied to data nodes.
Putting all this together, this following code snippet should replace your imported stylesheet refs with your FOUND text but without changing the formatting of your document:
// compile the regex before entering the loop, as it's a relatively expensive operation
Pattern pattern = Pattern.compile("url\\(\"(.)*\"\\)");
for(Element styleElem : doc.getElementsByTag("style")) {
String data = styleElem.data();
StringBuffer newData = new StringBuffer();
Matcher matcher = pattern.matcher(data);
while(matcher.find()) {
matcher.appendReplacement(newData, "FOUND");
}
matcher.appendTail(newData);
styleElem.appendChild(new DataNode(newData.toString(), base.toExternalForm()));
}
P.S. I'm assuming that you've turned pretty-printing off. Since your document parsing code isn't displayed, though, make doubly sure to call document.outputSettings().prettyPrint(false); after parsing.
P.P.S. In my own code, I'm using a more tolerant (and slightly uglier) regex to find the imports. It lets the user get away with omitting the URL declaration, quotes, parens, etc...because HTML in the wild tends to do all of those things. I have it declared in my code as follows:
public static final Pattern CSS_IMPORT_PATTERN = Pattern.compile("(#import\\s+(?:url)?\\s*\\(?\\s*['\"]?)(.*?)([\\s'\";,)]|$)");

Related

Java Regexp to match domain of url

I would like to use Java regex to match a domain of a url, for example,
for www.table.google.com, I would like to get 'google' out of the url, namely, the second last word in this URL string.
Any help will be appreciated !!!
It really depends on the complexity of your inputs...
Here is a pretty simple regex:
.+\\.(.+)\\..+
It fetches something that is inside dots \\..
And here are some examples for that pattern: https://regex101.com/r/L52oz6/1.
As you can see, it works for simple inputs but not for complex urls.
But why reinventing the wheel, there are plenty of really good libraries that correctly parse any complex url. But sure, for simple inputs a small regex is easily build. So if that does not solve the problem for your inputs then please callback, I will adjust the regex pattern then.
Note that you can also just use simple splitting like:
String[] elements = input.split("\\.");
String secondToLastElement = elements[elements.length - 2];
But don't forget the index-bound checking.
Or if you search for a very quick solution than walk through the input starting from the last position. Work your way through until you found the first dot, continue until the second dot was found. Then extract that part with input.substring(index1, index2);.
There is also already a delegate method for exactly that purpose, namely String#lastIndexOf (see the documentation).
Take a look at this code snippet:
String input = ...
int indexLastDot = input.lastIndexOf('.');
int indexSecondToLastDot = input.lastIndexOf('.', indexLastDot);
String secondToLastWord = input.substring(indexLastDot, indexSecondToLastDot);
Maybe the bounds are off by 1, haven't tested the code, but you get the idea. Also don't forget bound checking.
The advantage of this approach is that it is really fast, it can directly work on the internal structures of Strings without creating copies.
My attempt:
(?<scheme>https?:\/\/)?(?<subdomain>\S*?)(?<domainword>[^.\s]+)(?<tld>\.[a-z]+|\.[a-z]{2,3}\.[a-z]{2,3})(?=\/|$)
Demo. Works correctly for:
http://www.foo.stackoverflow.com
http://www.stackoverflow.com
http://www.stackoverflow.com/
http://stackoverflow.com
https://www.stackoverflow.com
www.stackoverflow.com
stackoverflow.com
http://www.stackoverflow.com
http://www.stackoverflow.co.uk
foo.www.stackoverflow.com
foo.www.stackoverflow.co.uk
foo.www.stackoverflow.co.uk/a/b/c
private static final Pattern URL_MATCH_GET_SECOND_AND_LAST =
Pattern.compile("www.(.*)//.google.(.*)", Pattern.CASE_INSENSITIVE);
String sURL = "www.table.google.com";
if (URL_MATCH_GET_SECOND_AND_LAST.matcher(sURL).find()){
Matcher matchURL = URL_MATCH_GET_SECOND_AND_LAST .matcher(sURL);
if (matchURL .find()) {
String sFirst = matchURL.group(1);
String sSecond= matchURL.group(2);
}
}

Regex Lookahead & Lookbehind with Java

I am trying to parse out data from a HTML page using a Java RegEx but have not had much luck. The data is dynamic and often includes zero to many instances of spaces, tabs, new lines. Also, depending on the number of hits the structure of the string I'm parsing may change. Here is a sample in the cleanest format:
<div class="center">Showing 25 of 2,343,098 (search took 1.245 seconds)</div>
However it can also look like this:
<div class="center">Showing 2343098 (search took 1.245 seconds)</div>
or
<div class="center">
Showing 125
of 2,343,098
(search took 1.245 seconds)</div>
What I'm trying to parse is the 2,343,098 but since the pages is HTML I have to use either "Showing" or "(search took" to search between. The spaces, tabs and new lines are tripping me up and I've been trying to use lookahead & lookbehind but so far no luck. Here are a few patterns I've tried
String pattern1 = "Showing [0-9]*\\S"; // not useful
String pattern2 = "[[\\d,+\\.?\\d+]*[\\s*\\n]\\(search took"; //fails
String pattern3 = "(/i)(Showing)(.+?)(\\(search took)"; //fails
String pattern4 = "([\\s\\S]*)\\(search took"; //fails
String pattern5 = "(?s)[\\d].*?(?=\\(search took)"; //close...but fails
Pattern pattern = Pattern.compile(pattern5);
Matcher matcher = pattern.matcher(text); // text = the string I'm parsing
while(matcher.find()) {
System.out.println(matcher.group(0));
}
HTML is not a regular language, and cannot be accurately parsed using regular expressions. Regex-based solutions are likely to break when the format of the markup changes in future, but a parser-based solution will be more accurate.
However, if this is a one-off job, you can get away with the following regex:
Showing\s+(?:\d+\s+of\s+)?([\d,.]+)\s+\(search
Demo
The examples suggest
"Showing\\s+\\d+\\s+(of\\s+[\\d,.]+\\s+)?\\(search"

Use RegEx in Java to extract parameters in between parentheses

I'm writing a utility to extract the names of header files from JSPs. I have no problem reading the JSPs line by line and finding the lines I need. I am having a problem extracting the specific text needed using regex. After looking at many similar questions I'm hitting a brick wall.
An example of the String I'll be matching from within is:
<jsp:include page="<%=Pages.getString(\"MY_HEADER\")%>" flush="true"></jsp:include>
All I need is MY_HEADER for this example. Any time I have this tag:
<%=Pages.getString
I need what comes between this:
<%=Pages.getString(\" and this: )%>
Here is what I have currently (which is not working, I might add) :
String currentLine;
while ((currentLine = fileReader.readLine()) != null)
{
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\]*)");
Matcher matcher = pattern.matcher(currentLine);
while(matcher.find()) {
System.out.println(matcher.group(1).toString());
}}
I need to be able to use the Java RegEx API and regex to extract those header names.
Any help on this issue is greatly appreciated. Thanks!
EDIT:
Resolved this issue, thankfully. The tricky part was, after being given the right regex, it had to be taken into account that the String I was feeding to the regex was always going to have two " / " characters ( (/"MY_HEADER"/) ) that needed to be escaped in the pattern.
Here is what worked (thanks to the help ;-)):
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\\"]*)");
This should do the trick:
<%=Pages\\.getString\\(\\\\\"([^\\\\]*)
Yeah that's a scary number of back slashes. matcher.group(1) should return MY_HEADER. It starts at the \" and matches everything until the next \ (which I assume here will be at \")%>.)
Of course, if your target text contains a backslash (\), this will not work. But you didn't give an indication that you'd ever be looking for something like <%=Pages.getString(\"Fun!\Yay!\")%> -- where this regex would only return Fun! and ignore the rest.
EDIT
The reason your test case was failing is because you were using this test string:
String currentLine = "<%=Pages.getString(\"MY_HEADER\")%>";
This is the equivalent of reading it in from a file and seeing:
<%=Pages.getString("MY_HEADER")%>
Note the lack of any \. You need to use this instead:
String sCurrentLine = "<%=Pages.getString(\\\"MY_HEADER\\\")%>";
Which is the equivalent of what you want.
This is test code that works:
String currentLine = "<%=Pages.getString(\\\"MY_HEADER\\\")%>";
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\]*)");
Matcher matcher = pattern.matcher(currentLine);
while(matcher.find()) {
System.out.println(matcher.group(1).toString());
}

Extracting an "encompassing" string based on a term within the string

I have a java function to extract a string out of the HTML Page source for any website...The function basically accepts the site name, along with a term to search for. Now, this search term is always contained within javascript tags. What I need to do is to pull the entire javascript (within the tags) that contains the search term.
Here is an example -
<script type="text/javascript">
//Roundtrip
rtTop = Number(new Date());
document.documentElement.className += ' jsenabled';
</script>
For the javascript snippet above, my search term would be "rtTop". Once found, I want my function to return the string containing everything within the script tags.
Any novel solution? Thanks.
You could use a regular expression along the lines of
String someHTML = //get your HTML from wherever
Pattern pattern = Pattern.compile("<script type=\"text/javascript\">(.*?rtTop.*?)</script>",Pattern.DOTALL);
Matcher myMatcher = pattern.matcher(someHTML);
myMatcher.find();
String result = myMatcher.group(1);
I wish I could just comment on JacobM's answer, but I think I need more stackCred.
You could use an HTML parser, that's usually the better solution. That said, for limited scopes, I often use regEx. It's a mean beast though. One change I would make to JacobM's pattern is to replace the attributes within the opening element with [^<]+
That will allow you to match even if the "type" isn't present or if it has some other oddities. I'd also wrap the .*? with parens to make using the values later a little easier.
* UPDATE *
Borrowing from JacobM's answer. I'd change the pattern a little bit to handle multiple elements.
String someHTML = //get your HTML from wherever
String lKeyword = "rtTop";
String lRegexPattern = "(.*)(<script[^>]*>(((?!</).)*)"+lKeyword +"(((?!</).)*)</script>)(.*)";
Pattern pattern = Pattern.compile(lRegexPattern ,Pattern.DOTALL);
Matcher myMatcher = pattern.matcher(someHTML);
myMatcher.find();
String lPreKeyword = myMatcher.group(3);
String lPostKeyword = myMatcher.group(5);
String result = lPreKeyword + lKeyword + lPostKeyword;
An example of this pattern in action can be found here. Like I said, parsing HTML via regex can get real ugly real fast.

Java Regex Matcher Question

How do I match an URL string like this:
img src = "https://stackoverflow.com/a/b/c/d/someimage.jpg"
where only the domain name and the file extension (jpg) is fixed while others are variables?
The following code does not seem working:
Pattern p = Pattern.compile("<img src=\"http://stachoverflow.com/.*jpg");
// Create a matcher with an input string
Matcher m = p.matcher(url);
while (m.find()) {
String s = m.toString();
}
There were a couple of issues with the regex matching the sample string you gave. You were close, though. Here's your code fixed to make it work:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TCPChat {
static public void main(String[] args) {
String url = "<img src=\"http://stackoverflow.com/a/b/c/d/someimage.jpg\">";
Pattern p = Pattern.compile("<img src=\"http://stackoverflow.com/.*jpg\">");
// Create a matcher with an input string
Matcher m = p.matcher(url);
while (m.find()) {
String s = m.toString();
System.out.println(s);
}
}
}
First, I would use the group() method to retrieve the matched text, not toString(). But it's probably just the URL part you want, so I would use parentheses to capture that part and call group(1) retrieve it.
Second, I wouldn't assume src was the first attribute in the <img> tag. On SO, for example, it's usually preceded by a class attribute. You want to add something to match intervening attributes, but make sure it can't match beyond the end of the tag. [^<>]+ will probably suffice.
Third, I would use something more restrictive than .* to match the unknown part to the path. There's always a chance that you'll find two URLs on one line, like this:
<img src="http://so.com/foo.jpg"> blah <img src="http://so.com/bar.jpg">
In that case, the .* in your regex would bridge the gap, giving you one match where you wanted two. Again, [^<>]* will probably be restrictive enough.
There are several other potential problems as well. Are attribute values always enclosed in double-quotes, or could they be single-quoted, or not quoted at all? Will there be whitespace around the =? Are element and attribute names always lowercase?
...and I could go on. As has been pointed out many, many times here on SO, regexes are not really the right tool for working with HTML. They can usually handle simple tasks like this one, but it's essential that you understand their limitations.
Here's my revised version of your regex (as a Java string literal):
"(?i)<img[^<>]+src\\s*=\\s*[\"']?(http://stackoverflow\\.com/[^<>]+\\.jpg)"

Categories