Extracting an "encompassing" string based on a term within the string - java

I have a java function to extract a string out of the HTML Page source for any website...The function basically accepts the site name, along with a term to search for. Now, this search term is always contained within javascript tags. What I need to do is to pull the entire javascript (within the tags) that contains the search term.
Here is an example -
<script type="text/javascript">
//Roundtrip
rtTop = Number(new Date());
document.documentElement.className += ' jsenabled';
</script>
For the javascript snippet above, my search term would be "rtTop". Once found, I want my function to return the string containing everything within the script tags.
Any novel solution? Thanks.

You could use a regular expression along the lines of
String someHTML = //get your HTML from wherever
Pattern pattern = Pattern.compile("<script type=\"text/javascript\">(.*?rtTop.*?)</script>",Pattern.DOTALL);
Matcher myMatcher = pattern.matcher(someHTML);
myMatcher.find();
String result = myMatcher.group(1);

I wish I could just comment on JacobM's answer, but I think I need more stackCred.
You could use an HTML parser, that's usually the better solution. That said, for limited scopes, I often use regEx. It's a mean beast though. One change I would make to JacobM's pattern is to replace the attributes within the opening element with [^<]+
That will allow you to match even if the "type" isn't present or if it has some other oddities. I'd also wrap the .*? with parens to make using the values later a little easier.
* UPDATE *
Borrowing from JacobM's answer. I'd change the pattern a little bit to handle multiple elements.
String someHTML = //get your HTML from wherever
String lKeyword = "rtTop";
String lRegexPattern = "(.*)(<script[^>]*>(((?!</).)*)"+lKeyword +"(((?!</).)*)</script>)(.*)";
Pattern pattern = Pattern.compile(lRegexPattern ,Pattern.DOTALL);
Matcher myMatcher = pattern.matcher(someHTML);
myMatcher.find();
String lPreKeyword = myMatcher.group(3);
String lPostKeyword = myMatcher.group(5);
String result = lPreKeyword + lKeyword + lPostKeyword;
An example of this pattern in action can be found here. Like I said, parsing HTML via regex can get real ugly real fast.

Related

Java Regexp to match domain of url

I would like to use Java regex to match a domain of a url, for example,
for www.table.google.com, I would like to get 'google' out of the url, namely, the second last word in this URL string.
Any help will be appreciated !!!
It really depends on the complexity of your inputs...
Here is a pretty simple regex:
.+\\.(.+)\\..+
It fetches something that is inside dots \\..
And here are some examples for that pattern: https://regex101.com/r/L52oz6/1.
As you can see, it works for simple inputs but not for complex urls.
But why reinventing the wheel, there are plenty of really good libraries that correctly parse any complex url. But sure, for simple inputs a small regex is easily build. So if that does not solve the problem for your inputs then please callback, I will adjust the regex pattern then.
Note that you can also just use simple splitting like:
String[] elements = input.split("\\.");
String secondToLastElement = elements[elements.length - 2];
But don't forget the index-bound checking.
Or if you search for a very quick solution than walk through the input starting from the last position. Work your way through until you found the first dot, continue until the second dot was found. Then extract that part with input.substring(index1, index2);.
There is also already a delegate method for exactly that purpose, namely String#lastIndexOf (see the documentation).
Take a look at this code snippet:
String input = ...
int indexLastDot = input.lastIndexOf('.');
int indexSecondToLastDot = input.lastIndexOf('.', indexLastDot);
String secondToLastWord = input.substring(indexLastDot, indexSecondToLastDot);
Maybe the bounds are off by 1, haven't tested the code, but you get the idea. Also don't forget bound checking.
The advantage of this approach is that it is really fast, it can directly work on the internal structures of Strings without creating copies.
My attempt:
(?<scheme>https?:\/\/)?(?<subdomain>\S*?)(?<domainword>[^.\s]+)(?<tld>\.[a-z]+|\.[a-z]{2,3}\.[a-z]{2,3})(?=\/|$)
Demo. Works correctly for:
http://www.foo.stackoverflow.com
http://www.stackoverflow.com
http://www.stackoverflow.com/
http://stackoverflow.com
https://www.stackoverflow.com
www.stackoverflow.com
stackoverflow.com
http://www.stackoverflow.com
http://www.stackoverflow.co.uk
foo.www.stackoverflow.com
foo.www.stackoverflow.co.uk
foo.www.stackoverflow.co.uk/a/b/c
private static final Pattern URL_MATCH_GET_SECOND_AND_LAST =
Pattern.compile("www.(.*)//.google.(.*)", Pattern.CASE_INSENSITIVE);
String sURL = "www.table.google.com";
if (URL_MATCH_GET_SECOND_AND_LAST.matcher(sURL).find()){
Matcher matchURL = URL_MATCH_GET_SECOND_AND_LAST .matcher(sURL);
if (matchURL .find()) {
String sFirst = matchURL.group(1);
String sSecond= matchURL.group(2);
}
}

Replace & only in links in a partial html document

I've tried a few methods (jsoup shown below) to turn &amp into & only in links. The difficulty I'm encountering suggests I'm going about this all wrong. I suspect I'll be facepalming when solutions are offered, but maybe good old regex is the best answer (as I need to only do the replacing in hrefs) unless the reader code is modified?
The parsing libraries (also tried NekoHTML) want to convert all &s to & so I'm having trouble using them to even get the true link hrefs with which to use String's replace method.
Input:
String toParse = "The Link with an encoded ampersand (&) is challenging."
Desired output:
The Link with an encoded ampersand (&) is challenging.
I'm encountering this trying to read an RSS feed that is rendering <link>s with & instead of &.
Update
I ended up using regex to identify the links, then using replace to insert a decoded link in place of the one with &s. Pattern.quote() turned out to be very handy, but I had to manually close and re-open the quoted portions so I could regex or my ampersand condition:
final String cleanLink = StringUtils.strip(link).replaceAll(" ", "%20").replaceAll("'", "%27");
String regex = Pattern.quote(link);
// end and re-start literal matching around my or condition
regex = regex.replaceAll("&", "\\\\E(&|&)\\\\Q");
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(result);
while (matcher.find()) {
int index = result.indexOf(matcher.group());
while (index != -1) {
// this replaces the links with & with the same links with &
// because cleanLink is from the DOM and has been properly decoded
result.replace(index, index + matcher.group().length(), cleanLink);
index += cleanLink.length();
index = result.indexOf(matcher.group(), index);
linkReplaced = true;
}
}
I'm not thrilled with this approach, but I had to handle too many conditions myself without using a DOM tool to identify links.
Have a look at the StringEscapeUtils. Try unescapeHtml() on your String.

How to change the width and height of an html file using java

I wanted to change width="xyz" , where (xyz) can be any particular value to width="300". I researched on regular expressions and this was the one I am using a syntax with regular expression
String holder = "width=\"340\"";
String replacer="width=\"[0-9]*\"";
theWeb.replaceAll(replacer,holder);
where theWeb is the string
. But this was not getting replaced. Any help would be appreciated.
Your regex is correct. One thing you might be forgetting is that in Java all string methods do not affect the current string - they only return a new string with the appropriate transformation. Try this instead:
String replacement = 'width="340"';
String regex = 'width="[0-9]*"';
String newWeb = theWeb.replaceAll(regex, replacement); // newWeb holds new text
Better use JSoup for manipulating and extracting data, etc. from Html
See this link for more details:
http://jsoup.org/

JSoup: Replacing a String adds new lines

I have the following issue with JSoup.
I want to parse and modify the following html code:
<code>
<style type="text/css" media="all">
#import url("http://hakkon-aetterni.at/modules/system/system.base.css?ll3lgd");
#import url("http://hakkon-aetterni.at/modules/system/system.menus.css?ll3lgd");
#import url("http://hakkon-aetterni.at/modules/system/system.messages.css?ll3lgd");
#import url("http://hakkon-aetterni.at/modules/system/system.theme.css?ll3lgd");
</style>
</code>
I'm using the following Code to acheive that:
Elements cssImports= doc.select("style");
for (Element src : cssImports) {
String regex ="url\\(\"(.)*\"\\)";
String data =src.data();
String link;
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(data);
while (m.find()){
link=m.group().substring(5,m.group().length()-2);
doc=Jsoup.parse(doc.html().replace(link, ""));
}
}
First, it works. All the import urls are replaced with the String "FOUND". The issue I'm having is that I get a lot new lines between the last import statement and the closed </style> Tag which where not there before.
Any clues why this is happenign and how I can avoid it?
Sorry for the bad formatting but I seems like some parts of my code is just getting removed on posting. There is a style Tag surrounding the first code block...
Well, I landed on this page today looking to do a very similar thing, and I believe that I've solved it. Hopefully someone's still watching this now that it's a month later. ;)
What I found to work well was, instead of doing string replaces and re-parsing the document on every loop, to rebuild the content of the style element. One of the places where JSoup really shines is in how easy it's API makes editing a parsed document.
The other trick, is to use the data() function. JSoup differentiates between data (e.g. script and style) and html/text nodes. The main difference is that HTML escaping is not applied to data nodes.
Putting all this together, this following code snippet should replace your imported stylesheet refs with your FOUND text but without changing the formatting of your document:
// compile the regex before entering the loop, as it's a relatively expensive operation
Pattern pattern = Pattern.compile("url\\(\"(.)*\"\\)");
for(Element styleElem : doc.getElementsByTag("style")) {
String data = styleElem.data();
StringBuffer newData = new StringBuffer();
Matcher matcher = pattern.matcher(data);
while(matcher.find()) {
matcher.appendReplacement(newData, "FOUND");
}
matcher.appendTail(newData);
styleElem.appendChild(new DataNode(newData.toString(), base.toExternalForm()));
}
P.S. I'm assuming that you've turned pretty-printing off. Since your document parsing code isn't displayed, though, make doubly sure to call document.outputSettings().prettyPrint(false); after parsing.
P.P.S. In my own code, I'm using a more tolerant (and slightly uglier) regex to find the imports. It lets the user get away with omitting the URL declaration, quotes, parens, etc...because HTML in the wild tends to do all of those things. I have it declared in my code as follows:
public static final Pattern CSS_IMPORT_PATTERN = Pattern.compile("(#import\\s+(?:url)?\\s*\\(?\\s*['\"]?)(.*?)([\\s'\";,)]|$)");

Java Regex Matcher Question

How do I match an URL string like this:
img src = "https://stackoverflow.com/a/b/c/d/someimage.jpg"
where only the domain name and the file extension (jpg) is fixed while others are variables?
The following code does not seem working:
Pattern p = Pattern.compile("<img src=\"http://stachoverflow.com/.*jpg");
// Create a matcher with an input string
Matcher m = p.matcher(url);
while (m.find()) {
String s = m.toString();
}
There were a couple of issues with the regex matching the sample string you gave. You were close, though. Here's your code fixed to make it work:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TCPChat {
static public void main(String[] args) {
String url = "<img src=\"http://stackoverflow.com/a/b/c/d/someimage.jpg\">";
Pattern p = Pattern.compile("<img src=\"http://stackoverflow.com/.*jpg\">");
// Create a matcher with an input string
Matcher m = p.matcher(url);
while (m.find()) {
String s = m.toString();
System.out.println(s);
}
}
}
First, I would use the group() method to retrieve the matched text, not toString(). But it's probably just the URL part you want, so I would use parentheses to capture that part and call group(1) retrieve it.
Second, I wouldn't assume src was the first attribute in the <img> tag. On SO, for example, it's usually preceded by a class attribute. You want to add something to match intervening attributes, but make sure it can't match beyond the end of the tag. [^<>]+ will probably suffice.
Third, I would use something more restrictive than .* to match the unknown part to the path. There's always a chance that you'll find two URLs on one line, like this:
<img src="http://so.com/foo.jpg"> blah <img src="http://so.com/bar.jpg">
In that case, the .* in your regex would bridge the gap, giving you one match where you wanted two. Again, [^<>]* will probably be restrictive enough.
There are several other potential problems as well. Are attribute values always enclosed in double-quotes, or could they be single-quoted, or not quoted at all? Will there be whitespace around the =? Are element and attribute names always lowercase?
...and I could go on. As has been pointed out many, many times here on SO, regexes are not really the right tool for working with HTML. They can usually handle simple tasks like this one, but it's essential that you understand their limitations.
Here's my revised version of your regex (as a Java string literal):
"(?i)<img[^<>]+src\\s*=\\s*[\"']?(http://stackoverflow\\.com/[^<>]+\\.jpg)"

Categories