Replace all URLs in a HTML - java

I'm crawling some HTML files with crawler4j and I want to replace all links in those pages with custom links. Currently I can get the source HTML and a list of all outgoing links with this code:
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String html = htmlParseData.getHtml();
List<WebURL> links = htmlParseData.getOutgoingUrls();
However a simple foreach loop and search & replace won't get me what I want. The problem is that athe WebURL.getURL(); will return the absolute URL but sometimes the links are relative and sometimes are not.
I want to handle all links (Images, URLs, JavaScript files, etc.). For instance I want to replace images/img.gif with view.php?url=http://www.domain.com/images/img.gif.
The only solution that comes to me is using a somewhat complicated Regex but I'm afraid I'm going to miss some rare cases. Has this been done already? Is there a library or some tool to achive this?

I think you can make use of regular expression for this:
For example :
...
String regex = "\\/[^.]*\\/[^.]*\\.";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
while(matcher.find()){
String imageLink = matcher.group();
text = text.replace(imageLink,prefix+imageLink);
}

Does it has to be a Java solution? PhantomJs in combination with pjscrape can site scrape a page to find all urls.
You just have to create a configuration javascript file.
getlinks.js:
pjs.addSuite({
url: 'http://stackoverflow.com/questions/14138297/replace-all-urls-in-a-html',
noConflict: true,
scraper: function() {
var links = _pjs.$('a').map(function() {
// convert relative URLs to absolute
var link = _pjs.toFullUrl($(this).attr('href'));
return link;
});
return links.toArray();
}
});
pjs.config({
// options: 'stdout' or 'file' (set in config.outFile)
log: 'stdout',
// options: 'json' or 'csv'
format: 'json',
// options: 'stdout' or 'file' (set in config.outFile)
writer: 'stdout',
scrape_output.json
});
And run the command phantomjs pjscrape.js getlinks.js. In this example, the output is stored in a file (can also be logged in the console):
Here is the (partial) output:
* Suite 0 starting
* Opening http://stackoverflow.com/questions/14138297/replace-all-urls-in-a-html
* Scraping http://stackoverflow.com/questions/14138297/replace-all-urls-in-a-html
* Suite 0 complete
* Writing 145 items
["http://stackoverflow.com/users/login?returnurl=%2fquestions%2f14138297%2freplace-all-urls-in-a-html","http://careers.stackoverflow.com","http://chat.stackoverflow.com","http://meta.stackoverflow.com","http://stackoverflow.com/about","http://stackoverflow.com/faq","http://stackoverflow.com/","http://stackoverflow.com/questions","http://stackoverflow.com/tags","http://stackoverflow.com/users","http://stackoverflow.com/badges","http://stackoverflow.com/unanswered","http://stackoverflow.com/questions/ask", ...
"http://creativecommons.org/licenses/by-sa/3.0/","http://creativecommons.org/licenses/by-sa/3.0/","http://blog.stackoverflow.com/2009/06/attribution-required/"]
* Saved 145 items

Related

Jsoup check if string is valid HTML

I am having difficulties with Jsoup parser. How can I tell if given string is a valid HTML code?
String input = "Your vote was successfully added."
boolean isValid = Jsoup.isValid(input);
// isValid = true
isValid flag is true, because Jsoup first uses HtmlTreeBuilder: if ony of html, head or body tag is missing, it adds them by itself. Then it uses Cleaner class and checks it against given Whitelist.
Is there any simple way to check if string is a valid HTML without Jsoup attempts to make it HTML?
My example is AJAX response, which comes as "text/html" content type. Then it goes to parser, Jsoup adds this tags and as a result, response is not displayed properly.
Thanks for your help.
First of all, solution proposed by Reuben is not working as expected. Pattern has to be compiled with Pattern.DOTALL flag. Input HTML may have (and probably will) new line signs etc.
So it should be something like this:
Pattern htmlPattern = Pattern.compile(".*\\<[^>]+>.*", Pattern.DOTALL);
boolean isHTML = htmlPattern.matcher(input).matches();
I also think that this pattern should find HTML tag not only . Next: is not the only valid option. There may also be attribute i.e . This also has to be handled.
I chose to modify Jsoup source. If HTMLTreeBuilder (actually state BeforeHtml) tries to add <html> element I throw ParseException and then I am sure that input file was not a valid HTML file.
Use regex to check String contains HTML or not
boolean isHTML = input.matches(".*\\<[^>]+>.*");
If your String contains HTML value then it will return true
String input = "<html><body></body></html>" ;
But this code String input = "Hello World <>"; will return false

Download all images like wget does with Java on client-side

It's perfectly easy to download all images from a website using wget.
But I need this feature on client-side, best would be in Java.
I know wget's source can be accessed online, but I don't know any C and the source is quite complex. Of course, wget has also other features which "blow up the source" for me.
As Java has a built-in HttpClient, yet I don't know how sophisticated wget really is, could you tell me if it is hard to re-implement the "download all images recursively" feature in Java?
How is this done, exactly? Does wget fetch the HTML source code of the given URL, extract all URLs with the given file endings (.jpg, .png) from the HTML and downloads them? Does it also search for images in the stylesheets that are linked in that HTML document?
How would you do this? Would you use regular expressions to search for (both relative and absolute) image URLs within the HTML document and let HttpClient download each of them? Or is there already some Java library that does something similar?
In Java you could use the Jsoup library to parse any web page and extract anything you want
For me crawler4j was the open source library to recursively crawl (and replicate) a site, e.g. like this (their QuickStart example):
(it also supports CSS URL crawling)
public class MyCrawler extends WebCrawler {
private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg"
+ "|png|mp3|mp3|zip|gz))$");
/**
* This method receives two parameters. The first parameter is the page
* in which we have discovered this new url and the second parameter is
* the new url. You should implement this function to specify whether
* the given url should be crawled or not (based on your crawling logic).
* In this example, we are instructing the crawler to ignore urls that
* have css, js, git, ... extensions and to only accept urls that start
* with "http://www.ics.uci.edu/". In this case, we didn't need the
* referringPage parameter to make the decision.
*/
#Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches()
&& href.startsWith("http://www.ics.uci.edu/");
}
/**
* This function is called when a page is fetched and ready
* to be processed by your program.
*/
#Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
Set<WebURL> links = htmlParseData.getOutgoingUrls();
System.out.println("Text length: " + text.length());
System.out.println("Html length: " + html.length());
System.out.println("Number of outgoing links: " + links.size());
}
}
}
More webcrawlers and HTML parsers can be found here.
Found this program which downloads images. It is open source.
You could get the images in a website using the <IMG> tags. Look into the following question. It might help you.
Get all Images from WebPage Program | Java

How do I manipulate strings with regex?

I'm fairly new to java and I'm trying to get a part of a string:
Say I have a URL and I want a specific part of it, such as a filename:
String url = "http://example.com/filename02563.zip";
The 02563 will be generated at random every time and it's now always 5 characters long.
I want to have java find what's between "m/" (from .com/) to the end of the line to get the filename alone.
Now consider this example:
Say I have an html file that I want a snippet extracted from. Below would be the extracted example:
<applet name=someApplet id=game width="100%" height="100%" archive=someJarFile0456799.jar code=classInsideAJarFile.class mayscript>
I want to extract the jar filename, so I want to get the text between "ve=" and ".jar". The extension will always be ".jar", so including this is not important.
How would I do this? If possible, could you comment the code so I understand what's happening?
Use the Java URI class where you can access the individual elements.
URI uri = new URI("http://example.com/filename02563.zip");
String filename = uri.getPath();
Granted, this will need a little more work if the resource no longer resides in the root path.
You can use the lastIndexOf() and substring() methods from the String class to extract a specific piece of a String:
String url = "http://example.com/filename02563.zip";
String filename = url.substring(url.lastIndexOf("/") + 1); //+1 skips ahead of the '/'
You have answers for your first question so this is for second one. Normally I would use some XML parser but your example is not valid XML file so this will be solved with regex (as you wanted).
String url = "<applet name=someApplet id=game width=\"100%\" height=\"100%\" archive=someJarFile0456799.jar code=classInsideAJarFile.class mayscript>";
Pattern pattern= Pattern.compile("(?<=archive=).*?(?= )");
Matcher m=pattern.matcher(url);
if(m.find())
System.out.println(m.group());
output:
someJarFile0456799.jar

How to read content of a web page through a Java program?

I am planning to write a Java program to read some exchange rates from a web site (http://www.doviz.com) and was wondering what is the best approach to only read (or read the whole and strip the parts needed) the content that I need.
Any help is appreciated.
My advice is to use the Jsoup library
It's very easy to parse an external content with a css/jquery-like syntax
// Only one line to parse an external content
Document doc = Jsoup.connect("http://jsoup.org").get();
// "Javascript-like" syntax
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
// "Jquery/Css-like" syntax
Elements resultLinks = doc.select("h3.r > a");
Elements pngs = doc.select("img[src$=.png]");
Just add the jsoup.jar library to your classpath and enjoy !
Open-Source and free to use of course.
I'd suggest you to implement an RSS reading mechanism of a webpage (programatically) and extract the content of the RSS xml using standard parsers.

Extracting an "encompassing" string based on a term within the string

I have a java function to extract a string out of the HTML Page source for any website...The function basically accepts the site name, along with a term to search for. Now, this search term is always contained within javascript tags. What I need to do is to pull the entire javascript (within the tags) that contains the search term.
Here is an example -
<script type="text/javascript">
//Roundtrip
rtTop = Number(new Date());
document.documentElement.className += ' jsenabled';
</script>
For the javascript snippet above, my search term would be "rtTop". Once found, I want my function to return the string containing everything within the script tags.
Any novel solution? Thanks.
You could use a regular expression along the lines of
String someHTML = //get your HTML from wherever
Pattern pattern = Pattern.compile("<script type=\"text/javascript\">(.*?rtTop.*?)</script>",Pattern.DOTALL);
Matcher myMatcher = pattern.matcher(someHTML);
myMatcher.find();
String result = myMatcher.group(1);
I wish I could just comment on JacobM's answer, but I think I need more stackCred.
You could use an HTML parser, that's usually the better solution. That said, for limited scopes, I often use regEx. It's a mean beast though. One change I would make to JacobM's pattern is to replace the attributes within the opening element with [^<]+
That will allow you to match even if the "type" isn't present or if it has some other oddities. I'd also wrap the .*? with parens to make using the values later a little easier.
* UPDATE *
Borrowing from JacobM's answer. I'd change the pattern a little bit to handle multiple elements.
String someHTML = //get your HTML from wherever
String lKeyword = "rtTop";
String lRegexPattern = "(.*)(<script[^>]*>(((?!</).)*)"+lKeyword +"(((?!</).)*)</script>)(.*)";
Pattern pattern = Pattern.compile(lRegexPattern ,Pattern.DOTALL);
Matcher myMatcher = pattern.matcher(someHTML);
myMatcher.find();
String lPreKeyword = myMatcher.group(3);
String lPostKeyword = myMatcher.group(5);
String result = lPreKeyword + lKeyword + lPostKeyword;
An example of this pattern in action can be found here. Like I said, parsing HTML via regex can get real ugly real fast.

Categories