jsoup : Absolute path while working with files - java

I got some page repository with html files. I want to process them using jsoup, but when I try to get absolute paths of all links jsoup gave me empty strings (""). Is there a possibility to set baseUri as a file path ?
Solution : link.get(i).baseUri + link.get(i).attr("href") is not sufficient for me becouse i need to some how recognize which link is relative or not.
The jsoup documentation says us :
There is a sister method parse(File in, String charsetName) which uses
the file's location as the baseUri. This is useful if you are working
on a filesystem-local site and the relative links it points to are
also on the filesystem.
But it doesn't work on my PC.

I'm "solving" the same issue with the following code. While I prefer jsoup functions worked on my local file system, I need something in the meantime. That solution is sending the file locations into the parser as the baseURI then concatenating each relative path to that base. Unfortunately, that means I lose the nesting functionality of HTML's "../" that jsoup normally handles with its built-in functions. Furthermore, I can never be as certain about the results as if the built-in functions worked.
Luckily, I use this mainly for JUnit Testing and it should add minor risks to my production code. The context is that I built a local "Internet" to test crawling offline. I create the JSoup Document by sending a local HTML file to it in my JUnit Test Class:
// From my JUnit Test
String testFileName = "HTMLTest_RelativeReferences.html";
String testFilePath = getClass().getResource(testFileName).getPath();
String testFileBaseURI = testFilePath.replace(testFileName, "");
// ...
// Sends filePath and baseURI to testing class that creates JSoup Doc with:
siteDoc = Jsoup.parse(new File(testFilePath), "UTF-8", testFileBaseURI);
Now that I created my Document with a baseURI, you & I both thought the relative paths should use that baseURI to create an absolute path. Since that failed, I run a simple test for empty-string abs:refs and concatenate my own URLs.
Elements links = siteDoc.select("a[href]"); // extract link collection
for (Element link : links) { // iterate through links
String linkString = link.attr("abs:href"); // ftr, neither this nor absUrl("href") works
if (linkString.isEmpty()) { // check if returned "" (i.e., the problem at hand)
URLs.add(siteDoc.baseUri() + link.attr("href")); // concatenate baseURI to relative ref
}
else { // for all the properly returned absolute refs
URLs.add(link.attr("abs:href"));
}
}
All my JUnit Tests continue to pass with both absolute and relative local references - good luck!
HTML Doc I used for reference with all 3 links representing other HTML files in the same folder:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>HTML Test using Relative References</title>
</head>
<body>
Link1
Link2
Link3
</body>
</html>
Edit: I little digging into the jsoup library leads me to believe our local file "URL"s will never work because jsoup handles actual URLs during its attr("abs:href") process and will through MalformedURLs and return "" since we are actually using local file paths rather than true URLs. I consider this outside of the scope of the above answer but thought I would mention my discovery.

You can use the absUrl()-function in JSoup Elements.
String path = linkEl.absUrl("href");

Related

Static images not display with flying saucer and thymeleaf for generated pdf files

I am using thymeleaf as my template engin to map XHTML to HTML and flying saucer to generate a pdf file afterwards.
Now i failed to display my static images located at /src/main/resources/ inside y generated pdf file. The file itsself will be displayed fine only images disapear.
Even other locations like /src/main/resources/static or /src/main/resources/public didnt help.
My HTML / XHTML looks like:
<img src="images/logo_black.png"></img>
<img src="/images/logo_black.png"></img>
<img alt="mastercard" th:src="#{classpath:static/images/logo_black.png}" />
<div data-src="images/logo_black.png"></div>
<div data-src="/images/logo_black.png"></div>
<div data-src="#{classpath:static/images/logo_black.png}"></div>
none of them is working properly.
The Images itself are visible by localhost:8048/logo_black.png
I dont want to refer my images with a full url (http://...)
You can include resources from any URL (from the Internet or from your file system). Either way, there are several steps involved:
When generating the HTML from the Thymeleaf template, you can use
#{/some/url} to resolve a path relative to your Web context (assuming you have a Web context), or
#{classpath:/some/url} with will just leave the URL as classpath:/some/url, or
simply a string value constant or a value from a variable (${var}), doesn't matter if it's an absolute URL https://some/url or relative, Thymleaf will leave them unchanged in the resulting HTML.
Before you pass the HTML to Flying Saucer, make sure the URLs are correct. Then Flying Saucer will process all URLs with a UserAgentCallback, by default ITextUserAgent.
The relevant methods in UserAgentCallBack are resolveURI and setBaseURL.
There is some weird logic going on in the default resolveURI method of ITextUserAgent (inherited from NaiveUserAgent). If the baseURL is null, it will try to set it, so it's best to always set it yourself. I had better results with overriding the resolveURI, the following is enough to keep absolute URLs and resolve relative URLs relative to the baseURL:
#Override
public String resolveURI(String uri) {
if (URI(uri).isAbsolute())
return uri;
else
return Paths.get(getBaseURL(), uri).toUri().toString();
}
Finally, in order to resolve the classpath: protocol, you need to define an URLStreamHandler unless there is already one defined (for example, the embedded Tomcat of Spring Boot already does supports this).
You can render image with the help of base 64 .You just convert your image on base 64 and it will show on your web page as well as mobile view.The tags are:
<img th:src="#{ base 64}"/>

What is the safest way to convert scraped URLs to real URLs?

I scrape a website and find these links on a page:
index.html
bla.html
/index.html
A.com/test.html
http://wwww.B.com/bla.html
If I know the current page is www.A.com/some/path, how can I convert these links into "real Urls" effectively. So, in each case, the urls should translate to:
index.html => http://www.A.com/some/path/index.html
bla.html => http://www.A.com/some/path/bla.html
/index.html => http://www.A.com/index.html
A.com/test.html => http://www.A.com/test.html
http://wwww.B.com/bla.html => http://wwww.B.com/bla.html
What is the most effective way to convert these on-page links to their fully qualified url names?
Use the java.net.URL class:
URL BASE_PATH = new URL("http://www.A.com/some/path");
String RELATIVE_PATH = "index.html";
URL absolute = new URL(BASE_PATH, RELATIVE_PATH);
It will resolve the relative URL against the base path. If the relative URL is actually an absolute URL, it will return it instead.
#Brigham's Answer is correct but incomplete.
The problem is that the page where you scraped the URLs from could include a <base> element in the <head>. This base URL may be significantly different to the URL that you fetched the page from.
For example:
<!DOCTYPE html>
<html>
<head>
<base href="http://www.example.com/">
...
</head>
<body>
...
</body>
</html>
In the ... sections, any relative URLs will be resolved relative to the base URL rather than the original page URL.
What this means that if you want to resolve "scraped" URLs correctly in all cases, you also need to look for any <base> elements as you are "scraping".

How to present a pretty printed Java source located outside the HTML file?

I am looking at writing a tutorial for a Java concept where it would be really nice if I could write the tutorial as a HTML-document with pretty printed Java sources.
I understand I can do this with e.g. http://code.google.com/p/google-code-prettify/ if I copy the various Java sources in my HTML document where I want them to be and put a styling class on the surrounding tag.
However, in order to ensure that the snippets are up to date I would really like to have the HTML page refer to the actual, real Java source files instead of a manually maintained copy.
To my understanding - which may be wrong - this is not supported directly by the Google Prettyprint library, but perhaps some trickery with Javascript pulling in the file and putting it in the DOM tree inside a <pre> tag could do it? I would like the HTML file to be present in the local file system, so doing server side scripting is not an option.
My question is - how can I do this?
(I intend to have the HTML file physically placed at the root of the source tree. This mean that all references from HTML to Java sources will be relative and without '..'. I do not know if that is important or not.)
There is no way to access files directly using JavaScript. JavaScript is restricted in this way for obvious security reasons.
You will need your webserver to serve the Java files. You don't need to do server side scripting but the content of your Java files has to be available at some web address. If they are you can load the content of the Java files with AJAX and inset the content into your webpage.
Using jQuery loading the text could be done as follows
$.get('java/somefile.java', function(data) {
$('#sourceCodeDestination').html(data);
// Prettyprint neeeds to run again in order to see the newly added code
prettyPrint();
}, "text");
This will load the url java/somefile.java get the content of it as plain text and insert it into the DOM element with the id sourceCodeDestination. For more information see the jQuery documentation on get() and ajax().
Here is a demo. As you can see it loads a minified version of the Prettyprint sourcecode from a CDN and pretty prints it.
if your users can accept the requirement of online access while reading your document, you could host your code somewhere like gist (https://gist.github.com/), and embed it in your html dopcument (see example by putting this into your document <script src="https://gist.github.com/sangohan/6494440.js"></script>)
Assuming prettify.js has been loaded previously you can invoke the function prettyPrint which takes arguments callback and rootNode.
<div id="foo">
<pre id="bar"></pre>
</div>
var pre = document.getElementById('bar');
pre.textContent = 'function () {\n return;\n}'; // assign code
pre.className = 'prettyprint'; // assign class
prettyPrint(null, document.getElementById('foo')); // prettify
DEMO

Image Extraction from webpage in java

I have just started working on a content extraction project. First I am trying to the Image URLs in a webpage. In some cases, the "src" attribute of "img" has relative URL. But I need to get the complete URL.
I was looking for some Java library to achieve this and thought Jsoup will be useful. Is there any other library to achieve this easily?
If you just need to get the complete URL from a relative one, the solution is simple in Java:
URL pageUrl = base_url_of_the_html_page;
String src = src_attribute_value; //relative or absolute URL
URL imgUrl = new URL(pageUrl, src);
The base URL of the HTML page is usually just the URL you have obtained the HTML code from. However, a <base> tag used in the document header, may be used for specifying a different base URL (but it's not used very frequently).
You may use Jsoup or just a DOM parser for obtaining the src attribute values and for finding the eventual base tag.

Get url minus current filename in JSP or CQ5

I wish to get the current url minus the file name that is being currently referenced. Whether the solution is in JSP or CQ5 doesn't matter. However, I am trying to use the latter more to get used to it.
I'm using this documentation but it's not helping. CQ5 Docs.
The example I found retrieves the full current path, but I don't know how to strip the file name from it:
<% Page containingPage = pageManager.getContainingPage(resourceResolver.getResource(currentNode.getPath()));
%>
Profile
Assuming you are accessing the following resource URL.
/content/mywebsite/english/mynode
if your current node is "mynode" and you want to get the part of url without your current node.
then the simplest way to do is, call getParent() on currentNode(mynode)
therefore, you can get the path of your parent like this.
currentNode.getParent().getPath() will give you "/content/mywebsite/english/"
full code :
<% Page containingPage = pageManager.getContainingPage(resourceResolver.getResource(currentNode.getParent().getPath()));
%>
Profile
A much simpler approach.
You can use the currentPage object to get the parent Page.
The code looks like this
Profile
In case you are getting an error while using this code, check whether you have included the global.jsp file in the page. The one shown below.
<%#include file="/libs/foundation/global.jsp"%>
I don't know anything about CQ5, but since getPath() returns an ordinary Java string I expect you could just take the prefix up to the last slash, which for a string s can be done with s.substring(0, s.lastIndexOf('/')+1). If you have to make it into a one-liner, you could do containingPage.getPath().substring(0, containingPage.getPath().lastIndexOf('/')+1).

Categories