I have just started working on a content extraction project. First I am trying to the Image URLs in a webpage. In some cases, the "src" attribute of "img" has relative URL. But I need to get the complete URL.
I was looking for some Java library to achieve this and thought Jsoup will be useful. Is there any other library to achieve this easily?
If you just need to get the complete URL from a relative one, the solution is simple in Java:
URL pageUrl = base_url_of_the_html_page;
String src = src_attribute_value; //relative or absolute URL
URL imgUrl = new URL(pageUrl, src);
The base URL of the HTML page is usually just the URL you have obtained the HTML code from. However, a <base> tag used in the document header, may be used for specifying a different base URL (but it's not used very frequently).
You may use Jsoup or just a DOM parser for obtaining the src attribute values and for finding the eventual base tag.
Related
Right now I'm trying to implement a WebCrawler that lists out every file (images, etc.) and file extension (.jpg, .png, etc.) contained within a website using Jsoup. And I don't how to extract files elements from the URL.
Right now I know how to get all the text contained in the URL by doing something like this.
val doc = Jsoup.connect(link).get
val body: Element = doc.body()
val allText: String = body.text
I am using thymeleaf as my template engin to map XHTML to HTML and flying saucer to generate a pdf file afterwards.
Now i failed to display my static images located at /src/main/resources/ inside y generated pdf file. The file itsself will be displayed fine only images disapear.
Even other locations like /src/main/resources/static or /src/main/resources/public didnt help.
My HTML / XHTML looks like:
<img src="images/logo_black.png"></img>
<img src="/images/logo_black.png"></img>
<img alt="mastercard" th:src="#{classpath:static/images/logo_black.png}" />
<div data-src="images/logo_black.png"></div>
<div data-src="/images/logo_black.png"></div>
<div data-src="#{classpath:static/images/logo_black.png}"></div>
none of them is working properly.
The Images itself are visible by localhost:8048/logo_black.png
I dont want to refer my images with a full url (http://...)
You can include resources from any URL (from the Internet or from your file system). Either way, there are several steps involved:
When generating the HTML from the Thymeleaf template, you can use
#{/some/url} to resolve a path relative to your Web context (assuming you have a Web context), or
#{classpath:/some/url} with will just leave the URL as classpath:/some/url, or
simply a string value constant or a value from a variable (${var}), doesn't matter if it's an absolute URL https://some/url or relative, Thymleaf will leave them unchanged in the resulting HTML.
Before you pass the HTML to Flying Saucer, make sure the URLs are correct. Then Flying Saucer will process all URLs with a UserAgentCallback, by default ITextUserAgent.
The relevant methods in UserAgentCallBack are resolveURI and setBaseURL.
There is some weird logic going on in the default resolveURI method of ITextUserAgent (inherited from NaiveUserAgent). If the baseURL is null, it will try to set it, so it's best to always set it yourself. I had better results with overriding the resolveURI, the following is enough to keep absolute URLs and resolve relative URLs relative to the baseURL:
#Override
public String resolveURI(String uri) {
if (URI(uri).isAbsolute())
return uri;
else
return Paths.get(getBaseURL(), uri).toUri().toString();
}
Finally, in order to resolve the classpath: protocol, you need to define an URLStreamHandler unless there is already one defined (for example, the embedded Tomcat of Spring Boot already does supports this).
You can render image with the help of base 64 .You just convert your image on base 64 and it will show on your web page as well as mobile view.The tags are:
<img th:src="#{ base 64}"/>
I'm programming a desktop application using SWT and I use the browser in parts of the interface because of the flexibility.
I easily can introduce external images. An image in the file system:
<img src="/home/user/image.jpg" />
Or an image on the web:
<img src="http://some.cl/image.jpg" />
Can I obtain the images from a stream? In some place of my code I want to program something like this:
OutputSteam getExternaResource(String resourcePath)
I want to arbitrarily control the origin of the request.
I don't know of a direct way to do this, all I can think of is using javascript to set the image data as base64 string into the src of the image.
Using org.eclipse.swt.browser.Browser.execute(String) or maybe use org.eclipse.swt.browser.BrowserFunction.
The images should have an id which can be used in javascript:
<img id="image1" />
Edit: on the other hand, maybe it's easier to just parse the HTML previously and set the image base64 string there.
Depending on how you get the HTML you could do:
if you create the HTML yourself, just use <img src="data:image/png;base64.... convert the image to base64 and put it in the src attribute
if you read the HTML from an external source, you could use JSoup to parse the HTML and replace the image src attribute with a base64 string. afterwards use Browser.setText(String) to set the HTML of the browser, be aware that in that case relative paths (in links or images) don't work.
String html = "html";
Document doc = Jsoup.parse(html);
Elements img = doc.getElementsByTag("img");
for (Element element : img) {
String src = element.attr("src");
// READ image using the existing src, convert to base64 (using java.util.Base64)
String base64 = "";
element.attr("src", "data:image/png;base64,"+);
}
String newHtml = doc.html();
browser.setText(newHtml);
If you have control over the HTML page, i.e. it is generated by your code, possibly from a template, then you can embed the image.
The bytes of the image need to be base64 encoded and appended to the src attribute of the image tag like described here: http://www.techerator.com/2011/12/how-to-embed-images-directly-into-your-html/
I got some page repository with html files. I want to process them using jsoup, but when I try to get absolute paths of all links jsoup gave me empty strings (""). Is there a possibility to set baseUri as a file path ?
Solution : link.get(i).baseUri + link.get(i).attr("href") is not sufficient for me becouse i need to some how recognize which link is relative or not.
The jsoup documentation says us :
There is a sister method parse(File in, String charsetName) which uses
the file's location as the baseUri. This is useful if you are working
on a filesystem-local site and the relative links it points to are
also on the filesystem.
But it doesn't work on my PC.
I'm "solving" the same issue with the following code. While I prefer jsoup functions worked on my local file system, I need something in the meantime. That solution is sending the file locations into the parser as the baseURI then concatenating each relative path to that base. Unfortunately, that means I lose the nesting functionality of HTML's "../" that jsoup normally handles with its built-in functions. Furthermore, I can never be as certain about the results as if the built-in functions worked.
Luckily, I use this mainly for JUnit Testing and it should add minor risks to my production code. The context is that I built a local "Internet" to test crawling offline. I create the JSoup Document by sending a local HTML file to it in my JUnit Test Class:
// From my JUnit Test
String testFileName = "HTMLTest_RelativeReferences.html";
String testFilePath = getClass().getResource(testFileName).getPath();
String testFileBaseURI = testFilePath.replace(testFileName, "");
// ...
// Sends filePath and baseURI to testing class that creates JSoup Doc with:
siteDoc = Jsoup.parse(new File(testFilePath), "UTF-8", testFileBaseURI);
Now that I created my Document with a baseURI, you & I both thought the relative paths should use that baseURI to create an absolute path. Since that failed, I run a simple test for empty-string abs:refs and concatenate my own URLs.
Elements links = siteDoc.select("a[href]"); // extract link collection
for (Element link : links) { // iterate through links
String linkString = link.attr("abs:href"); // ftr, neither this nor absUrl("href") works
if (linkString.isEmpty()) { // check if returned "" (i.e., the problem at hand)
URLs.add(siteDoc.baseUri() + link.attr("href")); // concatenate baseURI to relative ref
}
else { // for all the properly returned absolute refs
URLs.add(link.attr("abs:href"));
}
}
All my JUnit Tests continue to pass with both absolute and relative local references - good luck!
HTML Doc I used for reference with all 3 links representing other HTML files in the same folder:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>HTML Test using Relative References</title>
</head>
<body>
Link1
Link2
Link3
</body>
</html>
Edit: I little digging into the jsoup library leads me to believe our local file "URL"s will never work because jsoup handles actual URLs during its attr("abs:href") process and will through MalformedURLs and return "" since we are actually using local file paths rather than true URLs. I consider this outside of the scope of the above answer but thought I would mention my discovery.
You can use the absUrl()-function in JSoup Elements.
String path = linkEl.absUrl("href");
Hi I am looking for a simple URL & title extractor from html files in Java. I am trying to parse bookmarks.html (IE,Firefox) etc and add the title & url to a db. I need to do this in java (no 3rd party libraries allowed) so proably I have to use sax/dom/regex.
You can load up the file into a DOM document and then use an XPath expression to find all the instances of an tag. Extracting the HREF attribute and the tag contents should do what you want to do. The XPath would probably be something as simple as '//A'.