Convert relative to absolute links using jsoup - java

I'm using jsoup to clean a html page, the problem is that when I save the html locally, the images do not show because they are all relative links.
Here's some example code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class so2 {
public static void main(String[] args) {
String html = "<html><head><title>The Title</title></head>"
+ "<body><p><img width=\"437\" src=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" height=\"418\" class=\"documentimage\"></p></body></html>";
Document doc = Jsoup.parse(html,"https://whatever.com"); // baseUri seems to be ignored??
System.out.println(doc);
}
}
Output:
<html>
<head>
<title>The Title</title>
</head>
<body>
<p><img width="437" src="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif" height="418" class="documentimage"></p>
</body>
</html>
The output still shows the links as a href="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif".
I would like it to convert them to a href="http://whatever.com/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif"
Can anyone show me how to get jsoup to convert all the links to absolute links?

You can select all the links and transform their hrefs to absolute using Element.absUrl()
Example in your code:
EDIT (added processing of images)
public static void main(String[] args) {
String html = "<html><head><title>The Title</title></head>"
+ "<body><p><img width=\"437\" src=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" height=\"418\" class=\"documentimage\"></p></body></html>";
Document doc = Jsoup.parse(html,"https://whatever.com");
Elements select = doc.select("a");
for (Element e : select){
// baseUri will be used by absUrl
String absUrl = e.absUrl("href");
e.attr("href", absUrl);
}
//now we process the imgs
select = doc.select("img");
for (Element e : select){
e.attr("src", e.absUrl("src"));
}
System.out.println(doc);
}

Related

How to scrape the hyperlinks from a webpage using JSOUP

When I am scraping with the following code it does not show any element within body tag, but in manually checking with view-source it shows the elements in body. How to scrape the hyperlinks in the following URL?
public static void main(String[] args) throws SQLException, IOException {
String search_url = "http://www.manta.com/search?search=geico";
Document doc = Jsoup.connect(search_url).userAgent("Mozilla").get();
System.out.println(doc);
Elements links = doc.select("a[href]");
System.out.println(links);
for (Element a : links) {
System.out.println(a);
String linkhref=a.attr("href");
System.out.println(linkhref);
}
}

XML getting text around a tag

I have a XML with below schema and I want to retrieve the text around(both left and right ) a tag as below (Using JAVA + DOM4j)
<article>
<article-meta></article-meta>
<body>
<p>
Extensible Markup Language (XML) is a markup language that defines a set of
rules for encoding documents in a format that is both human-readable and machine-
readable <ref id = 1>1</ref>. It is defined in the XML 1.0 Specification produced
by the W3C, and several other related specifications
</p>
<p>
Many application programming interfaces (APIs) have been developed to aid
software developers with processing XML <ref id = 2>2</ref>. data, and several schema
systems exist to aid in the definition of XML-based languages.
</p>
</body>
</article>
I want to retrieve the text around tag . For example out for this XML would be
<ref id = 1>1</ref>
left : both human-readable and machine-
readable
right : It is defined in the XML 1.0 Specification
Try
import java.util.List;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.Node;
import org.dom4j.io.SAXReader;
public class TestDom4j {
public static Document getDocument(final String xmlFileName) {
Document document = null;
SAXReader reader = new SAXReader();
try {
document = reader.read(xmlFileName);
} catch (DocumentException e) {
e.printStackTrace();
}
return document;
}
/**
* #param args
*/
public static void main(String[] args) {
String xmlFileName = "data.xml";
String xPath = "//article/body/p";
Document document = getDocument(xmlFileName);
List<Node> nodes = document.selectNodes(xPath);
for (Node node : nodes) {
String nodeXml = node.asXML();
System.out.println("Left >> " + nodeXml.substring(3, nodeXml.indexOf("<ref")).trim());
System.out.println("Right >> " + nodeXml.substring(nodeXml.indexOf("</ref>") + 6, nodeXml.length() - 4).trim());
}
}
}

get body content of html file in java

i'm trying to get body content of html page.
suppose this html file:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<link href="../Styles/style.css" rel="STYLESHEET" type="text/css" />
<title></title>
</head>
<body>
<p> text 1 </p>
<p> text 2 </p>
</body>
</html>
what i want is :
<p> text 1 </p>
<p> text 2 </p>
so, i thought that using SAXParser would do that (if you know simpler way please tell me)
this is my code, but always i get null as body content:
private final String HTML_NAME_SPACE = "http://www.w3.org/1999/xhtml";
private final String HTML_TAG = "html";
private final String BODY_TAG = "body";
public static void parseHTML(InputStream in, ContentHandler handler) throws IOException, SAXException, ParserConfigurationException
{
if(in != null)
{
try
{
SAXParserFactory parseFactory = SAXParserFactory.newInstance();
XMLReader reader = parseFactory.newSAXParser().getXMLReader();
reader.setContentHandler(handler);
InputSource source = new InputSource(in);
source.setEncoding("UTF-8");
reader.parse(source);
}
finally
{
in.close();
}
}
}
public ContentHandler constrauctHTMLContentHandler()
{
RootElement root = new RootElement(HTML_NAME_SPACE, HTML_TAG);
root.setStartElementListener(new StartElementListener()
{
#Override
public void start(Attributes attributes)
{
String body = attributes.getValue(BODY_TAG);
Log.d("html parser", "body: " + body);
}
});
return root.getContentHandler();
}
then
parseHTML(inputStream, constrauctHTMLContentHandler()); // inputStream is html file as stream
what is wrong with this code?
How about using Jsoup? Your code can look like
Document doc = Jsoup.parse(html);
Elements elements = doc.select("body").first().children();
//or only `<p>` elements
//Elements elements = doc.select("p");
for (Element el : elements)
System.out.println("element: "+el);
Not sure how your grabbing the HTML. If its a local file then you can load it directly into Jsoup. If you have to fetch it from some URL then I normally use Apache's HttpClient. A quick start guide is here: HttpClient and does a good job of getting you started.
That will allow you to get the data back doing something like this:
HttpClient client = new DefaultHttpClient();
HttpPost post = new HttpPost(URL);
//
// here you can do things like add parameters used when connecting to the remote site
//
HttpResponse response = client.execute(post);
BufferedReader rd = new BufferedReader(new InputStreamReader(response.getEntity().getContent()));
Then (as has been suggested by Pshemo) I use Jsoup to parse and extract the data Jsoup
Document document = Jsoup.parse(HTML);
// OR
Document doc = Jsoup.parseBodyFragment(HTML);
Elements elements = doc.select("p"); // p for <p>text</p>

how to extract the text after certain tags using jsoup

i am using jsoup to extract tweeter text. so the html structure is
<p class="js-tweet-text tweet-text">#sexyazzjas There is so much love in the air, Jasmine! Thanks for the shout out. <a href="/search?q=%23ATTLove&src=hash" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr" ><s>#</s><b>ATTLove</b></a></p>
what i want to get isThere is so much love in the air, Jasmine! Thanks for the shout out.
and i want to extract all the tweeter text in the entire page.
I am new to java. the code has bugs. please help me thank you
below is my code:
package htmlparser;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class tweettxt {
public static void main(String[] args) {
Document doc;
try {
// need http protocol
doc = Jsoup.connect("https://twitter.com/ATT/").get();
// get page title
String title = doc.title();
System.out.println("title : " + title);
Elements links = doc.select("p class="js-tweet-text tweet-text"");
for (Element link : links) {
System.out.println("\nlink : " + link.attr("p"));
System.out.println("text : " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Although I do agree with Robin Green about using the API and not Jsoup in this occasion, I will provide a working solution for what you asked just to close this topic and for help on future viewers that have a problem with
selector with tag that has two or more classes
Get the direct text of a Jsoup element that contains other elements.
public static void main(String[] args) {
Document doc;
try {
// need http protocol
doc = Jsoup.connect("https://twitter.com/ATT/").get();
// get page title
String title = doc.title();
System.out.println("title : " + title);
//select this <p class="js-tweet-text tweet-text"></p>
Elements links = doc.select("p.js-tweet-text.tweet-text");
for (Element link : links) {
System.out.println("\nlink : " + link.attr("p"));
/*use ownText() instead of text() in order to grab the direct text of
<p> and not the text that belongs to <p>'s children*/
System.out.println("text : " + link.ownText());
}
} catch (IOException e) {
e.printStackTrace();
}
}

Using Flying Saucer to Render Images to PDF In Memory

I'm using Flying Saucer to convert XHTML to a PDF document. I've gotten the code to work with just basic HTML and in-line CSS, however, now I'm attempting to add an image as a sort of header to the PDF. What I'm wondering is if there is any way whatsoever to add the image by reading in an image file as a Java Image object, then adding that somehow to the PDF (or to the XHTML -- like it gets a virtual "url" representing the Image object that I can use to render the PDF). Has anyone ever done anything like this?
Thanks in advance for any help you can provide!
I had to do that last week so hopefully I will be able to answer you right away.
Flying Saucer
The easiest way is to add the image you want as markup in your HTML template before rendering with Flying Saucer. Within Flying Saucer you will have to implement a ReplacedElementFactory so that you can replace any markup before rendering with the image data.
/**
* Replaced element in order to replace elements like
* <tt><div class="media" data-src="image.png" /></tt> with the real
* media content.
*/
public class MediaReplacedElementFactory implements ReplacedElementFactory {
private final ReplacedElementFactory superFactory;
public MediaReplacedElementFactory(ReplacedElementFactory superFactory) {
this.superFactory = superFactory;
}
#Override
public ReplacedElement createReplacedElement(LayoutContext layoutContext, BlockBox blockBox, UserAgentCallback userAgentCallback, int cssWidth, int cssHeight) {
Element element = blockBox.getElement();
if (element == null) {
return null;
}
String nodeName = element.getNodeName();
String className = element.getAttribute("class");
// Replace any <div class="media" data-src="image.png" /> with the
// binary data of `image.png` into the PDF.
if ("div".equals(nodeName) && "media".equals(className)) {
if (!element.hasAttribute("data-src")) {
throw new RuntimeException("An element with class `media` is missing a `data-src` attribute indicating the media file.");
}
InputStream input = null;
try {
input = new FileInputStream("/base/folder/" + element.getAttribute("data-src"));
final byte[] bytes = IOUtils.toByteArray(input);
final Image image = Image.getInstance(bytes);
final FSImage fsImage = new ITextFSImage(image);
if (fsImage != null) {
if ((cssWidth != -1) || (cssHeight != -1)) {
fsImage.scale(cssWidth, cssHeight);
}
return new ITextImageElement(fsImage);
}
} catch (Exception e) {
throw new RuntimeException("There was a problem trying to read a template embedded graphic.", e);
} finally {
IOUtils.closeQuietly(input);
}
}
return this.superFactory.createReplacedElement(layoutContext, blockBox, userAgentCallback, cssWidth, cssHeight);
}
#Override
public void reset() {
this.superFactory.reset();
}
#Override
public void remove(Element e) {
this.superFactory.remove(e);
}
#Override
public void setFormSubmissionListener(FormSubmissionListener listener) {
this.superFactory.setFormSubmissionListener(listener);
}
}
You will notice that I have hardcoded here /base/folder which is the folder where the HTML file is located as it will be the root url for Flying Saucer for resolving medias. You may change it to the correct location, coming from anywhere you want (Properties for example).
HTML
Within your HTML markup you indicate somewhere a <div class="media" data-src="somefile.png" /> like so:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>My document</title>
<style type="text/css">
#logo { /* something if needed */ }
</style>
</head>
<body>
<!-- Header -->
<div id="logo" class="media" data-src="media/logo.png" style="width: 177px; height: 60px" />
...
</body>
</html>
Rendering
And finally you just need to indicate your ReplacedElementFactory to Flying-Saucer when rendering:
String content = loadHtml();
ITextRenderer renderer = new ITextRenderer();
renderer.getSharedContext().setReplacedElementFactory(new MediaReplacedElementFactory(renderer.getSharedContext().getReplacedElementFactory()));
renderer.setDocumentFromString(content.toString());
renderer.layout();
final ByteArrayOutputStream baos = new ByteArrayOutputStream();
renderer.createPDF(baos);
// baos.toByteArray();
I have been using Freemarker to generate the HTML from a template and then feeding the result to FlyingSaucer with great success. This is a pretty neat library.
what worked for me is putting it as a embedded image. So converting image to base64 first and then embed it:
byte[] image = ...
ITextRenderer renderer = new ITextRenderer();
renderer.setDocumentFromString("<html>\n" +
" <body>\n" +
" <h1>Image</h1>\n" +
" <div><img src=\"data:image/png;base64," + Base64.getEncoder().encodeToString(image) + "\"></img></div>\n" +
" </body>\n" +
"</html>");
renderer.layout();
renderer.createPDF(response.getOutputStream());
Thanks Alex for detailed solution. I'm using this solution and found there is another line to be added to make it work.
public ReplacedElement createReplacedElement(LayoutContext layoutContext, BlockBox blockBox, UserAgentCallback userAgentCallback, int cssWidth, int cssHeight) {
Element element = blockBox.getElement();
....
....
final Image image = Image.getInstance(bytes);
final int factor = ((ITextUserAgent)userAgentCallback).getSharedContext().getDotsPerPixel(); //Need to add this line
image.scaleAbsolute(image.getPlainWidth() * factor, image.getPlainHeight() * factor) //Need to add this line
final FSImage fsImage = new ITextFSImage(image);
....
....
We need to read the DPP from SharedContext and scale the image to display render the image on PDF.
Another suggestion:
We can directly extend ITextReplacedElement instead of implementing ReplacedElementFactory. In that case we can set the ReplacedElementFactory in the SharedContext as follows:
renderer.getSharedContext().setReplacedElementFactory(new MediaReplacedElementFactory(renderer.getOutputDevice());

Categories