get body content of html file in java

get body content of html file in java - java

i'm trying to get body content of html page.
suppose this html file:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<link href="../Styles/style.css" rel="STYLESHEET" type="text/css" />
<title></title>
</head>
<body>
<p> text 1 </p>
<p> text 2 </p>
</body>
</html>
what i want is :
<p> text 1 </p>
<p> text 2 </p>
so, i thought that using SAXParser would do that (if you know simpler way please tell me)
this is my code, but always i get null as body content:
private final String HTML_NAME_SPACE = "http://www.w3.org/1999/xhtml";
private final String HTML_TAG = "html";
private final String BODY_TAG = "body";
public static void parseHTML(InputStream in, ContentHandler handler) throws IOException, SAXException, ParserConfigurationException
{
if(in != null)
{
try
{
SAXParserFactory parseFactory = SAXParserFactory.newInstance();
XMLReader reader = parseFactory.newSAXParser().getXMLReader();
reader.setContentHandler(handler);
InputSource source = new InputSource(in);
source.setEncoding("UTF-8");
reader.parse(source);
}
finally
{
in.close();
}
}
}
public ContentHandler constrauctHTMLContentHandler()
{
RootElement root = new RootElement(HTML_NAME_SPACE, HTML_TAG);
root.setStartElementListener(new StartElementListener()
{
#Override
public void start(Attributes attributes)
{
String body = attributes.getValue(BODY_TAG);
Log.d("html parser", "body: " + body);
}
});
return root.getContentHandler();
}
then
parseHTML(inputStream, constrauctHTMLContentHandler()); // inputStream is html file as stream
what is wrong with this code?

How about using Jsoup? Your code can look like
Document doc = Jsoup.parse(html);
Elements elements = doc.select("body").first().children();
//or only `<p>` elements
//Elements elements = doc.select("p");
for (Element el : elements)
System.out.println("element: "+el);

Not sure how your grabbing the HTML. If its a local file then you can load it directly into Jsoup. If you have to fetch it from some URL then I normally use Apache's HttpClient. A quick start guide is here: HttpClient and does a good job of getting you started.
That will allow you to get the data back doing something like this:
HttpClient client = new DefaultHttpClient();
HttpPost post = new HttpPost(URL);
//
// here you can do things like add parameters used when connecting to the remote site
//
HttpResponse response = client.execute(post);
BufferedReader rd = new BufferedReader(new InputStreamReader(response.getEntity().getContent()));
Then (as has been suggested by Pshemo) I use Jsoup to parse and extract the data Jsoup
Document document = Jsoup.parse(HTML);
// OR
Document doc = Jsoup.parseBodyFragment(HTML);
Elements elements = doc.select("p"); // p for <p>text</p>

Related

How to display string as xml in jsp page

Im working on Struts2 project, In action class im passing string to jsp page. I want to display that string content as xml in jsp page.
jsp page : response.jsp
<%# taglib prefix="s" uri="/struts-tags" %>
<s:property value="sampleStr" />
Action class : ResponseAction
public class ResponseAction extends ActionSupport {
private static final long serialVersionUID = 1L;
public String sampleStr;
public String execute() throws IOException {
String responseStr = readStringFile();
setSampleStr(responseStr);
return SUCCESS;
}
#SuppressWarnings({ "rawtypes", "unchecked" })
public String readStringFile() throws IOException{
String xmlStr = "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>\n"+
"<response>"+ "$$" +
"</response>";
InputStream inputStream = XmlFormatter.class.getResourceAsStream("/sample.txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, Charset.forName("UTF-16")));
String s = "";
List list = new ArrayList();
String line;
while ((line = reader.readLine()) != null) {
list.add(line);
}
for (Object s1: list) {
s= s + s1;
}
xmlStr = xmlStr.replace("$$", s);
return xmlStr;
}
public String getSampleStr() {
return sampleStr;
}
public void setSampleStr(String sampleStr) {
this.sampleStr = sampleStr;
}
}
Struts.xml :
<package name="default" namespace="/" extends="struts-default">
<action name="PEConsolidation" class="com.metlife.ibit.pe.web.controller.actions.ResponseAction">
<interceptor-ref name="defaultStack" />
<result name="success">/WEB-INF/jsps/response.jsp</result>
</action>
</package>
When i looks response.jsp, it display return string as text. please anyone help to display as xml content?

s:property has built-in escaping functionality for HTML, JavaScript and XML.
By default it escapes HTML.
I think what you want to do is no escaping at all:
<s:property value="sampleStr" escapeHtml="false" />
You should also check the http headers of the response ("content-type: text/html" would be wrong in your case).
Instead of using a jsp, you could look into using a different result type, maybe write your own one.
https://struts.apache.org/core-developers/result-types.html

I think that the browser is trying to interpret the XML tags as HTML tags and, failing to do so, is ignoring them.
You will need to replace each < and > character to &lt; and &gt; respectively. You can use very useful String.replaceAll() method in Java API's.
Additionally, you can check this Oracle's page out. It would be very helpful in your development process using JSP with XML technologies.

How to parse xml and get the data from xml string?

I am getting one xml string, that I want to parse and get the data from it. I tried to parse it to json but I get the empty braces as a result.
public class ResultsActivity extends Activity {
String outputPath;
TextView tv;
public static int PRETTY_PRINT_INDENT_FACTOR = 4;
public static String TEST_XML_STRING;
DocumentBuilder builder;
InputStream is;
Document dom;
#Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
tv = new TextView(this);
setContentView(tv);
String imageUrl = "unknown";
Bundle extras = getIntent().getExtras();
if( extras != null) {
imageUrl = extras.getString("IMAGE_PATH" );
outputPath = extras.getString( "RESULT_PATH" );
}
// Starting recognition process
new AsyncProcessTask(this).execute(imageUrl, outputPath);
}
public void updateResults(Boolean success) {
if (!success)
return;
try {
StringBuffer contents = new StringBuffer();
FileInputStream fis = openFileInput(outputPath);
try {
Reader reader = new InputStreamReader(fis, "UTF-8");
BufferedReader bufReader = new BufferedReader(reader);
String text = null;
while ((text = bufReader.readLine()) != null) {
contents.append(text).append(System.getProperty("line.separator"));
}
} finally {
fis.close();
}
XmlToJson xmlToJson = new XmlToJson.Builder(contents.toString()).build();
// convert to a JSONObject
JSONObject jsonObject = xmlToJson.toJson();
// OR convert to a Json String
String jsonString = xmlToJson.toString();
// OR convert to a formatted Json String (with indent & line breaks)
String formatted = xmlToJson.toFormattedString();
Log.e("xml",contents.toString());
Log.e("json",jsonObject.toString());
} catch (Exception e) {
displayMessage("Error: " + e.getMessage());
}
}
public void displayMessage( String text )
{
tv.post( new MessagePoster( text ) );
}
#Override
public boolean onCreateOptionsMenu(Menu menu) {
// Inflate the menu; this adds items to the action bar if it is present.
getMenuInflater().inflate(R.menu.activity_results, menu);
return true;
}
class MessagePoster implements Runnable {
public MessagePoster( String message )
{
_message = message;
}
public void run() {
tv.append( _message + "\n" );
setContentView( tv );
}
private final String _message;
}
}
I followed this link : https://github.com/smart-fun/XmlToJson
Can I only parse xml? How can I get the data out of xml string?
Following is the xml string:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<document xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://ocrsdk.com/schema/recognizedBusinessCard-1.0.xsd http://ocrsdk.com/schema/recognizedBusinessCard-1.0.xsd" xmlns="http://ocrsdk.com/schema/recognizedBusinessCard-1.0.xsd">
<businessCard imageRotation="noRotation">
<field type="Mobile">
<value>•32147976</value>
</field>
<field type="Address">
<value>Timing: 11:00 a.m. to 5.00 p.m</value>
</field>
<field type="Address">
<value>MULTOWECIALITY HOSPITAL Havnmg Hotel MwyantfwfMf), TOL: 1814 7»7» / 0454 7575 fax: 2514 MSS MtoMte t wvHwJaMtur0Mapttal.com</value>
</field>
<field type="Name">
<value>M. S. (Surgery), Fais, Fics</value>
</field>
<field type="Company">
<value>KASTURI MEDICARE PVT. LTD.</value>
</field>
<field type="Job">
<value>Consulting General Surgeon Special Interest: Medical Administrator: KsturiSecretary: IMA - Mira</value>
</field>
<field type="Text">
<value>Mob.: •32114976
Dr. Rakhi R
M. S. (Surgery), Surgeon
Special Interest: Medical
President: Bhayander Medical Association
Scientific Secretary: IMA - Mira Bhayander
Timing: 11:00 a.m. to 5.00 p.m
%
*
KASTURI MEDICARE PVT. LTD.
ISO 9001:2008 Certified, ASNH Cliniq 21 Certified,
MtoMte t wvHwJaMtur0Mapttal.com
mkhLkasturi0gmoiH.com</value>
</field>
</businessCard>
I checked this link to parse the xml: http://androidexample.com/XML_Parsing_-_Android_Example/index.php?view=article_discription&aid=69
But this string dose not have the list, I am not getting how to parse this xml string. Can anyone help please?? Thank you..

You can parse Json easily than XML.
So I will suggest you to parse Json,
First Convert XMLto Json then parse the JsonObject.
here is reference you can take to convert XML to JSON Step by Step
https://stackoverflow.com/a/18339178/6676466

For Xml parsing you can go for either XML Pull Parser or XML DOM Parser.
Both the process are quite lengthy and involves a lot code as it focuses on manual parsing on XML.
Another way is to use This Library in your project and boom most of your job is done. It will parse your XML just like you parse your JSON using GSON.
All you need to do is to create a instance of the parser and use it like:
XmlParserCreator parserCreator = new XmlParserCreator() {
#Override
public XmlPullParser createParser() {
try {
return XmlPullParserFactory.newInstance().newPullParser();
} catch (Exception e) {
throw new RuntimeException(e);
}
}
};
GsonXml gsonXml = new GsonXmlBuilder()
.setXmlParserCreator(parserCreator)
.create();
String xml = "<model><name>my name</name><description>my description</description></model>";
SimpleModel model = gsonXml.fromXml(xml, SimpleModel.class);
Remember that you need to create a POJO class for your response just like you do for GSON.
Include the library in your gradle using:
compile 'com.stanfy:gson-xml-java:0.1.+'
Please read the github link for library carefully to know the usage and limitations.

from your question I don't get the reason to convert xml to json but just to get a way to fetch some fields out of the xml directly.
If there is no need to process the json data at a later step I recommend you to use XPATH. With Xpath you can get the data of you xml with a simple path query like "/document/businessCard/field[#type='Mobile']/value"
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(URI_TO_YOUR_DOCUMENT);
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile("/document/businessCard/field[#type='Mobile']/value");

get image url of rss with rome library

I having a rss file in following :
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<title> سایپا نیوز </title>
<link>http://www.saipanews.com/</link>
<description></description>
<language>fa</language>
<item>
<author></author>
<pretitle></pretitle>
<title>پیام تبریک دکتر جمالی به مناسبت فرارسیدن سالروز ولادت حضرت علی(ع) و روز پدر</title>
<link>http://www.saipanews.com/view-6751.html</link>
<pubdate>2016-04-20 10:58:00</pubdate>
<description>سایپا نیوز: مدیرعامل گروه خودروسازی سایپا همزمان با فرارسیدن سالروز میلاد باسعادت حضرت علی(ع) و روز پدر، طی پیامی به تمامی پدران متعهد و پرتلاش ایران زمین تبریک گفت.</description>
<secid>0</secid>
<typid>8</typid>
<image>http://www.saipanews.com/media/image/jamali/jmali.JPG</image>
</item>
<item>
<author></author>
<pretitle></pretitle>
<title>فرهنگ رانندگی بین خطوط در معابر شهری در حال گسترش است </title>
<link>http://www.saipanews.com/view-6748.html</link>
<pubdate>2016-04-19 11:27:00</pubdate>
<description>سایپا نیوز: به گزارش سایپا نیوز و به نقل از فرارو، از آنجایی که فرهنگ رانندگی مجموعه ای از رفتارهای درست رانندگی و آداب زندگی اجتماعی بهنگام تردد در شهرها و جاده ها است، رانندگی در بین خطوط معابر شهری یکی از نمادهای فرهنگ رانندگی در کشورهای درحال توسعه و توسعه یافته می باشد.</description>
<secid>0</secid>
<typid>8</typid>
<image>http://www.saipanews.com/media/image/farhang%20ranandegi/252887_331.jpg</image>
</item>
</channel>
</rss>
I want to get image's urls.
I use Rome library but not found any solution.
how to get image's url in item with Rome library ?

I for that get image tag , build new rss parser on the following:
public class NewRssParser extends RSS094Parser implements WireFeedParser {
public NewRssParser() {
this("rss_2.0");
}
protected NewRssParser(String type) {
super(type);
}
protected String getRSSVersion() {
return "2.0";
}
protected boolean isHourFormat24(Element rssRoot) {
return false;
}
protected Description parseItemDescription(Element rssRoot, Element eDesc) {
Description desc = super.parseItemDescription(rssRoot, eDesc);
desc.setType("text/html"); // change as per
// https://rome.dev.java.net/issues/show_bug.cgi?id=26
return desc;
}
public boolean isMyType(Document document) {
boolean ok;
Element rssRoot = document.getRootElement();
ok = rssRoot.getName().equals("rss");
if (ok) {
ok = false;
Attribute version = rssRoot.getAttribute("version");
if (version != null) {
// At this point, as far ROME is concerned RSS 2.0, 2.00 and
// 2.0.X are all the same, so let's use startsWith for leniency.
ok = version.getValue().startsWith(getRSSVersion());
}
}
return ok;
}
#Override
public Item parseItem(Element arg0, Element arg1) {
Item item = super.parseItem(arg0, arg1);
Element imageElement = arg1.getChild("image", getRSSNamespace());
if (imageElement != null) {
String imageUrl = imageElement.getText();
Element urlElement = imageElement.getChild("url");
imageUrl = urlElement != null ? urlElement.getText() : imageUrl;
Enclosure enc = new Enclosure();
enc.setType("image");
enc.setUrl(imageUrl);
item.getEnclosures().add(enc);
}
return item;
}
}
in the class override parseItem method and add code for get image element and add image's url to Enclosures.
then add following line to rome.properties file :
WireFeedParser.classes=[packge name].NewRssParser
Example :
WireFeedParser.classes=ir.armansoft.newscommunity.newsgathering.parser.impl.NewRssParser

Rome wont provide the <image> tag because it does not belong to the namespace it is in. So the feed isn't valid:
line 18, column 3: Undefined item element: image (29 occurrences) [help]
<image>http://www.saipanews.com/media/image/%D8%AA%D9%88%D9%84%D9%8A%D8%A ...
If the image tag would be in a different namespace, like this:
<image:image>http://www.saipanews.com/media/image/%D8%AA%D9%88%D9%84%D9%8A%D8%AF/2.jpg</image:image>
You could get foreing markup in this way:
for(SyndEntry entry : feed.getEntries()) {
for (Element element : entry.getForeignMarkup()) {
System.out.println("element: " + element.toString());
}
}
And the result would be
element: [Element: <image:image [Namespace: http://purl.org/rss/1.0/modules/image/]/>]
Unless the feed is fixed, It seems that there isn't a way to get the image url with Rome library at the moment.

The Answer is so simple.
First get the syndContent using the Roam API.
Find the code for the reading images and all content from RSS
<%# page import="com.rometools.rome.feed.synd.SyndFeed"%>
<%# page import="com.rometools.rome.feed.synd.SyndEntry"%>
<%# page import="com.rometools.rome.feed.synd.SyndContent"%>
<%# page import="com.rometools.modules.mediarss.MediaEntryModule"%>
<%# page import="com.rometools.rome.feed.module.Module"%>
<%# page import="com.rometools.modules.mediarss.types.Thumbnail"%>
<%# page import="java.util.Iterator"%>
<%# page import="java.util.List"%>
<html>
<head>
<title>website</title>
<link href="/css/style.css" rel="stylesheet" type="text/css" />
</head>
<body>
<h1>Home</h1>
<%
HttpSession session1=request.getSession(false);
SyndFeed syndFeed11= (SyndFeed) session1.getAttribute("syndFeed");
%>
<h2><%=syndFeed11.getTitle()%></h2>
<ul>
<%
Iterator it = syndFeed11.getEntries().iterator();
while (it.hasNext())
{
SyndEntry entry = (SyndEntry) it.next();
%>
<li><%=entry.getTitle()%> <%
List<SyndContent> syndContents=entry.getContents();
System.out.println(syndContents.size());
for(SyndContent syndContent:syndContents)
{
System.out.println(syndContent.getMode());
System.out.println("This is content"+syndContent.getValue());
%>
//This is The STRING WHICH CONTAINS the link to the image apply regex expression to get SAMPLE_LINK out of "<img src"LINK">"
<%=syndContent.getValue() %>>
<%
}
//SyndContent syndContent=syndContents.get(0);
for (Module module : entry.getModules()) {
if (module instanceof MediaEntryModule) {
MediaEntryModule media = (MediaEntryModule)module;
for (Thumbnail thumb : media.getMetadata().getThumbnail()) {
%><img src="<%=thumb.getUrl() %>" />
<%
}
}
}
%></li>
<% } %>
</ul>
</body>
</html>
Bellow is the Servlet Class:-
package website.web;
import java.io.IOException;
import java.io.PrintWriter;
import java.net.URL;
import javax.servlet.RequestDispatcher;
import javax.servlet.ServletConfig;
import javax.servlet.ServletContext;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.servlet.http.HttpSession;
import org.apache.log4j.Logger;
import com.rometools.rome.feed.synd.SyndFeed;
import com.rometools.rome.io.FeedException;
import com.rometools.rome.io.SyndFeedInput;
import com.rometools.rome.io.XmlReader;
public class HomeServlet extends HttpServlet {
/**
*
*/
private static final long serialVersionUID = 1L;
private Logger logger = Logger.getLogger(this.getClass());
#Override
public void init(ServletConfig config) throws ServletException {
super.init(config);
}
#Override
protected void doGet(HttpServletRequest req, HttpServletResponse resp)
throws ServletException, IOException {
String rssUrl=(String)req.getAttribute("rss");
logger.debug("Retrieving yahoo news feed");
URL url = new URL("https://www.reddit.com/.rss");
SyndFeedInput syndFeedInput = new SyndFeedInput();
HttpSession session=req.getSession();
SyndFeed syndFeed = null;
XmlReader xmlReader = new XmlReader(url);
try {
syndFeed = syndFeedInput.build(xmlReader);
System.out.println("Donr");
} catch (IllegalArgumentException e) {
logger.error("", e);
} catch (FeedException e) {
logger.error("", e);
}
logger.debug("Forwarding to home.jsp");
req.setAttribute("syndFeed11", syndFeed);
PrintWriter out = resp.getWriter();
out.println("<h1>");
out.println();
session.setAttribute("syndFeed", syndFeed);
out.println("</h1>");
ServletContext context = getServletContext();
RequestDispatcher dispatcher = context.getRequestDispatcher("/WEB-INF/jsp/home.jsp");
dispatcher.forward(req,resp);
}
}

I solved this problem by parsing the feed with Rome and then parsing it again to get the raw jdom Document. Then I can get the item elements from the feed and look for images. Bit hacky but it easier than extending the RSS parsers and so on.
byte[] data = ... bytes for the feed ...
SyndFeedInput input = new SyndFeedInput()
input.allowDoctypes = true
SyndFeed sf = input.build(new XmlReader(new ByteArrayInputStream(data)))
Document doc = new MyWireFeedInput().getDocument(new XmlReader(new ByteArrayInputStream(data)))
Element channel = doc.rootElement.getChild("channel")
List<Element> items = channel ? channel.getChildren("item") : null
List<SyndEntry> entries = sf.entries
for (int i = 0; i < entries.size(); i++) {
SyndEntry entry = entries[i]
Element item = items ? items[i] : null
if (item) {
Element image = item.getChild("image")
... add it to enclosures or whatever ...
}
}
Here is the class that gets the jdom Document:
/**
* This is a hack to get at the protected {#link WireFeedInput#createSAXBuilder()} method so we can get the
* raw jdom document for the feed to extract elements (e.g. 'image') not parsed by the built in feed parsers.
*/
public class MyWireFeedInput extends WireFeedInput {
Document getDocument(Reader reader) {
final SAXBuilder saxBuilder = createSAXBuilder();
try {
if (xmlHealerOn) reader = new XmlFixerReader(reader)
return saxBuilder.build(reader);
} catch (final JDOMParseException ex) {
throw new ParsingFeedException("Invalid XML: " + ex.getMessage(), ex);
} catch (final IllegalArgumentException ex) {
throw ex;
} catch (final Exception ex) {
throw new ParsingFeedException("Invalid XML", ex);
}
}
}

Convert relative to absolute links using jsoup

I'm using jsoup to clean a html page, the problem is that when I save the html locally, the images do not show because they are all relative links.
Here's some example code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class so2 {
public static void main(String[] args) {
String html = "<html><head><title>The Title</title></head>"
+ "<body><p><img width=\"437\" src=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" height=\"418\" class=\"documentimage\"></p></body></html>";
Document doc = Jsoup.parse(html,"https://whatever.com"); // baseUri seems to be ignored??
System.out.println(doc);
}
}
Output:
<html>
<head>
<title>The Title</title>
</head>
<body>
<p><img width="437" src="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif" height="418" class="documentimage"></p>
</body>
</html>
The output still shows the links as a href="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif".
I would like it to convert them to a href="http://whatever.com/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif"
Can anyone show me how to get jsoup to convert all the links to absolute links?

You can select all the links and transform their hrefs to absolute using Element.absUrl()
Example in your code:
EDIT (added processing of images)
public static void main(String[] args) {
String html = "<html><head><title>The Title</title></head>"
+ "<body><p><img width=\"437\" src=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" height=\"418\" class=\"documentimage\"></p></body></html>";
Document doc = Jsoup.parse(html,"https://whatever.com");
Elements select = doc.select("a");
for (Element e : select){
// baseUri will be used by absUrl
String absUrl = e.absUrl("href");
e.attr("href", absUrl);
}
//now we process the imgs
select = doc.select("img");
for (Element e : select){
e.attr("src", e.absUrl("src"));
}
System.out.println(doc);
}

Using Flying Saucer to Render Images to PDF In Memory

I'm using Flying Saucer to convert XHTML to a PDF document. I've gotten the code to work with just basic HTML and in-line CSS, however, now I'm attempting to add an image as a sort of header to the PDF. What I'm wondering is if there is any way whatsoever to add the image by reading in an image file as a Java Image object, then adding that somehow to the PDF (or to the XHTML -- like it gets a virtual "url" representing the Image object that I can use to render the PDF). Has anyone ever done anything like this?
Thanks in advance for any help you can provide!

I had to do that last week so hopefully I will be able to answer you right away.
Flying Saucer
The easiest way is to add the image you want as markup in your HTML template before rendering with Flying Saucer. Within Flying Saucer you will have to implement a ReplacedElementFactory so that you can replace any markup before rendering with the image data.
/**
* Replaced element in order to replace elements like
* <tt><div class="media" data-src="image.png" /></tt> with the real
* media content.
*/
public class MediaReplacedElementFactory implements ReplacedElementFactory {
private final ReplacedElementFactory superFactory;
public MediaReplacedElementFactory(ReplacedElementFactory superFactory) {
this.superFactory = superFactory;
}
#Override
public ReplacedElement createReplacedElement(LayoutContext layoutContext, BlockBox blockBox, UserAgentCallback userAgentCallback, int cssWidth, int cssHeight) {
Element element = blockBox.getElement();
if (element == null) {
return null;
}
String nodeName = element.getNodeName();
String className = element.getAttribute("class");
// Replace any <div class="media" data-src="image.png" /> with the
// binary data of `image.png` into the PDF.
if ("div".equals(nodeName) && "media".equals(className)) {
if (!element.hasAttribute("data-src")) {
throw new RuntimeException("An element with class `media` is missing a `data-src` attribute indicating the media file.");
}
InputStream input = null;
try {
input = new FileInputStream("/base/folder/" + element.getAttribute("data-src"));
final byte[] bytes = IOUtils.toByteArray(input);
final Image image = Image.getInstance(bytes);
final FSImage fsImage = new ITextFSImage(image);
if (fsImage != null) {
if ((cssWidth != -1) || (cssHeight != -1)) {
fsImage.scale(cssWidth, cssHeight);
}
return new ITextImageElement(fsImage);
}
} catch (Exception e) {
throw new RuntimeException("There was a problem trying to read a template embedded graphic.", e);
} finally {
IOUtils.closeQuietly(input);
}
}
return this.superFactory.createReplacedElement(layoutContext, blockBox, userAgentCallback, cssWidth, cssHeight);
}
#Override
public void reset() {
this.superFactory.reset();
}
#Override
public void remove(Element e) {
this.superFactory.remove(e);
}
#Override
public void setFormSubmissionListener(FormSubmissionListener listener) {
this.superFactory.setFormSubmissionListener(listener);
}
}
You will notice that I have hardcoded here /base/folder which is the folder where the HTML file is located as it will be the root url for Flying Saucer for resolving medias. You may change it to the correct location, coming from anywhere you want (Properties for example).
HTML
Within your HTML markup you indicate somewhere a <div class="media" data-src="somefile.png" /> like so:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>My document</title>
<style type="text/css">
#logo { /* something if needed */ }
</style>
</head>
<body>
<!-- Header -->
<div id="logo" class="media" data-src="media/logo.png" style="width: 177px; height: 60px" />
...
</body>
</html>
Rendering
And finally you just need to indicate your ReplacedElementFactory to Flying-Saucer when rendering:
String content = loadHtml();
ITextRenderer renderer = new ITextRenderer();
renderer.getSharedContext().setReplacedElementFactory(new MediaReplacedElementFactory(renderer.getSharedContext().getReplacedElementFactory()));
renderer.setDocumentFromString(content.toString());
renderer.layout();
final ByteArrayOutputStream baos = new ByteArrayOutputStream();
renderer.createPDF(baos);
// baos.toByteArray();
I have been using Freemarker to generate the HTML from a template and then feeding the result to FlyingSaucer with great success. This is a pretty neat library.

what worked for me is putting it as a embedded image. So converting image to base64 first and then embed it:
byte[] image = ...
ITextRenderer renderer = new ITextRenderer();
renderer.setDocumentFromString("<html>\n" +
" <body>\n" +
" <h1>Image</h1>\n" +
" <div><img src=\"data:image/png;base64," + Base64.getEncoder().encodeToString(image) + "\"></img></div>\n" +
" </body>\n" +
"</html>");
renderer.layout();
renderer.createPDF(response.getOutputStream());

Thanks Alex for detailed solution. I'm using this solution and found there is another line to be added to make it work.
public ReplacedElement createReplacedElement(LayoutContext layoutContext, BlockBox blockBox, UserAgentCallback userAgentCallback, int cssWidth, int cssHeight) {
Element element = blockBox.getElement();
....
....
final Image image = Image.getInstance(bytes);
final int factor = ((ITextUserAgent)userAgentCallback).getSharedContext().getDotsPerPixel(); //Need to add this line
image.scaleAbsolute(image.getPlainWidth() * factor, image.getPlainHeight() * factor) //Need to add this line
final FSImage fsImage = new ITextFSImage(image);
....
....
We need to read the DPP from SharedContext and scale the image to display render the image on PDF.
Another suggestion:
We can directly extend ITextReplacedElement instead of implementing ReplacedElementFactory. In that case we can set the ReplacedElementFactory in the SharedContext as follows:
renderer.getSharedContext().setReplacedElementFactory(new MediaReplacedElementFactory(renderer.getOutputDevice());

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

get body content of html file in java - java

How about using Jsoup? Your code can look like Document doc = Jsoup.parse(html); Elements elements = doc.select("body").first().children(); //or only `<p>` elements //Elements elements = doc.select("p"); for (Element el : elements) System.out.println("element: "+el);

Related

How to display string as xml in jsp page

How to parse xml and get the data from xml string?

get image url of rss with rome library

Convert relative to absolute links using jsoup

Using Flying Saucer to Render Images to PDF In Memory

Categories

Resources