Using Flying Saucer to Render Images to PDF In Memory - java

I'm using Flying Saucer to convert XHTML to a PDF document. I've gotten the code to work with just basic HTML and in-line CSS, however, now I'm attempting to add an image as a sort of header to the PDF. What I'm wondering is if there is any way whatsoever to add the image by reading in an image file as a Java Image object, then adding that somehow to the PDF (or to the XHTML -- like it gets a virtual "url" representing the Image object that I can use to render the PDF). Has anyone ever done anything like this?
Thanks in advance for any help you can provide!

I had to do that last week so hopefully I will be able to answer you right away.
Flying Saucer
The easiest way is to add the image you want as markup in your HTML template before rendering with Flying Saucer. Within Flying Saucer you will have to implement a ReplacedElementFactory so that you can replace any markup before rendering with the image data.
/**
* Replaced element in order to replace elements like
* <tt><div class="media" data-src="image.png" /></tt> with the real
* media content.
*/
public class MediaReplacedElementFactory implements ReplacedElementFactory {
private final ReplacedElementFactory superFactory;
public MediaReplacedElementFactory(ReplacedElementFactory superFactory) {
this.superFactory = superFactory;
}
#Override
public ReplacedElement createReplacedElement(LayoutContext layoutContext, BlockBox blockBox, UserAgentCallback userAgentCallback, int cssWidth, int cssHeight) {
Element element = blockBox.getElement();
if (element == null) {
return null;
}
String nodeName = element.getNodeName();
String className = element.getAttribute("class");
// Replace any <div class="media" data-src="image.png" /> with the
// binary data of `image.png` into the PDF.
if ("div".equals(nodeName) && "media".equals(className)) {
if (!element.hasAttribute("data-src")) {
throw new RuntimeException("An element with class `media` is missing a `data-src` attribute indicating the media file.");
}
InputStream input = null;
try {
input = new FileInputStream("/base/folder/" + element.getAttribute("data-src"));
final byte[] bytes = IOUtils.toByteArray(input);
final Image image = Image.getInstance(bytes);
final FSImage fsImage = new ITextFSImage(image);
if (fsImage != null) {
if ((cssWidth != -1) || (cssHeight != -1)) {
fsImage.scale(cssWidth, cssHeight);
}
return new ITextImageElement(fsImage);
}
} catch (Exception e) {
throw new RuntimeException("There was a problem trying to read a template embedded graphic.", e);
} finally {
IOUtils.closeQuietly(input);
}
}
return this.superFactory.createReplacedElement(layoutContext, blockBox, userAgentCallback, cssWidth, cssHeight);
}
#Override
public void reset() {
this.superFactory.reset();
}
#Override
public void remove(Element e) {
this.superFactory.remove(e);
}
#Override
public void setFormSubmissionListener(FormSubmissionListener listener) {
this.superFactory.setFormSubmissionListener(listener);
}
}
You will notice that I have hardcoded here /base/folder which is the folder where the HTML file is located as it will be the root url for Flying Saucer for resolving medias. You may change it to the correct location, coming from anywhere you want (Properties for example).
HTML
Within your HTML markup you indicate somewhere a <div class="media" data-src="somefile.png" /> like so:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>My document</title>
<style type="text/css">
#logo { /* something if needed */ }
</style>
</head>
<body>
<!-- Header -->
<div id="logo" class="media" data-src="media/logo.png" style="width: 177px; height: 60px" />
...
</body>
</html>
Rendering
And finally you just need to indicate your ReplacedElementFactory to Flying-Saucer when rendering:
String content = loadHtml();
ITextRenderer renderer = new ITextRenderer();
renderer.getSharedContext().setReplacedElementFactory(new MediaReplacedElementFactory(renderer.getSharedContext().getReplacedElementFactory()));
renderer.setDocumentFromString(content.toString());
renderer.layout();
final ByteArrayOutputStream baos = new ByteArrayOutputStream();
renderer.createPDF(baos);
// baos.toByteArray();
I have been using Freemarker to generate the HTML from a template and then feeding the result to FlyingSaucer with great success. This is a pretty neat library.

what worked for me is putting it as a embedded image. So converting image to base64 first and then embed it:
byte[] image = ...
ITextRenderer renderer = new ITextRenderer();
renderer.setDocumentFromString("<html>\n" +
" <body>\n" +
" <h1>Image</h1>\n" +
" <div><img src=\"data:image/png;base64," + Base64.getEncoder().encodeToString(image) + "\"></img></div>\n" +
" </body>\n" +
"</html>");
renderer.layout();
renderer.createPDF(response.getOutputStream());

Thanks Alex for detailed solution. I'm using this solution and found there is another line to be added to make it work.
public ReplacedElement createReplacedElement(LayoutContext layoutContext, BlockBox blockBox, UserAgentCallback userAgentCallback, int cssWidth, int cssHeight) {
Element element = blockBox.getElement();
....
....
final Image image = Image.getInstance(bytes);
final int factor = ((ITextUserAgent)userAgentCallback).getSharedContext().getDotsPerPixel(); //Need to add this line
image.scaleAbsolute(image.getPlainWidth() * factor, image.getPlainHeight() * factor) //Need to add this line
final FSImage fsImage = new ITextFSImage(image);
....
....
We need to read the DPP from SharedContext and scale the image to display render the image on PDF.
Another suggestion:
We can directly extend ITextReplacedElement instead of implementing ReplacedElementFactory. In that case we can set the ReplacedElementFactory in the SharedContext as follows:
renderer.getSharedContext().setReplacedElementFactory(new MediaReplacedElementFactory(renderer.getOutputDevice());

Related

How to get a landscape orientation for only some pages while converting html to pdf in itext 7?

I convert HTML to PDF using iText7 with the convertToPDF() method of pdfHTML. I would like to change the page orientation for a few specific pages in my PDF document. The content of these pages is dynamic, and we cannot guess how many pages that should be in landscape (i.e. content of dynamic table could take more than one page)
Current situation: I create a custom worker (implements ITagWorker) which landscape the page that following the tag <landscape/>
public byte[] generatePDF(String html) {
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
PdfWriter pdfWriter = new PdfWriter(byteArrayOutputStream);
PdfDocument pdfDocument = new PdfDocument(pdfWriter);
try {
ConverterProperties properties = new ConverterProperties();
properties.setTagWorkerFactory(
new DefaultTagWorkerFactory() {
#Override
public ITagWorker getCustomTagWorker(
IElementNode tag, ProcessorContext context) {
if ("landscape".equalsIgnoreCase(tag.name())) {
return new LandscapeDivTagWorker();
}
return null;
}
} );
MediaDeviceDescription mediaDeviceDescription = new MediaDeviceDescription(MediaType.PRINT);
properties.setMediaDeviceDescription(mediaDeviceDescription);
HtmlConverter.convertToPdf(html, pdfDocument, properties);
} catch (IOException e) {
e.printStackTrace();
}
pdfDocument.close();
return byteArrayOutputStream.toByteArray();
}
The custom worker :
public class LandscapeDivTagWorker implements ITagWorker {
#Override
public void processEnd(IElementNode element, ProcessorContext context) {
}
#Override
public boolean processContent(String content, ProcessorContext context) {
return false;
}
#Override
public boolean processTagChild(ITagWorker childTagWorker, ProcessorContext context) {
return false;
}
#Override
public IPropertyContainer getElementResult() {
return new AreaBreak(new PageSize(PageSize.A4).rotate());
}
}
Is there a way to define all the content that should be displayed in landscape?
Something like:
<p>Display in portrait</p>
<landscape>
<div>
<p>display in landscape</p>
…
<table>
..
</table>
</div>
</landscape>
or with a CSS class :
<p>Display in portrait</p>
<div class="landscape">
<p>display in landscape</p>
…
<table>
..
</table>
</div>
Result => 1 page in portrait and other pages in landscape (All the div content should be in landscape)
PS: I follow this hint Change page orientation for only some pages in the resulting PDF (created out of html) by using a custom CssApplierFactory but the result was the same => just the first page where the landscape class is used was in landscape and the other content of the table was in portrait
Doing it is actually quite tricky, but the whole mechanism is still flexible enough to accommodate this requirement.
We will be working on supporting the following syntax:
<p>Display in portrait</p>
<landscape>
<div>
<p>display in landscape</p>
<p>content</p>
.....
<p>content</p>
</div>
</landscape>
<p> After portrait </p>
First off, we will need to convert the HTML content into elements first and then add those elements into a document instead direct HTML -> PDF conversion. This is needed because in case of HTML there is a separate mechanism for page size handling as dictated by CSS specification and it's not flexible enough to accommodate your requirement so we will be using native iText Layout mechanism for that.
The idea is that apart from customizing the new page size by passing an argument to AreaBreak we will also change the default page size for PdfDocument so that all the consequent pages are created with that custom new page size. For that we will need to pass PdfDocument all along. The high-level code looks as follows:
PdfDocument pdfDocument = new PdfDocument(new PdfWriter(outFilePath));
ConverterProperties properties = new ConverterProperties();
properties.setTagWorkerFactory(new CustomTagWorkerFactory(pdfDocument));
Document document = new Document(pdfDocument);
List<IElement> elements = HtmlConverter.convertToElements(new FileInputStream(inputHtmlPath), properties);
for (IElement element : elements) {
if (element instanceof IBlockElement) {
document.add((IBlockElement) element);
}
}
pdfDocument.close();
The custom tag worker factory is also almost unchanged - it just passes PdfDocument along to the tag worker:
private static class CustomTagWorkerFactory extends DefaultTagWorkerFactory {
PdfDocument pdfDocument;
public CustomTagWorkerFactory(PdfDocument pdfDocument) {
this.pdfDocument = pdfDocument;
}
#Override
public ITagWorker getCustomTagWorker(IElementNode tag, ProcessorContext context) {
if ("landscape".equalsIgnoreCase(tag.name())) {
return new LandscapeDivTagWorker(tag, context, pdfDocument);
}
return null;
}
}
The idea of LandscapeDivTagWorker is to create a Div wrapper and put there the inner content of <landscape> tag but also surround it with AreaBreak elements - the preceding one will force the new page break with landscape orientation and change the default page size for the whole document while the succeeding one will revert everything back - force the split to portrait page size and set default page size to portrait as well. Note that we are also setting a custom renderer for AreaBreak with setNextRenderer to actually set the default page size when it comes to that break:
private static class LandscapeDivTagWorker extends DivTagWorker {
private PdfDocument pdfDocument;
public LandscapeDivTagWorker(IElementNode element, ProcessorContext context, PdfDocument pdfDocument) {
super(element, context);
this.pdfDocument = pdfDocument;
}
#Override
public IPropertyContainer getElementResult() {
IPropertyContainer baseElementResult = super.getElementResult();
if (baseElementResult instanceof Div) {
Div div = new Div();
AreaBreak landscapeAreaBreak = new AreaBreak(new PageSize(PageSize.A4).rotate());
landscapeAreaBreak.setNextRenderer(new DefaultPageSizeChangingAreaBreakRenderer(landscapeAreaBreak, pdfDocument));
div.add(landscapeAreaBreak);
div.add((IBlockElement) baseElementResult);
AreaBreak portraitAreaBreak = new AreaBreak(new PageSize(PageSize.A4));
portraitAreaBreak.setNextRenderer(new DefaultPageSizeChangingAreaBreakRenderer(portraitAreaBreak, pdfDocument));
div.add(portraitAreaBreak);
baseElementResult = div;
}
return baseElementResult;
}
}
The implementation of the custom area break renderer is quite straightforward - we only set the default page size to PdfDocument - the rest is done under the hood by default implementations which we extend from:
private static class DefaultPageSizeChangingAreaBreakRenderer extends AreaBreakRenderer {
private PdfDocument pdfDocument;
private AreaBreak areaBreak;
public DefaultPageSizeChangingAreaBreakRenderer(AreaBreak areaBreak, PdfDocument pdfDocument) {
super(areaBreak);
this.pdfDocument = pdfDocument;
this.areaBreak = areaBreak;
}
#Override
public LayoutResult layout(LayoutContext layoutContext) {
pdfDocument.setDefaultPageSize(areaBreak.getPageSize());
return super.layout(layoutContext);
}
}
As a result you will get the page set up similar to the one on the screenshot:

Display dynamically generated image to the browser using jsp

I am doing a small project with images using jsp/servlets.In that I generate some image dynamically(actually I'll decrypt two image shares as one).That decrypted image must be displayed directly to browser without saving it as file in filesystem.
Crypting c=new Crypting();
BufferedImage imgKey;
BufferedImage imgEnc;
imgKey = ImageIO.read(new File("E:/Netbeans Projects/banking/web/Key.png"));
imgEnc=ImageIO.read(new File("E:/Netbeans Projects/banking/build/web/upload/E.png"));
BufferedImage imgDec=Crypting.decryptImage(imgKey,imgEnc);
When I store it in filesystem and display it using <img> it does not show the image.When reloaded it shows the image.So it is problem with the backend work of IDE.
Any help pls...
Make a servlet to generate images.
Use html img tag with attribute src, as a path to your genarated resource.
Example in spring boot (QR Codes).
Servlet
public class QRCodeServlet extends HttpServlet {
#Override
protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
String url = req.getParameter("url");
String format = req.getParameter("format");
QRCodeFormat formatParam = StringUtils.isNotEmpty(format) && format.equalsIgnoreCase("PDF") ?
QRCodeFormat.PDF : QRCodeFormat.JPG;
if(formatParam.equals(QRCodeFormat.PDF))
resp.setContentType("application/pdf");
else
resp.setContentType("image/jpeg");
if(StringUtils.isNotBlank(url)) {
ByteArrayOutputStream stream = QRCodeService.getQRCodeFromUrl(url, formatParam);
stream.writeTo(resp.getOutputStream());
}
}
}
Configuration:
#Configuration
public class WebMvcConfig {
#Bean
public ServletRegistrationBean qrCodeServletRegistrationBean(){
ServletRegistrationBean qrCodeBean =
new ServletRegistrationBean(new QRCodeServlet(), "/qrcode");
qrCodeBean.setLoadOnStartup(1);
return qrCodeBean;
}
}
Conroller:
String qrcodeServletPrefix = "http://localhost:8082/qrcode?url="
String encodedUrl = URLEncoder.encode("http://exmaple.com?param1=value1&param2=value2", "UTF-8");
modelAndView.addObject("qrcodepage", qrcodeServletPrefix + encodedUrl);
modelAndView.setViewName("view");
view.jsp
<%# taglib prefix="c" uri="http://java.sun.com/jsp/jstl/core" %>
<img src="<c:url value='${qrcodepage}'/>" />

Multi-page PDF generation from SVG with Java and Apache Batik

I have two simple SVG documents that I want to convert to a PDF such that each document is on one page in the PDF.
My first SVG document has two rectangles that look as follow:
and the second one is a black circle.
The code looks as follow:
import java.io.*;
import org.apache.batik.anim.dom.*;
import org.apache.batik.transcoder.*;
import org.w3c.dom.*;
public class MultiPagePdf {
public static void main(String[] args) {
MultiPagePDFTranscoder transcoder = new MultiPagePDFTranscoder();
try {
final DOMImplementation impl = SVGDOMImplementation.getDOMImplementation();
SVGOMDocument doc1 = (SVGOMDocument) impl.createDocument(SVGDOMImplementation.SVG_NAMESPACE_URI, "svg", null);
// 1st rectangle in doc1
Element el = doc1.createElementNS(SVGDOMImplementation.SVG_NAMESPACE_URI, "rect");
el.setAttributeNS(null, "width", "60");
el.setAttributeNS(null, "height", "60");
el.setAttributeNS(null, "fill", "none");
el.setAttributeNS(null, "stroke", "blue");
SVGOMSVGElement docEl = (SVGOMSVGElement) doc1.getDocumentElement();
docEl.appendChild(el);
// 2nd rectangle in doc1
Element ell = doc1.createElementNS(SVGDOMImplementation.SVG_NAMESPACE_URI, "rect");
ell.setAttributeNS(null, "x", "50");
ell.setAttributeNS(null, "y", "50");
ell.setAttributeNS(null, "width", "25");
ell.setAttributeNS(null, "height", "25");
ell.setAttributeNS(null, "fill", "green");
docEl.appendChild(ell);
final DOMImplementation impl2 = SVGDOMImplementation.getDOMImplementation();
SVGOMDocument doc2 = (SVGOMDocument) impl2.createDocument(SVGDOMImplementation.SVG_NAMESPACE_URI, "svg", null);
// circle in doc2
Element el2 = doc2.createElementNS(SVGDOMImplementation.SVG_NAMESPACE_URI, "circle");
el2.setAttributeNS(null, "cx", "130");
el2.setAttributeNS(null, "cy", "100");
el2.setAttributeNS(null, "r", "50");
SVGOMSVGElement docEl2 = (SVGOMSVGElement) doc2.getDocumentElement();
docEl2.appendChild(el2);
OutputStream outputStream = new FileOutputStream(new File("/C:/Users/ah/Documents/simpleMulti.pdf"));
TranscoderOutput transcoderOutput = new TranscoderOutput(outputStream);
Document[] doccs = { doc1, doc2 };
transcoder.transcode(doccs, null, transcoderOutput); // generate PDF doc
} catch (Exception e) {
e.printStackTrace();
}
}
}
I've printed both SVG documents and they look as they should:
1st SVG Document:
<?xml version="1.0" encoding="UTF-8"?>
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" contentScriptType="text/ecmascript" zoomAndPan="magnify" contentStyleType="text/css" preserveAspectRatio="xMidYMid meet" version="1.0">
<rect fill="none" width="60" height="60" stroke="blue"/>
<rect fill="green" x="50" width="25" height="25" y="50"/>
</svg>
2nd SVG Document:
<?xml version="1.0" encoding="UTF-8"?>
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" contentScriptType="text/ecmascript" zoomAndPan="magnify" contentStyleType="text/css" preserveAspectRatio="xMidYMid meet" version="1.0">
<circle r="50" cx="130" cy="100"/>
</svg>
I found a code that should be doing what I'm after using Apache FOP; I have the following code to generate the PDF:
import java.io.*;
import java.util.*;
import org.apache.batik.transcoder.*;
import org.apache.fop.*;
import org.apache.fop.svg.*;
import org.w3c.dom.*;
public class MultiPagePDFTranscoder extends AbstractFOPTranscoder {
protected PDFDocumentGraphics2D graphics = null;
protected Map<String, Object> params = null;
public MultiPagePDFTranscoder() {
super();
}
protected void transcode(Document[] documents, String uri, TranscoderOutput output) throws TranscoderException {
graphics = new PDFDocumentGraphics2D(isTextStroked());
graphics.getPDFDocument().getInfo().setProducer("Apache FOP Version " + Version.getVersion() + ": PDF Transcoder for Batik");
try {
OutputStream out = output.getOutputStream();
if (!(out instanceof BufferedOutputStream)) {
out = new BufferedOutputStream(out);
}
for (int i = 0; i < documents.length; i++) {
Document document = documents[i];
super.transcode(document, uri, null);
int tmpWidth = 300;
int tmpHeight = 300;
if (i == 0) {
graphics.setupDocument(out, tmpWidth, tmpHeight);
} else {
graphics.nextPage(tmpWidth, tmpHeight);
}
graphics.setGraphicContext(new org.apache.xmlgraphics.java2d.GraphicContext());
graphics.transform(curTxf);
this.root.paint(graphics);
}
graphics.finish();
} catch (IOException ex) {
throw new TranscoderException(ex);
}
}
}
The PDF file is generated, but I have two problems with it.
The translation and scale of the SVG elements are not correct.
On the second page, the first document is present (I've tried with multiple pages, and all the previous documents are present on the current page).
I use Apache FOP 2.1 with Apache Batik 1.8.
Any help with either problem would be highly appreciated.
I'm also open to other solutions to my overall task (converting SVGs to multi-paged PDF).
I had a similar problem. The way I went about it was to use the SVGConverter app in Batik (org.apache.batik.apps.rasterizer.SVGConverter) to convert the single SVG to PDF then stick them together into one file using PDFBox (org.apache.pdfbox.multipdf.PDFMergerUtility).
Convert SVG to single page PDF(s):
File outputFile = new File(pdfPath);
SVGConverter converter = new SVGConverter();
converter.setDestinationType(DestinationType.PDF);
converter.setSources(new String[] { svgPath });
converter.setDst(outputFile);
converter.execute();
Join the PDFs together:
File pdffile = new File(multipagepdfPath);
PDFMergerUtility pdf = new PDFMergerUtility();
pdf.setDestinationFileName(pdffile.getPath());
pdf.addSource(page1pdffile);
pdf.addSource(page2pdffile);
pdf.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
//The memory settings depend on if you want to use RAM or Temp files.
If you find a better solution please let me know.
The main problem I had was that the converter app is the only one in Batik that converts to pdf and keeps the sizes correctly.
rsvg-convert can turn multiple svg-documents into a single PDF.
rsvg-convert -f pdf -o out.pdf file1.svg file2.svg file3.svg
or for all files in a folder:
rsvg-convert -f pdf -o out.pdf *.svg
Don't know how to do this with Java though. Just a thought...

How to retrieve images from server to user using Struts 2

I have a Product entity, which has a imageUrl String field.
Products images after obtaining from user will be saved in directory:
System.getProperty("user.home") + "shop/data/product/"
And when user wants to see some Product I need to get this image from "user.home"+... to JSP page.
I've tried to read the image into the byte array, convert it to Base64 encoding, and then refer in JSP like this:
<img alt="image from user home" src="data:image/png, base64;${requestScope.image}">
But this solution is not working, and as far as I understand, it has a limitation on image size.
Could you suggest me a way how to do such thing?
Try this ( i think you have some typo )
<img alt="image from user home" src="data:image/png;base64,${requestScope.image}">
Also use this site: http://www.askapache.com/online-tools/base64-image-converter/ to make sure that your output Base64 code is correct.
There's an example of ImageAction that serves image from the file system. It's called
Struts 2 dynamic image example.
Instead of base64 encoding/decoding which increases the content length two times and slows down page loading you can use the action that returnes image bytes from the file. It could be a database, in this way it should fetch bytes from Blob.
In your <img> tag that is using src attribute can contain the URL to the action that returns response with a header Content-Type: image/jpeg and bytes written to the body.
This is the code of the ImageAction:
#Result(type = "stream", params = {"contentType", "${type}"})
public class ImageAction extends ActionSupport implements ServletRequestAware {
byte[] imageInByte = null;
String imageId;
private HttpServletRequest servletRequest;
private final static String type = "image/jpeg";
public getInputStream() { return new ByteArrayInputStream(getCustomImageInBytes()); }
public String getType() { return type; }
private String getFilename() {
return this.filename;
}
public String getImageId() {
return imageId;
}
public void setImageId(String imageId) {
this.imageId = imageId;
}
public ImageAction() {
System.out.println("ImageAction");
}
public byte[] getCustomImageInBytes() {
System.out.println("imageId" + imageId);
BufferedImage originalImage;
try {
originalImage = ImageIO.read(getImageFile(this.imageId));
// convert BufferedImage to byte array
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ImageIO.write(originalImage, "jpeg", baos);
baos.flush();
imageInByte = baos.toByteArray();
baos.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return imageInByte;
}
private File getImageFile(String imageId) {
String filePath = servletRequest.getSession().getServletContext().getRealPath("/");
File file = new File(filePath + "/Image/", imageId);
System.out.println(file.toString());
return file;
}
#Override
public void setServletRequest(HttpServletRequest request) {
this.servletRequest = request;
}
}
This action supposed to have configuration created by convention-plugin. So it could be used in HTML like this
<img src="<s:url action='Image?imageId=darksouls.jpg' />" alt=""/>
So Alireza Fattahi was right that I had mistakes in my code. The first one is typo in img tag (see answer by Alireza Fattahi), the second one is after reading image to bytes array
byte[] image = ...;
I used
Base64.getEncoder().encode(image);
instead of
Base64.getEncoder().encodeToString(image));
So eventually this method with returning Base64 encoded image works. If there is a better choices - please left comments and answers.

get body content of html file in java

i'm trying to get body content of html page.
suppose this html file:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<link href="../Styles/style.css" rel="STYLESHEET" type="text/css" />
<title></title>
</head>
<body>
<p> text 1 </p>
<p> text 2 </p>
</body>
</html>
what i want is :
<p> text 1 </p>
<p> text 2 </p>
so, i thought that using SAXParser would do that (if you know simpler way please tell me)
this is my code, but always i get null as body content:
private final String HTML_NAME_SPACE = "http://www.w3.org/1999/xhtml";
private final String HTML_TAG = "html";
private final String BODY_TAG = "body";
public static void parseHTML(InputStream in, ContentHandler handler) throws IOException, SAXException, ParserConfigurationException
{
if(in != null)
{
try
{
SAXParserFactory parseFactory = SAXParserFactory.newInstance();
XMLReader reader = parseFactory.newSAXParser().getXMLReader();
reader.setContentHandler(handler);
InputSource source = new InputSource(in);
source.setEncoding("UTF-8");
reader.parse(source);
}
finally
{
in.close();
}
}
}
public ContentHandler constrauctHTMLContentHandler()
{
RootElement root = new RootElement(HTML_NAME_SPACE, HTML_TAG);
root.setStartElementListener(new StartElementListener()
{
#Override
public void start(Attributes attributes)
{
String body = attributes.getValue(BODY_TAG);
Log.d("html parser", "body: " + body);
}
});
return root.getContentHandler();
}
then
parseHTML(inputStream, constrauctHTMLContentHandler()); // inputStream is html file as stream
what is wrong with this code?
How about using Jsoup? Your code can look like
Document doc = Jsoup.parse(html);
Elements elements = doc.select("body").first().children();
//or only `<p>` elements
//Elements elements = doc.select("p");
for (Element el : elements)
System.out.println("element: "+el);
Not sure how your grabbing the HTML. If its a local file then you can load it directly into Jsoup. If you have to fetch it from some URL then I normally use Apache's HttpClient. A quick start guide is here: HttpClient and does a good job of getting you started.
That will allow you to get the data back doing something like this:
HttpClient client = new DefaultHttpClient();
HttpPost post = new HttpPost(URL);
//
// here you can do things like add parameters used when connecting to the remote site
//
HttpResponse response = client.execute(post);
BufferedReader rd = new BufferedReader(new InputStreamReader(response.getEntity().getContent()));
Then (as has been suggested by Pshemo) I use Jsoup to parse and extract the data Jsoup
Document document = Jsoup.parse(HTML);
// OR
Document doc = Jsoup.parseBodyFragment(HTML);
Elements elements = doc.select("p"); // p for <p>text</p>

Categories