I am trying to export an HTML page into a PDF using Flying Saucer. For some reason, the pages have a large white space after the header (id = "divTemplateHeaderPage1") divisions.
The jsFiddle link to my HTML code that is being used by PDF renderer: https://jsfiddle.net/Sparks245/uhxqdta6/.
Below is the Java code used for rendering the PDF (Test.html is the same HTML code in the fiddle) and rendering only one page.
import java.io.IOException;
import javax.servlet.ServletException;
import javax.servlet.annotation.WebServlet;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import org.json.HTTP;
import org.json.JSONException;
import org.json.*;
import org.json.simple.JSONArray;
import org.json.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
import org.json.simple.parser.*;
import org.xhtmlrenderer.pdf.ITextRenderer;
import com.lowagie.text.DocumentException;
import com.lowagie.text.List;
import com.sun.xml.internal.bind.v2.runtime.unmarshaller.XsiNilLoader.Array;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.OutputStream;
#WebServlet("/PLPDFExport")
public class PLPDFExport extends HttpServlet
{
//Option for Serialization
private static final long serialVersionUID = 1L;
public PLPDFExport()
{
super();
}
//Get method
protected void doGet(HttpServletRequest request,
HttpServletResponse response)
throws ServletException,
IOException
{
}
//Post method
protected void doPost(HttpServletRequest request,
HttpServletResponse response)
throws ServletException,
IOException
{
StringBuffer jb = new StringBuffer();
String line = null;
int Pages;
String[] newArray = null;
try
{
BufferedReader reader = request.getReader();
while ((line = reader.readLine()) != null)
{ jb.append(line);
}
} catch (Exception e) { /*report an error*/ }
try
{
JSONObject obj = new JSONObject(jb.toString());
Pages = obj.getInt("Pages");
newArray = new String[1];
for(int cnt = 1; cnt <= 1; cnt++)
{
StringBuffer buf = new StringBuffer();
String base = "C:/Users/Sparks/Desktop/";
buf.append(readFile(base + "Test.html"));
newArray[0] = buf.toString();
}
}
catch (JSONException e)
{
// crash and burn
throw new IOException("Error parsing JSON request string");
}
//Get the parameters
OutputStream os = null;
try {
final File outputFile = File.createTempFile("FlyingSacuer.PDFRenderToMultiplePages", ".pdf");
os = new FileOutputStream(outputFile);
ITextRenderer renderer = new ITextRenderer();
// we need to create the target PDF
// we'll create one page per input string, but we call layout for the first
renderer.setScaleToFit(true);
renderer.isScaleToFit();
renderer.setDocumentFromString(newArray[0]);
renderer.layout();
try {
renderer.createPDF(os, false);
} catch (DocumentException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// each page after the first we add using layout() followed by writeNextDocument()
for (int i = 1; i < newArray.length; i++) {
renderer.setScaleToFit(true);
renderer.isScaleToFit();
renderer.setDocumentFromString(newArray[i]);
renderer.layout();
try {
renderer.writeNextDocument();
} catch (DocumentException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
// complete the PDF
renderer.finishPDF();
System.out.println("PDF Downloaded to " + outputFile );
System.out.println(newArray[0]);
}
finally {
if (os != null) {
try {
os.close();
} catch (IOException e) { /*ignore*/ }
}
}
//Return
response.setContentType("application/json");
response.setCharacterEncoding("UTF-8");
response.getWriter().write("File Uploaded");
}
String readFile(String fileName) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(fileName));
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append("\n");
line = br.readLine();
}
return sb.toString();
} finally {
br.close();
}
}
}
The link for exported PDF: https://drive.google.com/file/d/13CmlJK0ZDLolt7C3yLN2k4uJqV3TX-4B/view?usp=sharing
I tried adding css properties like page-break-inside: avoid to the header divisions but it didn't work. Also I tried adding absolute positions and top margins to the body division (id = "divTemplateBodyPage1") just below the header div, but the white space continues to exist.
Any suggestions would be helpful.
Please take a look at the metadata of your PDF:
You are using an old third party tool that is not endorsed by iText Group, and that uses iText 2.1.7, a version of iText dating from 2009 that should no longer be used.
It would probably have been OK to complain and to write "My code isn't working" about 7 years ago, but if you would use the most recent version of iText, the result of converting your HTML to PDF would look like this:
I only needed a single line of code to get this result:
HtmlConverter.convertToPdf(new File(src), new File(dest));
In this line src is the path the the source HTML and dest is the path to the resulting PDF.
I only had to apply one minor change to your HTML. I change the #page properties like this:
#page {
size: 27cm 38cm;
margin: 0.2cm;
}
If I hadn't changed this part of the CSS, the page size would have been A4, and in that case, not all the content would have fitted the page. I also added a small margin because I didn't like the fact that the border was sticking to close to the sides of the page.
Morale: don't use old versions of libraries! Download the latest version of iText and the pdfHTML add-on. You need iText 7 core and the pdfHTML add-on. You might also want to read the HTML to PDF tutorial.
Related
I under gone a situation of converting html to pdf, Thankfully I can achieved this through flying saucer api. But My HTML consists of svg tags while converting I am unable to get the svg in pdf. It can be achieved using a Stackoverflow question
and Tutorial.
What is meant by the replacedElementFactory?
ChainingReplacedElementFactory chainingReplacedElementFactory
= new ChainingReplacedElementFactory();
chainingReplacedElementFactory.addReplacedElementFactory(replacedElementFactory);
chainingReplacedElementFactory.addReplacedElementFactory(new SVGReplacedElementFactory());
renderer.getSharedContext().setReplacedElementFactory(chainingReplacedElementFactory);
It's just an error in the tutorial, the line with replacedElementFactory is not needed.
Here is my working example.
Java:
import java.io.ByteArrayOutputStream;
import java.io.FileOutputStream;
import java.io.OutputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.xhtmlrenderer.pdf.ITextRenderer;
public class PdfSvg {
public static void main(String[] args) throws Exception {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document inputDoc = builder.parse("svg.html");
ByteArrayOutputStream output = new ByteArrayOutputStream();
ITextRenderer renderer = new ITextRenderer();
ChainingReplacedElementFactory chainingReplacedElementFactory = new ChainingReplacedElementFactory();
chainingReplacedElementFactory.addReplacedElementFactory(new SVGReplacedElementFactory());
renderer.getSharedContext().setReplacedElementFactory(chainingReplacedElementFactory);
renderer.setDocument(inputDoc, "");;
renderer.layout();
renderer.createPDF(output);
OutputStream fos = new FileOutputStream("svg.pdf");
output.writeTo(fos);
}
}
HTML:
<html>
<head>
<style type="text/css">
svg {display: block;width:100mm;height:100mm}
</style>
</head>
<body>
<div>
<svg xmlns="http://www.w3.org/2000/svg">
<circle cx="50" cy="50" r="40" stroke="black" stroke-width="3"
fill="red" />
</svg>
</div>
</body>
</html>
The ChainingReplacedElementFactory, SVGReplacedElement and SVGReplacedElementFactory comes from the tutorial.
If you wanted an in page solution, here's an alternate using #cloudformatter which is a remote formatting service. I added their Javascript to your fiddle along with some text and your Highchart chart.
http://jsfiddle.net/yk0Lxzg0/1/
var click="return xepOnline.Formatter.Format('printme', {render:'download'})";
jQuery('#buttons').append('<button onclick="'+ click +'">PDF</button>');
The above code placed in the fiddle will format the div with 'id' printme to PDF for download. That div includes your chart and some text.
http://www.cloudformatter.com/CSS2Pdf.APIDoc.Usage shows usage instructions and has many more samples of charts in SVG formatted to PDF either by themselves or as part of pages combined with text, tables and such.
#Rajesh I hope you already found a solution to your problem. If not (or anyone having issues working with flying saucer, batik and svg tags) then you might want to consider this-
removing all clip-path="url(#highcharts-xxxxxxx-xx)" from <g> tags did the trick for me.
My code is referring to the missing code part "SVGReplacedElementFactory".
And I use it like this:
renderer
.getSharedContext()
.setReplacedElementFactory( new B64ImgReplacedElementFactory() );
import com.itextpdf.text.BadElementException;
import com.itextpdf.text.Image;
import com.itextpdf.text.pdf.codec.Base64;
import org.apache.batik.transcoder.TranscoderException;
import org.apache.batik.transcoder.TranscoderInput;
import org.apache.batik.transcoder.TranscoderOutput;
import org.apache.batik.transcoder.image.JPEGTranscoder;
import org.apache.batik.transcoder.image.PNGTranscoder;
import org.w3c.dom.Element;
import org.xhtmlrenderer.extend.FSImage;
import org.xhtmlrenderer.extend.ReplacedElement;
import org.xhtmlrenderer.extend.ReplacedElementFactory;
import org.xhtmlrenderer.extend.UserAgentCallback;
import org.xhtmlrenderer.layout.LayoutContext;
import org.xhtmlrenderer.pdf.ITextFSImage;
import org.xhtmlrenderer.pdf.ITextImageElement;
import org.xhtmlrenderer.render.BlockBox;
import org.xhtmlrenderer.simple.extend.FormSubmissionListener;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
public class B64ImgReplacedElementFactory implements ReplacedElementFactory
{
public ReplacedElement createReplacedElement(LayoutContext c, BlockBox box, UserAgentCallback uac, int cssWidth, int cssHeight)
{
Element e = box.getElement();
if(e == null)
{
return null;
}
String nodeName = e.getNodeName();
if(nodeName.equals("img"))
{
String attribute = e.getAttribute("src");
FSImage fsImage;
try
{
fsImage = buildImage(attribute, uac);
}
catch(BadElementException e1)
{
fsImage = null;
}
catch(IOException e1)
{
fsImage = null;
}
if(fsImage != null)
{
if(cssWidth != -1 || cssHeight != -1)
{
fsImage.scale(cssWidth, cssHeight);
}
return new ITextImageElement(fsImage);
}
}
return null;
}
protected FSImage buildImage(String srcAttr, UserAgentCallback uac) throws IOException, BadElementException
{
if(srcAttr.startsWith("data:image/"))
{
// BASE64Decoder decoder = new BASE64Decoder();
// byte[] decodedBytes = decoder.decodeBuffer(b64encoded);
// byte[] decodedBytes = B64Decoder.decode(b64encoded);
byte[] decodedBytes = Base64.decode(srcAttr.substring(srcAttr.indexOf("base64,") + "base64,".length(), srcAttr.length()));
return new ITextFSImage(Image.getInstance(decodedBytes));
}
FSImage fsImage = uac.getImageResource(srcAttr).getImage();
if(fsImage == null)
{
return convertToPNG(srcAttr);
}
return null;
}
private FSImage convertToPNG(String srcAttr) throws IOException, BadElementException
{
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
PNGTranscoder t = new PNGTranscoder();
// t.addTranscodingHint(JPEGTranscoder.KEY_PIXEL_UNIT_TO_MILLIMETER, (25.4f / 72f));
t.addTranscodingHint(JPEGTranscoder.KEY_WIDTH, 4000.0F);
t.addTranscodingHint(JPEGTranscoder.KEY_HEIGHT, 4000.0F);
try
{
t.transcode(
new TranscoderInput(srcAttr),
new TranscoderOutput(byteArrayOutputStream)
);
}
catch(TranscoderException e)
{
e.printStackTrace();
}
byteArrayOutputStream.flush();
byteArrayOutputStream.close();
return new ITextFSImage(Image.getInstance(byteArrayOutputStream.toByteArray()));
}
public void remove(Element e)
{
}
#Override
public void setFormSubmissionListener(FormSubmissionListener formSubmissionListener)
{
}
public void reset()
{
}
}
My code is
import java.awt.image.BufferedImage;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.Iterator;
import javax.imageio.ImageIO;
import javax.imageio.ImageReader;
import javax.imageio.stream.ImageInputStream;
import org.dcm4che2.imageio.plugins.dcm.DicomImageReadParam;
import com.sun.image.codec.jpeg.JPEGCodec;
import com.sun.image.codec.jpeg.JPEGImageEncoder;
public class DicomToJpeg {
public static void main(String args[]) throws IOException, Exception
{
dicomToJpeg("d:/F74AFBC7");
}
public static void dicomToJpeg(String args) throws IOException, Exception {
// TODO Auto-generated method stub
try
{
File myDicomFile = new File(args);
BufferedImage myJpegImage = null;
Iterator<ImageReader> iter = ImageIO.getImageReadersByFormatName("DICOM");
ImageReader reader = (ImageReader) iter.next();
DicomImageReadParam param = null;
try{
param = (DicomImageReadParam) reader.getDefaultReadParam();
}
catch (Exception e) {
e.printStackTrace();
}
ImageInputStream iis=ImageIO.createImageInputStream(myDicomFile);
reader.setInput(iis, false);
myJpegImage = reader.read(0, param);
iis.close();
if (myJpegImage == null) {
System.out.println("\nError: couldn't read dicom image!");
return;
}
File myJpegFile = new File("d:/demo.jpg");
OutputStream output = new BufferedOutputStream(new FileOutputStream(myJpegFile));
JPEGImageEncoder encoder = JPEGCodec.createJPEGEncoder(output);
encoder.encode(myJpegImage);
System.out.println("Image Create successufully");
output.close();
}
catch(IOException e){
System.out.println("\nError: couldn't read dicom image!"+ e.getMessage());
return;
}
}
}
When i execute in java project using eclipse it work fine...
But when i execute using web application and in this i call it from controller page like
DicomToJpeg.dicomToJpeg("d:/F74AFBC7");
then it gives error like...
java.util.NoSuchElementException
at javax.imageio.spi.FilterIterator.next(Unknown Source)
at javax.imageio.ImageIO$ImageReaderIterator.next(Unknown Source)
at javax.imageio.ImageIO$ImageReaderIterator.next(Unknown Source)
at com.lifecare.controller.DicomToJpeg.dicomToJpeg(DicomToJpeg.java:32)
How to solve this error please help me....
The javadoc to ImageIO.getImageREadersByFormatName says:
Returns an Iterator containing all currently registered ImageReaders
that claim to be able to decode the named format.
If you access the iterator without checking if it has an element, you will get an exception.
Since it runs in you IDE and not on the server, you may have a look if the image readers for the DICOM is in the application's classpath on the server.
However, I would also like to know how do you call the above class. Is it from a servlet?
I solved it by calling ImageIO.scanForPlugins() before ImageIO.getImageReadersByFormatName()
ImageIO.scanForPlugins()
Iterator<ImageReader> iter = ImageIO.getImageReadersByFormatName("DICOM");
ImageReader reader = (ImageReader) iter.next();
This works perfectly on servlets
Try like this
BufferImage bi = ImageIO.read(dcm file name with path);
ImageIO.write(enter pathname with filename, format);
I have some pdf files, Using pdfbox i have converted them into text and stored into text files, Now from the text files i want to remove
Hyperlinks
All special characters
Blank lines
headers footers of pdf files
“1)”,“2)”, “a)”, “bullets”, etc.
I want to get valid text line by line like this:
We propose OntoGain, a method for ontology learning from multi-word concept terms extracted from plain text. OntoGain follows an ontology learning process dened by distinct processing layers. Building upon plain term extraction a con-cept hierarchy is formed by clustering the extracted concepts. The derived term taxonomy is then enriched with non-taxonomic relations. Several dierent state-of-the-art methods have been examined for implementing each layer. OntoGain is based upon multi-word term concepts, as multi-word or compound terms are vested with more solid and distinctive semantics than plain single word terms. We opted for a hierarchical clustering method and Formal Concept Analysis (FCA) algorithm for building the term taxonomy. Furthermore an association rule algorithm is applied for revealing non-taxonomic relations. A method which tries to carry out the most appropriate generalization level between a relation's concepts is also implemented. To show proof of concept, a system prototype is implemented. The OntoGain allows transformation of the derived ontology into OWL using Jena Semantic Web Frame-work1. OntoGain is applied on two separate data sources a medical and computer corpus and its results are compared with similar results obtained by Text2Onto, a state-of-the-art-ontology learning method. The analysis of 11.5 CCD1.1 results indicates that OntoGain performs better than Text2Onto in terms of precision extracts more correct concepts while being more selective extracts fewer but more reasonable concepts.
How can I achieve this?
Using pdfbox we can achive this
Example :
public static void main(String args[]) {
PDFParser parser = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
PDFTextStripper pdfStripper;
String parsedText;
String fileName = "E:\\Files\\Small Files\\PDF\\JDBC.pdf";
File file = new File(fileName);
try {
parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText.replaceAll("[^A-Za-z0-9. ]+", ""));
} catch (Exception e) {
e.printStackTrace();
try {
if (cosDoc != null)
cosDoc.close();
if (pdDoc != null)
pdDoc.close();
} catch (Exception e1) {
e1.printStackTrace();
}
}
}
Hi we can extract the pdf files using Apache Tika
The Example is :
import java.io.IOException;
import java.io.InputStream;
import java.util.HashMap;
import java.util.Map;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.TikaCoreProperties;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
public class WebPagePdfExtractor {
public Map<String, Object> processRecord(String url) {
DefaultHttpClient httpclient = new DefaultHttpClient();
Map<String, Object> map = new HashMap<String, Object>();
try {
HttpGet httpGet = new HttpGet(url);
HttpResponse response = httpclient.execute(httpGet);
HttpEntity entity = response.getEntity();
InputStream input = null;
if (entity != null) {
try {
input = entity.getContent();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext parseContext = new ParseContext();
parser.parse(input, handler, metadata, parseContext);
map.put("text", handler.toString().replaceAll("\n|\r|\t", " "));
map.put("title", metadata.get(TikaCoreProperties.TITLE));
map.put("pageCount", metadata.get("xmpTPg:NPages"));
map.put("status_code", response.getStatusLine().getStatusCode() + "");
} catch (Exception e) {
e.printStackTrace();
} finally {
if (input != null) {
try {
input.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
} catch (Exception exception) {
exception.printStackTrace();
}
return map;
}
public static void main(String arg[]) {
WebPagePdfExtractor webPagePdfExtractor = new WebPagePdfExtractor();
Map<String, Object> extractedMap = webPagePdfExtractor.processRecord("http://math.about.com/library/q20.pdf");
System.out.println(extractedMap.get("text"));
}
}
You can use iText for do such things
//iText imports
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
for example:
try {
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
String str=PdfTextExtractor.getTextFromPage(reader, 2); //Extracting the content from a particular page.
System.out.println(str);
reader.close();
} catch (Exception e) {
System.out.println(e);
}
another one
try {
PdfReader reader = new PdfReader("c:/temp/test.pdf");
System.out.println("This PDF has "+reader.getNumberOfPages()+" pages.");
String page = PdfTextExtractor.getTextFromPage(reader, 2);
System.out.println("Page Content:\n\n"+page+"\n\n");
System.out.println("Is this document tampered: "+reader.isTampered());
System.out.println("Is this document encrypted: "+reader.isEncrypted());
} catch (IOException e) {
e.printStackTrace();
}
the above examples can only extract the text, but you need to do some more to remove hyperlinks, bullets, heading & numbers.
For the newer versions of Apache pdfbox. Here is the example from the original source
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.pdfbox.examples.util;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.encryption.AccessPermission;
import org.apache.pdfbox.text.PDFTextStripper;
/**
* This is a simple text extraction example to get started. For more advance usage, see the
* ExtractTextByArea and the DrawPrintTextLocations examples in this subproject, as well as the
* ExtractText tool in the tools subproject.
*
* #author Tilman Hausherr
*/
public class ExtractTextSimple
{
private ExtractTextSimple()
{
// example class should not be instantiated
}
/**
* This will print the documents text page by page.
*
* #param args The command line arguments.
*
* #throws IOException If there is an error parsing or extracting the document.
*/
public static void main(String[] args) throws IOException
{
if (args.length != 1)
{
usage();
}
try (PDDocument document = PDDocument.load(new File(args[0])))
{
AccessPermission ap = document.getCurrentAccessPermission();
if (!ap.canExtractContent())
{
throw new IOException("You do not have permission to extract text");
}
PDFTextStripper stripper = new PDFTextStripper();
// This example uses sorting, but in some cases it is more useful to switch it off,
// e.g. in some files with columns where the PDF content stream respects the
// column order.
stripper.setSortByPosition(true);
for (int p = 1; p <= document.getNumberOfPages(); ++p)
{
// Set the page interval to extract. If you don't, then all pages would be extracted.
stripper.setStartPage(p);
stripper.setEndPage(p);
// let the magic happen
String text = stripper.getText(document);
// do some nice output with a header
String pageStr = String.format("page %d:", p);
System.out.println(pageStr);
for (int i = 0; i < pageStr.length(); ++i)
{
System.out.print("-");
}
System.out.println();
System.out.println(text.trim());
System.out.println();
// If the extracted text is empty or gibberish, please try extracting text
// with Adobe Reader first before asking for help. Also read the FAQ
// on the website:
// https://pdfbox.apache.org/2.0/faq.html#text-extraction
}
}
}
/**
* This will print the usage for this document.
*/
private static void usage()
{
System.err.println("Usage: java " + ExtractTextSimple.class.getName() + " <input-pdf>");
System.exit(-1);
}
}
Extracting all keywords from PDF(from a web page) file on your local machine or Base64 encoded string:
import org.apache.commons.codec.binary.Base64;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
public class WebPagePdfExtractor {
public static void main(String arg[]) {
WebPagePdfExtractor webPagePdfExtractor = new WebPagePdfExtractor();
System.out.println("From file: " + webPagePdfExtractor.processRecord(createByteArray()).get("text"));
System.out.println("From string: " + webPagePdfExtractor.processRecord(getArrayFromBase64EncodedString()).get("text"));
}
public Map<String, Object> processRecord(byte[] byteArray) {
Map<String, Object> map = new HashMap<>();
try {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(false);
stripper.setShouldSeparateByBeads(true);
PDDocument document = PDDocument.load(byteArray);
String text = stripper.getText(document);
map.put("text", text.replaceAll("\n|\r|\t", " "));
} catch (Exception exception) {
exception.printStackTrace();
}
return map;
}
private static byte[] getArrayFromBase64EncodedString() {
String encodedContent = "data:application/pdf;base64,JVBERi0xLjMKJcTl8uXrp/Og0MTGCjQgMCBvYmoKPDwgL0xlbmd0aCA1IDAgUiAvRmlsdGVyIC9GbGF0ZURlY29kZSA+PgpzdHJlYW0KeAGF0E0OgjAQBeA9p3hL3UCHlha2Gg9A0sS1AepPxIDl/rFFErVESDddvPlm8nqU6EFpzARjBCVkLHNkipBzPBsc8UCyt4TKgmCr/9HI+GDqg2x8Luzk8UtfYwX5DVWLnQaLmd+qHTsF3V5QEekWidZuDNpgc7L1FvqGg35fOzPlqslFYJrzZdnkq6YI77TXtrs3GBo7oKvNss9mfhT0IAV+e6CUL5pSTWb0t1tVBKbI5McsXxNmciYKZW5kc3RyZWFtCmVuZG9iago1IDAgb2JqCjE4NQplbmRvYmoKMiAwIG9iago8PCAvVHlwZSAvUGFnZSAvUGFyZW50IDMgMCBSIC9SZXNvdXJjZXMgNiAwIFIgL0NvbnRlbnRzIDQgMCBSIC9NZWRpYUJveCBbMCAwIDU5NSA4NDJdC" +
"j4+CmVuZG9iago2IDAgb2JqCjw8IC9Qcm9jU2V0IFsgL1BERiAvVGV4dCBdIC9Db2xvclNwYWNlIDw8IC9DczEgNyAwIFIgL0NzMiA4IDAgUiA+PiAvRm9udCA8PAovVFQxIDkgMCBSID4+ID4+CmVuZG9iagoxMCAwIG9iago8PCAvTGVuZ3RoIDExIDAgUiAvTiAxIC9BbHRlcm5hdGUgL0RldmljZUdyYXkgL0ZpbHRlciAvRmxhdGVEZWNvZGUgPj4Kc3RyZWFtCngBhVVdaBxVFD67c2cDEgcftA0ttIM/bQnpMolWE4u12026SRO362ZTmyrKdHY2O81kZpyZ3SahT6XgmxYE6augPsaCCLYqNi/2paXFkko1DwoRWowgKH1S8Dsz22R2QTLDnfnuueeee8537rmXqOtv3fPstEo054R+oZybPjl9Su26TWlSqJvw6Ebg5UqlCcaO65j8b38e3qUUS+7sZ1vtY1v25KoZGNC6huZWA2OOKKURZWqG54dEXZcgHzwbeoxvAz85WynngdeAldZcQHqqYDqmbxlqwdcX1JLv1i" +
"w76etW42xjy2fObrCv/OxG6w5mJ8fx74XPF0xnahJ4H/CSoY8w7gO+27ROFGOcTnvhkXKsn842ZqdyLfnJmn90qiW/UG+MMs4SpZcW65U3gJ8AXnVOF4+39Ndn3XG200Mk9RhB/hTws8Ba3RzjPKnAFd8tsz7Lw6o5PAL8MvAlKxyrAMO+9EPQnGQ5sKDFep79xFoie0Y/VgLeBnzItAu8FuyIiheW2OYg8LxjF3ktxC4um0EUL2IXP4X1ymisL6dDv8JznyaS99Sso2PA4EQerfujLIc/cujZ0d56EXjJb5Q59j3Aa7o/UgCGzcxjVX2YeX4BeIBOpHQyyaXT+Brk0L+INyCLmhHyyMdYDX2bCtBw0Hz0DGgVgHRaAColtEz0WCeeo1IVPZVmollBhNjK/ahvUH7Xp9SAtE7rkNaBXqNfIsk8/Upz6OchbWBspsNuHl44tAgP2BO2+aBl0xXbhSaeRzsoJsQrYlAMkSpeFYfFITEM6ZA4GM2JvU/6zn4+2LD0LtZN+r4MDkKsZ8MzB6xwNAE8+AfrzkaaCbYu7mjs87yP3j/vv2MZtz74s429APoxJ7/BogtrJiXmXj/3TU/CQ3VFfPXWne7r5+h4MktR3qqdWZLX5PvyCr735NWkDflneRXvvbZcPcoL/5O5zSFGO5LNQc48m1G0ccYbwCG4qUVz9rdZTLLptmK0YMlClJ2ruP/LCfPDPLexUnMu7vC8tz9jNs33ig+LdL5Pu6y" +
"ta59oP2p/aCvax0C/Sx9KX0rfSlekq9INUqVr0rL0nfS99Ln0NXpfQLosXenYSXHsG7sHfsZ71mjtMGaGsxQQ88LazApLH/F3BmOb+TOh1V4Dnbt/Yy3liLJTeUYZVnYrzykTSq9yQDmsbFcG0PqVUWUvRnZusGRjPc6AhX+SZ4umI67iPLFXdbDnw0sd76ZfXMPWhjXYST0Ontnapg6vEVe/FVVjvDtdnAY6TSFii84ich86nB8nqv7O2VyTODVSb+KUsMQu0S/GWjWYEwdQheNt9TjIVZoZyQxncqRmejNDmf7MMcZRrNH5ktmL0SF8RxLeM8sx/5s1xGcY7x3mqAlso4dbKzTncd8R5V1vwbdm6qE6oGkvqTlcr6Y65hjZPlW3bTUaClTfDEy/aVazxHc3zyP66/XoTk5tu2E0/GYso1TqJtF/t4+TNAplbmRzdHJlYW0KZW5kb2JqCjExIDAgb2JqCjExMTYKZW5kb2JqCjcgMCBvYmoKWyAvSUNDQmFzZWQgMTAgMCBSIF0KZW5kb2JqCjEyIDAgb2JqCjw8IC9MZW5ndGgg" +
"MTMgMCBSIC9OIDMgL0FsdGVybmF0ZSAvRGV2aWNlUkdCIC9GaWx0ZXIgL0ZsYXRlRGVjb2RlID4+CnN0cmVhbQp4AYVVW4gbVRj+kznJCrvO09rVLaRDvXQpu0u2Fd2ltJpbk7RrGrLZ1RZBs5OTZMzsJM5M0gt9KoLii6u+SUG8vS0IgtJ6wdYH+1KpUFZ36yIoPrR4QSj0RbfxO5NkJllqm2XPfPP93/lv558ZooG1Qr2u+xWiJcM2c8mo8tzRY8rAOvnpIRqkURosqFY9ks3OEn5CK679v1s/kE8wVyfubO9Xb7kbLHJLJfLdB75WtNQl4BNEgbNq3bSJBobBTx+36wKLHIZNJAj8osDlNoaNhhfb+DVHk8/FoDkLLKuVQhF4BXh8sYcv9+B2DlDAT5Ib3NRURfQia9ZKms4dQ3u5h7lHeTe4pDdQs/PbgXXIqs4dxnUMtb9SLMQFngReUQuJOeBHgK81tYVMB9+u29Ec8GNE/p2N6nwEeDdwqmQenAeGH79ZaaS6+J1Tlfyz4LeB/8ZYzBzp7F1TrRh6STvB367wtOhviEhSN" +
"DudB4Yf6YBZywk9cpBKRR5PAI8Dv16tHRY5wKf0mdWcE7zIZ+1UJSbyFPzllwqHssCjwL9yPSn0iCX9W7eznRxYyNAzIi5isTi3nHrhh4XsSj4FHnGZbpv5zl62XNIOpjv6TypmSvBi77W67swocgv4zUZO1I5YgcmCmUgCw2cgy4150U+Bm7TgKxCnGi1iVcmgTVIoR0mK4lonE5YSaaSD4bByMBx3Xc2Es8+iKniNmo7Nwpp1lO2dXa1CZbAGXXe0KsVCH1EDnir0B9iK61OhGO4a4Mr/46edy42OnxobYWG2F//72Czbz6bZDCnsKfY0O8DiYGfYPtd3Fnu6FYl8biBK28/LiMgd3QJqv4gabSpg/QWKGlmuh76uLI82xjzLGfMFTb3yxt89vdKws+oqJvo6euRePQ/8FrgeWMW6HthwfSiBnwIb+FtHb7xaap6902VxUhpOtNan23oWXVUElerOziV0QUPNvKfmiV4fl05/+aAXbZWde/7q0KXTJWN51GNFF/irmVsZOjPuseEfw3+GV8PvhT8M/y69LX0qfSWdlz6XLpMiXZ" +
"AuSl9L30ofS1+4+rvNkHv2JDIXcyXyFtPVrbC315hYOSpvlx+W4/IO+VF51lUp8og8JafkXbBsd8/Nm2+lt3L05Siidftz51jiWdFcTzgD3/2YAM2L2DcD88hYo+PwaaLfYt4MOglt75PXqYiF2BRLb5nuaTHzXd/BRDAejJAS3B2cCU4FDwncfZaDu2CbwZrozQ3z4Sr6KuU2PyG+JxSr1U+aWrliK3vC4SeVCD59XEkb6uS4UtB1xTFZisktbjZ5cZLEd1PsI7qZc76Hvm1XPM5+hmj/X3j3fe9xxxpEKxbRyOMeN4Z35QPvEp17Qm2YzbY/8vm+I7JKe/c4976hKN5fP7daN/EeG3iLaPPNVuuf91utzQ/gf4Pogv4foJ98VQplbmRzdHJlYW0KZW5kb2JqCjEzIDAgb2JqCjEwNzkKZW5kb2JqCjggMCBvYmoKWyAvSUNDQmFzZWQgMTIgMCBSIF0KZW5kb2JqCjMgMCBvYmoKPDwgL1R5cGUgL1BhZ2VzIC9NZWRpYUJveCBbMCAwIDU5NSA4NDJdIC9Db3VudCAxIC9LaWR" +
"zIFsgMiAwIFIgXSA+PgplbmRvYmoKMTQgMCBvYmoKPDwgL1R5cGUgL0NhdGFsb2cgL1BhZ2VzIDMgMCBSID4+CmVuZG9iago5IDAgb2JqCjw8IC9UeXBlIC9Gb250IC9TdWJ0eXBlIC9UcnVlVHlwZSAvQmFzZUZvbnQgL0NOVFpYVStNZW5" +
"sby1SZWd1bGFyIC9Gb250RGVzY3JpcHRvcgoxNSAwIFIgL0VuY29kaW5nIC9NYWNSb21hbkVuY29kaW5nIC9GaXJzdENoYXIgMzIgL0xhc3RDaGFyIDExNiAvV2lkdGhzIFsgNjAyCjAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgNjAyIDYwMiA2MDIgNjAyIDYwMiA2MDIgMCAwIDAgMCAwIDAgMCAwIDAKMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgNjAyIDAgMAo2MDIgNjAyIDYwMiA2MDIgNjAyIDYwMiAwIDAgNjAyIDYwMiAwIDAgNjAyIDAgMCA2MDIgNjAyIF0gPj4KZW5kb2JqCjE1IDAgb2JqCjw8IC9UeXBlIC9Gb250RGVzY3JpcHRvciAvRm9udE5hbWUgL0NOVFpYVStNZW5sby1SZWd1bGFyIC9GbGFncyAzMyAvRm9udEJCb3gKWy01NTggLTM3NSA3MTggMTA0MV0g" +
"L0l0YWxpY0FuZ2xlIDAgL0FzY2VudCA5MjggL0Rlc2NlbnQgLTIzNiAvQ2FwSGVpZ2h0IDcyOQovU3RlbVYgOTkgL1hIZWlnaHQgNTQ3…/ZfICj5JcLdi/ATmQZKogDPg0lIDBunI0ZGOB1OB/Lpyce1TbJqCpBThycVs3GyQPZSLKexbMGyFss8LF4sNb2lElu5HPlJ2439G1jKsbRh6cTyPNpx8I6AFxa8P+xD2E4e/G+5PqJ/8aDzERFvGBJR/WLkfwcM3kRCiZpokDMdxhn5MeD9Rn5MSm0mYUpLSF98J5HXaQgtpJvoDWGesEe4C4NgK3woWsQ88RgzszXsMM4WyALeIC5gO5B/FYk/pNxVCJGoZT8NYc8LIknrONeVQYznus51pYeZHCaXw+RYIJLAEogJfMEbVPrvv31S6icvTMlp1EQhO41cOuXb0EEkSYkmGaMXSzuIfhCKAA4Y/YScTs9ASizblWVyWB1UT4fwNfSp9+mgwLFd4oI3D++9++kuheYWpOnEeBhLJrv7kVg" +
"Xk1hkVDRExLgkieUZTTt1jZYGkTTiXU8tULUtIsEIfeKMgY5AV3u7yZyTQdK6Mm923fwgHe1GZWTfmCJy5CYi05PgwqWzB5HBw2n2wL7OBEmVPZxmZYpWi6TSU7pM2BNY1kojs0sLN1bPOLZ4/nuzL1CNp/S+zt27dx+lA4Y/2/hA1fq8kR9kZF57u6R96YgvZRmsvfe5OBj5TSKjkN+wRqu6LrRF1yjF19lbYhudDVKTdVe/8DAClihbX6MNEuItofH9kF9k+FwXMofC7rqCDHcZu293G7tz0qmNWi2iM6FvYrYN2RuEvCbT7GDnZ0xDyMZt/Otb8z+aP+/dOS17927ZurVu24ZVnrYFz7w95jxlayE+8b3Nf/m6b5/j2QMb1v26qeXZ8iWVSUkH7fYLb1bKBw7aA57LYgVGYAGtLM8dT3WgIwC6PAIaVSOjsCaUatXEFiJKBm0fvTEQODe0K9Mki/mK3DPnBOUsHkchH/ckhFIHZJmyrE6T0+TIFi7xfvRjx/X33jves5rFBb6Gk4GsHXwbLX1Hlp0XZZeKa8eRYe4EURUX3agy" +
"1RnXWxp1QiNZo2tS7baBjUTYqDqBGONtspI7UEwosSsoMUVevAM5CEO9mmRVEquF/ExwsrxOCTd7OpKnpXxFjfzzO8uPTnz44Oydb7bunLy1kHXu5huMBt59vYvfsNtPZmb4tjfvdblQGjXI23jUayTpg9w5VfFRjer4RqP6DRGPs/ViY3iDscmVYCN9dQkqKZaGxbuMga6uwBXZeYLq/MKI6jShPq0DqDNBUBg0Wy2C0y6YjMSRGU4TJKslPKhYuJS7fkL7u+m7F33yzc3PeOBb6qSWsZv4Zys3bVq5as0atu+gK5Ff4ldLH+d3vvuW36bL6Ab6LF0X37Pw4I4dB//4+z0+RZ8y306xEuNGP7LI3V+tItF2baRBRfZHqurNjjr7O3H1fdrMTZE6GilG6dWSNt8uStbh/Y03u9AkM1G3snI7rtwMyCKWd2DKMeegZ6W749Lj0+3pjvSEZtJMm4VmdbNme3hzRHNkc1RztH4m7rJ3Q4OzB5uc2XpE9M0eOOh+mi1LoNfdwlGfQtuwV159duGWPfTAgfv/VP3GBz98d4eu2jirfca81uK6o8P62oWsJxaXLT57sN/4npUtpY/8eXvr4bhVzwwa6E9MnDIlc2PQditxr2bMIowYLdLd0YxYouv1lvqQJn0bfREiRCIJo0xmzeg43Ju8NTk2KIaDe0ynpqxeHlEd5ixUx790gazCdL9/QFPpiWvX3y/byg1ramr" +
"q6mpq1sAZYeQ/utYVTaP3Uys10cHTuOaj8xfPdV44L/uSzE8Jyt6K/GQFIyKGKSUIChgEY09jScPcUUJf02C4DCNRShuL32qS0zOo1WGVZIMYbEXZ2QmaSVamWRUUnlgS+DzknT3F7eWPHpnBf+Dnqf3GR3d82g1ran4XItRPl744dl/OW8nJNIeGUS118782Lt3lWyT72RH08USUUxgZiFIyUm3IfonWkxf10mG1EKYioUzSGTQW47mhHYGhHZmKAVzJDKD60b9lA0aJxFHZyWSndqBKs8TEM3Mn0JV8hZ930uRdf5IsTZPnz/UG0uCMd6JfTq9lefDRornXFke7E6O0tpjEUDDXhYWH1tvC6w2AlmgzHEk63D8xikjaUZLZ7BiNhtjRqy10846gERo7u9EC0RKRGyV0Bx0nDL3pxzg5TJANrleZEdlZMH31ytXrvWtWrPZ3Xx3fUjSneeTmNSlbyjuuX+9Y2JDmF3JOffzxqVOfnuefBXggNmb/gJTtvpCqWQ/TIVSFZ+mQh6ZvkPcRlF+MIr8Ud2SoHvC3OKne1KZ9UU0FiYzVhUqaQgvaGIoMTWwoxnKM67KFOU1BZrGTpfh/uBhz4LEnVtb5/RmvL3ljl7C/Z6ywv3H9W2/0rJYsPTtK5l6W5daN+qqWDHh+6oicYqKpyD9kyocpRTtSXcR8Em1J7muwlW1LexrtSs5JZLuScwa5pXgGyx/hdZxIeA" +
"JTPMvxBMSaYoSmQ+nHtDywiJbzyzTe7xcfDmR5vTBcyPsKv7yBPEyXopGDxCAHOoUoppRILPQiribnP/IqOjlLka3XEo5eIbu8bCTCqRmej6+99ib/lF6im3/13ItnD8OtF4LyxN+YxQq0iwTyqjsx0mwIFVUkLkZSWbX1dmiLORxlVBGTIWSC" +
"NNE0wTAxNnJCdIHTeHOcTzt1nM80dUbxARJ9r/0+T2BoAA0leOYPHXrlpnIwoYmg8NPdo9LFdJYupavSQ9JD09Xpmtzw3IjcyNyo3OjcmNzY3LhcWzVUi9WsWqpWVYdUh1arqzXecG+EN9Ib5Y32xnhjvXFem5POpPLJEh5Ff6LMf2vVqgwKOx" +
"IeHbu64vXswkn3v54zdkzOzp2Oubnjy6B7dMEZfqlnubDymyWVX/SsEFbeWCy3YknJ0NxCWddt/CFxKspCjmFZ7tgfY1ibvokegcNxGL9GKZGsUI5imYqJoV/8GMZcsm0pXJhNRtkbfuofdPmBA3IYu/rV+/Oa6I3VNatqa1fVrF7Xc1xSe4um" +
"8Xf5df53fnwavfXR+Qud5y5iFJPtvRPjmIQ8JZJlbrdOK+g1EfG2kFBBpY6wxdvy4myRao0tXrSSOtouWuqs7ZH1JrHe1WZqSopTa+JjVOSBGEk/RiVZEgqSgu58RXZfObDIh6OR3+o23uo2RyjHipKn6ZU8Tak9CcETUz5L4pVkSPq3k2cPTBMGYPo2CFUCJx9oLqqqfPitsWvXdX1YtP+x+YemPrvqVkjBy789//70FjFn34ABk4vGjXXqo7dVtbQ6nW3Z2XM91RmCPn7jilf+4FD2incAMYS9hLExwx2pZyEG2E9M9HDIfnWIJhTzYclo1v88MnbdHNohp0Cyg8sx8Wdmb8Lr7nY+a9ayU5dP7ZZDI3uJH/b2NP9qzsaWE0KJlw5HnSvPvefwlwRZ2r988CqMnmzBUyScRGD+EUXyySgymowhY8k4Mp48gLl+EXmITFM+pPjvhSANSb5Ej5w4dXrxg8kTyhYtrEidUjZ/2cLZTxLyT2S78dEKZW5kc3RyZWFtCmVuZG9iagoxNyAwIG9iago0ODAxCmVuZG9iagoxOCAwIG9iagooKQplbmRvYmoKMTkgMCBvYmoKKE1hYyBPUyBYIDEwLjEyLjYgUXVhcnR6IFBERkNvbnRleHQpCmVuZG9iagoyMCAwIG9iagooKQplbmRvYmoKMjEgMCBvYmoKKCkKZW5kb2JqCjIyIDAgb2JqCihUZXh0TWF0ZSkKZW5kb2JqCjIzIDAgb2JqCihEOjIwMTcxMjEyMTMwMzQ4WjAwJzAwJykKZW5kb2JqCjI0IDAgb2JqCigpCmVuZG9iagoyNSAwIG9iagpbICgpIF0KZW5kb2JqCjEgMCBvYmoKPDwgL1RpdGxlIDE4IDAgUiAvQXV0aG9yIDIwIDAgUiAvU3ViamVjdCAyMSAwIFIgL1Byb2R1Y2VyIDE5IDAgUiAvQ3JlYXRvcgoyMiAwIFIgL0NyZWF0aW9uRGF0ZSAyMyAwIFIgL01vZERhdGUgMjMgMCBSIC9LZXl3b3JkcyAyNCAwIFIgL0FBUEw6S2V5d29yZHMKMjUgMCBSID4+CmVuZG9iagp4cmVmCjAgMjYKMDAwMDAwMDAwMCA2NTUzNSBmIAowMDAwMDA4OTI5IDAwMDAwIG4gCjAwMDAwMDAzMDAgMDAwMDAgbiAKMDAwMDAwMzAyOCAwMDAwMCBuIAowMDAwMDAwMDIyIDAwMDAwIG4gCjAwMDAwMDAyODEgMDAwMDAgbiAKMDAwMDAwMDQwNCAwMDAwMCBuIAowMDAwMDAxNzUzIDAwMDAwIG4gCjAwMDAwMDI5OTIgMDAwMDAgbiAKMDAwMDAwMzE2MSAwMDAwMCBuIAowMDAwMDAwNTEyIDAwMDAwIG4gCjAwMDAwMDE3MzIgMDAwMDAgbiAKMDAwMDAwMTc4OSAwMDAwMCBuIAowMDAwMDAyOTcxIDAwMDAwIG4gCjAwMDAwMDMxMTEgMDAwMDAgbiAKMDAwMDAwMzU0NCAwMDAwMCB" +
"uIAowMDAwMDAzNzk2IDAwMDAwIG4gCjAwMDAwMDg2ODcgMDAwMDAgbiAKMDAwMDAwODcwOCAwMDAwMCBuIAowMDAwMDA4NzI3IDAwMDAwIG4gCjAwMDAwMDg3ODAgMDAwMDAgbiAKMDAwMDAwODc5OSAwMDAwMCBuIAowMDAwMDA4ODE4IDAwMDAwIG4gCjAwMDAwMDg4NDUgMDAwMDAgbiAKMDAwMDAwODg4NyAwMDAwMCBuIAowMDAwMDA4OTA2IDAwMDAwIG4gCnRyYWlsZXIKPDwgL1NpemUgMjYgL1Jvb3QgMTQgMCBSIC9JbmZvIDEgMCBSIC9JRCBbIDxkYjc4M2NhNDM2Mzg4YzI5ZDc5MDQ2NzY3NjUxNjE3OT4KPGRiNzgzY2E0MzYzODhjMjlkNzkwNDY3Njc2NTE2MTc5PiBdID4+CnN0YXJ0eHJlZgo5MTA0CiUlRU9GCg==";
String content = encodedContent.substring("data:application/pdf;base64," .length());
return Base64.decodeBase64(content);
}
public static byte[] createByteArray() {
String pathToBinaryData = "/bla-bla/src/main/resources/small.pdf";
File file = new File(pathToBinaryData);
if (!file.exists()) {
System.out.println(" could not be found in folder " + pathToBinaryData);
return null;
}
FileInputStream fin = null;
try {
fin = new FileInputStream(file);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
byte fileContent[] = new byte[(int) file.length()];
try {
fin.read(fileContent);
} catch (IOException e) {
e.printStackTrace();
}
return fileContent;
}
}
I have images that are stored in the database as a BLOB. Now I can display them in my jsf pages using Richfaces mediaOutput tag.
Is it possible for images to have path like "/images/image.jpg" while images are stored in the database.
While searching for an answer I came around something like this:
#GET
#Path("/files/{filename}")
#Produces(MediaType.WILDCARD)
Best regards,
Ilya Sidorovich
You could write a servlet picking up every request to /image/* or something that suits you.
And in your servlet you retrieve the correct data from your database via request parameters.
And you write out the data via
response.getOutputStream().write(content);
(content being the bytearray of you image)
Thank you roel and BalusC!
If anyone comes around this issue, here is what you can do.
package org.gicm.test;
import java.io.IOException;
import javax.servlet.ServletException;
import javax.servlet.annotation.WebServlet;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.inject.Inject;
import java.io.BufferedOutputStream;
import java.io.BufferedInputStream;
import org.gicm.cms.CMSDao;
import org.gicm.model.UploadedImage;
#WebServlet("/images/*")
public class TestServlet extends HttpServlet {
#Inject
private CMSDao cms;
#Override
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
String imageId = String.valueOf(request.getPathInfo().substring(1)); // Gets string that goes after "/images/".
UploadedImage image = cms.findImage(imageId); // Get Image from DB.
response.setHeader("Content-Type", getServletContext().getMimeType(image.getName()));
response.setHeader("Content-Disposition", "inline; filename=\"" + image.getName() + "\"");
BufferedInputStream input = null;
BufferedOutputStream output = null;
try {
input = new BufferedInputStream(image.getData()); // Creates buffered input stream.
output = new BufferedOutputStream(response.getOutputStream());
byte[] buffer = new byte[8192];
for (int length = 0; (length = input.read(buffer)) > 0;) {
output.write(buffer, 0, length);
}
} finally {
if (output != null) try { output.close(); } catch (IOException logOrIgnore) {}
if (input != null) try { input.close(); } catch (IOException logOrIgnore) {}
}
}
}
I want to have a program that reads metadata from an MP3 file. My program should also able to edit these metadata. What can I do?
I got to search out for some open source code. But they have code; but not simplified idea for my job they are going to do.
When I read further I found the metadata is stored in the MP3 file itself. But I am yet not able to make a full idea of my baby program.
Any help will be appreciated; with a program or very idea (like an algorithm). :)
The last 128 bytes of a mp3 file contains meta data about the mp3 file., You can write a program to read the last 128 bytes...
UPDATE:
ID3v1 Implementation
The Information is stored in the last 128 bytes of an MP3. The Tag
has got the following fields, and the offsets given here, are from
0-127.
Field Length Offsets
Tag 3 0-2
Songname 30 3-32
Artist 30 33-62
Album 30 63-92
Year 4 93-96
Comment 30 97-126
Genre 1 127
WARINING- This is just an ugly way of getting metadata and it might not actually be there because the world has moved to id3v2. id3v1 is actually obsolete. Id3v2 is more complex than this, so ideally you should use existing libraries to read id3v2 data from mp3s . Just putting this out there.
You can use apache tika Java API for meta-data parsing from MP3 such as title, album, genre, duraion, composer, artist and etc.. required jars are tika-parsers-1.4, tika-core-1.4.
Sample Program:
package com.parse.mp3;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.mp3.Mp3Parser;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class AudioParser {
/**
* #param args
*/
public static void main(String[] args) {
String fileLocation = "G:/asas/album/song.mp3";
try {
InputStream input = new FileInputStream(new File(fileLocation));
ContentHandler handler = new DefaultHandler();
Metadata metadata = new Metadata();
Parser parser = new Mp3Parser();
ParseContext parseCtx = new ParseContext();
parser.parse(input, handler, metadata, parseCtx);
input.close();
// List all metadata
String[] metadataNames = metadata.names();
for(String name : metadataNames){
System.out.println(name + ": " + metadata.get(name));
}
// Retrieve the necessary info from metadata
// Names - title, xmpDM:artist etc. - mentioned below may differ based
System.out.println("----------------------------------------------");
System.out.println("Title: " + metadata.get("title"));
System.out.println("Artists: " + metadata.get("xmpDM:artist"));
System.out.println("Composer : "+metadata.get("xmpDM:composer"));
System.out.println("Genre : "+metadata.get("xmpDM:genre"));
System.out.println("Album : "+metadata.get("xmpDM:album"));
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
}
}
For J2ME(which is what I was struggling with), here's the code that worked for me..
import java.io.InputStream;
import javax.microedition.io.Connector;
import javax.microedition.io.file.FileConnection;
import javax.microedition.lcdui.*;
import javax.microedition.media.Manager;
import javax.microedition.media.Player;
import javax.microedition.media.control.MetaDataControl;
import javax.microedition.midlet.MIDlet;
public class MetaDataControlMIDlet extends MIDlet implements CommandListener {
private Display display = null;
private List list = new List("Message", List.IMPLICIT);
private Command exitCommand = new Command("Exit", Command.EXIT, 1);
private Alert alert = new Alert("Message");
private Player player = null;
public MetaDataControlMIDlet() {
display = Display.getDisplay(this);
alert.addCommand(exitCommand);
alert.setCommandListener(this);
list.addCommand(exitCommand);
list.setCommandListener(this);
//display.setCurrent(list);
}
public void startApp() {
try {
FileConnection connection = (FileConnection) Connector.open("file:///e:/breathe.mp3");
InputStream is = null;
is = connection.openInputStream();
player = Manager.createPlayer(is, "audio/mp3");
player.prefetch();
player.realize();
} catch (Exception e) {
alert.setString(e.getMessage());
display.setCurrent(alert);
e.printStackTrace();
}
if (player != null) {
MetaDataControl mControl = (MetaDataControl) player.getControl("javax.microedition.media.control.MetaDataControl");
if (mControl == null) {
alert.setString("No Meta Information");
display.setCurrent(alert);
} else {
String[] keys = mControl.getKeys();
for (int i = 0; i < keys.length; i++) {
list.append(keys[i] + " -- " + mControl.getKeyValue(keys[i]), null);
}
display.setCurrent(list);
}
}
}
public void commandAction(Command cmd, Displayable disp) {
if (cmd == exitCommand) {
notifyDestroyed();
}
}
public void pauseApp() {
}
public void destroyApp(boolean unconditional) {
}
}