PDF2Dom (based on the PDFBox library) is capable of converting PDFs to HTML format preserving such characteristics like font size, boldness etc. Example of this conversation is shown below:
private void generateHTMLFromPDF(String filename) {
PDDocument pdf = PDDocument.load(new File(filename));
Writer output = new PrintWriter("src/output/pdf.html", "utf-8");
new PDFDomTree().writeText(pdf, output);
output.close();}
I'm trying to parse an existing PDF and extract these characteristics on a line for line basis and I wonder if there are any existing methods within the PDF2Dom/PDFBox parse these right from the PDF?
Another method would be to just use the HTML output and proceed from there but it seems like an unnecessary detour.
Related
I'm currently parsing PDF with PDFBox library like this:
File f = new File("myFile.pdf");
PDDocument doc = PDDocument.load(f);
PDFTextStripper pdfStripper = new PDFLayoutTextStripper(); //For better printing in console I used a PDFLayoutTextStripper
String text = pdfStripper.getText(doc);
System.out.println(text);
doc.close();
I'm getting a really nice looking pdf. Mine pdf files will have structure like this:
My super pdf file which is first one of all
someKey1 someValue1
someKey2 someValue2
someKey3 someValue3
....
someKey1 someValue4
someKey2 someValue5
someKey3 someValue6
....
some header over here
and here would be my next pair
someKey4 someValue7
....
Is there any library with could get for me all values with e.g. key someKey1? Or maybe is there any better solution for parsing PDF in Java?
I am trying to extract tabular data from PDF, and to start it, my first step of algorithm says to convert PDF to html doc.
How can I convert PDF to html using pdf2Dom library?
you can convert it using this
private void generateHTMLFromPDF(String filename) {
PDDocument pdf = PDDocument.load(new File(filename));
Writer output = new PrintWriter("src/output/pdf.html", "utf-8");
new PDFDomTree().writeText(pdf, output);
output.close();
}
reference - link
I need to parse a PDF file through the pages and load each separately into a byte[]. I use the itext library.
I download a file consisting of one page with this code:
public Document addPageInTheDocument(String namePage, MultipartFile pdfData, Long documentId) throws IOException {
notNull(namePage, INVALID_PARAMETRE);
notNull(pdfData, INVALID_PARAMETRE);
notNull(documentId, INVALID_PARAMETRE);
byte[] in = pdfData.getBytes(); // size file 88747
Page page = new Page(namePage);
Document document = new Document();
document.setId(documentId);
PdfReader reader = new PdfReader(new ByteArrayInputStream(pdfData.getBytes()));
PdfDocument pdfDocument = new PdfDocument(reader);
if (pdfDocument.getNumberOfPages() != 1) {
throw new IllegalArgumentException();
}
byte[] transform = pdfDocument.getPage(1).getContentBytes(); // 1907 size page
page.setPageData(pdfDocument.getPage(1).getContentBytes());
return addPageInTheDocument(document, page);
}
I'm trying to restore the file with this code:
ByteBuffer byteContent = new ByteBuffer() ;
for (Map.Entry<String, Page> page : pages.entrySet()) {
byteContent.append(page.getValue().getPageData());
}
PdfWriter writer = new PdfWriter(new FileOutputStream(book.getName() + modification + FORMAT));
byte[] df = byteContent.toByteArray();
PdfReader reader = new PdfReader(new ByteArrayInputStream(byteContent.toByteArray()));
com.itextpdf.layout.Document itextDocument = new com.itextpdf.layout.Document(new PdfDocument(reader, writer));
itextDocument.close();
Why is there such a difference in size?
And why the files and pages, and both the byte[] to create the file?
Let's start with your size question:
byte[] in = pdfData.getBytes(); // size file 88747
...
byte[] transform = pdfDocument.getPage(1).getContentBytes(); // 1907 size page
...
Why are there such a difference in size?
Because PdfPage.getContentBytes() does not return what you expect.
You seem to expect it to return a complete representation of the contents of the given page, and the Javadocs of that method might be interpreted ("Get decoded bytes for the whole page content.") to mean that.
This is not the case. PdfPage.getContentBytes() returns the contents of the content stream(s) of the page. These content stream(s) contain a sequence of commands which build the page. But these commands take parameters which reference data outside the content stream, e.g.:
when text is drawn on a PDF page, the content stream contains an operation selecting a font but the data describing the font and in case of embedded fonts the font program itself are outside the content stream;
when bitmap images are drawn, the content stream usually contains an operation for it which references image data outside the content stream;
there are operations which reference so called xobjects which essentially are independent content streams which can be called upon from any page; these xobject are not contained in the page content stream either.
Furthermore there are annotations (e.g. form fields) with their own content streams which are stored in separate structures. And lots of page properties are outside, too.
Thus, there are such differences in size because you get only a minute part of the page definition using getContentBytes.
Now let's look at your code "restoring the file".
As a corollary of the above it is obvious that your code merely concatenates some content streams but does not provide the external resources these streams refer to.
But aside from that your code also points out a misunderstanding concerning the nature of PDF pages: they are not merely blobs you can split and concatenate again as you want. They are collections of PDF objects which are spread throughout the PDF file; different pages can share some of their objects (e.g. fonts of often used images).
What you can do instead...
As representations of a single page you should use a PDF containing the data referenced by that single page. The iText example Burst.java shows how to do that.
To join these single page PDFs again you can use an iText PdfMerger. Remember to set smart mode (PdfWriter.setSmartMode(true)) to prevent resource duplication in the result.
I have made a software that generate a pdf as the part of its function, I am using iTextPDF Java library to generate PDF. For a demo version of my software, I added text watermarking (like "demo software") by use of following code
PdfContentByte under = writer.getDirectContentUnder();
BaseFont baseFont = BaseFont.createFont(BaseFont.HELVETICA, BaseFont.WINANSI, BaseFont.EMBEDDED);
under.beginText();
under.setColorFill(BaseColor.RED);
under.setFontAndSize(baseFont, 25);
under.showTextAligned(PdfContentByte.ALIGN_CENTER," demo software",250, 470,55);
under.endText();
After it I converted it to .docx format using PDF to Word converter and the resultant docx file does not contain the watermark also the contents are easily editable so as a result the sole purpose of giving demo software is vanished.
How can I achieve permanent watermarking so that pdf to word converter wont be able to remove it.
One idea come to my mind is that instead of putting the text in the pdf there should be a way of converting all the text of a page first into an image then making the pdf comprising of those images. But I am unsure on how to achieve this using iTextPdf.
You can encrypt your PDF so that it cannot be modified without an owner password, after you have generated your PDF, create a PDFStamper with your PDF as input
and encrypt the pdf like the following:
final PdfReader reader = new PdfReader(your_input_stream);
final PdfStamper stamper = new PdfStamper(reader, your_output_stream);
stamper.setEncryption(PdfWriter.ENCRYPTION_AES_128 | PdfWriter.DO_NOT_ENCRYPT_METADATA,
"your_user_password", "your_owner_password", PdfWriter.ALLOW_PRINTING);
stamper.close();
As a side note, i would recommend not using a hardcoded owner password; since you have no need for the owner password after the file has been generated, I would suggest making it a SHA hash of a random string of say 20 alphanumeric characters.
Can I format the substring of some string (for example string for Paragraph - new Paragraph(someString) ) using any markers in Itext? Is something like that enabled?
For example:
new Paragraph("Congrats, you've [formatMarker]gained[/formatMarker] the privilege") ?
You can have iText parse HTML tags in order to format your text. Here is an example
Reader reader = new StringReader("<b>Here is Some HTML<b><h1>Hello World</h1>");
HTMLWorker worker = new HTMLWorker(document);
worker.parse(reader);
When you parse it adds your contents to the document. No need to store them in a Paragraph. If you want more functionality and control over the individual elements of the html, you can try using the static method HTMLWorker.parseList()
The API for iText is here http://api.itextpdf.com/itext/ and it has lots of details on both methods used above