PDFBox: convert PDF to text including chapter headlineinformation - java

I am currently working at a project to extract the content of pdf files and search for certain keywords in them.
For extracting the content I am using PDFBox and this works fine.
The problem I now have encountered is that I want to be able to search for certain keywords only within chapter headlines.
At the moment my code for extracting looks like this:
PDDocument doc = PDDocument.load(pdfFile);
String text = new PDFTextStripper().getText(doc);
doc.close();
This only extracts the raw text of the file, with no information about headlines. I was not able to figure out how to use PDFBox to include such information. So I am not sure if this is even possible.
Has anybody experience with this tool and can tell me, if its even possible to do this by using PDFBox and if yes, how I will be able to achieve this?
Kind regards

Related

Why extracting tables in a converted docx work better than in the original PDF?

I'm trying to perform automaticaly table extraction inside PDF. I know there are several libraries and methods Java and Python, but to my surprise, the method that has worked best for me is to convert my Pdf to a Docx document and from there to extract the tables (thanks to: How to get pictures and tables from .docx document using apache poi?).
My question is this: Assuming that within the format conversion there may be loss of information, why are my results better this way? Tabula hasn't been able to do better automatically. To understand this, I have looked for information (e.g. Extracting table contents from a collection of PDF files) but I'm still very confused.
PD: For the moment, I have used https://github.com/thoqbk/traprange (A method based on Pdfbox), How to extract table as text from the PDF using Python? (PyPdf2) and Tabula. When I get to my home I going to put code and cases, I'm writing from my smartphone.

How to test the PDF generated using iText?

I have used itext in Java to convert a HTML to PDF.
Now I want to test if the PDF generated by me is correct i.e the positions and contents all are correct and at correct positions.
Is there away to do the testing of my code?
Basically, Your question is about validating itext output.
If You do not trust library for converting HTML to PDF, You probably do not trust reading raw PDF data as well. You can therefore use other libraries (PDF clown) for parsing PDF as a validation.
You have 2 approaches.
First one requires rasterization of PDF (GhostScript) and comparing to HTML. Indeed, the performance overhead is significant.
Second one parses the document format. I have gone into depth in my previous answer about searching for text inside PDF file.
I have mentioned there searching for text as well as finding it's position on page.
I would suggest just simply avoid validating of output, unless You know something is wrong.
These libraries are widely-used and well-tested.

Creating complex pdf using java

I have an Java/Java EE based application wherein I have a requirement to create PDF certificates for various services that will be provided to the users. I am looking for a way to create PDF (no need for digital certificates for now).
What is the easiest and convenient way of doing that? I have tried
XSL to pdf conversion
HTML to PDF conversion using itext.
crude java way (using PDFWriter, PdfPCell etc.)
What is the best way out of these or is there any other way which is easier and convenient?
When you talk about Certificates, I think of standard sheets that look identical for every receiver of the certificate, except for:
the name of the receiver
the course that was followed by the receiver
a date
If this is the case, I would use any tool that allows you to create a fancy certificate (Acrobat, Open Office, Adobe InDesign,...) and create a static form (sometimes referred to as an AcroForm) containing three fields: name, course, date.
I would then use iText to fill in the fields like this:
PdfReader reader = new PdfReader(pathToCertificateTemplate);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(pathToCertificate));
AcroFields form = stamper.getAcroFields();
form.setField("name", name);
form.setField("course", course);
form.setField("date", date);
stamper.setFormFlattening(true);
stamper.close();
reader.close();
Creating such a certificate from code is "the hard way"; creating such a certificate from XML is "a pain" (because XML isn't well-suited for defining a layout), creating a certificate from (HTML + CSS) is possible with iText's XML Worker, but all of these solutions have the disadvantage that it's hard work to position every item correctly, to make sure everything fits on the same page, etc...
It's much easier to maintain a template with fixed fields. This way, you only have to code once. If for some reason you want to move the fields to another place, you only have to change the template, you don't have to worry about messing around in code, XML, HTML or CSS.
See http://www.manning.com/lowagie2/samplechapter6.pdf for some more info (section 6.3.5).
Try using Jasper Reports mate. Check it out at http://community.jaspersoft.com/
I recommend the first method: XSL to pdf conversion, which is the most powerful. I have experience to produce a lot of PDF reports(each having thousands of pages) gracefully by use of Apache FOP, I think it's good enough and fairly easy(but it requires some knowledge of xsl-FO).
Even though, this is old question, I think it should be anwered.
To create very complex pdf such as certificates,reports or payment slips etc.
You can definitely use Dynamic Reports library. This library is dependent on jasper reports (This is also very popular and old library). Dynamic reports will provide you to design your documents using java code so that you can easily manipulate or make changes as required.
There are lots of examples available there at their site and very easy to learn from those examples.
Below is link for it :
http://www.dynamicreports.org/
Bruno Lowagie pointed out a great way to generate a Template which is the same basically for all data and needs to be populated. However, Bruno Lowagie recommends iText as library to populate the fields. For me like for Ankit, this license was an issue why I had to choose another library. In the following I have a step-by-step guide how to create a template and populate it with data using Apaches PdfBox
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.16</version>
</dependency>
Create a Template with LibreOffice Writer. For placeholders use
TextBoxes (View >> Toolbars >> Form Controls ). This will create a PDF with AcroForms as suggested by Bruno Lowagie
Set a name for each Textbox. Set read-only to true.
Save the document as PDF.
Read the PDF-Template with PdfBox and set the values for the
textboxes.
InputStream is = getClass().getClassLoader().getResourceAsStream("Template.pdf");
try {
PDDocument pDDocument = PDDocument.load(is);
PDAcroForm pDAcroForm = pDDocument.getDocumentCatalog().getAcroForm();
PDField fieldName = pDAcroForm.getField("name");
fieldName.setValue("FirstName Surname"); // <-- Replacement
pDDocument.save(outStream);
pDDocument.close();
} catch (IOException e) {
e.printStackTrace();
}
Use iText pdf library for creating the pdf's It will be easy for you to generate pdfs from that api. Here is the link
http://itextpdf.com/
Text ® is a library that allows you to create and manipulate PDF documents. It enables developers looking to enhance web- and other applications with dynamic PDF document generation and/or manipulation.
Developers can use iText to:
Serve PDF to a browser
Generate dynamic documents from XML files or databases
Use PDF's many interactive features
Add bookmarks, page numbers, watermarks, etc.
Split, concatenate, and manipulate PDF pages
Automate filling out of PDF forms
Add digital signatures to a PDF file
You mentioned the PDFs can be complex. If this is to do with variability or layout, one option that provides reasonably sophisticated template-based layouts and controls is Docmosis. You provide Docmosis with doc or odt files as templates so they are very easy to change and the call Docmosis to mail-merge to create the pdf or other formats. Please not I work for the company that created Docmosis.
Hope that helps.

PDF Handling in Java

I have created a program that should one day become a PDF editor
It's purpose will be saving GUI's textual content to the PDF, and loading it from it. GUI resembles text editor, but it only has certain fields(JTextAreas, actually).
It can look like this (this is only one page, it can have many more, also upper and lower margins are cut out of the picture) It should actually resemble A4 in pixel size.
I have looked around for a bit for PDF libraries and found out that iText could suit my PDF creating needs, however, if I understood it correct, it retirevs text from a whole page as a string which won't work for me, because I will need to detect diferent fields/paragaphs/orsomething to be able to load them back into the program.
Now, I'm a bit lazy, but I don't want to spend hours going trough numerus PDF libraries just to find out that they won't work for me.
Instead, I'm asking someone with a bit more Java PDF handling experience to recommend me one according to my needs.
Or maybe recommend me how to add invisible parts to PDF which will help my program to determine where is it exactly situated insied a PDF file...
Just to be clear (I formed my question wrong before), only thing I need to put in my PDF is text, and that's all I need to later be able to get out. My program should be able to read PDF's which he created himself...
Also, because of the designated use of files created with this program, they need to be in the PDF format.
Short Answer: Use an intermediate format like JSON or XML.
Long Answer: You're using PDF's in a manner that they wasn't designed for. PDF's were not designed to store data; they were designed to present and format data in an portable form. Furthermore, a PDF is a very "heavy" way to store data. I suggest storing your data in another manner, perhaps in a format like JSON or XML.
The advantage now is that you are not tied to a specific output-format like PDF. This can come in handy later on if you decide that you want to export your data into another format (like a Word document, or an image) because you now have a common representation.
I found this link and another link that provides examples that show you how to store and read back metadata in your PDF. This might be what you're looking for, but again, I don't recommend it.
If you really insist on using PDF to store data, I suggest that you store the actual data in either XML or RDF and then attach that to the PDF file when you generate it. Then you can read the XML back for the data.
Assuming that your application will only consume PDF files generated by the same application, there is one part of the PDF specification called Marked Content, that was introduced precisely for this purpose. Using Marked Content you can specify the structure of the text in your document (chapter, paragraph, etc).
Read Chapter 14 - Document Interchange of the PDF Reference Document for more details.

How to append/paste bufferedImage into a Word or RTF document using Java?

I created a Microsoft Word document and tried to write the buffered image to it but all I got was garbled text. Is there a way to write (preferably append) a buffered image to a doc or RTF file?
I want to avoid using docx4j or iText or any external package for that matter due to some constraints. But if there is no other way then please do let me know.
My code in case anyone needs for reference:
ps_file = new File("ps_file.doc");
ImageIO.write(i1, "jpg", ps_file);
Word Documents have their own syntax to store their data so you can't just append text to them and expect it to just work.
You will have to use a 3rd party library unless if you're willing to reinvent the car.
You can however create an RTF file which stores the image. There's a question similar to it that's been answered here:
Programmatically adding Images to RTF Document
Obviously it's for C# but the same procedures can easily be applied in Java.

Categories