This question already has answers here:
How to read a Table in a PDF using iText java?
(3 answers)
Closed 7 years ago.
I am very tried to trying to read table with rows, cells of a pdf file to get records in systematic order.
I have done a lot of google but i could not find best ways to do this.
So i want to ask one question about it -
Q 1- Can we read data from pdf file ?
Q 2- Can we read data from any cell of pdf table ?
I am using itext of java to do this.
Please give me any example to do this.
Thanks
The answer to both your questions is: It depends.
Suppose that you have a ZUGFeRD invoice. In that case, the invoice is a PDF/A-3 document that has an embedded file in the CII XML format. It is very easy to extract this XML and read it to get all the necessary information about the invoice. The concept of embedded or attached file that contain the source of the data used to create the PDF, or the data in an alternative form than PDF, is a technique that is used to allow what you need.
You can extract text from a PDF. This is explained in questions such as PDF text extraction using iText but you only get the raw text without formatting. In many cases, a PDF consists of a bunch of text and lines put on a canvas at absolute positions. A word on the page does not know if it's part of a sentence, part of a cell, etc. Unless:
If the PDF is a Tagged PDF, then the PDF also contains information about the structure of the content. For instance: the content will contain tags that indicate structures such as tables, table headers, table rows, table cells. If you are talking about Tagged PDFs, then it's possible to extract the text in a structured way.
In the past, we have done project where we received credit card statements from VISA, MasterCard, AmEx,... We had to extract all the expenses and store them as records in a database. We were able to achieve this, because the format of the statements was predictable: all VISA statements are created alike, hence we were able to find the pattern that allowed us to extract the data.
It goes without saying that we do not share the code we used to do this. The company that paid us for doing that project would not be pleased.
Related
I have a system that ultimately creates a PDF files from html file. It works very similar to a mail merge. It grabs data from a database, merge's the data into palceholders in the html document and then converts the html file to a pdf.
When I am unit testing the html file I can look at the values in my place holder. For example if I had a John Smith and I want to validate that the name is "John Smith" I simply look the value of the div after the merge.
I need to do something similar with validating the data in the pdf. Using pdfbox and itext I was able to extract text from a location as well as text from the document but I can't find anything that would let me create a "tag/placeholder/..." and extract information from it similar to what I do with the html file.
Is this possible with a pdf?
That's perfectly possible using pdf2Data, which is a solution from the iText suite.
You can find the demo here
http://pdf2data.online/
It essentially does exactly what you described, you are given a viewer and some tools that allow you to define areas of interest (what you called 'placeholders').
Areas of interest can be defined using:
coordinates
relative to other areas of interest
relative to text or regular expressions
matching a certain regular exression
matching a table
etc
The tool then stores your template as an XML file, and you can use java or .NET code to extract information from a PDF that matches the template.
You are given either a json-like datastructure, or an XML file.
That should make it relatively straightforward to test whether a given area of interest contains a piece of text.
Good day to all. I am currently building a program that covers the review of product warranty applications. I'm doing it in javaFX using Netbeans. The program has the following scenes:
a screen where the information of each guarantee request is entered. all the information is stored in a table in a database. The interaction between the program and the database is done, in effect, through JDBC.
a screen where you can see a table that shows all the requests that have been saved. if a row is selected, a button that carries the third scene all the data of the request that was selected is enabled.
a screen where all the data of the tests that are made to the selected guarantee application are entered. The results are also stored in another table in the database.
After the application is evaluated, a warranty review report must be generated. Currently this format is generated in pdf from excel. What I want to do is that from the data results of the tests stored in the database I can dynamically generate the pdf formats from the program in javaFX. Is there a plugin to write these documents automatically? I'm good at writing texts in LaTEX, so if there is a way to generate the latex format from the program and call the necessary information from the database, it would be perfect. Thanks in advance for the help. Any indication or idea is welcome.
It seems like you have two core requirements:
Fetch data from the database suitable for reporting
Generate the report(s) in PDF from JavaFX but can fall back to LaTEX
What you really need seems like a PDF library for Java. I can suggest iText and Docmosis as good options (please note I work for Docmosis) - both are commercial for commercial products so you would have to buy.
Assuming you are using one of these libraries, the process for each report is:
execute the query to fetch the appropriate data for the report
manipulate the data if required to make the reporting stage simple
generate the report
Using iText you would write the query, the manipulation code and then the code to layout the report including the data.
Using Docmosis you would write the query, possibly some manipulation code (Docmosis can also work directly with your ResultSet) and the code to execute the report. The layout is designed in the template (Word or Libre Office Writer).
When you mention writing "these documents automatically" I assume you mean creating the PDF file format, which iText and Docmosis can do. If you mean creating the report layout itself, then you always need to design/write something to make the report do what you require.
I hope that helps.
Thank you very much for your response Paul! I had found something related to the libraries you mentioned, and indeed something like what I'm looking for. I notice that you are more in the subject. then, you do not know bookstore, preferably free, that gives me the possibility of doing the following (pseudo code):
take the row from the database
Save the information of that row in the attributes of a created class.
create text1: "the guarantee with reference" + object.attribute1 + "was not approved in view of the physical revision test indicated that" + object.attribute2 + "
create text2: "..."
...
create the text n: "..."
take text 1 and place it in the header of the pdf document
Take text 2, put it in bold and place it in the subtitle
Generate a table and fill it with the content of text 3, 4 ...
compile all information as a pdf, (word file, xls or others if possible)
I am clear that with the libraries that you recommend you can easily make items from
1 to 8, but I do not know if it is possible to enter the texts within a template created, so that the library accommodates all the texts in the respective zones of the template file. I imagine that this can easily be done with Latex, since everything is written in plain text.
I found a library called Java LaTeX Report (JLR) that allows me to do what I want. This information may be useful to someone. Thank you again for your answer Paul, if you consider the libraries that you mention do the job more easily than JLR please let me know!
I'm trying to perform automaticaly table extraction inside PDF. I know there are several libraries and methods Java and Python, but to my surprise, the method that has worked best for me is to convert my Pdf to a Docx document and from there to extract the tables (thanks to: How to get pictures and tables from .docx document using apache poi?).
My question is this: Assuming that within the format conversion there may be loss of information, why are my results better this way? Tabula hasn't been able to do better automatically. To understand this, I have looked for information (e.g. Extracting table contents from a collection of PDF files) but I'm still very confused.
PD: For the moment, I have used https://github.com/thoqbk/traprange (A method based on Pdfbox), How to extract table as text from the PDF using Python? (PyPdf2) and Tabula. When I get to my home I going to put code and cases, I'm writing from my smartphone.
I have a program which will be used for building questions database. I'm making it for a site that want user to know that contet was donwloaded from that site. That's why I want the output be PDF - almost everyone can view it, almost nobody can edit it (and remove e.g. footer or watermark, unlike in some simpler file types). That explains why it HAS to be PDF.
This program will be used by numerous users which will create new databases or expand existing ones. That's why having output formed as multple files is extremly sloppy and inefficient way of achieving what I want to achieve (it would complicate things for the user).
And what I want to do is to create PDF files which are still editable with my program once created.
I want to achieve this by implementing my custom file type readable with my program into the output PDF.
I came up with three ways of doing that:
Attach the file to PDF and then corrupting the part of PDF which contains it in a way it just makes the PDF unaware that it contains the file, thus making imposible for user to notice it (easely). Upon reading the document I'd revert the corruption and extract file using one of may PDF libraries.
Hide the file inside an image which would be added to the PDF somwhere on the first or last page, somehow (that is still need to work out) hidden from the public eye. Knowing it's location, it should be relativley easy to retrieve it using PDF library.
I have learned that if you add "%" sign as a first character in line inside a PDF, the whole line will be ignored (similar to "//" in Java) by the PDF reader (atleast Adobe reader), making possible for me to add as many lines as I want to the PDF (if I know where, and I do) whitout the end user being aware of that. I could implement my whole custom file into PDF that way. The problem here is that I actually have to read the PDF using one of the Java's input readers, but I'm not sure which one. I understand that PDF can't be read like a text file since it's a binary file (Right?).
In the end, I decided to go with the method number 3.
Unless someone has any better ideas, and the conditions are:
1. One file only. And that file is PDF.
2. User must not be aware of the addition.
The problem is that I don't know how to read the PDF as a file (I'm not trying to read it as a PDF, which I would do using a PDF library).
So, does anyone have a better idea?
If not, how do I read PDF as a FILE, so the output is array of characters (with newline detection), and then rewrite the whole file with my content addition?
In Java, there is no real difference between text and binary files, you can read them both as an inputstream. The difference is that for binary files, you can't really create a Reader for it, because that assumes there's a way to convert the byte stream to unicode characters, and that won't work for PDF files.
So in your case, you'd need to read the files in byte buffers and possibly loop over them to scan for bytes representing the '%' and end-of-line character in PDF.
A better way is to use another existing way of encoding data in a PDF: XMP tags. This is allows any sort of complex Key-Value pairs to be encoded in XML and embedded in PDF's, JPEGs etc. See http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf.
There's an open source library in Java that allows you to manipulate that: http://pdfbox.apache.org/userguide/metadata.html. See also a related question from another guy who succeeded in it: custom schema to XMP metadata or http://plindenbaum.blogspot.co.uk/2010/07/pdfbox-insertextract-metadata-frominto.html
It's all just 1's and 0's - just use RandomAccessFile and start reading. The PDF specification defines what a valid newline character(s) is/are (there are several). Grab a hex editor and open a PDF and you can at least start getting a feel for things. Be careful of where you insert your lines though - you'll need to add them towards the end of the file where they won't screw up the xref table offsets to the obj entries.
Here's a related question that may be of interest: PDF parsing file trailer
I would suggest putting your comment immediately before the startxref line. If you put it anywhere else, you could wind up shifting things around and breaking the xref table pointers.
So a simple algorithm for inserting your special comment will be:
Go to the end of the file
Search backwards for startxref
Insert your special comment immediately before startxref - be sure to insert a newline character at the end of your special comment
Save the PDF
You can (and should) do this manually in a hex editor.
Really important: are your users going to be saving changes to these files? i.e. if they fill in the form field, are they going to hit save? If they are, your comment lines may be removed during the save (and different versions of different PDF viewers could behave differently in this regard).
XMP tags are the correct way to do what you are trying to do - you can embed entire XML segments, and I think you'd be hard pressed to come up with a data structure that couldn't be expressed as XML.
I personally recommend using iText for this, but I'm biased (I'm one of the devs). The iText In Action book has an excellent chapter on embedding XMP data into PDFs. Here's some sample code from the book (which I definitely recommend): http://itextpdf.com/examples/iia.php?id=217
I have created a program that should one day become a PDF editor
It's purpose will be saving GUI's textual content to the PDF, and loading it from it. GUI resembles text editor, but it only has certain fields(JTextAreas, actually).
It can look like this (this is only one page, it can have many more, also upper and lower margins are cut out of the picture) It should actually resemble A4 in pixel size.
I have looked around for a bit for PDF libraries and found out that iText could suit my PDF creating needs, however, if I understood it correct, it retirevs text from a whole page as a string which won't work for me, because I will need to detect diferent fields/paragaphs/orsomething to be able to load them back into the program.
Now, I'm a bit lazy, but I don't want to spend hours going trough numerus PDF libraries just to find out that they won't work for me.
Instead, I'm asking someone with a bit more Java PDF handling experience to recommend me one according to my needs.
Or maybe recommend me how to add invisible parts to PDF which will help my program to determine where is it exactly situated insied a PDF file...
Just to be clear (I formed my question wrong before), only thing I need to put in my PDF is text, and that's all I need to later be able to get out. My program should be able to read PDF's which he created himself...
Also, because of the designated use of files created with this program, they need to be in the PDF format.
Short Answer: Use an intermediate format like JSON or XML.
Long Answer: You're using PDF's in a manner that they wasn't designed for. PDF's were not designed to store data; they were designed to present and format data in an portable form. Furthermore, a PDF is a very "heavy" way to store data. I suggest storing your data in another manner, perhaps in a format like JSON or XML.
The advantage now is that you are not tied to a specific output-format like PDF. This can come in handy later on if you decide that you want to export your data into another format (like a Word document, or an image) because you now have a common representation.
I found this link and another link that provides examples that show you how to store and read back metadata in your PDF. This might be what you're looking for, but again, I don't recommend it.
If you really insist on using PDF to store data, I suggest that you store the actual data in either XML or RDF and then attach that to the PDF file when you generate it. Then you can read the XML back for the data.
Assuming that your application will only consume PDF files generated by the same application, there is one part of the PDF specification called Marked Content, that was introduced precisely for this purpose. Using Marked Content you can specify the structure of the text in your document (chapter, paragraph, etc).
Read Chapter 14 - Document Interchange of the PDF Reference Document for more details.