Extracting data from a PDF

Extracting data from a PDF - java

I have a system that ultimately creates a PDF files from html file. It works very similar to a mail merge. It grabs data from a database, merge's the data into palceholders in the html document and then converts the html file to a pdf.
When I am unit testing the html file I can look at the values in my place holder. For example if I had a John Smith and I want to validate that the name is "John Smith" I simply look the value of the div after the merge.
I need to do something similar with validating the data in the pdf. Using pdfbox and itext I was able to extract text from a location as well as text from the document but I can't find anything that would let me create a "tag/placeholder/..." and extract information from it similar to what I do with the html file.
Is this possible with a pdf?

That's perfectly possible using pdf2Data, which is a solution from the iText suite.
You can find the demo here
http://pdf2data.online/
It essentially does exactly what you described, you are given a viewer and some tools that allow you to define areas of interest (what you called 'placeholders').
Areas of interest can be defined using:
coordinates
relative to other areas of interest
relative to text or regular expressions
matching a certain regular exression
matching a table
etc
The tool then stores your template as an XML file, and you can use java or .NET code to extract information from a PDF that matches the template.
You are given either a json-like datastructure, or an XML file.
That should make it relatively straightforward to test whether a given area of interest contains a piece of text.

Related

How to extract data from HTML-formatted EMail files via OpenNLP?

I am working on a project where I have emailed receipts from various courier agents. The emails are of HTML format.
But, they do not all form a specific structure. Each email is of different format. I tried jsoup to extract data, but its difficult to write the extraction for each specific type of html. I need to extract Name, from location, to location, organization and few other details from the mail. I tried openNLP, but it does not recognize all locations and names. It catches some of the locations if it is in a sentence form.
Can I create my own training data with html content in it, annotate them and train it to detect locations and names based on the html structure i have in the training data?

I think your initial approach is worth pursuing. I see an option for 2 steps here:
Get the 'text' content of the mail using Jsoup. An example of that is here: Get Text from html Using Jsoup.
Use OpenNLP or StanfordNLP NER to extract the named entities. Locations, Names e.t.c.
Another options involves playing around with the parse tree generated from the sentences and see if there is a pattern for the data you're open to extracting.
As regards getting from location and to location, you can try to generate a parse tree for the Sentences, there is an excellent example of that here: Extract noun phrase from Sentences OpenNLP. Just change the code to get the PP (Prepositional Phrase) in line 65 as it currently gets the NP (Noun Phrase).
You'll notice that from location and to location are prepositional phrases (from and to are prepositions). Once you get the prepositional phrases from the sentences, you can try to extract the noun component (after the preposition) and use other heuristics to determine if they are locations.
Something that can also be very useful is if you have a lexicon of the possible locations. If there is a lexicon then your 'search space' is smaller, you can check your prepositional phrases to see if they are known locations.
As someone mentioned in the comment, no entity recognizer can do a perfect job out of the box. These things usually need a lot of tweaking so you have to be keen on experimenting and looking at what the data says.
Hope this helps

how to create a new word from template with docx4j

I have the following scenario, and need some advice:
The user will input a word document as a template, and provide some parameters in runtime so i can query my database and get data to fill the document.
So, there are two basic things i need to do:
Replace every key in the document with it´s respective result from the current query line.
"Merge" (copy? duplicate?) the existing document unchanged into itself (append) depending on how many rows i got from the query, and replacing the keys from this new copy with the next row values.
What´s is the best aprroach to do this? I´ve managed to do the replace part for now, by using the unmarshallfromtemplate providing it a hashmap.
But this way is a little bit tricky, because i need to add "${variable_name}" in the document, and sometimes word separates "${" and "}" in different tags, causing issues.
I´ve read about the custom xml binding, but didn´t understand it completely. I need to generate a custom XML, inject it in the document (all of this un runtime) and call the applybindings?? If this is true, how would i bind the fields in the document to the xml ? By name?

docx4j includes VariablePrepare, which can tidy up your input docx so that your keys are not split across separate runs.
But, you would still be better off switching to content control data binding, particularly if you have repeated data (think for example of line items in an invoice). Disclosure: I champion this approach in docx4j.
To adopt the content control data binding approach:
dream up an XML format which makes sense for your data, and write some code to convert the results of your database query into that format.
modify your template, so that the content controls are bound to elements in your XML document. ordinarily you'd use an authoring add-in for Word to help with this. (The technology Microsoft uses for binding is XPath, so how you bind depends on your XML structure, but, yes, you'd typically bind to the element name or ID).
now you have your XML file and a suitable intput docx, ContentControlsMergeXML contains the code you need to create an instance document at run time. There's also a version of this for a servlet environment at https://github.com/plutext/OpenDoPE-WAR
As an alternative to 1 & 2, there is also org.docx4j.model.datastorage.migration.FromVariableReplacement in current nightlies, which can convert your existing "${" document. Only to a standardised target XML format though.
If you have further questions, there is a forum devoted to this topic at http://www.docx4java.org/forums/data-binding-java-f16/

PDF Handling in Java

I have created a program that should one day become a PDF editor
It's purpose will be saving GUI's textual content to the PDF, and loading it from it. GUI resembles text editor, but it only has certain fields(JTextAreas, actually).
It can look like this (this is only one page, it can have many more, also upper and lower margins are cut out of the picture) It should actually resemble A4 in pixel size.
I have looked around for a bit for PDF libraries and found out that iText could suit my PDF creating needs, however, if I understood it correct, it retirevs text from a whole page as a string which won't work for me, because I will need to detect diferent fields/paragaphs/orsomething to be able to load them back into the program.
Now, I'm a bit lazy, but I don't want to spend hours going trough numerus PDF libraries just to find out that they won't work for me.
Instead, I'm asking someone with a bit more Java PDF handling experience to recommend me one according to my needs.
Or maybe recommend me how to add invisible parts to PDF which will help my program to determine where is it exactly situated insied a PDF file...
Just to be clear (I formed my question wrong before), only thing I need to put in my PDF is text, and that's all I need to later be able to get out. My program should be able to read PDF's which he created himself...
Also, because of the designated use of files created with this program, they need to be in the PDF format.

Short Answer: Use an intermediate format like JSON or XML.
Long Answer: You're using PDF's in a manner that they wasn't designed for. PDF's were not designed to store data; they were designed to present and format data in an portable form. Furthermore, a PDF is a very "heavy" way to store data. I suggest storing your data in another manner, perhaps in a format like JSON or XML.
The advantage now is that you are not tied to a specific output-format like PDF. This can come in handy later on if you decide that you want to export your data into another format (like a Word document, or an image) because you now have a common representation.
I found this link and another link that provides examples that show you how to store and read back metadata in your PDF. This might be what you're looking for, but again, I don't recommend it.

If you really insist on using PDF to store data, I suggest that you store the actual data in either XML or RDF and then attach that to the PDF file when you generate it. Then you can read the XML back for the data.

Assuming that your application will only consume PDF files generated by the same application, there is one part of the PDF specification called Marked Content, that was introduced precisely for this purpose. Using Marked Content you can specify the structure of the text in your document (chapter, paragraph, etc).
Read Chapter 14 - Document Interchange of the PDF Reference Document for more details.

Print xml in pdf using itext

I want to print xml in pdf using itext in java, as well formatted and displayed in color and indention as well like shown in notepad++,
any api or suggestion regarding this?

I have converted XHTML to pdf, via iText, using flying saucer for the rendering (previously xhtml renderer).
http://code.google.com/p/flying-saucer/
You can format using CSS, though I do remember it's slightly temperamental, however you can tweak it to get what you want, and end up with something nicely formatted.
I wasn't sure whet you meant regarding Notepad++ - I don't have PDF support there, just opens as Binary file contents, unless there is a PDF plugin you use?
::Answer updated after comments below.
Thanks for the comment, I understand the question much better now. I thought you wanted to output the data in the XML in the PDF, now I understand you want to see the raw XML itself in the PDF, formatted as you'd see XML formatted in Notepad, colours and all.
XML is a markup language designed to describe data, so you want to get this into a language that can descibe the presentation and style as well as the data. I'd suggest
1) Convert the XML to XHTML - so all the XML (tags, attributes) is your content, and you have classes describing each type (for example, attribute names, attribute values, starter tag, end tag). I don't know if you can use an XSLT library to transform it this way, oterwise you can write something yourself in Java, walking through the DOM and output it in the way you want. This way you can
2) Create CSS to style your classes as you want - e.g. have all attribute names as text color "red"
3) Use iText and flying saucer as above to convert the XHTML and CSS into PDF using Java, as described in original answer

Print XML as PDF using PCL

I have an xml file with which i want to print as a PDF using PCL. I am new to PCL. Can i use PCL to get the xml printed in PDF format directly or should i have some intermediate process to create a PDF file and then use PCL to get it printed as PDF?

If you have a xml, there are two ways to recieve PDF file.
1. Create stylesheet for your xml, and use XEP
or
2. use just your xml and VisualXSL, which will help you create your pdf for print.
More additional: If you will create your xsl stylsheet, you can format by XEP many type of PDFs, for example PDF/1A, or another levels
Both XEP and VisualXSL are Renderx products(http://www.renderx.com/tools/index.html) and they have trial versions, that you can use:). I have used both products many times, and was satisfied.
You can also visit the forum where you can find answers about how to use and how usefull are products described above. http://cooltools.renderx.com

PCL is a printer control language. In other words command bytes you send to a (usually HP) printer which is then converted to ink on a page. This is normally not the way you will generate a PDF since too much information from the original will be lost.
You will normally want to convert your XML to something describing the actual print you want to have. A reasonable choice for this is the XSL-FO XML dialect which, however, is not very nice to do by hand. You can then choose to convert your XML into DocBook XML which in turn has very nice style sheets for converting further on to XSL-FO and other formats.
You can then use Apache FOP to convert XSL-FO into many formats, one being PDF. This allows you to - if FOP gets too small - to replace with one of several commercial XSL_FO rendering engines at a later date.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extracting data from a PDF - java

Related

How to extract data from HTML-formatted EMail files via OpenNLP?

how to create a new word from template with docx4j

PDF Handling in Java

Print xml in pdf using itext

Print XML as PDF using PCL

Categories

Resources