Facing issues on extracting text from pdf file using java

Facing issues on extracting text from pdf file using java - java

Not able to extract the text from the pdf which has Customer encryption fonts, which can identify by File -> Properties -> Font in Adobe reader.
One of the font is mention as,
C0EX02Q0_22
Type: Type 3
Encoding: Custom
Actual Font: C0EX02Q0_22
Actual Font type: Type 3
Let me know is there any way to to extract the text content from such pdf files.
Currently i am using PDFText2HTML from pdf util.
Get the values like 'ÁÙÅ#ÅÕãÉ' while extracting such pdf files
Sample pdf: tesis completa.pdf
In this pdf you could see the fonts used having custom encoding Eg: T3Font_1 (Please refer by File -> Properties -> Font in Adobe reader)
Since i could not upload the my pdf updated the sample one having same issue

Extraction as described in the standard
The PDF specification ISO 32000-1 describes in section 9.10 Extraction of Text Content how text extraction can be done if the PDF provides the required information and does so correctly.
Using this algorithm, though, only works in a few page ranges of the document (namely the summaries, the content lists, the thank-yous, and the section Publicación 7) but in the other ranges results in gibberish, e.g. 8QLYHUVLWDWGH/OHLGD instead of Universitat de Lleida. Looking at the PDF objects in question makes clear that the required information are missing (no ToUnicode map and while the Encoding is based on WinAnsiEncoding, all positions in use are mapped via Differences to non-standard names).
Also trying to extract the text using copy&paste from Adobe Reader returns that gibberish. This generally is a sign that generic extraction is not possible.
A work-around
Inspecting the PDF objects and the outputs of the generic text extraction attempt, though, gives rise to the idea that the actual encoding for the text extracted as gibberish is the same for all fonts used, and that it is some ASCII-based encoding shifted by a constant: Adding 'U' - '8' to each character of the extracted 8QLYHUVLWDWGH/OHLGD results in Universitat de Lleida. Adding the same constant to the chars from text extracted elsewhere in the document also results in correct text as long as the text only uses ASCII characters.
Characters outside the ASCII range are not mapped correctly by that simple method, but they also always seem to be extracted as the same wrong character, e.g. the glyph 'ó' always is extracted as 'y'.
Thus, you can extract the text from that (and similarly created) documents by first extracting the text using the standard algorithm and then in the gibberish sections (which probably can be identified by font name) replacing each character by adding 'U' - '8' for small values and by replacing according to some mapping for higher values.
As you mentioned Java in your question, I have run your document through iText and PDFBox text extraction with and without shifting by 'U' - '8', and the results look promising. I assume other general-purpose Java PDF libraries will also work.
Another work-around
Instead of creating custom extraction routines, you can try to fix the PDFs in question by adding ToUnicode map entries to the fonts in question. After that normal text extraction programs should be able to properly extract the contents.

Related

Java: Converting a PostScript File into text

Is there a Java Library that converts a PostScrpit File ".ps" into a String or TextFile (or something I can read with an InputStream)?
I have these Files and need to read them and handle them accourding to the Text in it. They allway contain only Text and usually its just one line like
date:SWYgeW91IHJlYWQgdGhpcyB5b3UncmUgcHJvYmFibGUgdG8gY3VyaW91cyAgYnV0IG5pY2UgdHJ5IGFueXdheS4gUGxlYXNlIEhlbHA=
in it.
Right now I convert it into a PDF and "read" it with an OCR Engine. But it seems a litte bit over the top for just one line.
Is there an other way to do it?
If you could point me in the right direction, that would be great.

PostScript is a language to define graphical output on paper, to a printer device. As such it does not really contain plaintext, and "extracting" text from it poses problems. It could for instance be programmatically determined in places, or it could be interspersed with PS code making the text data useless.
Normally you would output a modified PS to a printer (real or virtual) with a specific config that leads the result to be output as a standard text sequence (without the graphical formatting).
This is often done by altering the PS code file, to alter the text output command.
A desciption of this method can be found in part 3 of following Waikato Uni PM
http://www.cs.waikato.ac.nz/~ihw/papers/98NM-Reed-IHW-Extract-Text.pdf

If you convert the PostScript file to PDF (for example, with Ghostscript ps2pdf or with Acrobat Distiller), you could then read this file using iText (http://itextpdf.com). You could also convert the PDF into a more readable form using RUPS, one of the iText tools.

How to test the PDF generated using iText?

I have used itext in Java to convert a HTML to PDF.
Now I want to test if the PDF generated by me is correct i.e the positions and contents all are correct and at correct positions.
Is there away to do the testing of my code?

Basically, Your question is about validating itext output.
If You do not trust library for converting HTML to PDF, You probably do not trust reading raw PDF data as well. You can therefore use other libraries (PDF clown) for parsing PDF as a validation.
You have 2 approaches.
First one requires rasterization of PDF (GhostScript) and comparing to HTML. Indeed, the performance overhead is significant.
Second one parses the document format. I have gone into depth in my previous answer about searching for text inside PDF file.
I have mentioned there searching for text as well as finding it's position on page.
I would suggest just simply avoid validating of output, unless You know something is wrong.
These libraries are widely-used and well-tested.

Two fields (out of 18) in a PDF form don't appear after filling and flattening with iText

I'm using the latest version of iText (5.5.0) to fill a pdf form and flatten it.
For some reason when I fill the form using an fdf everything works except for special chars like 'ç'. But when I use an xfdf the special chars appear but two fields do not ('comment' and 'datelicFormatted')
-> see https://upl.cases.lu/?action=d&id=82085749396826225834 for the template, the xpdf and a result
The form was created by converting a word document and adding text fields with acrobat 10.
The really strange part is when I ask not to flatten it: the fields contain the right values, they just seem to vanish when flattened.
Thank you for any help you could provide.

Your template indeed uses the font Arial Unicode MS that could be interpreted by iText correctly only with some extra resource(CMAP UniKS-UTF16-H). The ordinary iText JARS does not contain such resources by design. But there is one more iText JAR - itext-asia.jar that contains CMAPs resources. So just added itext-asia.jar to your classpath and the problematic fields will be flattened correctly. Your could download itext-asia.jar at http://sourceforge.net/projects/itext/files/extrajars/extrajars-2.3.zip/download

Reading PDF in java as a file and making "PDF" editable

I have a program which will be used for building questions database. I'm making it for a site that want user to know that contet was donwloaded from that site. That's why I want the output be PDF - almost everyone can view it, almost nobody can edit it (and remove e.g. footer or watermark, unlike in some simpler file types). That explains why it HAS to be PDF.
This program will be used by numerous users which will create new databases or expand existing ones. That's why having output formed as multple files is extremly sloppy and inefficient way of achieving what I want to achieve (it would complicate things for the user).
And what I want to do is to create PDF files which are still editable with my program once created.
I want to achieve this by implementing my custom file type readable with my program into the output PDF.
I came up with three ways of doing that:
Attach the file to PDF and then corrupting the part of PDF which contains it in a way it just makes the PDF unaware that it contains the file, thus making imposible for user to notice it (easely). Upon reading the document I'd revert the corruption and extract file using one of may PDF libraries.
Hide the file inside an image which would be added to the PDF somwhere on the first or last page, somehow (that is still need to work out) hidden from the public eye. Knowing it's location, it should be relativley easy to retrieve it using PDF library.
I have learned that if you add "%" sign as a first character in line inside a PDF, the whole line will be ignored (similar to "//" in Java) by the PDF reader (atleast Adobe reader), making possible for me to add as many lines as I want to the PDF (if I know where, and I do) whitout the end user being aware of that. I could implement my whole custom file into PDF that way. The problem here is that I actually have to read the PDF using one of the Java's input readers, but I'm not sure which one. I understand that PDF can't be read like a text file since it's a binary file (Right?).
In the end, I decided to go with the method number 3.
Unless someone has any better ideas, and the conditions are:
1. One file only. And that file is PDF.
2. User must not be aware of the addition.
The problem is that I don't know how to read the PDF as a file (I'm not trying to read it as a PDF, which I would do using a PDF library).
So, does anyone have a better idea?
If not, how do I read PDF as a FILE, so the output is array of characters (with newline detection), and then rewrite the whole file with my content addition?

In Java, there is no real difference between text and binary files, you can read them both as an inputstream. The difference is that for binary files, you can't really create a Reader for it, because that assumes there's a way to convert the byte stream to unicode characters, and that won't work for PDF files.
So in your case, you'd need to read the files in byte buffers and possibly loop over them to scan for bytes representing the '%' and end-of-line character in PDF.
A better way is to use another existing way of encoding data in a PDF: XMP tags. This is allows any sort of complex Key-Value pairs to be encoded in XML and embedded in PDF's, JPEGs etc. See http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf.
There's an open source library in Java that allows you to manipulate that: http://pdfbox.apache.org/userguide/metadata.html. See also a related question from another guy who succeeded in it: custom schema to XMP metadata or http://plindenbaum.blogspot.co.uk/2010/07/pdfbox-insertextract-metadata-frominto.html

It's all just 1's and 0's - just use RandomAccessFile and start reading. The PDF specification defines what a valid newline character(s) is/are (there are several). Grab a hex editor and open a PDF and you can at least start getting a feel for things. Be careful of where you insert your lines though - you'll need to add them towards the end of the file where they won't screw up the xref table offsets to the obj entries.
Here's a related question that may be of interest: PDF parsing file trailer
I would suggest putting your comment immediately before the startxref line. If you put it anywhere else, you could wind up shifting things around and breaking the xref table pointers.
So a simple algorithm for inserting your special comment will be:
Go to the end of the file
Search backwards for startxref
Insert your special comment immediately before startxref - be sure to insert a newline character at the end of your special comment
Save the PDF
You can (and should) do this manually in a hex editor.
Really important: are your users going to be saving changes to these files? i.e. if they fill in the form field, are they going to hit save? If they are, your comment lines may be removed during the save (and different versions of different PDF viewers could behave differently in this regard).
XMP tags are the correct way to do what you are trying to do - you can embed entire XML segments, and I think you'd be hard pressed to come up with a data structure that couldn't be expressed as XML.
I personally recommend using iText for this, but I'm biased (I'm one of the devs). The iText In Action book has an excellent chapter on embedding XMP data into PDFs. Here's some sample code from the book (which I definitely recommend): http://itextpdf.com/examples/iia.php?id=217

rendering pdf on webside ala google documents

In a current project i need to display PDFs in a webpage. Right now we are embedding them with the Adobe PDF Reader but i would rather have something more elegant (the reader does not integrate well, it can not be overlaid with transparent regions, ...).
I envision something close google documents, where they display PDFs as image but also allow text to be selected and copied out of the PDF (an requirement we have).
Does anybody know how they do this? Or of any library we could use to obtain a comparable result?
I know we could split the PDFs into images on server side, but this would not allow for the selection of text ...
Thanks in advance for any help
PS: Java based project, using wicket.

I have some suggestions, but it'll be definitely hard to implement this stuff. Good luck!
First approach:
First, use a library like pdf-renderer (https://pdf-renderer.dev.java.net/) to convert the PDF into an image. Store these images on your server or use a caching-technique. Converting PDF into an image is not hard.
Then, use the Type Select JavaScript library (http://www.typeselect.org/) to overlay textual data over your text. This text is selectable, while the real text is still in the original image. To get the original text, see the next approach, or do it yourself, see the conclusion.
The original text then must be overlaid on the image, which is a pain.
Second approach:
The PDF specifications allow textual information to be linked to a Font. Most documents use a subset of Type-3 or Type-1 fonts which (often) use a standard character set (I thought it was Unicode, but not sure). If your PDF document does not contain a standard character set, (i.e. it has defined it's own) it's impossible to know what characters are which glyphs (symbols) and thus are you unable to convert to a textual representation.
Read the PDF document, read the graphics-objects, parse the instructions (use the PDF specification for more insight in this process) for rendering text, converting them to HTML. The HTML conversion can select appropriate tags (like <H1> and <p>, but also <b> and <i>) based on the parameters of the fonts (their names and attributes) used and the instructions (letter spacing, line spacing, size, face) in the graphics-objects.
You can use the pdf-renderer library for reading and parsing the PDF files and then code a HTML translator yourself. This is not easy, and it does not cover all cases of PDF documents.
In this approach you will lose the original look of the document. There are some PDF generation libraries which do not use the Adobe Font techniques. This also is a problem with the first approach, even you can see it you can not select it (but equal behavior with the official Adobe Reader, thus not a big deal you'd might say).
Conclusion:
You can choose the first approach, the second approach or both.
I wouldn't go in the direction of Optical Character Recognition (OCR) since it's really overkill in such a problem, since it also has several drawbacks. This approach is Google using. If there are characters which are unrecognized, a human being does the processing.
If you are into the human-processing thing; you can only use the Type Select library and PDF to Image conversion and do the OCR yourself, which is probably the easiest (human as a machine = intelligently cheap, lol) way to solve the problem.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.