Cannot read generated text of pdf file in Java

Cannot read generated text of pdf file in Java - java

I am trying to read the text in Java and it isn't doing well.
Here is my code
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File pdfFile = new File("1.pdf");
PDFParser parser = new PDFParser(new RandomAccessFile(pdfFile,"rw"));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
But the result like this
Please wait...
If this message is not eventually replaced by the proper contents of the document, your PDF
viewer may not be able to display this type of document.
You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by
visiting http://www.adobe.com/go/reader_download.
For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader.
Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark
of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other
countries.
I found this error occurred because of xfa pdf document.
But I don't know about xfa format of my pdf document.
So please Let me know how can I know about xfa format.
Someone help me please.
Thank you!

To sum up what has been said or hinted at in the comments...
The text quoted by the OP,
Please wait...
If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document.
...
is the content of the single PDF page Adobe software commonly puts into PDFs with a pure XFA form.
XFA forms constitute an alternative way to describe forms in PDFs. In contrast to the AcroForm way, XFA forms only use PDFs as an envelope carrying a XML stream describing properties, behavior, and values of the form in a way unrelated to any other PDF structure.
Thus, many PDF processors offer a rudimentary support for XFA forms only (or none at all), the main exception being (obviously) Adobe products.
As a result XFA has been marked deprecated in the current PDF specification ISO 32000-2.
In case of PDFBox the XFA support is restricted to the feature of retrieval of the XFA XML data. Text extraction using the PdfTextStripper and related classes only operates on the regular PDF content and, therefore, only retrieves the text reported by the OP.
To access the content of XFA forms, you can retrieve the XFA resource using PDAcroForm.getXFA().

Related

Save as print to pdf option using Java 8

I am watermarking pdf document using itext7 library. it is preserving layers and shows one of its signature invalid.
I want to flatten the created document.
When i tried saving the document manually using Adobe print option, it flattens all signature and makes the document as valid document. Same functionality i want with java program.
Is there any way using java program, we can flatten pdf document?

According to your tag selection you appear to be using iText 7 for Java.
How to flatten a PDF AcroForm form using iText 7 is explained in the iText 7 knowledge base example Flattening a form. The pivotal code is:
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC), new PdfWriter(DEST));
PdfAcroForm form = PdfAcroForm.getAcroForm(pdfDoc, true);
// If no fields have been explicitly included, then all fields are flattened.
// Otherwise only the included fields are flattened.
form.flattenFields();
pdfDoc.close();
(https://kb.itextpdf.com/home/it7kb/examples/flattening-a-form visited 2021-05-24)

java - generate unicode pdf with Apache PDFBox

I have to generate pdf in my spring mvc application. recently I tested iTextPdf library, but i could not generate unicode pdf document. in fact I didn't see non-latin characters in the generated document. I decided to use Apache PDFBox for my purpose, but I don't know has it support unicode characters? If has, is there any good tutorial for learning pdfBox? And If not, which library should I use?
Thanks in advance.

The 1.8.* versions don't support PDF generation with Unicode, but the 2.0.* versions do. This is the example EmbeddedFonts.java:
public class EmbeddedFonts
{
public static void main(String[] args) throws IOException
{
PDDocument document = new PDDocument();
PDPage page = new PDPage(PDRectangle.A4);
document.addPage(page);
String dir = "../pdfbox/src/main/resources/org/apache/pdfbox/resources/ttf/";
PDType0Font font = PDType0Font.load(document, new File(dir + "LiberationSans-Regular.ttf"));
PDPageContentStream stream = new PDPageContentStream(document, page);
stream.beginText();
stream.setFont(font, 12);
stream.setLeading(12 * 1.2);
stream.newLineAtOffset(50, 600);
stream.showText("PDFBox Unicode with Embedded TrueType Font");
stream.newLine();
stream.showText("Supports full Unicode text ?");
stream.newLine();
stream.showText("English русский язык Tiếng Việt");
stream.endText();
stream.close();
document.save("example.pdf");
document.close();
}
}
Note that unlike iText, PDFBox support for PDF creation is very low level, i.e. we don't support paragraphs or tables out of the box. There is no tutorial, but a lot of examples. The API orients itself on the PDF specification.

The current version of Apache PDFBox can't deal with Unicode, see:
https://pdfbox.apache.org/ideas.html
iTextPdf v. 5.x generates pdf files with Unicode. There is an exemple here:
iText in Action: Chapter 11: Choosing the right font
part3.chapter11.UnicodeExample
http://itextpdf.com/examples/iia.php?id=199
To run it, you just need to adapt the value of EncodingExample.FONT and to add some code to create the output file.

pdf conversion Using java library

I am willing to convert xhtml files into pdf/a format or pdf files to pdf/a format.. Can anyone please suggest which java library I can use..
Thank you
I will make my example more specific
I have a simple html file xyz.html
<html><body>
hello
<br>
<font style = "Helvetica">hello</font>
<br>
</body></html>
java code :
Document document = new Document(PageSize.A4);
FileOutputStream fout = new FileOutputStream(pdffile);
PdfWriter pdfWriter = PdfWriter.getInstance(document, fout);
pdfWriter.setPDFXConformance(PdfWriter.PDFA1B);
FileReader fr = new FileReader(xyz.html);
document.open();
HashMap<String, Object> Provider = new HashMap<String, Object>();
DefaultFontProvider def = new
Provider.put(HTMLWorker.FONT_PROVIDER, def);
HTMLWorker htmlWorker = new HTMLWorker(document);
htmlWorker.setProviders(Provider);
htmlWorker.parse(fr);
I get the error com.itextpdf.text.pdf.PdfXConformanceException: All the fonts must be embedded. This one isn't: Helvetica

try the flying soucer: http://code.google.com/p/flying-saucer/

Check for iText library which has support for both Java and .net
http://itextpdf.com/
Few examples in the below link :
http://itextpdf.com/book/examples.php
http://www.rgagnon.com/javadetails/java-html-to-pdf-using-itext.html
This is proprietory but Its really a smart enterprise library and has good customer support.

Consider Apache FOP project, it supports conversion of xml files to pdf files.

I work at Expected Behavior, and we've developed a SaaS application called DocRaptor that converts HTML to PDF using Prince XML as our rendering engine. DocRaptor uses HTTP POST requests to generate PDF files, and can be used with Java.
Here's a link to our Java example:
DocRaptor Java example
And a link to DocRaptor's home page:
DocRaptor
DocRaptor IS a subscription based service, but our free plan allows you to create up to 5 documents per month, and we don't embed watermarks or restrict the size of free documents.

How to automate PDF form-filling in Java

I am doing some "pro bono" development for a food pantry near where I live. They are inundated with forms and paperwork, and I would like to develop a system that simply reads data from their MySQL server (which I set up for them on a previous project) and feeds data into PDF versions of all the forms they are required to fill out. This will help them out enormously and save them a lot of time, as well as get rid of a lot of human errors that are made when filling out these forms.
Not knowing anything about the internals of PDF files, I can foresee two avenues here:
Harder Way: It is possible to scan a paper document, turn it into a PDF, and then have software that "fills out" the PDF simply by saying "add text except blah to the following (x,y) coordinates..."; or
Easier Way: PDF specification already allows for the construct of "fields" that can be filled out; this way I just write code that says "add text excerpt blah to the field called *address_value*...", etc.
So my first question is: which of the two avenues am I facing? Does PDF have a concept of "fields" or do I need to "fill out" these documents by telling the PDF library the pixel coordinates of where to place data?
Second, I obviously need an open source (and Java) library to do this. iText seems to be a good start but I've heard it can be difficult to work with. Can anyone lend some ideas or general recommendations here? Thanks in advance!

You can easily merge data into PDF's fields using the FDF(Form Data Format) technology.
Adobe provides a library to do that : Acrobat Forms Data Format (FDF) Toolkit
Also Apache PDFBox can be used to do that.

Please take a look at the chapter about interactive forms in the free ebook The Best iText Questions on StackOverflow. It bundles the answers to questions such as:
How to fill out a pdf file programatically?
How can I flatten a XFA PDF Form using iTextSharp?
Checking off pdf checkbox with itextsharp
How to continue field output on a second page?
finding out required fields to fill in pdf file
and so on...
Or you can watch this video where I explain how to use forms for reporting step by step.
See for instance:
public void manipulatePdf(String src, String dest) throws DocumentException, IOException {
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader,
new FileOutputStream(dest));
AcroFields fields = stamper.getAcroFields();
fields.setField("name", "CALIFORNIA");
fields.setField("abbr", "CA");
fields.setField("capital", "Sacramento");
fields.setField("city", "Los Angeles");
fields.setField("population", "36,961,664");
fields.setField("surface", "163,707");
fields.setField("timezone1", "PT (UTC-8)");
fields.setField("timezone2", "-");
fields.setField("dst", "YES");
stamper.setFormFlattening(true);
stamper.close();
reader.close();
}

public void fillPDF()
{
try {
PDDocument pDDocument = PDDocument.load(new File("D:/pdf/pdfform.pdf")); // pdfform.pdf is input file
PDAcroForm pDAcroForm = pDDocument.getDocumentCatalog().getAcroForm();
PDField field = pDAcroForm.getField("Given Name Text Box");
field.setValue("firstname");
field = pDAcroForm.getField("Family Name Text Box");
field.setValue("lastname");
field = pDAcroForm.getField("Country Combo Box");
field.setValue("Country");
System.out.println("country combo" );
field = pDAcroForm.getField(" Driving License Check Box");
field = pDAcroForm.getField("Favourite Colour List Box");
System.out.println("country combo"+ field.isRequired());
pDDocument.save("D:/pdf/pdf-java-output.pdf");
pDDocument.close();
} catch (IOException e) {
e.printStackTrace();
}
}

Securing PDF Generated from iTextPdf

I have made a software that generate a pdf as the part of its function, I am using iTextPDF Java library to generate PDF. For a demo version of my software, I added text watermarking (like "demo software") by use of following code
PdfContentByte under = writer.getDirectContentUnder();
BaseFont baseFont = BaseFont.createFont(BaseFont.HELVETICA, BaseFont.WINANSI, BaseFont.EMBEDDED);
under.beginText();
under.setColorFill(BaseColor.RED);
under.setFontAndSize(baseFont, 25);
under.showTextAligned(PdfContentByte.ALIGN_CENTER," demo software",250, 470,55);
under.endText();
After it I converted it to .docx format using PDF to Word converter and the resultant docx file does not contain the watermark also the contents are easily editable so as a result the sole purpose of giving demo software is vanished.
How can I achieve permanent watermarking so that pdf to word converter wont be able to remove it.
One idea come to my mind is that instead of putting the text in the pdf there should be a way of converting all the text of a page first into an image then making the pdf comprising of those images. But I am unsure on how to achieve this using iTextPdf.

You can encrypt your PDF so that it cannot be modified without an owner password, after you have generated your PDF, create a PDFStamper with your PDF as input
and encrypt the pdf like the following:
final PdfReader reader = new PdfReader(your_input_stream);
final PdfStamper stamper = new PdfStamper(reader, your_output_stream);
stamper.setEncryption(PdfWriter.ENCRYPTION_AES_128 | PdfWriter.DO_NOT_ENCRYPT_METADATA,
"your_user_password", "your_owner_password", PdfWriter.ALLOW_PRINTING);
stamper.close();
As a side note, i would recommend not using a hardcoded owner password; since you have no need for the owner password after the file has been generated, I would suggest making it a SHA hash of a random string of say 20 alphanumeric characters.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Cannot read generated text of pdf file in Java - java

Related

Save as print to pdf option using Java 8

java - generate unicode pdf with Apache PDFBox

pdf conversion Using java library

How to automate PDF form-filling in Java

Securing PDF Generated from iTextPdf

Categories

Resources