I do some acrofield manipulation for text fields which have parent fields. This works so far, but the form also contains some checkboxes, the will not be changed. But when I store the manipulated pdf to disk and inspect the value of the checkbox, i can see that the value of cb_a.0 has been changed from ÄÖÜ?ß to ?????
My further processing fails because of this unintended change, any idea how to prevent that?
My testcase
#Test
public void changeBoxedFieldsToOne() throws IOException {
File encodingPdfFile = new File(classLoader.getResource("./prefill/TestFormEncoding.pdf").getFile());
byte[] encodingPdfByte = Files.readAllBytes(encodingPdfFile.toPath());
PdfAcrofieldManipulator pdfMani = new PdfAcrofieldManipulator(encodingPdfByte);
assertTrue(pdfMani.getTextFieldsWithMoreThan2Children().size() > 0);
pdfMani.changeBoxedFieldsToOne();
byte[] changedPdf = pdfMani.savePdf();
Files.write(Paths.get("./build/changeBoxedFieldsToOne.pdf"), changedPdf);
pdfMani = new PdfAcrofieldManipulator(changedPdf);
assertTrue(pdfMani.getTextFieldsWithMoreThan2Children().size() == 0);
}
public void changeBoxedFieldsToOne() {
PDDocumentCatalog docCatalog = pdDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
List<PDNonTerminalField> textFieldWithMoreThan2Childrens = getTextFieldsWithMoreThan2Children();
for (PDField field : textFieldWithMoreThan2Childrens) {
int amountOfChilds = ((PDNonTerminalField) field).getChildren().size();
String currentFieldName = field.getPartialName();
LOG.info("merging fields of fieldnam {0} to one field", currentFieldName);
PDField firstChild = getChildWithPartialName((PDNonTerminalField) field, "0");
if (firstChild == null ) {
LOG.debug("found field which has a dot but starts not with 0, skipping this field");
continue;
}
PDField lastChild = getChildWithPartialName((PDNonTerminalField) field, Integer.toString(amountOfChilds - 1));
PDPage pageWhichContainsField = firstChild.getWidgets().get(0).getPage();
try {
removeField(pdDocument, currentFieldName);
} catch (IOException e) {
LOG.error("Error while removing field {0}", currentFieldName, e);
}
PDField newField = creatNewField(acroForm, field, firstChild, lastChild, pageWhichContainsField);
acroForm.getFields().add(newField);
PDAnnotationWidget newFieldWidget = createWidgetForField(newField, pageWhichContainsField, firstChild, lastChild);
try {
pageWhichContainsField.getAnnotations().add(newFieldWidget);
} catch (IOException e) {
LOG.error("error while adding new field to page");
}
}
}
public byte[] savePdf() throws IOException {
try (final ByteArrayOutputStream out = new ByteArrayOutputStream()) {
//pdDocument.saveIncremental(out);
pdDocument.save(out);
pdDocument.close();
return out.toByteArray();
}
}
I am using PDFBox 2.0.8
Here is the source PDF:https://ufile.io/gr01f or here https://www.file-upload.net/download-12928052/TestFormEncoding.pdf.html
Here the output: https://ufile.io/k8cr3 or here https://www.file-upload.net/download-12928049/changeBoxedFieldsToOne.pdf.html
This indeed is a bug in PDFBox: PDFBox cannot properly handle PDF Name objects containing bytes with values outside the US_ASCII range (in particular outside the range 0..127, and your umlauts are outside).
The first error in PDF Name handling is that PDFBox internally represents them as strings after a mixed UTF-8 / CP-1252 decoding strategy. This is wrong, according to the PDF specification a name object is an atomic symbol uniquely defined by a sequence of any characters (8-bit values) except null (character code 0). [...]
Ordinarily, the bytes making up the name are never treated as text to be presented to a human user or to an application external to a PDF processor. However, occasionally the need arises to treat a name object as text, such as one that represents a font name [...], a colourant name in a Separation or DeviceN colour space, or a structure type [...]
In such situations, the sequence of bytes making up the name object should be interpreted according to UTF-8, a variable-length byte-encoded representation.
Thus, it generally does not make sense to treat a name as anything else than a byte sequence. Only names used in certain contexts should be meaningful as UTF-8 encoded strings.
Furthermore, a mixed UTF-8 / CP-1252 decoding strategy, i.e. one that first tries to decode using UTF-8 and in case of failure tries again with CP-1252, can create the same string representation for different name entities, so this can indeed falsify by making unequal names equal.
This is not the problem in your case, though, the names you used can be interpreted.
The second error is, though, that while serializing the PDF it only properly encodes the characters in the strings representing names which are from US_ASCII, all else are replaced by '?':
public void writePDF(OutputStream output) throws IOException
{
output.write('/');
byte[] bytes = getName().getBytes(Charsets.US_ASCII);
for (byte b : bytes)
{
[...]
}
}
(from org.apache.pdfbox.cos.COSName.writePDF(OutputStream))
This is where your checkbox values (which internally are represented by PDF Name objects) get damaged beyond repair...
A more simple example to show the problem is this:
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
document.getDocumentCatalog().getCOSObject().setString(COSName.getPDFName("äöüß"), "äöüß");
document.save(new File(RESULT_FOLDER, "non-ascii-name.pdf"));
document.close();
In the result the catalog with the custom entry looks like this:
1 0 obj
<<
/Type /Catalog
/Version /1.4
/Pages 2 0 R
/#3F#3F#3F#3F <E4F6FCDF>
>>
In the name key all characters are replaced by '?' in hex encoded form (#3F) while in the string value the characters are appropriately encoded.
After a bit of searching I stumbled over an answer on this topic I gave almost two years ago. Back then the PDF Name object bytes were always interpreted as UTF-8 encoded which led to issues in that question.
As a consequence the issue PDFBOX-3347 was created. To resolve it the mixed UTF-8 / CP-1252 decoding strategy was introduced. As expressed above, though, I'm not a friend of that strategy.
In that stack overflow answer I also already discussed the problems related to the use of US_ASCII during PDF serialization but that aspect has not yet been addressed at all.
Another related issue is PDFBOX-3519 but its resolution also was reduced to trying to fix the parsing of PDF Names, ignoring the serialization of it.
Yet another related issue is PDFBOX-2836.
Related
I have a function to place my text to the document into something like a table.
private static void addCenteredParagraph(Document document, float width, String text) {
PdfFont timesNewRomanBold = null;
try {
timesNewRomanBold = PdfFontFactory.createFont(StandardFonts.TIMES_BOLD);
} catch (IOException e) {
LOGGER.error("Failed to create Times New Roman Bold font.");
LOGGER.error(e);
}
List<TabStop> tabStops = new ArrayList<>();
// Create a TabStop at the middle of the page
tabStops.add(new TabStop(width / 2, TabAlignment.CENTER));
// Create a TabStop at the end of the page
tabStops.add(new TabStop(width, TabAlignment.LEFT));
Paragraph p = new Paragraph().addTabStops(tabStops).setFontSize(14);
if (timesNewRomanBold != null) {
p.setFont(timesNewRomanBold);
}
p.add(new Tab()).add(text).add(new Tab());
document.add(p);
}
But my problem is it shows empty characters in the exported PDF instead of the letters ő,Ő,ű,Ű.
The Times New Roman supports these characters, so I think I need to set it to UTF8, but I couldn't find a workaround on Google to do it.
Can someone please explain how to make these characters appear properly on the pdf?
Tried these, but some of them are deprecated functions, or not applicable with my arguments I'm trying to give them, or I don't use Chunk.
Itext PDF writer, Is there any way to allow unicode subscript symbol in the pdf? (Without setTextRise)
How can I set encoding for iText when I insert value to placeholder in pdf form?
Edit: I figured out, that timesNewRomanBold is null, even if I set it to HELVETICA, or COURIER.
i reading pdf documents via ItextSharp library.
But these documents is in Czech language which use diacritic (ř ě ž š č etc.)
How I can read this chars? Any idea? Or, is some solution for replacing this chars for normal r e z s c ?
This is code in my method. Thanks
PdfReader reader = new PdfReader("M:/ShareDirs_KSP/RDM_Debtors/DMS_PROD/" + src);
// we can inspect the syntax of the imported page
String text = new String();
for (int page = 1; page <= 1; page++) {
text += PdfTextExtractor.getTextFromPage(reader, page);
}
reader.close();
I have written a small proof of concept that parses the file czech.pdf. This file contains several characters with diacritics. It was created in answer to the following question: Can't get Czech characters while generating a PDF
The text is stored in the file twice: once using a simple font, once using a composite font. In my proof of concept (named ParseCzech), I parse this PDF to a file encoded using UTF-8 (UNICODE):
public void parse(String filename) throws IOException {
PdfReader reader = new PdfReader(filename);
FileOutputStream fos = new FileOutputStream(DEST);
for (int page = 1; page <= 1; page++) {
fos.write(PdfTextExtractor.getTextFromPage(reader, page).getBytes("UTF-8"));
}
fos.flush();
fos.close();
}
The result is the file czech.txt:
As you can see from the screen shot, the text is extracted correctly (but make sure that the viewer you use knows that the file is encoded as UTF-8, otherwise you may see strange characters instead of the actual text).
Note that some PDFs do not allow text to be extracted correctly. This is explained in the following video: http://www.youtube.com/watch?v=wxGEEv7ibHE
Please share your PDF so that people on StackOverflow can check whether you don't succeed to extract text because of an error in your code, or whether you don't succeed because the PDF doesn't allow you to extract the text.
How do you convert a specific charset to unicode in Java?
charsets have been discussed quite a lot here, but I think this one hasn't been covered yet.
I have a hex-string that meets the criteria length%4==0 (e.g. \ud3faef8e). usually I just display this in an HTML container and add &#x to the front and ; to the back of each hex quadruple.
but in this case the following procedure led to the correct output (non-Java)
paste hex string into Hex-Editor and save the file to test.txt (utf-8)
open the file with Notepad++
change the encoding to Simplified Chinese (GB2312)
Now I'm trying to do the same in Java.
// having hex convert to ascii
String ascii = "";
for (int cnt = 0; cnt <= unicode.length() - 2; cnt += 2) {
String tmp = unicode.substring(cnt, cnt + 2);
int decimal = Integer.parseInt(tmp, 16);
ascii += (char) decimal;
}
// writing ascii to file at this point leads to the same result as in step 2 before
try {
// get the bytes
byte[] utf8 = ascii.getBytes("UTF-8"); // == UTF8
// convert to gb2312
String converted = new String(utf8, "GB2312"); // == EUC_CN
// write to file (writer with declared UTF-8)
writeToFile(converted, 20 + cntu);
cntu++;
} catch (Exception e) {
System.err.println(e.getMessage());
}
the output looks according the should-output, except the fact that randomly the following character is displayed: � why does this one come up? and how can I get rid of it?
in the end, what I'd like to get is the converted unicode again to be able to display it with my original approach (폴), but I haven't figured out a way to get to the hex values again (they don't match the criteria length%4==0). how do I get the hex values of the characters?
update1
to be more precise, regarding the input, I'm assuming that it is Unicode, because of the start of the String with \u, which would be sufficient for my usual approach, but not in the case I am describing above.
update2
the writeToFile method
FileOutputStream fos = new FileOutputStream("test" + id + ".txt");
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(str);
out.close();
I tried with GB2312 as well, but there is no change. I still get the ? inbetween the correct characters.
update3
the expected output for \ud3f6ef8e is 遇飵 , you get to it when following the steps 1 to 3. (HxD as an example of an hex editor)
there was no indication that I should delete my question, thus I'm writing my final comment as the answer
I was misinterpreting the incoming hex-digits. they were in a specific charset and not uni-code, so they represented the hex-values of a character in that charset. What I'm doing now is new String(byteArray, "CharsetName"); and get (int)s.charAt(i) to get the unicode value and write it to HTML. thanks for your ideas and hints
for more details see this answer here: https://stackoverflow.com/a/4049781/1338732 , and this question here: How to convert UTF-8 to unicode in Java?
How to check pdf file is password protected or not in java?
I know of several tools/libraries that can do this but I want to know if this is possible with just program in java.
Update
As per mkl's comment below this answer, it seems that there are two types of PDF structures permitted by the specs: (1) Cross-referenced tables (2) Cross-referenced Streams. The following solution only addresses the first type of structure. This answer needs to be updated to address the second type.
====
All of the answers provided above refer to some third party libraries which is what the OP is already aware of. The OP is asking for native Java approach. My answer is yes, you can do it but it will require a lot of work.
It will require a two step process:
Step 1: Figure out if the PDF is encrypted
As per Adobe's PDF 1.7 specs (page number 97 and 115), if the trailer record contains the key "\Encrypted", the pdf is encrypted (the encryption could be simple password protection or RC4 or AES or some custom encryption). Here's a sample code:
Boolean isEncrypted = Boolean.FALSE;
try {
byte[] byteArray = Files.readAllBytes(Paths.get("Resources/1.pdf"));
//Convert the binary bytes to String. Caution, it can result in loss of data. But for our purposes, we are simply interested in the String portion of the binary pdf data. So we should be fine.
String pdfContent = new String(byteArray);
int lastTrailerIndex = pdfContent.lastIndexOf("trailer");
if(lastTrailerIndex >= 0 && lastTrailerIndex < pdfContent.length()) {
String newString = pdfContent.substring(lastTrailerIndex, pdfContent.length());
int firstEOFIndex = newString.indexOf("%%EOF");
String trailer = newString.substring(0, firstEOFIndex);
if(trailer.contains("/Encrypt"))
isEncrypted = Boolean.TRUE;
}
}
catch(Exception e) {
System.out.println(e);
//Do nothing
}
Step 2: Figure out the encryption type
This step is more complex. I don't have a code sample yet. But here is the algorithm:
Read the value of the key "/Encrypt" from the trailer as read in the step 1 above. E.g. the value is 288 0 R.
Look for the bytes "288 0 obj". This is the location of the "encryption dictionary" object in the document. This object boundary ends at the string "endobj".
Look for the key "/Filter" in this object. The "Filter" is the one that identifies the document's security handler. If the value of the "/Filter" is "/Standard", the document uses the built-in password-based security handler.
If you just want to know whether the PDF is encrypted without worrying about whether the encryption is in form of owner / user password or some advance algorithms, you don't need the step 2 above.
Hope this helps.
you can use PDFBox:
http://pdfbox.apache.org/
code example :
try
{
document = PDDocument.load( yourPDFfile );
if( document.isEncrypted() )
{
//ITS ENCRYPTED!
}
}
using maven?
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0</version>
</dependency>
Using iText pdf API we can identify the password protected PDF.
Example :
try {
new PdfReader("C:\\Password_protected.pdf");
} catch (BadPasswordException e) {
System.out.println("PDF is password protected..");
} catch (Exception e) {
e.printStackTrace();
}
You can validate pdf, i.e it can be readable, writable by using Itext.
Following is the code snippet,
boolean isValidPdf = false;
try {
InputStream tempStream = new FileInputStream(new File("path/to/pdffile.pdf"));
PdfReader reader = new PdfReader(tempStream);
isValidPdf = reader.isOpenedWithFullPermissions();
} catch (Exception e) {
isValidPdf = false;
}
The correct how to do it in java answer is per #vhs.
However in any application by far the simplest is to use very lightweight pdfinfo tool to filter the encryption status and here using windows cmd I can instantly get a report that two different copies of the same file are encrypted
>forfiles /m *.pdf /C "cmd /c echo #file &pdfinfo #file|find /i \"Encrypted\""
"Certificate (9).pdf"
Encrypted: no
"ds872 source form.pdf"
Encrypted: AES 128-bit
"ds872 filled form.pdf"
Encrypted: AES 128-bit
"How to extract data from a particular area in a PDF file - Stack Overflow.pdf"
Encrypted: no
"Test.pdf"
Encrypted: no
>
The solution:
1) Install PDF Parser http://www.pdfparser.org/
2) Edit Parser.php in this section:
if (isset($xref['trailer']['encrypt'])) {
echo('Your Allert message');
exit();}
3)In your .php form post ( ex. upload.php) insert this:
for the first require '...yourdir.../vendor/autoload.php';
then write this function:
function pdftest_is_encrypted($form) {
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile($form);
}
and then call the function
pdftest_is_encrypted($_FILES["upfile"]["tmp_name"]);
This is all, if you'll try to load a PDF with password the system return an error "Your Allert message"
I need to parse a java file (actually a .pdf) to an String and go back to a file. Between those process I'll apply some patches to the given string, but this is not important in this case.
I've developed the following JUnit test case:
String f1String=FileUtils.readFileToString(f1);
File temp=File.createTempFile("deleteme", "deleteme");
FileUtils.writeStringToFile(temp, f1String);
assertTrue(FileUtils.contentEquals(f1, temp));
This test converts a file to a string and writtes it back. However the test is failing.
I think it may be because of the encodings, but in FileUtils there is no much detailed info about this.
Anyone can help?
Thanks!
Added for further undestanding:
Why I need this?
I have very large pdfs in one machine, that are replicated in another one. The first one is in charge of creating those pdfs. Due to the low connectivity of the second machine and the big size of pdfs, I don't want to synch the whole pdfs, but only the changes done.
To create patches/apply them, I'm using the google library DiffMatchPatch. This library creates patches between two string. So I need to load a pdf to an string, apply a generated patch, and put it back to a file.
A PDF is not a text file. Decoding (into Java characters) and re-encoding of binary files that are not encoded text is asymmetrical. For example, if the input bytestream is invalid for the current encoding, you can be assured that it won't re-encode correctly. In short - don't do that. Use readFileToByteArray and writeByteArrayToFile instead.
Just a few thoughts:
There might actually some BOM (byte order mark) bytes in one of the files that either gets stripped when reading or added during writing. Is there a difference in the file size (if it is the BOM the difference should be 2 or 3 bytes)?
The line breaks might not match, depending which system the files are created on, i.e. one might have CR LF while the other only has LF or CR. (1 byte difference per line break)
According to the JavaDoc both methods should use the default encoding of the JVM, which should be the same for both operations. However, try and test with an explicitly set encoding (JVM's default encoding would be queried using System.getProperty("file.encoding")).
Ed Staub awnser points why my solution is not working and he suggested using bytes instead of Strings. In my case I need an String, so the final working solution I've found is the following:
#Test
public void testFileRWAsArray() throws IOException{
String f1String="";
byte[] bytes=FileUtils.readFileToByteArray(f1);
for(byte b:bytes){
f1String=f1String+((char)b);
}
File temp=File.createTempFile("deleteme", "deleteme");
byte[] newBytes=new byte[f1String.length()];
for(int i=0; i<f1String.length(); ++i){
char c=f1String.charAt(i);
newBytes[i]= (byte)c;
}
FileUtils.writeByteArrayToFile(temp, newBytes);
assertTrue(FileUtils.contentEquals(f1, temp));
}
By using a cast between byte-char, I have the symmetry on conversion.
Thank you all!
Try this code...
public static String fetchBase64binaryEncodedString(String path) {
File inboundDoc = new File(path);
byte[] pdfData;
try {
pdfData = FileUtils.readFileToByteArray(inboundDoc);
} catch (IOException e) {
throw new RuntimeException(e);
}
byte[] encodedPdfData = Base64.encodeBase64(pdfData);
String attachment = new String(encodedPdfData);
return attachment;
}
//How to decode it
public void testConversionPDFtoBase64() throws IOException
{
String path = "C:/Documents and Settings/kantab/Desktop/GTR_SDR/MSDOC.pdf";
File origFile = new File(path);
String encodedString = CreditOneMLParserUtil.fetchBase64binaryEncodedString(path);
//now decode it
byte[] decodeData = Base64.decodeBase64(encodedString.getBytes());
String decodedString = new String(decodeData);
//or actually give the path to pdf file.
File decodedfile = File.createTempFile("DECODED", ".pdf");
FileUtils.writeByteArrayToFile(decodedfile,decodeData);
Assert.assertTrue(FileUtils.contentEquals(origFile, decodedfile));
// Frame frame = new Frame("PDF Viewer");
// frame.setLayout(new BorderLayout());
}