How to check pdf file is password protected or not in java?
I know of several tools/libraries that can do this but I want to know if this is possible with just program in java.
Update
As per mkl's comment below this answer, it seems that there are two types of PDF structures permitted by the specs: (1) Cross-referenced tables (2) Cross-referenced Streams. The following solution only addresses the first type of structure. This answer needs to be updated to address the second type.
====
All of the answers provided above refer to some third party libraries which is what the OP is already aware of. The OP is asking for native Java approach. My answer is yes, you can do it but it will require a lot of work.
It will require a two step process:
Step 1: Figure out if the PDF is encrypted
As per Adobe's PDF 1.7 specs (page number 97 and 115), if the trailer record contains the key "\Encrypted", the pdf is encrypted (the encryption could be simple password protection or RC4 or AES or some custom encryption). Here's a sample code:
Boolean isEncrypted = Boolean.FALSE;
try {
byte[] byteArray = Files.readAllBytes(Paths.get("Resources/1.pdf"));
//Convert the binary bytes to String. Caution, it can result in loss of data. But for our purposes, we are simply interested in the String portion of the binary pdf data. So we should be fine.
String pdfContent = new String(byteArray);
int lastTrailerIndex = pdfContent.lastIndexOf("trailer");
if(lastTrailerIndex >= 0 && lastTrailerIndex < pdfContent.length()) {
String newString = pdfContent.substring(lastTrailerIndex, pdfContent.length());
int firstEOFIndex = newString.indexOf("%%EOF");
String trailer = newString.substring(0, firstEOFIndex);
if(trailer.contains("/Encrypt"))
isEncrypted = Boolean.TRUE;
}
}
catch(Exception e) {
System.out.println(e);
//Do nothing
}
Step 2: Figure out the encryption type
This step is more complex. I don't have a code sample yet. But here is the algorithm:
Read the value of the key "/Encrypt" from the trailer as read in the step 1 above. E.g. the value is 288 0 R.
Look for the bytes "288 0 obj". This is the location of the "encryption dictionary" object in the document. This object boundary ends at the string "endobj".
Look for the key "/Filter" in this object. The "Filter" is the one that identifies the document's security handler. If the value of the "/Filter" is "/Standard", the document uses the built-in password-based security handler.
If you just want to know whether the PDF is encrypted without worrying about whether the encryption is in form of owner / user password or some advance algorithms, you don't need the step 2 above.
Hope this helps.
you can use PDFBox:
http://pdfbox.apache.org/
code example :
try
{
document = PDDocument.load( yourPDFfile );
if( document.isEncrypted() )
{
//ITS ENCRYPTED!
}
}
using maven?
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0</version>
</dependency>
Using iText pdf API we can identify the password protected PDF.
Example :
try {
new PdfReader("C:\\Password_protected.pdf");
} catch (BadPasswordException e) {
System.out.println("PDF is password protected..");
} catch (Exception e) {
e.printStackTrace();
}
You can validate pdf, i.e it can be readable, writable by using Itext.
Following is the code snippet,
boolean isValidPdf = false;
try {
InputStream tempStream = new FileInputStream(new File("path/to/pdffile.pdf"));
PdfReader reader = new PdfReader(tempStream);
isValidPdf = reader.isOpenedWithFullPermissions();
} catch (Exception e) {
isValidPdf = false;
}
The correct how to do it in java answer is per #vhs.
However in any application by far the simplest is to use very lightweight pdfinfo tool to filter the encryption status and here using windows cmd I can instantly get a report that two different copies of the same file are encrypted
>forfiles /m *.pdf /C "cmd /c echo #file &pdfinfo #file|find /i \"Encrypted\""
"Certificate (9).pdf"
Encrypted: no
"ds872 source form.pdf"
Encrypted: AES 128-bit
"ds872 filled form.pdf"
Encrypted: AES 128-bit
"How to extract data from a particular area in a PDF file - Stack Overflow.pdf"
Encrypted: no
"Test.pdf"
Encrypted: no
>
The solution:
1) Install PDF Parser http://www.pdfparser.org/
2) Edit Parser.php in this section:
if (isset($xref['trailer']['encrypt'])) {
echo('Your Allert message');
exit();}
3)In your .php form post ( ex. upload.php) insert this:
for the first require '...yourdir.../vendor/autoload.php';
then write this function:
function pdftest_is_encrypted($form) {
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile($form);
}
and then call the function
pdftest_is_encrypted($_FILES["upfile"]["tmp_name"]);
This is all, if you'll try to load a PDF with password the system return an error "Your Allert message"
Related
I do some acrofield manipulation for text fields which have parent fields. This works so far, but the form also contains some checkboxes, the will not be changed. But when I store the manipulated pdf to disk and inspect the value of the checkbox, i can see that the value of cb_a.0 has been changed from ÄÖÜ?ß to ?????
My further processing fails because of this unintended change, any idea how to prevent that?
My testcase
#Test
public void changeBoxedFieldsToOne() throws IOException {
File encodingPdfFile = new File(classLoader.getResource("./prefill/TestFormEncoding.pdf").getFile());
byte[] encodingPdfByte = Files.readAllBytes(encodingPdfFile.toPath());
PdfAcrofieldManipulator pdfMani = new PdfAcrofieldManipulator(encodingPdfByte);
assertTrue(pdfMani.getTextFieldsWithMoreThan2Children().size() > 0);
pdfMani.changeBoxedFieldsToOne();
byte[] changedPdf = pdfMani.savePdf();
Files.write(Paths.get("./build/changeBoxedFieldsToOne.pdf"), changedPdf);
pdfMani = new PdfAcrofieldManipulator(changedPdf);
assertTrue(pdfMani.getTextFieldsWithMoreThan2Children().size() == 0);
}
public void changeBoxedFieldsToOne() {
PDDocumentCatalog docCatalog = pdDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
List<PDNonTerminalField> textFieldWithMoreThan2Childrens = getTextFieldsWithMoreThan2Children();
for (PDField field : textFieldWithMoreThan2Childrens) {
int amountOfChilds = ((PDNonTerminalField) field).getChildren().size();
String currentFieldName = field.getPartialName();
LOG.info("merging fields of fieldnam {0} to one field", currentFieldName);
PDField firstChild = getChildWithPartialName((PDNonTerminalField) field, "0");
if (firstChild == null ) {
LOG.debug("found field which has a dot but starts not with 0, skipping this field");
continue;
}
PDField lastChild = getChildWithPartialName((PDNonTerminalField) field, Integer.toString(amountOfChilds - 1));
PDPage pageWhichContainsField = firstChild.getWidgets().get(0).getPage();
try {
removeField(pdDocument, currentFieldName);
} catch (IOException e) {
LOG.error("Error while removing field {0}", currentFieldName, e);
}
PDField newField = creatNewField(acroForm, field, firstChild, lastChild, pageWhichContainsField);
acroForm.getFields().add(newField);
PDAnnotationWidget newFieldWidget = createWidgetForField(newField, pageWhichContainsField, firstChild, lastChild);
try {
pageWhichContainsField.getAnnotations().add(newFieldWidget);
} catch (IOException e) {
LOG.error("error while adding new field to page");
}
}
}
public byte[] savePdf() throws IOException {
try (final ByteArrayOutputStream out = new ByteArrayOutputStream()) {
//pdDocument.saveIncremental(out);
pdDocument.save(out);
pdDocument.close();
return out.toByteArray();
}
}
I am using PDFBox 2.0.8
Here is the source PDF:https://ufile.io/gr01f or here https://www.file-upload.net/download-12928052/TestFormEncoding.pdf.html
Here the output: https://ufile.io/k8cr3 or here https://www.file-upload.net/download-12928049/changeBoxedFieldsToOne.pdf.html
This indeed is a bug in PDFBox: PDFBox cannot properly handle PDF Name objects containing bytes with values outside the US_ASCII range (in particular outside the range 0..127, and your umlauts are outside).
The first error in PDF Name handling is that PDFBox internally represents them as strings after a mixed UTF-8 / CP-1252 decoding strategy. This is wrong, according to the PDF specification a name object is an atomic symbol uniquely defined by a sequence of any characters (8-bit values) except null (character code 0). [...]
Ordinarily, the bytes making up the name are never treated as text to be presented to a human user or to an application external to a PDF processor. However, occasionally the need arises to treat a name object as text, such as one that represents a font name [...], a colourant name in a Separation or DeviceN colour space, or a structure type [...]
In such situations, the sequence of bytes making up the name object should be interpreted according to UTF-8, a variable-length byte-encoded representation.
Thus, it generally does not make sense to treat a name as anything else than a byte sequence. Only names used in certain contexts should be meaningful as UTF-8 encoded strings.
Furthermore, a mixed UTF-8 / CP-1252 decoding strategy, i.e. one that first tries to decode using UTF-8 and in case of failure tries again with CP-1252, can create the same string representation for different name entities, so this can indeed falsify by making unequal names equal.
This is not the problem in your case, though, the names you used can be interpreted.
The second error is, though, that while serializing the PDF it only properly encodes the characters in the strings representing names which are from US_ASCII, all else are replaced by '?':
public void writePDF(OutputStream output) throws IOException
{
output.write('/');
byte[] bytes = getName().getBytes(Charsets.US_ASCII);
for (byte b : bytes)
{
[...]
}
}
(from org.apache.pdfbox.cos.COSName.writePDF(OutputStream))
This is where your checkbox values (which internally are represented by PDF Name objects) get damaged beyond repair...
A more simple example to show the problem is this:
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
document.getDocumentCatalog().getCOSObject().setString(COSName.getPDFName("äöüß"), "äöüß");
document.save(new File(RESULT_FOLDER, "non-ascii-name.pdf"));
document.close();
In the result the catalog with the custom entry looks like this:
1 0 obj
<<
/Type /Catalog
/Version /1.4
/Pages 2 0 R
/#3F#3F#3F#3F <E4F6FCDF>
>>
In the name key all characters are replaced by '?' in hex encoded form (#3F) while in the string value the characters are appropriately encoded.
After a bit of searching I stumbled over an answer on this topic I gave almost two years ago. Back then the PDF Name object bytes were always interpreted as UTF-8 encoded which led to issues in that question.
As a consequence the issue PDFBOX-3347 was created. To resolve it the mixed UTF-8 / CP-1252 decoding strategy was introduced. As expressed above, though, I'm not a friend of that strategy.
In that stack overflow answer I also already discussed the problems related to the use of US_ASCII during PDF serialization but that aspect has not yet been addressed at all.
Another related issue is PDFBOX-3519 but its resolution also was reduced to trying to fix the parsing of PDF Names, ignoring the serialization of it.
Yet another related issue is PDFBOX-2836.
I have no idea how can I insert boolean sign into RTF document from java programm. I think about √ or ✓ and –. I tried insert these signs to clear document and save it as *.rtf and then open it in Notepad++ but there is a lot of codes (~160 lines) and I can not understand what is it. Do you have any idea?
After a short search I found this:
Writing unicode to rtf file
So a final code version would be:
public void writeToFile() {
String strJapanese = "日本語✓";
try {
FileOutputStream fos = new FileOutputStream("test.rtf");
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(strJapanese);
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
Please read about RTF
√ or ✓ and – are not available in every charset, so specify it. If yout output in UTF-8 (and i advise you to do so, check here on how to do this). You might need to encode the sign aswell, check Wikipedia
Recently I've implemented an application in Java that uses the Google Docs API v3.0. New entries are created like this:
DocumentListEntry newEntry = new DocumentListEntry();
newEntry.setFile(file, Common.resolveMimeType(file)); //Common is a custom class
newEntry.setFilename(entryTitle.getPlainText()); //entryTitle is a TextConstruct
newEntry.setTitle(entryTitle);
newEntry.setDraft(false);
newEntry.setHidden(file.isHidden());
newEntry.setMd5Checksum(Common.getMD5HexDigest(file));
Trust me when I tell you that Common.getMD5HexDigest(file) returns a valid and unique MD5 Hexadecimal hash.
Now, the file uploads properly yet when retrieving the file and checking the MD5 checksum through the entry.getMd5Checksum() method, it always returns null.
I've tried EVERYTHING, even set the ETag, ResourceID and VersionID but they all get override with default values (null or server generated strings).
I would guess that you need to set the checksum to the md5 hash of the file's contents, not the hash of the path-name.
Why would they (google) care about the path? It makes no sense at all. Forgive me if I misinterpreted your code, but I think you have misconceived the concept of file checksums.
Anyway, what you need to do is eat (digest) the file and not the path:
import java.security.*;
import java.util.*;
import java.math.*;
import java.io.*;
public class MD5 {
private MessageDigest mDigest;
private File openFile;
private FileInputStream ofis;
private int fSize;
private byte[] fBytes;
public MD5(String filePath) {
try { mDigest = MessageDigest.getInstance("MD5"); }
catch (NoSuchAlgorithmException e) { System.exit(1); }
openFile = new File(filePath);
}
public String toString() {
try {
ofis = new FileInputStream(openFile);
fSize = ofis.available();
fBytes = new byte[fSize];
ofis.read(fBytes);
} catch (Throwable t) {
return "Can't read file or something";
}
mDigest.update(fBytes);
return new BigInteger(1, mDigest.digest()).toString(16);
}
public static void main(String[] argv){
MD5 md5 = new MD5("someFile.ext");
System.out.println(md5);
}
}
So the error in your snippet above is here:
messageDigest.update(String.valueOf(file.hashCode()).getBytes());
Now, I can show that my class gives the correct md5sum of the file which is most likely what you need. Just read the javadoc of the method if you don't trust me:
http://gdata-java-client.googlecode.com/svn/trunk/java/src/com/google/gdata/data/docs/DocumentListEntry.java
What it says is:
* Set the MD5 checksum calculated for the document.
... nothing about the path's checksum :)
here:
$ echo "Two dogs are sleeping on my couch" > someFile.ext
$ echo "Two dogs are sleeping on my couch" |md5sum
1d81559b611e0079bf6c16a2c09bd994 -
$ md5sum someFile.ext
1d81559b611e0079bf6c16a2c09bd994 someFile.ext
$ javac MD5.java && java MD5
1d81559b611e0079bf6c16a2c09bd994
After struggling a few weeks with the MD5 checksum problem (to verify if the content of the file changed over time), I came up with a solution that doesn't rely on the MD5 checksum of the file but on the client last-update attribute of the file.
This solution goes to everyone that wants to check if a file has changed over time. However, "an update" on any operating system can be considered as the act of opening the file and saving the file, with or without making any changes to the content of the file. So, it's not perfect but does save some time and bandwidth.
Solution:
long lastModified = new DateTime(
new Date(file.lastModified()), TimeZone.getDefault()
).getValue();
if(lastModified > entry.getUpdated().getValue()) {
//update the file
}
Where file is a File instance of the desired file and entry is the DocumentListEntry associated to the local file.
I need to parse a java file (actually a .pdf) to an String and go back to a file. Between those process I'll apply some patches to the given string, but this is not important in this case.
I've developed the following JUnit test case:
String f1String=FileUtils.readFileToString(f1);
File temp=File.createTempFile("deleteme", "deleteme");
FileUtils.writeStringToFile(temp, f1String);
assertTrue(FileUtils.contentEquals(f1, temp));
This test converts a file to a string and writtes it back. However the test is failing.
I think it may be because of the encodings, but in FileUtils there is no much detailed info about this.
Anyone can help?
Thanks!
Added for further undestanding:
Why I need this?
I have very large pdfs in one machine, that are replicated in another one. The first one is in charge of creating those pdfs. Due to the low connectivity of the second machine and the big size of pdfs, I don't want to synch the whole pdfs, but only the changes done.
To create patches/apply them, I'm using the google library DiffMatchPatch. This library creates patches between two string. So I need to load a pdf to an string, apply a generated patch, and put it back to a file.
A PDF is not a text file. Decoding (into Java characters) and re-encoding of binary files that are not encoded text is asymmetrical. For example, if the input bytestream is invalid for the current encoding, you can be assured that it won't re-encode correctly. In short - don't do that. Use readFileToByteArray and writeByteArrayToFile instead.
Just a few thoughts:
There might actually some BOM (byte order mark) bytes in one of the files that either gets stripped when reading or added during writing. Is there a difference in the file size (if it is the BOM the difference should be 2 or 3 bytes)?
The line breaks might not match, depending which system the files are created on, i.e. one might have CR LF while the other only has LF or CR. (1 byte difference per line break)
According to the JavaDoc both methods should use the default encoding of the JVM, which should be the same for both operations. However, try and test with an explicitly set encoding (JVM's default encoding would be queried using System.getProperty("file.encoding")).
Ed Staub awnser points why my solution is not working and he suggested using bytes instead of Strings. In my case I need an String, so the final working solution I've found is the following:
#Test
public void testFileRWAsArray() throws IOException{
String f1String="";
byte[] bytes=FileUtils.readFileToByteArray(f1);
for(byte b:bytes){
f1String=f1String+((char)b);
}
File temp=File.createTempFile("deleteme", "deleteme");
byte[] newBytes=new byte[f1String.length()];
for(int i=0; i<f1String.length(); ++i){
char c=f1String.charAt(i);
newBytes[i]= (byte)c;
}
FileUtils.writeByteArrayToFile(temp, newBytes);
assertTrue(FileUtils.contentEquals(f1, temp));
}
By using a cast between byte-char, I have the symmetry on conversion.
Thank you all!
Try this code...
public static String fetchBase64binaryEncodedString(String path) {
File inboundDoc = new File(path);
byte[] pdfData;
try {
pdfData = FileUtils.readFileToByteArray(inboundDoc);
} catch (IOException e) {
throw new RuntimeException(e);
}
byte[] encodedPdfData = Base64.encodeBase64(pdfData);
String attachment = new String(encodedPdfData);
return attachment;
}
//How to decode it
public void testConversionPDFtoBase64() throws IOException
{
String path = "C:/Documents and Settings/kantab/Desktop/GTR_SDR/MSDOC.pdf";
File origFile = new File(path);
String encodedString = CreditOneMLParserUtil.fetchBase64binaryEncodedString(path);
//now decode it
byte[] decodeData = Base64.decodeBase64(encodedString.getBytes());
String decodedString = new String(decodeData);
//or actually give the path to pdf file.
File decodedfile = File.createTempFile("DECODED", ".pdf");
FileUtils.writeByteArrayToFile(decodedfile,decodeData);
Assert.assertTrue(FileUtils.contentEquals(origFile, decodedfile));
// Frame frame = new Frame("PDF Viewer");
// frame.setLayout(new BorderLayout());
}
The database of my application need to be filled with a lot of data,
so during onCreate(), it's not only some create table sql
instructions, there is a lot of inserts. The solution I chose is to
store all this instructions in a sql file located in res/raw and which
is loaded with Resources.openRawResource(id).
It works well but I face to encoding issue, I have some accentuated
caharacters in the sql file which appears bad in my application. This
my code to do this:
public String getFileContent(Resources resources, int rawId) throws
IOException
{
InputStream is = resources.openRawResource(rawId);
int size = is.available();
// Read the entire asset into a local byte buffer.
byte[] buffer = new byte[size];
is.read(buffer);
is.close();
// Convert the buffer into a string.
return new String(buffer);
}
public void onCreate(SQLiteDatabase db) {
try {
// get file content
String sqlCode = getFileContent(mCtx.getResources(), R.raw.db_create);
// execute code
for (String sqlStatements : sqlCode.split(";"))
{
db.execSQL(sqlStatements);
}
Log.v("Creating database done.");
} catch (IOException e) {
// Should never happen!
Log.e("Error reading sql file " + e.getMessage(), e);
throw new RuntimeException(e);
} catch (SQLException e) {
Log.e("Error executing sql code " + e.getMessage(), e);
throw new RuntimeException(e);
}
The solution I found to avoid this is to load the sql instructions
from a huge static final String instead of a file, and all
accentuated characters appear well.
But isn't there a more elegant way to load sql instructions than a big
static final String attribute with all sql instructions?
I think your problem is in this line:
return new String(buffer);
You're converting the array of bytes in to a java.lang.String but you're not telling Java/Android the encoding to use. So the bytes for your accented characters aren't being converted correctly as the wrong encoding is being used.
If you use the String(byte[],<encoding>) constructor you can specify the encoding your file has and your characters will be converted correctly.
The SQL file solution seems perfect, it's just that you need to make sure that the file is saved in utf8 encoding otherwise all the accentuated characters will be lost. If you don't want to change the file's encoding then you need to pass an extra argument to new String(bytes, charset) defining the file's encoding.
Do prefer to use file resources instead of static final String to avoid having all those unnecessary bytes loaded into memory. In mobile phones you want to save all memory possible!
I am using a different approach:
Instead of executing loads of sql statements (which will take long time to complete), I build my sqlite database on the desktop, put it in the assets folder, create an empty sqlite db in android and copy the db from the assets folder into the database folder. This is a huge increase in speed. Note, you need to create an empty database first in android, and then you can copy and overwrite it. Otherwise, Android will not allow you to write a db into the datbase folder. There are several examples on the internet.
BTW, seems this approach works best, if the db has no file extension.
It looks like you are passing all your sql statements in one string. That's a problem because execSQL expects "a single statement that is not a query" (see documentation [here][1]). Following is a somewhat-ugly-but-working solution.
I have all my sql statements in a file like this:
INSERT INTO table1 VALUES (1, 2, 3);
INSERT INTO table1 VALUES (4, 5, 6);
INSERT INTO table1 VALUES (7, 8, 9);
Notice the new lines in between text(semicolon followed by 2 new lines)
Then, I do this:
String text = new String(buffer, "UTF-8");
for (String command : text.split(";\n\n")) {
try { command = command.trim();
//Log.d(TAG, "command: " + command);
if (command.length() > 0)
db.execSQL(command.trim());
}
catch(Exception e) {do whatever you need here}
My data columns contain blobs of text with new lines AND semicolons, so I had to find a different command-separator. Just be sure to get creative with the split str: use something you know doesn't exist in your data.
HTH
Gerardo
[1]: http://developer.android.com/reference/android/database/sqlite/SQLiteDatabase.html#execSQL(java.lang.String, java.lang.Object[])