Google Docs API "setMd5Checksum" not working - java

Recently I've implemented an application in Java that uses the Google Docs API v3.0. New entries are created like this:
DocumentListEntry newEntry = new DocumentListEntry();
newEntry.setFile(file, Common.resolveMimeType(file)); //Common is a custom class
newEntry.setFilename(entryTitle.getPlainText()); //entryTitle is a TextConstruct
newEntry.setTitle(entryTitle);
newEntry.setDraft(false);
newEntry.setHidden(file.isHidden());
newEntry.setMd5Checksum(Common.getMD5HexDigest(file));
Trust me when I tell you that Common.getMD5HexDigest(file) returns a valid and unique MD5 Hexadecimal hash.
Now, the file uploads properly yet when retrieving the file and checking the MD5 checksum through the entry.getMd5Checksum() method, it always returns null.
I've tried EVERYTHING, even set the ETag, ResourceID and VersionID but they all get override with default values (null or server generated strings).

I would guess that you need to set the checksum to the md5 hash of the file's contents, not the hash of the path-name.
Why would they (google) care about the path? It makes no sense at all. Forgive me if I misinterpreted your code, but I think you have misconceived the concept of file checksums.
Anyway, what you need to do is eat (digest) the file and not the path:
import java.security.*;
import java.util.*;
import java.math.*;
import java.io.*;
public class MD5 {
private MessageDigest mDigest;
private File openFile;
private FileInputStream ofis;
private int fSize;
private byte[] fBytes;
public MD5(String filePath) {
try { mDigest = MessageDigest.getInstance("MD5"); }
catch (NoSuchAlgorithmException e) { System.exit(1); }
openFile = new File(filePath);
}
public String toString() {
try {
ofis = new FileInputStream(openFile);
fSize = ofis.available();
fBytes = new byte[fSize];
ofis.read(fBytes);
} catch (Throwable t) {
return "Can't read file or something";
}
mDigest.update(fBytes);
return new BigInteger(1, mDigest.digest()).toString(16);
}
public static void main(String[] argv){
MD5 md5 = new MD5("someFile.ext");
System.out.println(md5);
}
}
So the error in your snippet above is here:
messageDigest.update(String.valueOf(file.hashCode()).getBytes());
Now, I can show that my class gives the correct md5sum of the file which is most likely what you need. Just read the javadoc of the method if you don't trust me:
http://gdata-java-client.googlecode.com/svn/trunk/java/src/com/google/gdata/data/docs/DocumentListEntry.java
What it says is:
* Set the MD5 checksum calculated for the document.
... nothing about the path's checksum :)
here:
$ echo "Two dogs are sleeping on my couch" > someFile.ext
$ echo "Two dogs are sleeping on my couch" |md5sum
1d81559b611e0079bf6c16a2c09bd994 -
$ md5sum someFile.ext
1d81559b611e0079bf6c16a2c09bd994 someFile.ext
$ javac MD5.java && java MD5
1d81559b611e0079bf6c16a2c09bd994

After struggling a few weeks with the MD5 checksum problem (to verify if the content of the file changed over time), I came up with a solution that doesn't rely on the MD5 checksum of the file but on the client last-update attribute of the file.
This solution goes to everyone that wants to check if a file has changed over time. However, "an update" on any operating system can be considered as the act of opening the file and saving the file, with or without making any changes to the content of the file. So, it's not perfect but does save some time and bandwidth.
Solution:
long lastModified = new DateTime(
new Date(file.lastModified()), TimeZone.getDefault()
).getValue();
if(lastModified > entry.getUpdated().getValue()) {
//update the file
}
Where file is a File instance of the desired file and entry is the DocumentListEntry associated to the local file.

Related

Calculate md5 hash based on file contains(means without filename)

I'm trying to calculate an MD5 hash based on the file contents, not the file name. In my code below while calculating MD5 hash on two files with different file names but identical contents it is generating two different MD5 hash values. I expected the same hash value.
Code
def computeMD5Hash(path: String): String = {
val buffer = new Array[Byte](8192)
val md5 = MessageDigest.getInstance("MD5")
val dis = new DigestInputStream(new FileInputStream(new File(path)), md5)
try {
while (dis.read(buffer) != -1) {}
} finally {
dis.close()
}
md5.digest.map("%02x".format(_)).mkString
}
println(computeMD5Hash("/Users/xxxx/Documents/Project/yyy/de/src/main/resources/input/xxxxx_list_01.txt"))
println(computeMD5Hash("/Users/xxxx/Documents/Project/yyy/de/src/main/resources/input/xxxxx_list_03.txt"))
Output
10d34fcb95ca6714fb00dae12527be4e
651c8eaf62016182d2a39c5442a339a8
Expected Output
10d34fcb95ca6714fb00dae12527be4e
10d34fcb95ca6714fb00dae12527be4e
Tried your code and it works for me. Are you sure that the files are equal?
Does it work if you take one file explicitly copy-paste it to another location and run your program?

SequenceFile Compactor of several small files in only one file.seq

Novell in HDFS and Hadoop:
I am developing a program which one should get all the files of a specific directory, where we can find several small files of any type.
Get everyfile and make append in a SequenceFile compressed, where the key must be the path of the file, and the value must be the file got, For now my code is:
import java.net.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.io.compress.BZip2Codec;
public class Compact {
public static void main (String [] args) throws Exception{
try{
Configuration conf = new Configuration();
FileSystem fs =
FileSystem.get(new URI("hdfs://quickstart.cloudera:8020"),conf);
Path destino = new Path("/user/cloudera/data/testPractice.seq");//test args[1]
if ((fs.exists(destino))){
System.out.println("exist : " + destino);
return;
}
BZip2Codec codec=new BZip2Codec();
SequenceFile.Writer outSeq = SequenceFile.createWriter(conf
,SequenceFile.Writer.file(fs.makeQualified(destino))
,SequenceFile.Writer.compression(SequenceFile.CompressionType.BLOCK,codec)
,SequenceFile.Writer.keyClass(Text.class)
,SequenceFile.Writer.valueClass(FSDataInputStream.class));
FileStatus[] status = fs.globStatus(new Path("/user/cloudera/data/*.txt"));//args[0]
for (int i=0;i<status.length;i++){
FSDataInputStream in = fs.open(status[i].getPath());
outSeq.append(new org.apache.hadoop.io.Text(status[i].getPath().toString()), new FSDataInputStream(in));
fs.close();
}
outSeq.close();
System.out.println("End Program");
}catch(Exception e){
System.out.println(e.toString());
System.out.println("File not found");
}
}
}
But after of every execution I receive this exception:
java.io.IOException: Could not find a serializer for the Value class: 'org.apache.hadoop.fs.FSDataInputStream'. Please ensure that the configuration 'io.serializations' is properly configured, if you're using custom serialization.
File not found
I understand the error must be in the type of the file I am creating and the type of object I define for adding to the sequenceFile, but I don't know which one should add, can anyone help me?
FSDataInputStream, like any other InputStream, is not intended to be serialized. What serializing an "iterator" over a stream of byte should do ?
What you most likely want to do, is to store the content of the file as the value. For example you can switch the value type from FsDataInputStream to BytesWritable and just get all the bytes out of the FSDataInputStream. One drawback of using Key/Value SequenceFile for a such purpose is that the content of each file has to fit in memory. It could be fine for small files but you have to be aware of this issue.
I am not sure what you are really trying to achieve but perhaps you could avoid reinventing the wheel by using something like Hadoop Archives ?
Thanks a lot by your comments, the problem was the serializer like you say, and finally I used BytesWritable:
FileStatus[] status = fs.globStatus(new Path("/user/cloudera/data/*.txt"));//args[0]
for (int i=0;i<status.length;i++){
FSDataInputStream in = fs.open(status[i].getPath());
byte[] content = new byte [(int)fs.getFileStatus(status[i].getPath()).getLen()];
outSeq.append(new org.apache.hadoop.io.Text(status[i].getPath().toString()), new org.apache.hadoop.io.BytesWritable(in));
}
outSeq.close();
Probably there are other better solutions in the hadoop ecosystem but this problem was an exercise of a degree I am developing, and for now We are remaking the wheel for understanding concepts ;-).

computing checksum for an input stream

I need to compute checksum for an inputstream(or a file) to check if the file contents are changed. I have this below code that generates a different value for each execution though I'm using the same stream. Can someone help me to do this right?
public class CreateChecksum {
public static void main(String args[]) {
String test = "Hello world";
ByteArrayInputStream bis = new ByteArrayInputStream(test.getBytes());
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));
}
public static String checkSum(InputStream fis){
String checksum = null;
try {
MessageDigest md = MessageDigest.getInstance("MD5");
//Using MessageDigest update() method to provide input
byte[] buffer = new byte[8192];
int numOfBytesRead;
while( (numOfBytesRead = fis.read(buffer)) > 0){
md.update(buffer, 0, numOfBytesRead);
}
byte[] hash = md.digest();
checksum = new BigInteger(1, hash).toString(16); //don't use this, truncates leading zero
} catch (Exception ex) {
}
return checksum;
}
}
You're using the same stream object for both calls - after you've called checkSum once, the stream will not have any more data to read, so the second call will be creating a hash of an empty stream. The simplest approach would be to create a new stream each time:
String test = "Hello world";
byte[] bytes = test.getBytes(StandardCharsets.UTF_8);
System.out.println("MD5 checksum for file using Java : "
+ checkSum(new ByteArrayInputStream(bytes)));
System.out.println("MD5 checksum for file using Java : "
+ checkSum(new ByteArrayInputStream(bytes)));
Note that your exception handling in checkSum really needs fixing, along with your hex conversion...
Check out the code in org/apache/commons/codec/digest/DigestUtils.html
Changes on a file are relatively easy to monitor, File.lastModified() changes each time a file is changed (and closed). There is even a build-in API to get notified of selected changes to the file system: http://docs.oracle.com/javase/tutorial/essential/io/notification.html
The hashCode of an InputStream is not suitable to detect changes (there is no definition how an InputStream should calculate its hashCode - quite likely its using Object.hashCode, meaning the hashCode doesn't depend on anything but object identity).
Building an MD5 like you try works, but requires reading the entire file every time. Quite a performance killer if the file is large and/or watching for multiple files.
You are confusing two related, but different responsibilities.
First you have a Stream which provides stuff to be read. Then you have a checksum on that stream; however, your implementation is a static method call, effectively divorcing it from a class, meaning that nobody has the responsibility for maintaining the checksum.
Try reworking your solution like so
public ChecksumInputStream implements InputStream {
private InputStream in;
public ChecksumInputStream(InputStream source) {
this.in = source;
}
public int read() {
int value = in.read();
updateChecksum(value);
return value;
}
// and repeat for all the other read methods.
}
Note that now you only do one read, with the checksum calculator decorating the original input stream.
The issue is after you first read the inputstream. The pos has reach the end. The quick way to resolve your issue is
ByteArrayInputStream bis = new ByteArrayInputStream(test.getBytes());
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));
bis = new ByteArrayInputStream(test.getBytes());
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));

How to check pdf file is password protected?

How to check pdf file is password protected or not in java?
I know of several tools/libraries that can do this but I want to know if this is possible with just program in java.
Update
As per mkl's comment below this answer, it seems that there are two types of PDF structures permitted by the specs: (1) Cross-referenced tables (2) Cross-referenced Streams. The following solution only addresses the first type of structure. This answer needs to be updated to address the second type.
====
All of the answers provided above refer to some third party libraries which is what the OP is already aware of. The OP is asking for native Java approach. My answer is yes, you can do it but it will require a lot of work.
It will require a two step process:
Step 1: Figure out if the PDF is encrypted
As per Adobe's PDF 1.7 specs (page number 97 and 115), if the trailer record contains the key "\Encrypted", the pdf is encrypted (the encryption could be simple password protection or RC4 or AES or some custom encryption). Here's a sample code:
Boolean isEncrypted = Boolean.FALSE;
try {
byte[] byteArray = Files.readAllBytes(Paths.get("Resources/1.pdf"));
//Convert the binary bytes to String. Caution, it can result in loss of data. But for our purposes, we are simply interested in the String portion of the binary pdf data. So we should be fine.
String pdfContent = new String(byteArray);
int lastTrailerIndex = pdfContent.lastIndexOf("trailer");
if(lastTrailerIndex >= 0 && lastTrailerIndex < pdfContent.length()) {
String newString = pdfContent.substring(lastTrailerIndex, pdfContent.length());
int firstEOFIndex = newString.indexOf("%%EOF");
String trailer = newString.substring(0, firstEOFIndex);
if(trailer.contains("/Encrypt"))
isEncrypted = Boolean.TRUE;
}
}
catch(Exception e) {
System.out.println(e);
//Do nothing
}
Step 2: Figure out the encryption type
This step is more complex. I don't have a code sample yet. But here is the algorithm:
Read the value of the key "/Encrypt" from the trailer as read in the step 1 above. E.g. the value is 288 0 R.
Look for the bytes "288 0 obj". This is the location of the "encryption dictionary" object in the document. This object boundary ends at the string "endobj".
Look for the key "/Filter" in this object. The "Filter" is the one that identifies the document's security handler. If the value of the "/Filter" is "/Standard", the document uses the built-in password-based security handler.
If you just want to know whether the PDF is encrypted without worrying about whether the encryption is in form of owner / user password or some advance algorithms, you don't need the step 2 above.
Hope this helps.
you can use PDFBox:
http://pdfbox.apache.org/
code example :
try
{
document = PDDocument.load( yourPDFfile );
if( document.isEncrypted() )
{
//ITS ENCRYPTED!
}
}
using maven?
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0</version>
</dependency>
Using iText pdf API we can identify the password protected PDF.
Example :
try {
new PdfReader("C:\\Password_protected.pdf");
} catch (BadPasswordException e) {
System.out.println("PDF is password protected..");
} catch (Exception e) {
e.printStackTrace();
}
You can validate pdf, i.e it can be readable, writable by using Itext.
Following is the code snippet,
boolean isValidPdf = false;
try {
InputStream tempStream = new FileInputStream(new File("path/to/pdffile.pdf"));
PdfReader reader = new PdfReader(tempStream);
isValidPdf = reader.isOpenedWithFullPermissions();
} catch (Exception e) {
isValidPdf = false;
}
The correct how to do it in java answer is per #vhs.
However in any application by far the simplest is to use very lightweight pdfinfo tool to filter the encryption status and here using windows cmd I can instantly get a report that two different copies of the same file are encrypted
>forfiles /m *.pdf /C "cmd /c echo #file &pdfinfo #file|find /i \"Encrypted\""
"Certificate (9).pdf"
Encrypted: no
"ds872 source form.pdf"
Encrypted: AES 128-bit
"ds872 filled form.pdf"
Encrypted: AES 128-bit
"How to extract data from a particular area in a PDF file - Stack Overflow.pdf"
Encrypted: no
"Test.pdf"
Encrypted: no
>
The solution:
1) Install PDF Parser http://www.pdfparser.org/
2) Edit Parser.php in this section:
if (isset($xref['trailer']['encrypt'])) {
echo('Your Allert message');
exit();}
3)In your .php form post ( ex. upload.php) insert this:
for the first require '...yourdir.../vendor/autoload.php';
then write this function:
function pdftest_is_encrypted($form) {
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile($form);
}
and then call the function
pdftest_is_encrypted($_FILES["upfile"]["tmp_name"]);
This is all, if you'll try to load a PDF with password the system return an error "Your Allert message"

Java Apache FileUtils readFileToString and writeStringToFile problems

I need to parse a java file (actually a .pdf) to an String and go back to a file. Between those process I'll apply some patches to the given string, but this is not important in this case.
I've developed the following JUnit test case:
String f1String=FileUtils.readFileToString(f1);
File temp=File.createTempFile("deleteme", "deleteme");
FileUtils.writeStringToFile(temp, f1String);
assertTrue(FileUtils.contentEquals(f1, temp));
This test converts a file to a string and writtes it back. However the test is failing.
I think it may be because of the encodings, but in FileUtils there is no much detailed info about this.
Anyone can help?
Thanks!
Added for further undestanding:
Why I need this?
I have very large pdfs in one machine, that are replicated in another one. The first one is in charge of creating those pdfs. Due to the low connectivity of the second machine and the big size of pdfs, I don't want to synch the whole pdfs, but only the changes done.
To create patches/apply them, I'm using the google library DiffMatchPatch. This library creates patches between two string. So I need to load a pdf to an string, apply a generated patch, and put it back to a file.
A PDF is not a text file. Decoding (into Java characters) and re-encoding of binary files that are not encoded text is asymmetrical. For example, if the input bytestream is invalid for the current encoding, you can be assured that it won't re-encode correctly. In short - don't do that. Use readFileToByteArray and writeByteArrayToFile instead.
Just a few thoughts:
There might actually some BOM (byte order mark) bytes in one of the files that either gets stripped when reading or added during writing. Is there a difference in the file size (if it is the BOM the difference should be 2 or 3 bytes)?
The line breaks might not match, depending which system the files are created on, i.e. one might have CR LF while the other only has LF or CR. (1 byte difference per line break)
According to the JavaDoc both methods should use the default encoding of the JVM, which should be the same for both operations. However, try and test with an explicitly set encoding (JVM's default encoding would be queried using System.getProperty("file.encoding")).
Ed Staub awnser points why my solution is not working and he suggested using bytes instead of Strings. In my case I need an String, so the final working solution I've found is the following:
#Test
public void testFileRWAsArray() throws IOException{
String f1String="";
byte[] bytes=FileUtils.readFileToByteArray(f1);
for(byte b:bytes){
f1String=f1String+((char)b);
}
File temp=File.createTempFile("deleteme", "deleteme");
byte[] newBytes=new byte[f1String.length()];
for(int i=0; i<f1String.length(); ++i){
char c=f1String.charAt(i);
newBytes[i]= (byte)c;
}
FileUtils.writeByteArrayToFile(temp, newBytes);
assertTrue(FileUtils.contentEquals(f1, temp));
}
By using a cast between byte-char, I have the symmetry on conversion.
Thank you all!
Try this code...
public static String fetchBase64binaryEncodedString(String path) {
File inboundDoc = new File(path);
byte[] pdfData;
try {
pdfData = FileUtils.readFileToByteArray(inboundDoc);
} catch (IOException e) {
throw new RuntimeException(e);
}
byte[] encodedPdfData = Base64.encodeBase64(pdfData);
String attachment = new String(encodedPdfData);
return attachment;
}
//How to decode it
public void testConversionPDFtoBase64() throws IOException
{
String path = "C:/Documents and Settings/kantab/Desktop/GTR_SDR/MSDOC.pdf";
File origFile = new File(path);
String encodedString = CreditOneMLParserUtil.fetchBase64binaryEncodedString(path);
//now decode it
byte[] decodeData = Base64.decodeBase64(encodedString.getBytes());
String decodedString = new String(decodeData);
//or actually give the path to pdf file.
File decodedfile = File.createTempFile("DECODED", ".pdf");
FileUtils.writeByteArrayToFile(decodedfile,decodeData);
Assert.assertTrue(FileUtils.contentEquals(origFile, decodedfile));
// Frame frame = new Frame("PDF Viewer");
// frame.setLayout(new BorderLayout());
}

Categories