Calculate md5 hash based on file contains(means without filename) - java

I'm trying to calculate an MD5 hash based on the file contents, not the file name. In my code below while calculating MD5 hash on two files with different file names but identical contents it is generating two different MD5 hash values. I expected the same hash value.
Code
def computeMD5Hash(path: String): String = {
val buffer = new Array[Byte](8192)
val md5 = MessageDigest.getInstance("MD5")
val dis = new DigestInputStream(new FileInputStream(new File(path)), md5)
try {
while (dis.read(buffer) != -1) {}
} finally {
dis.close()
}
md5.digest.map("%02x".format(_)).mkString
}
println(computeMD5Hash("/Users/xxxx/Documents/Project/yyy/de/src/main/resources/input/xxxxx_list_01.txt"))
println(computeMD5Hash("/Users/xxxx/Documents/Project/yyy/de/src/main/resources/input/xxxxx_list_03.txt"))
Output
10d34fcb95ca6714fb00dae12527be4e
651c8eaf62016182d2a39c5442a339a8
Expected Output
10d34fcb95ca6714fb00dae12527be4e
10d34fcb95ca6714fb00dae12527be4e

Tried your code and it works for me. Are you sure that the files are equal?
Does it work if you take one file explicitly copy-paste it to another location and run your program?

Related

Java: What is the best way to check if file needs to be updated before rewriting it?

We generate some file in our code. Sometimes the file coicides with the one, we have generated before. A question is : how can we check if the files are the same and skip writing?
The only way I see is:
read saved file in a string and generate its hash
generate hash of string we want to save into a new file
compare the hashes
May be, there are better ways?
According to me, hash is the best way to find modification/updates. Alternatively, if you have a definite line or character change whenever there is an update, you can just check that change with the new file generated and decide if you want to proceed with the write operation. You can always introduce such a parameter like a counter when you write a file, but updating the counter will require some logic that is related to the changes made before writing. The answer to this question depends on the context and working of the application.
MD5 Check Sum is the easiest way. I think your approach is valid.
Example I use in a unit test:
/** Returns a MD5 checksum from a file
*
* #param filename file name to write
* #return String
* #throws Exception
*/
private static String createChecksumForFile(String filename) throws Exception
{
InputStream fis = new FileInputStream(filename);
byte[] buffer = new byte[1024];
MessageDigest complete = MessageDigest.getInstance("MD5");
int numRead;
do {
numRead = fis.read(buffer);
if (numRead > 0) {
complete.update(buffer, 0, numRead);
}
} while (numRead != -1);
fis.close();
byte[] b = complete.digest();
String result = "";
for (byte aB : b) {
result +=
Integer.toString((aB & 0xff) + 0x100, 16).substring(1);
}
return result;
}
Unless there is any easy way to determine whether the data is still uptodate it'll be more efficient to simply overwrite it with the existent data, since reading and hashing a complete file is quite likely to be slower than simply overwriting the data. Though this is highly dependant upon the filesize.

What is a good way to load many pictures and their reference in an array? - Java + ImageJ

I have for example 1000 images and their names are all very similar, they just differ in the number. "ImageNmbr0001", "ImageNmbr0002", ....., ImageNmbr1000 etc.;
I would like to get every image and store them into an ImageProcessor Array.
So, for example, if I use a method on element of this array, then this method is applied on the picture, for example count the black pixel in it.
I can use a for loop the get numbers from 1 to 1000, turn them into a string and create substrings of the filenames to load and then attach the string numbers again to the file name and let it load that image.
However I would still have to turn it somehow into an element I can store in an array and I don't a method yet, that receives a string, in fact the file path and returns the respective ImageProcessor that is stored at it's end.
Also my approach at the moment seems rather clumsy and not too elegant. So I would be very happy, if someone could show me a better to do that using methods from those packages:
import ij.ImagePlus;
import ij.plugin.filter.PlugInFilter;
import ij.process.ImageProcessor;
I think I found a solution:
Opener opener = new Opener();
String imageFilePath = "somePath";
ImagePlus imp = opener.openImage(imageFilePath);
ImageProcesser ip = imp.getProcessor();
That do the job, but thank you for your time/effort.
I'm not sure if I undestand what you want exacly... But I definitly would not save each information of each image in separate files for 2 reasons:
- It's slower to save and read the content of multiple files compare with 1 medium size file
- Each file adds overhead (files need Path, minimum size in disk, etc)
If you want performance, group multiple image descriptions in single description files.
If you dont want to make a binary description file, you can always use a Database, which is build for it, performance in read and normally on save.
I dont know exacly what your needs, but I guess you can try make a binary file with fixed size data and read it later
Example:
public static void main(String[] args) throws IOException {
FileOutputStream fout = null;
FileInputStream fin = null;
try {
fout = new FileOutputStream("description.bin");
DataOutputStream dout = new DataOutputStream(fout);
for (int x = 0; x < 1000; x++) {
dout.writeInt(10); // Write Int data
}
fin = new FileInputStream("description.bin");
DataInputStream din = new DataInputStream(fin);
for (int x = 0; x < 1000; x++) {
System.out.println(din.readInt()); // Read Int data
}
} catch (Exception e) {
} finally {
if (fout != null) {
fout.close();
}
if (fin != null) {
fin.close();
}
}
}
In this example, the code writes integers in "description.bin" file and then read them.
This is pretty fast in Java, since Java uses "channels" for files by default

computing checksum for an input stream

I need to compute checksum for an inputstream(or a file) to check if the file contents are changed. I have this below code that generates a different value for each execution though I'm using the same stream. Can someone help me to do this right?
public class CreateChecksum {
public static void main(String args[]) {
String test = "Hello world";
ByteArrayInputStream bis = new ByteArrayInputStream(test.getBytes());
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));
}
public static String checkSum(InputStream fis){
String checksum = null;
try {
MessageDigest md = MessageDigest.getInstance("MD5");
//Using MessageDigest update() method to provide input
byte[] buffer = new byte[8192];
int numOfBytesRead;
while( (numOfBytesRead = fis.read(buffer)) > 0){
md.update(buffer, 0, numOfBytesRead);
}
byte[] hash = md.digest();
checksum = new BigInteger(1, hash).toString(16); //don't use this, truncates leading zero
} catch (Exception ex) {
}
return checksum;
}
}
You're using the same stream object for both calls - after you've called checkSum once, the stream will not have any more data to read, so the second call will be creating a hash of an empty stream. The simplest approach would be to create a new stream each time:
String test = "Hello world";
byte[] bytes = test.getBytes(StandardCharsets.UTF_8);
System.out.println("MD5 checksum for file using Java : "
+ checkSum(new ByteArrayInputStream(bytes)));
System.out.println("MD5 checksum for file using Java : "
+ checkSum(new ByteArrayInputStream(bytes)));
Note that your exception handling in checkSum really needs fixing, along with your hex conversion...
Check out the code in org/apache/commons/codec/digest/DigestUtils.html
Changes on a file are relatively easy to monitor, File.lastModified() changes each time a file is changed (and closed). There is even a build-in API to get notified of selected changes to the file system: http://docs.oracle.com/javase/tutorial/essential/io/notification.html
The hashCode of an InputStream is not suitable to detect changes (there is no definition how an InputStream should calculate its hashCode - quite likely its using Object.hashCode, meaning the hashCode doesn't depend on anything but object identity).
Building an MD5 like you try works, but requires reading the entire file every time. Quite a performance killer if the file is large and/or watching for multiple files.
You are confusing two related, but different responsibilities.
First you have a Stream which provides stuff to be read. Then you have a checksum on that stream; however, your implementation is a static method call, effectively divorcing it from a class, meaning that nobody has the responsibility for maintaining the checksum.
Try reworking your solution like so
public ChecksumInputStream implements InputStream {
private InputStream in;
public ChecksumInputStream(InputStream source) {
this.in = source;
}
public int read() {
int value = in.read();
updateChecksum(value);
return value;
}
// and repeat for all the other read methods.
}
Note that now you only do one read, with the checksum calculator decorating the original input stream.
The issue is after you first read the inputstream. The pos has reach the end. The quick way to resolve your issue is
ByteArrayInputStream bis = new ByteArrayInputStream(test.getBytes());
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));
bis = new ByteArrayInputStream(test.getBytes());
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));

Google Docs API "setMd5Checksum" not working

Recently I've implemented an application in Java that uses the Google Docs API v3.0. New entries are created like this:
DocumentListEntry newEntry = new DocumentListEntry();
newEntry.setFile(file, Common.resolveMimeType(file)); //Common is a custom class
newEntry.setFilename(entryTitle.getPlainText()); //entryTitle is a TextConstruct
newEntry.setTitle(entryTitle);
newEntry.setDraft(false);
newEntry.setHidden(file.isHidden());
newEntry.setMd5Checksum(Common.getMD5HexDigest(file));
Trust me when I tell you that Common.getMD5HexDigest(file) returns a valid and unique MD5 Hexadecimal hash.
Now, the file uploads properly yet when retrieving the file and checking the MD5 checksum through the entry.getMd5Checksum() method, it always returns null.
I've tried EVERYTHING, even set the ETag, ResourceID and VersionID but they all get override with default values (null or server generated strings).
I would guess that you need to set the checksum to the md5 hash of the file's contents, not the hash of the path-name.
Why would they (google) care about the path? It makes no sense at all. Forgive me if I misinterpreted your code, but I think you have misconceived the concept of file checksums.
Anyway, what you need to do is eat (digest) the file and not the path:
import java.security.*;
import java.util.*;
import java.math.*;
import java.io.*;
public class MD5 {
private MessageDigest mDigest;
private File openFile;
private FileInputStream ofis;
private int fSize;
private byte[] fBytes;
public MD5(String filePath) {
try { mDigest = MessageDigest.getInstance("MD5"); }
catch (NoSuchAlgorithmException e) { System.exit(1); }
openFile = new File(filePath);
}
public String toString() {
try {
ofis = new FileInputStream(openFile);
fSize = ofis.available();
fBytes = new byte[fSize];
ofis.read(fBytes);
} catch (Throwable t) {
return "Can't read file or something";
}
mDigest.update(fBytes);
return new BigInteger(1, mDigest.digest()).toString(16);
}
public static void main(String[] argv){
MD5 md5 = new MD5("someFile.ext");
System.out.println(md5);
}
}
So the error in your snippet above is here:
messageDigest.update(String.valueOf(file.hashCode()).getBytes());
Now, I can show that my class gives the correct md5sum of the file which is most likely what you need. Just read the javadoc of the method if you don't trust me:
http://gdata-java-client.googlecode.com/svn/trunk/java/src/com/google/gdata/data/docs/DocumentListEntry.java
What it says is:
* Set the MD5 checksum calculated for the document.
... nothing about the path's checksum :)
here:
$ echo "Two dogs are sleeping on my couch" > someFile.ext
$ echo "Two dogs are sleeping on my couch" |md5sum
1d81559b611e0079bf6c16a2c09bd994 -
$ md5sum someFile.ext
1d81559b611e0079bf6c16a2c09bd994 someFile.ext
$ javac MD5.java && java MD5
1d81559b611e0079bf6c16a2c09bd994
After struggling a few weeks with the MD5 checksum problem (to verify if the content of the file changed over time), I came up with a solution that doesn't rely on the MD5 checksum of the file but on the client last-update attribute of the file.
This solution goes to everyone that wants to check if a file has changed over time. However, "an update" on any operating system can be considered as the act of opening the file and saving the file, with or without making any changes to the content of the file. So, it's not perfect but does save some time and bandwidth.
Solution:
long lastModified = new DateTime(
new Date(file.lastModified()), TimeZone.getDefault()
).getValue();
if(lastModified > entry.getUpdated().getValue()) {
//update the file
}
Where file is a File instance of the desired file and entry is the DocumentListEntry associated to the local file.

Java Apache FileUtils readFileToString and writeStringToFile problems

I need to parse a java file (actually a .pdf) to an String and go back to a file. Between those process I'll apply some patches to the given string, but this is not important in this case.
I've developed the following JUnit test case:
String f1String=FileUtils.readFileToString(f1);
File temp=File.createTempFile("deleteme", "deleteme");
FileUtils.writeStringToFile(temp, f1String);
assertTrue(FileUtils.contentEquals(f1, temp));
This test converts a file to a string and writtes it back. However the test is failing.
I think it may be because of the encodings, but in FileUtils there is no much detailed info about this.
Anyone can help?
Thanks!
Added for further undestanding:
Why I need this?
I have very large pdfs in one machine, that are replicated in another one. The first one is in charge of creating those pdfs. Due to the low connectivity of the second machine and the big size of pdfs, I don't want to synch the whole pdfs, but only the changes done.
To create patches/apply them, I'm using the google library DiffMatchPatch. This library creates patches between two string. So I need to load a pdf to an string, apply a generated patch, and put it back to a file.
A PDF is not a text file. Decoding (into Java characters) and re-encoding of binary files that are not encoded text is asymmetrical. For example, if the input bytestream is invalid for the current encoding, you can be assured that it won't re-encode correctly. In short - don't do that. Use readFileToByteArray and writeByteArrayToFile instead.
Just a few thoughts:
There might actually some BOM (byte order mark) bytes in one of the files that either gets stripped when reading or added during writing. Is there a difference in the file size (if it is the BOM the difference should be 2 or 3 bytes)?
The line breaks might not match, depending which system the files are created on, i.e. one might have CR LF while the other only has LF or CR. (1 byte difference per line break)
According to the JavaDoc both methods should use the default encoding of the JVM, which should be the same for both operations. However, try and test with an explicitly set encoding (JVM's default encoding would be queried using System.getProperty("file.encoding")).
Ed Staub awnser points why my solution is not working and he suggested using bytes instead of Strings. In my case I need an String, so the final working solution I've found is the following:
#Test
public void testFileRWAsArray() throws IOException{
String f1String="";
byte[] bytes=FileUtils.readFileToByteArray(f1);
for(byte b:bytes){
f1String=f1String+((char)b);
}
File temp=File.createTempFile("deleteme", "deleteme");
byte[] newBytes=new byte[f1String.length()];
for(int i=0; i<f1String.length(); ++i){
char c=f1String.charAt(i);
newBytes[i]= (byte)c;
}
FileUtils.writeByteArrayToFile(temp, newBytes);
assertTrue(FileUtils.contentEquals(f1, temp));
}
By using a cast between byte-char, I have the symmetry on conversion.
Thank you all!
Try this code...
public static String fetchBase64binaryEncodedString(String path) {
File inboundDoc = new File(path);
byte[] pdfData;
try {
pdfData = FileUtils.readFileToByteArray(inboundDoc);
} catch (IOException e) {
throw new RuntimeException(e);
}
byte[] encodedPdfData = Base64.encodeBase64(pdfData);
String attachment = new String(encodedPdfData);
return attachment;
}
//How to decode it
public void testConversionPDFtoBase64() throws IOException
{
String path = "C:/Documents and Settings/kantab/Desktop/GTR_SDR/MSDOC.pdf";
File origFile = new File(path);
String encodedString = CreditOneMLParserUtil.fetchBase64binaryEncodedString(path);
//now decode it
byte[] decodeData = Base64.decodeBase64(encodedString.getBytes());
String decodedString = new String(decodeData);
//or actually give the path to pdf file.
File decodedfile = File.createTempFile("DECODED", ".pdf");
FileUtils.writeByteArrayToFile(decodedfile,decodeData);
Assert.assertTrue(FileUtils.contentEquals(origFile, decodedfile));
// Frame frame = new Frame("PDF Viewer");
// frame.setLayout(new BorderLayout());
}

Categories