computing checksum for an input stream

computing checksum for an input stream - java

I need to compute checksum for an inputstream(or a file) to check if the file contents are changed. I have this below code that generates a different value for each execution though I'm using the same stream. Can someone help me to do this right?
public class CreateChecksum {
public static void main(String args[]) {
String test = "Hello world";
ByteArrayInputStream bis = new ByteArrayInputStream(test.getBytes());
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));
}
public static String checkSum(InputStream fis){
String checksum = null;
try {
MessageDigest md = MessageDigest.getInstance("MD5");
//Using MessageDigest update() method to provide input
byte[] buffer = new byte[8192];
int numOfBytesRead;
while( (numOfBytesRead = fis.read(buffer)) > 0){
md.update(buffer, 0, numOfBytesRead);
}
byte[] hash = md.digest();
checksum = new BigInteger(1, hash).toString(16); //don't use this, truncates leading zero
} catch (Exception ex) {
}
return checksum;
}
}

You're using the same stream object for both calls - after you've called checkSum once, the stream will not have any more data to read, so the second call will be creating a hash of an empty stream. The simplest approach would be to create a new stream each time:
String test = "Hello world";
byte[] bytes = test.getBytes(StandardCharsets.UTF_8);
System.out.println("MD5 checksum for file using Java : "
+ checkSum(new ByteArrayInputStream(bytes)));
System.out.println("MD5 checksum for file using Java : "
+ checkSum(new ByteArrayInputStream(bytes)));
Note that your exception handling in checkSum really needs fixing, along with your hex conversion...

Check out the code in org/apache/commons/codec/digest/DigestUtils.html

Changes on a file are relatively easy to monitor, File.lastModified() changes each time a file is changed (and closed). There is even a build-in API to get notified of selected changes to the file system: http://docs.oracle.com/javase/tutorial/essential/io/notification.html
The hashCode of an InputStream is not suitable to detect changes (there is no definition how an InputStream should calculate its hashCode - quite likely its using Object.hashCode, meaning the hashCode doesn't depend on anything but object identity).
Building an MD5 like you try works, but requires reading the entire file every time. Quite a performance killer if the file is large and/or watching for multiple files.

You are confusing two related, but different responsibilities.
First you have a Stream which provides stuff to be read. Then you have a checksum on that stream; however, your implementation is a static method call, effectively divorcing it from a class, meaning that nobody has the responsibility for maintaining the checksum.
Try reworking your solution like so
public ChecksumInputStream implements InputStream {
private InputStream in;
public ChecksumInputStream(InputStream source) {
this.in = source;
}
public int read() {
int value = in.read();
updateChecksum(value);
return value;
}
// and repeat for all the other read methods.
}
Note that now you only do one read, with the checksum calculator decorating the original input stream.

The issue is after you first read the inputstream. The pos has reach the end. The quick way to resolve your issue is
ByteArrayInputStream bis = new ByteArrayInputStream(test.getBytes());
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));
bis = new ByteArrayInputStream(test.getBytes());
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));

Related

Java Deflater for large set of random strings

I am using the Deflater class to try to compress a large set of random strings. My compression and decompression methods look like this:
public static String compressAndEncodeBase64(String text) {
try {
ByteArrayOutputStream os = new ByteArrayOutputStream();
try (DeflaterOutputStream dos = new DeflaterOutputStream(os)) {
dos.write(text.getBytes());
}
byte[] bytes = os.toByteArray();
return new String(Base64.getEncoder().encode(bytes));
} catch (Exception e){
log.info("Caught exception when trying to compress {}: ", text, e);
}
return null;
}
public static String decompressB64(String compressedAndEncodedText) {
try {
byte[] decodedText = Base64.getDecoder().decode(compressedAndEncodedText);
ByteArrayOutputStream os = new ByteArrayOutputStream();
try (OutputStream ios = new InflaterOutputStream(os)) {
ios.write(decodedText);
}
byte[] decompressedBArray = os.toByteArray();
return new String(decompressedBArray, StandardCharsets.UTF_8);
} catch (Exception e){
log.error("Caught following exception when trying to decode and decompress text {}: ", compressedAndEncodedText, e);
throw new BadRequestException(Constants.ErrorMessages.COMPRESSED_GROUPS_HEADER_ERROR);
}
}
However, when I test this on a large set of random strings, my "compressed" string is larger than the original string. Even for a relatively small random string, the compressed data is longer. For example, this unit test fails:
#Test
public void testCompressDecompressRandomString(){
String orig = RandomStringUtils.random(71, true, true);
String compressedString = compressAndEncodeBase64(orig.toString());
Assertions.assertTrue((orig.toString().length() - compressedString.length()) > 0, "The decompressed string has length " + orig.toString().length() + ", while compressed string has length " + compressedString.length());
}
Anyone can explain what's going on and a possible alternative?
Note: I tried using the deflater without the base64 encoding:
public static String compress(String data) {
Deflater new_deflater = new Deflater();
new_deflater.setInput(data.getBytes(StandardCharsets.UTF_8));
new_deflater.finish();
byte compressed_string[] = new byte[1024];
int compressed_size = new_deflater.deflate(compressed_string);
byte[] returnValues = new byte[compressed_size];
System.arraycopy(compressed_string, 0, returnValues, 0, compressed_size);
log.info("The Original String: " + data + "\n Size: " + data.length());
log.info("The Compressed String Output: " + new String(compressed_string) + "\n Size: " + compressed_size);
return new String(returnValues, StandardCharsets.UTF_8);
}
My test still fails however.

First off, you aren't going to get much or any compression on short strings. Compressors need more data to both collect statistics on the data and to have previous data in which to look for repeated strings.
Second, if you're testing with random data, you are further crippling the compressor, since now there are no repeated strings. For your test case with random alphanumeric strings, the only compression you can get is to take advantage of the fact that there are only 62 possible values for each byte. That can be compressed by a factor of log(62)/log(256) = 0.744. Even then, you need to have enough input to cancel the overhead of the code description. Your test case of 71 characters will always be compressed to 73 bytes by deflate, which is essentially just copying the data with a small overhead. There isn't enough input to justify the code description to take advantage of the limited character set. If I have 1,000,000 random characters from that set of 62, then deflate can compress that to about 752,000 bytes.
Third, you are then expanding the resulting compressed data by a factor of 1.333 by encoding it using Base64. So if I take that compression by a factor of 0.752 and then expand it by 1.333, I get an overall expansion of 1.002! You won't get anywhere that way on random characters from a set of 62, no matter how long the input is.
Given all that, you need to do your testing on real-world inputs. I suspect that your application does not have randomly-generated data. Don't attempt compression on short strings. Combine your strings into much longer input, so that the compressor has something to work with. If you must encode with Base64, then you must. But expect that there may be expansion instead of compression. You could include in your output format an option for chunks to be compressed or not compressed, indicated by a leading byte. Then when compressing, if it doesn't compress, send it without compression instead. You can also try a more efficient encoding, e.g. Base85, or whatever number of characters you can transmit transparently.

Read/Write Bytes to and From a File Using Only Java.IO

How can we write a byte array to a file (and read it back from that file) in Java?
Yes, we all know there are already lots of questions like that, but they get very messy and subjective due to the fact that there are so many ways to accomplish this task.
So let's reduce the scope of the question:
Domain:
Android / Java
What we want:
Fast (as possible)
Bug-free (in a rigidly meticulous way)
What we are not doing:
Third-party libraries
Any libraries that require Android API later than 23 (Marshmallow)
(So, that rules out Apache Commons, Google Guava, Java.nio, and leaves us with good ol' Java.io)
What we need:
Byte array is always exactly the same (content and size) after going through the write-then-read process
Write method only requires two arguments: File file, and byte[] data
Read method returns a byte[] and only requires one argument: File file
In my particular case, these methods are private (not a library) and are NOT responsible for the following, (but if you want to create a more universal solution that applies to a wider audience, go for it):
Thread-safety (file will not be accessed by more than one process at once)
File being null
File pointing to non-existent location
Lack of permissions at the file location
Byte array being too large
Byte array being null
Dealing with any "index," "length," or "append" arguments/capabilities
So... we're sort of in search of the definitive bullet-proof code that people in the future can assume is safe to use because your answer has lots of up-votes and there are no comments that say, "That might crash if..."
This is what I have so far:
Write Bytes To File:
private void writeBytesToFile(final File file, final byte[] data) {
try {
FileOutputStream fos = new FileOutputStream(file);
fos.write(data);
fos.close();
} catch (Exception e) {
Log.i("XXX", "BUG: " + e);
}
}
Read Bytes From File:
private byte[] readBytesFromFile(final File file) {
RandomAccessFile raf;
byte[] bytesToReturn = new byte[(int) file.length()];
try {
raf = new RandomAccessFile(file, "r");
raf.readFully(bytesToReturn);
} catch (Exception e) {
Log.i("XXX", "BUG: " + e);
}
return bytesToReturn;
}
From what I've read, the possible Exceptions are:
FileNotFoundException : Am I correct that this should not happen as long as the file path being supplied was derived using Android's own internal tools and/or if the app was tested properly?
IOException : I don't really know what could cause this... but I'm assuming that there's no way around it if it does.
So with that in mind... can these methods be improved or replaced, and if so, with what?

It looks like these are going to be core utility/library methods which must run on Android API 23 or later.
Concerning library methods, I find it best to make no assumptions on how applications will use these methods. In some cases the applications may want to receive checked IOExceptions (because data from a file must exist for the application to work), in other cases the applications may not even care if data is not available (because data from a file is only cache that is also available from a primary source).
When it comes to I/O operations, there is never a guarantee that operations will succeed (e.g. user dropping phone in the toilet). The library should reflect that and give the application a choice on how to handle errors.
To optimize I/O performance always assume the "happy path" and catch errors to figure out what went wrong. This is counter intuitive to normal programming but essential in dealing with storage I/O. For example, just checking if a file exists before reading from a file can make your application twice as slow - all these kind of I/O actions add up fast to slow your application down. Just assume the file exists and if you get an error, only then check if the file exists.
So given those ideas, the main functions could look like:
public static void writeFile(File f, byte[] data) throws FileNotFoundException, IOException {
try (FileOutputStream out = new FileOutputStream(f)) {
out.write(data);
}
}
public static int readFile(File f, byte[] data) throws FileNotFoundException, IOException {
try (FileInputStream in = new FileInputStream(f)) {
return in.read(data);
}
}
Notes about the implementation:
The methods can also throw runtime-exceptions like NullPointerExceptions - these methods are never going to be "bug free".
I do not think buffering is needed/wanted in the methods above since only one native call is done
(see also here).
The application now also has the option to read only the beginning of a file.
To make it easier for an application to read a file, an additional method can be added. But note that it is up to the library to detect any errors and report them to the application since the application itself can no longer detect those errors.
public static byte[] readFile(File f) throws FileNotFoundException, IOException {
int fsize = verifyFileSize(f);
byte[] data = new byte[fsize];
int read = readFile(f, data);
verifyAllDataRead(f, data, read);
return data;
}
private static int verifyFileSize(File f) throws IOException {
long fsize = f.length();
if (fsize > Integer.MAX_VALUE) {
throw new IOException("File size (" + fsize + " bytes) for " + f.getName() + " too large.");
}
return (int) fsize;
}
public static void verifyAllDataRead(File f, byte[] data, int read) throws IOException {
if (read != data.length) {
throw new IOException("Expected to read " + data.length
+ " bytes from file " + f.getName() + " but got only " + read + " bytes from file.");
}
}
This implementation adds another hidden point of failure: OutOfMemory at the point where the new data array is created.
To accommodate applications further, additional methods can be added to help with different scenario's. For example, let's say the application really does not want to deal with checked exceptions:
public static void writeFileData(File f, byte[] data) {
try {
writeFile(f, data);
} catch (Exception e) {
fileExceptionToRuntime(e);
}
}
public static byte[] readFileData(File f) {
try {
return readFile(f);
} catch (Exception e) {
fileExceptionToRuntime(e);
}
return null;
}
public static int readFileData(File f, byte[] data) {
try {
return readFile(f, data);
} catch (Exception e) {
fileExceptionToRuntime(e);
}
return -1;
}
private static void fileExceptionToRuntime(Exception e) {
if (e instanceof RuntimeException) { // e.g. NullPointerException
throw (RuntimeException)e;
}
RuntimeException re = new RuntimeException(e.toString());
re.setStackTrace(e.getStackTrace());
throw re;
}
The method fileExceptionToRuntime is a minimal implementation, but it shows the idea here.
The library could also help an application to troubleshoot when an error does occur. For example, a method canReadFile(File f) could check if a file exists and is readable and is not too large. The application could call such a function after a file-read fails and check for common reasons why a file cannot be read. The same can be done for writing to a file.

Although you can't use third party libraries, you can still read their code and learn from their experience. In Google Guava for example, you usually read a file into bytes like this:
FileInputStream reader = new FileInputStream("test.txt");
byte[] result = ByteStreams.toByteArray(reader);
The core implementation of this is toByteArrayInternal. Before calling this, you should check:
A not null file is passed (NullPointerException)
The file exists (FileNotFoundException)
After that, it is reduced to handling an InputStream and this where IOExceptions come from. When reading streams a lot of things out of the control of your application can go wrong (bad sectors and other hardware issues, mal-functioning drivers, OS access rights) and manifest themselves with an IOException.
I am copying here the implementation:
private static final int BUFFER_SIZE = 8192;
/** Max array length on JVM. */
private static final int MAX_ARRAY_LEN = Integer.MAX_VALUE - 8;
private static byte[] toByteArrayInternal(InputStream in, Queue<byte[]> bufs, int totalLen)
throws IOException {
// Starting with an 8k buffer, double the size of each successive buffer. Buffers are retained
// in a deque so that there's no copying between buffers while reading and so all of the bytes
// in each new allocated buffer are available for reading from the stream.
for (int bufSize = BUFFER_SIZE;
totalLen < MAX_ARRAY_LEN;
bufSize = IntMath.saturatedMultiply(bufSize, 2)) {
byte[] buf = new byte[Math.min(bufSize, MAX_ARRAY_LEN - totalLen)];
bufs.add(buf);
int off = 0;
while (off < buf.length) {
// always OK to fill buf; its size plus the rest of bufs is never more than MAX_ARRAY_LEN
int r = in.read(buf, off, buf.length - off);
if (r == -1) {
return combineBuffers(bufs, totalLen);
}
off += r;
totalLen += r;
}
}
// read MAX_ARRAY_LEN bytes without seeing end of stream
if (in.read() == -1) {
// oh, there's the end of the stream
return combineBuffers(bufs, MAX_ARRAY_LEN);
} else {
throw new OutOfMemoryError("input is too large to fit in a byte array");
}
}
As you can see most of the logic has to do with reading the file in chunks. This is to handle situations, where you don't know the size of the InputStream, before starting reading. In your case, you only need to read files and you should be able to know the length beforehand, so this complexity could be avoided.
The other check is OutOfMemoryException. In standard Java the limit is too big, however in Android, it will be a much smaller value. You should check, before trying to read the file that there is enough memory available.

How to download a large file from Google Cloud Storage using Java with checksum control

I want to download large files from Google Cloud Storage using the google provided Java library com.google.cloud.storage. I have working code, but I still have one question and one major concern:
My major concern is, when is the file content actually downloaded? During (references to the code below) storage.get(blobId), during blob.reader() or during reader.read(bytes)? This gets very important when it comes to how to handle an invalid checksum, what do I need to do in order to actually trigger that the file is fetched over the network again?
The simpler question is: Is there built in functionality to do md5 (or crc32c) check on the received file in the google library? Maybe I don't need to implement it on my own.
Here is my method trying to download big files from Google Cloud Storage:
private static final int MAX_NUMBER_OF_TRIES = 3;
public Path downloadFile(String storageFileName, String bucketName) throws IOException {
// In my real code, this is a field populated in the constructor.
Storage storage = Objects.requireNonNull(StorageOptions.getDefaultInstance().getService());
BlobId blobId = BlobId.of(bucketName, storageFileName);
Path outputFile = Paths.get(storageFileName.replaceAll("/", "-"));
int retryCounter = 1;
Blob blob;
boolean checksumOk;
MessageDigest messageDigest;
try {
messageDigest = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException ex) {
throw new RuntimeException(ex);
}
do {
LOGGER.debug("Start download file {} from bucket {} to Content Store (try {})", storageFileName, bucketName, retryCounter);
blob = storage.get(blobId);
if (null == blob) {
throw new CloudStorageCommunicationException("Failed to download file after " + retryCounter + " tries.");
}
if (Files.exists(outputFile)) {
Files.delete(outputFile);
}
try (ReadChannel reader = blob.reader();
FileChannel channel = new FileOutputStream(outputFile.toFile(), true).getChannel()) {
ByteBuffer bytes = ByteBuffer.allocate(128 * 1024);
int bytesRead = reader.read(bytes);
while (bytesRead > 0) {
bytes.flip();
messageDigest.update(bytes.array(), 0, bytesRead);
channel.write(bytes);
bytes.clear();
bytesRead = reader.read(bytes);
}
}
String checksum = Base64.encodeBase64String(messageDigest.digest());
checksumOk = checksum.equals(blob.getMd5());
if (!checksumOk) {
Files.delete(outputFile);
messageDigest.reset();
}
} while (++retryCounter <= MAX_NUMBER_OF_TRIES && !checksumOk);
if (!checksumOk) {
throw new CloudStorageCommunicationException("Failed to download file after " + MAX_NUMBER_OF_TRIES + " tries.");
}
return outputFile;
}

The google-cloud-java storage library does not validate checksums on its own when reading data beyond normal HTTPS/TCP correctness checking. If it compared the MD5 of the received data to the known MD5, it would need to download the entire file before it could return any results from read(), which for very large files would be infeasible.
What you're doing is a good idea if you need the additional protection of comparing MD5s. If this is a one-off task, you could use the gsutil command-line tool, which does this same sort of additional check.

As the JavaDoc of ReadChannel says:
Implementations of this class may buffer data internally to reduce remote calls.
So the implementation you get from blob.reader() could cache the whole file, some bytes or nothing and just fetch byte for byte when you call read(). You will never know and you shouldn't care.
As only read() throws an IOException and the other methods you used do not, I'd say that only calling read() will actually download stuff. You can also see this in the sources of the lib.
Btw. despite the example in the JavaDocs of the library, you should check for >= 0, not > 0. 0 just means nothing was read, not that end of stream is reached. End of stream is signaled by returning -1.
For retrying after a failed checksum check, get a new reader from the blob. If something caches the downloaded data, then the reader itself. So if you get a new reader from the blob, the file will be redownloaded from remote.

Java: What is the best way to check if file needs to be updated before rewriting it?

We generate some file in our code. Sometimes the file coicides with the one, we have generated before. A question is : how can we check if the files are the same and skip writing?
The only way I see is:
read saved file in a string and generate its hash
generate hash of string we want to save into a new file
compare the hashes
May be, there are better ways?

According to me, hash is the best way to find modification/updates. Alternatively, if you have a definite line or character change whenever there is an update, you can just check that change with the new file generated and decide if you want to proceed with the write operation. You can always introduce such a parameter like a counter when you write a file, but updating the counter will require some logic that is related to the changes made before writing. The answer to this question depends on the context and working of the application.

MD5 Check Sum is the easiest way. I think your approach is valid.
Example I use in a unit test:
/** Returns a MD5 checksum from a file
*
* #param filename file name to write
* #return String
* #throws Exception
*/
private static String createChecksumForFile(String filename) throws Exception
{
InputStream fis = new FileInputStream(filename);
byte[] buffer = new byte[1024];
MessageDigest complete = MessageDigest.getInstance("MD5");
int numRead;
do {
numRead = fis.read(buffer);
if (numRead > 0) {
complete.update(buffer, 0, numRead);
}
} while (numRead != -1);
fis.close();
byte[] b = complete.digest();
String result = "";
for (byte aB : b) {
result +=
Integer.toString((aB & 0xff) + 0x100, 16).substring(1);
}
return result;
}

Unless there is any easy way to determine whether the data is still uptodate it'll be more efficient to simply overwrite it with the existent data, since reading and hashing a complete file is quite likely to be slower than simply overwriting the data. Though this is highly dependant upon the filesize.

How do you write a custom deflation dictionary and implement it

How do you write a custom deflation dictionary and implement it
{"playerX":"64","playerY":"224","playerTotalHealth":"100","playerCurrentHealth":"100","playerTotalMana":"50","playerCurrentMana":"50","playerExp":"0","playerExpTNL":"20","playerLevel":"1","points":"0","strength":"1","dexterity":"1","constitution":"1","intelligence":"1","wisdom":"1","items":["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24"],"currentMapX":"0","currentMapY":"0","playerBody":"1","playerHair":"6","playerClothes":"7"}
This is the String I am trying to compress.
Things that will never change are the names of each variable, so I wanna add that to a dictionary (this is a json object)
There are a lot of things I could put into a dictionary, such as
"playerX":
"playerY":
I am trying to compress this to as small as I can get it.
I just dont know how to implement it into a dictionary. I know that I have to use a byte[], but how do I separate words in a byte[]?
Currently the code I provided below compresses it to a length of 253 from 494. I wanna try to get it as small as I can. Since it is a small String I would rather have more compression than speed.
you dont have to solve it for me, but maybe provide hints and sources etc on what I can do to make this string mega small
public static void main(String[] args)
{
deflater("String");
}
public static String deflater(String str)
{
System.out.println("Original: " + str + ":End");
System.out.println("Length: " + str.length());
byte[] input = str.getBytes();
Deflater d = new Deflater();
d.setInput(input);
d.setLevel(1);
d.finish();
ByteArrayOutputStream dbos = new ByteArrayOutputStream(input.length);
byte[] buffer = new byte[1024];
while(d.finished() == false)
{
int bytesCompressed = d.deflate(buffer);
System.out.println("Total Bytes: " + bytesCompressed);
dbos.write(buffer, 0, bytesCompressed);
}
try
{
dbos.close();
}
catch(IOException e1)
{
e1.printStackTrace();
System.exit(0);
}
//Dictionary implementation required!
byte[] compressedArray = dbos.toByteArray();
String compStr = new String(compressedArray);
System.out.println("Compressed: " + compStr + ":End");
System.out.println("Length: " + compStr.length());
return null;
}

The dictionary is simply your common strings concatenated to make a sequence of bytes less than or equal to 32K in length. You do not need to separate words. There is no structure to the dictionary. It is simply used as a source of data to match the current string to. You should put the more common strings at the end of the dictionary, since it takes fewer bits to encode shorter distances back.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.