Hashing a file with SHA256

Hashing a file with SHA256 - java

I need to add a digital signature to a picture using the RSA method. To do this, I first need to hash the file with SHA256. But how can this be done? As I understand it, I should get an array of byte hash[] with hashed bytes.
For example in c# example MD5 there is such a way:
byte[] hash = MD5.Create.ComputeHash(bytesOfFile);
I tried like this, but how can I get an array of hashes and how can I change the size of the final one?
MessageDigest sha256 = MessageDigest.getInstance("SHA-256");
byte[] array = Files.readAllBytes(Path.of("C:\\Users\\User\\IdeaProjects\\CryptoLab1\\src\\crypto\\dino.png"));
sha256.digest(array);
System.out.println(Arrays.toString(sha256.digest()));

You're almost there. sha256.digest(array) itself returns the result, your next digest call wouldn't work anymore. The API of MessageDigest is: Send as many bytes as you want in chunks via the update method, and then do a final call with a digest method. Purely as a convenience, digest(byteArr) is identical to update(byteArr); digest();.
So, replace your last 2 lines with just System.out.println(Arrays.toString(sha256.digest(array));.
A few more notes:
This code eats a ton of memory if the file is large, and there is absolutely no need for it; you can just repeatedly read a smallish byte array, send it to the MessageDigest object via an update method. Imagine this file is 4GB large; such a setup can easily do it in a tiny memory footprint, whereas your code would require 4GB worth of memory.
A hash's size is what it is, you can't "resize" it, that doesn't make sense. Because bytes are annoying to show, various solutions exist. There is no one unified standard. Whatever tool gives you the supposed hash will have picked a way to turn a bunch of bytes (the hash) into string form. The common candidates are hex-nibbles, which looks like a sequence of symbols where each symbol is 0-9 or A-F, and all the letters are either all uppercased, or all lowercased. Or, base64, which is a mix of letters and digits.
To turn a byte array to hex-nibble form in java, follow this guide. To turn a byte array into base64, it's simply Base64.getEncoder().encodeToString(byteArr).

Related

Is a non-random initialization vector still bad when encrypting short strings?

We encrypt with AES 256 CBC PKCS5PADDING in Java with the libraries one has to download from Oracle, with Base64 encoding of the resulting byte arrays. I have read that static common initialization vector drastically decreases the security as texts that starts with the chars will looks the same when encrypted. Is this still true for short strings (12 numeric chars)?
I have encrypted a large set and I cannot find any reoccurring substrings in the resulting encrypted strings, even when they start with the same sequence.
Example (plaintext on the left and resulting encrypted string on the right)
555555555501 -> U0Mkd0PPloB5iLBy5jM6nw==
555555555502 -> NUHWaFs62LMEeyoGA0mGoQ==
555555555503 -> X3/XJNd4TzEsMv7V0bXwqg==
Albeit separate from the question, but to preempt some suggestions: we need to be able to do look ups based on plaintext strings and to be able to decrypt. We could do both hashing and encryption, but prefer to avoid it if it does not improve security significantly as it adds complexity.

I have read that static common initialization vector are bad as one can derive the key from encrypted strings.
I'm curious: where have you read that?
With short (<=16 bytes) plaintext, a random IV effectifely works as a Salt, i.e. it causes the ciphertext to differ even if the plain text is the same. This is an important feature in a lot of applications. But you write:
We need to be able to do look ups based on plaintext strings.
So you want to build some sort of pseudonymization database? If that is a requirement for you, the feature that salt, and in your case random IV adds, is actually one that you specifically don't want. Depending on your other requirements you can probably get away with using a static IV here. But for pseudonymization in general, it is recommended to use a dedicated pseudonym. In your case the data seems to be atomic. But in the general case of, for example, address data, you want to hash the name, the zip code, the city and whatever else your pseudonym is, separately, both to allow more specific queries, and to keep access to and information flow from your data under strict control.

Avoiding line breaks in encrypted and encoded URL string

I am trying to implement a simple string encoder to obfuscates some parts of a URL string (to prevent them from getting mucked with by a user). I'm using code nearly identical to the sample in the JCA guide, except:
using DES (assuming it's a little faster than AES, and requires a smaller key) and
Base64 en/decoding the string to make sure it stays safe for a URL.
For reasons I can't understand, the output string ends up with linebreaks, which I presume won't work. I can't figure out what's causing this. Suggestions on something similar that's easier or pointers to some other resources to read? I'm finding all the cryptography references a bit over my head (and overkill), but a simple ROT13 implementation won't work since I want to deal with a larger character set (and don't want to waste time implementing something likely to have issues with obscure characters i didn't think of).
Sample input (no line break):
http://maps.google.com/maps?q=kansas&hl=en&sll=42.358431,-71.059773&sspn=0.415552,0.718918&hnear=Kansas&t=m&z=7
Sample Output (line breaks as shown below):
GstikIiULcJSGEU2NWNTpyucSWUFENptYk4m5lD8RJl8l1CuspiuXiE9a07fUEAGM/tC7h0Vzus+
jAH6cT4Wtz2RUlBdGf8WtQxVDKZVOzKwi84eQh2kZT9T3KomlnPOu2owJ/2RAEvG+QuGem5UGw==
my encode snippet:
final Key key = new SecretKeySpec(seed.getBytes(), "DES");
final Cipher c = Cipher.getInstance("DES");
c.init(Cipher.ENCRYPT_MODE, key);
final byte[] encVal = c.doFinal(s.getBytes());
return new BASE64Encoder().encode(encVal);

Simply perform base64Str = base64Str.replaceAll("(?:\\r\\n|\\n\\r|\\n|\\r)", "")
on the encoded string.
It works fine when you try do decode it back to bytes. I did test it several times with random generated byte arrays. Obviously decoding process just ignores the newlines either they are present or not.
I tested this "confirmed working" by using com.sun.org.apache.xml.internal.security.utils.Base64
Other encoders not tested.

Base64 encoders usually impose some maximum line (chunk) length, and adds newlines when necessary. You can normally configure that, but that depends on the particular coder implementation.
For example, the class from Apache Commons has a linelength attribute, setting it to zero (or negative) disables the line separation.
BTW: I agree with the other answer in that DES is hardly advisable today. Further, are you just "obfuscating" or really encrypting? Who has the key? The whole thing does not smell very well to me.

import android.util.Base64;
...
return new BASE64.encodeToString(encVal, Base64.NO_WRAP);

Though it's unrelated to your actual question, DES is generally slower than AES (at least in software), so unless you really need to keep the key small, AES is almost certainly a better choice.
Second, it's perfectly normal that encryption (DES or AES) would/will produce new-line characters in its output. Producing output without them will be entirely up to the base-64 encoder, so that's where you clearly need to look.
It's not particularly surprising to see a base-64 insert new-line characters at regular intervals in its output though. The most common use for base-64 encoding is putting raw data into something like the body of an email, where a really long line would cause a problem. To prevent that, the data is broken up into pieces, typically no more than 80 columns (and usually a bit less). In this case, the new-lines should be ignored, however, so you should be able to just delete them, if memory serves.

Is it possible to search a file with comrpessed objects in java?

I read from ORACLE of the following bit:
Can I execute methods on compressed versions of my objects, for example isempty(zip(serial(x)))?
This is not really viable for arbitrary objects because of the encoding of objects. For a particular object (such as String) you can compare the resulting bit streams. The encoding is stable, in that every time the same object is encoded it is encoded to the same set of bits.
So I got this idea, say if I have a char array of 4M something long, is it possible for me to compress it to several hundreds of bytes using GZIPOutputStream, and then map the whole file into memory, and do random search on it by comparing bits? Say if I am looking for a char sequence of "abcd", could I somehow get the bit sequence of compressed version of "abcd", and then just search the file for it? Thanks.

You cannot use GZIP or similar to do this as the encoding of each byte change as the stream is processed. i.e. the only way to determine what a byte means is to read all the bytes previous.
If you want to access the data randomly, you can break the String into smaller sections. That way you only need to decompress a relative short section of data.

Could anyone verify the correctness of getting a md5 hash using this method?

MessageDigest m=MessageDigest.getInstance("MD5");
StringBuffer sb = new StringBuffer();
if(nodeName!=null) sb.append(nodeName);
if(nodeParentName!=null) sb.append(nodeParentName);
if(nodeParentFieldName!=null) sb.append(nodeParentFieldName);
if(nodeRelationName!=null) sb.append(nodeRelationName);
if(nodeViewName!=null) sb.append(nodeViewName);
if(treeName!=null) sb.append(treeName);
if(nodeValue!=null && nodeValue.trim().length()>0) sb.append(nodeValue);
if(considerParentHash) sb.append(parentHash);
m.update(sb.toString().getBytes("UTF-8"),0,sb.toString().length());
BigInteger i = new BigInteger(1,m.digest());
hash = String.format("%1$032X", i);
The idea behind these lines of code is that we append all the values of a class/model into a StringBuilder and then return the padded hash of that (the Java implementation returns md5 hashes that are lenght 30 or 31, so the last line formats the hash to be padded with 0s).
I can verify that this works, but I have a feeling it fails at one point (our application fails and I believe this to be the probable cause).
Can anyone see a reason why this wouldn't work? Are there any workarounds to make this code less prone to errors (e.g. removing the need for the strings to be UTF-8).

There are a few weird things in your code.
UTF-8 encoding of a character may use more than one byte. So you should not use the string length as final parameter to the update() call, but the length of the array of bytes that getBytes() actually returned. As suggested by Paŭlo, use the update() method which takes a single byte[] as parameter.
The output of MD5 is a sequence of 16 bytes with quite arbitrary values. If you interpret it as an integer (that's what you do with your call to BigInteger()), then you will get a numerical value which will be smaller than 2160, possibly much smaller. When converted back to hexadecimal digits, you may get 32, 31, 30... or less than 30 characters. Your usage of the the "%032X" format string left-pads with enough zeros, so your code works, but it is kind of indirect (the output of MD5 has never been an integer to begin with).
You assemble the hash input elements with raw concatenation. This may induce issues. For instance, if modeName is "foo" and modeParentName is "barqux", then the MD5 input will begin with (the UTF-8 encoding of) "foobarqux". If modeName is "foobar" and modeParentName is "qux", then the MD5 input will also begin with "foobarqux". You do not tell why you want to use a hash function, but usually, when one uses a hash function, it is to have a unique trace of some piece of data; two distinct data elements should yield distinct hash inputs.
When handling nodeValue, you call trim(), which means that this string could begin and/or end with whitespace, and you do not want to include that whitespace into the hash input -- but you do include it, since you append nodeValue and not nodeValue.trim().
If what you are trying to do has any relation to security then you should not use MD5, which is cryptographically broken. Use SHA-256 instead.
Hashing an XML element is normally done through canonicalization (which handles whitespace, attribute order, text representation, and so on). See this question on the topic of canonicalizing XML data with Java.

One possible problem is here:
m.update(sb.toString().getBytes("UTF-8"),0,sb.toString().length());
As said by Robing Green, the UTF-8 encoding can produce a byte[] which is longer than your original string (it will do this exactly when the String contains non-ASCII characters). In this case, you are only hashing the start of your String.
Better write it like this:
m.update(sb.toString().getBytes("UTF-8"));
Of course, this would not cause an exception, simply another hash than would be produced otherwise, if you have non-ASCII-characters in your string. You should try to brew your failure down to an SSCCE, like lesmana recommended.

How does a person go about learning Java? (convert byte array to hex string)

I know this sounds like a broad question but I can narrow it down with an example. I am VERY new at Java. For one of my "learning" projects, I wanted to create an in-house MD5 file hasher for us to use. I started off very simple by attempting to hash a string and then moving on to a file later. I created a file called MD5Hasher.java and wrote the following:
import java.security.*;
import java.io.*;
public class MD5Hasher{
public static void main(String[] args){
String myString = "Hello, World!";
byte[] myBA = myString.getBytes();
MessageDigest myMD;
try{
myMD = MessageDigest.getInstance("MD5");
myMD.update(myBA);
byte[] newBA = myMD.digest();
String output = newBA.toString();
System.out.println("The Answer Is: " + output);
} catch(NoSuchAlgorithmException nsae){
// print error here
}
}
}
I visited java.sun.com to view the javadocs for java.security to find out how to use MessageDigest class. After reading I knew that I had to use a "getInstance" method to get a usable MessageDigest object I could use. The Javadoc went on to say "The data is processed through it using the update methods." So I looked at the update methods and determined that I needed to use the one where I fed it a byte array of my string, so I added that part. The Javadoc went on to say "Once all the data to be updated has been updated, one of the digest methods should be called to complete the hash computation." I, again, looked at the methods and saw that digest returned a byte array, so I added that part. Then I used the "toString" method on the new byte array to get a string I could print. However, when I compiled and ran the code all that printed out was this:
The Answer Is: [B#4cb162d5
I have done some looking around here on StackOverflow and found some information here:
How can I generate an MD5 hash?
that gave the following example:
String plaintext = 'your text here';
MessageDigest m = MessageDigest.getInstance("MD5");
m.reset();
m.update(plaintext.getBytes());
byte[] digest = m.digest();
BigInteger bigInt = new BigInteger(1,digest);
String hashtext = bigInt.toString(16);
// Now we need to zero pad it if you actually want the full 32 chars.
while(hashtext.length() < 32 ){
hashtext = "0"+hashtext;
}
It seems the only part I MAY be missing is the "BigInteger" part, but I'm not sure.
So, after all of this, I guess what I am asking is, how do you know to use the "BigInteger" part? I wrongly assumed that the "toString" method on my newBA object would convert it to a readable output, but I was, apparently, wrong. How is a person supposed to know which way to go in Java? I have a background in C so this Java thing seems pretty weird. Any advice on how I can get better without having to "cheat" by Googling how to do something all the time?
Thank you all for taking the time to read. :-)

The key in this particular case is that you need to realize that bytes are not "human readable", but characters are. So you need to convert bytes to characters in a certain format. For arbitrary bytes like hashes, usually hexadecimal is been used as "human readable" format. Every byte is then to be converted to a 2-character hexadecimal string which you in turn concatenate together.
This is unrelated to the language you use. You just have to understand/realize how it works "under the hoods" in a language agnostic way. You have to understand what you have (a byte array) and what you want (a hexstring). The programming language is just a tool to achieve the desired result. You just google the "functional requirement" along with the programming language you'd like to use to achieve the requirement. E.g. "convert byte array to hex string in java".
That said, the code example you found is wrong. You should actually determine each byte inside a loop and test if it is less than 0x10 and then pad it with zero instead of only padding the zero depending on the length of the resulting string (which may not necessarily be caused by the first byte being less than 0x10!).
StringBuilder hex = new StringBuilder(bytes.length * 2);
for (byte b : bytes) {
if ((b & 0xff) < 0x10) hex.append("0");
hex.append(Integer.toHexString(b & 0xff));
}
String hexString = hex.toString();
Update as per the comments on the answer of #extraneon, using new BigInteger(byte[]) is also the wrong solution. This doesn't unsign the bytes. Bytes (as all primitive numbers) in Java are signed. They have a negative range. The byte in Java ranges from -128 to 127 while you want to have a range of 0 to 255 to get a proper hexstring. You basically just need to remove the sign to make them unsigned. The & 0xff in the above example does exactly that.
The hexstring as obtained from new BigInteger(bytes).toString(16) is NOT compatible with the result of all other hexstring producing MD5 generators the world is aware of. They will differ whenever you've a negative byte in the MD5 digest.

You have actually successfully digested the message. You just don't know how to present the found digest value properly. What you have is a byte array. That's a bit difficult to read, and a toString of a byte array yields [B#somewhere which is not useful at all.
The BigInteger comes into it as a tool to format the byte array to a single number.
What you do is:
construct a BigInteger with the proper value (in this case that value happens to be encoded in the form of a byte array - your digest
Instruct the BigInteger object to return a String representation (e.g. plain, readable text) of that number, base 16 (e.g. hex)
And the while loop prefixes that value with 0-characters to get a width of 32. I'd probably use String.format for that, but whatever floats your boat :)

MessageDigests compute a byte array of something, the string that you usually see (such as 1f3870be274f6c49b3e31a0c6728957f) is actually just a conversion of the byte array to a hexadecimal string.
When you call MessageDigest.toString(), it calls MessageDigest.digest().toString(), and in Java, the toString method for a byte[] (returned by MessageDigest.digest()) returns a sort of reference to the bytes, not the actual bytes.
In the code you posted, the byte array is changed to an integer (in this case a BigInteger because it would be extremely large), and then converted to hexadecimal to be printed to a String.
The byte array computed by the digest represents a number (a 128-bit number according to http://en.wikipedia.org/wiki/MD5), and that number can be converted to any other base, so the result of the MD5 could be represented as a base-10 number, a base-2 number (as in a byte array), or, most commonly, a base-16 number.

It is OK to google for answers as long as you (eventually) understand what you copy-pasted into your app :-)
In general, I recommend starting with a good Java introductory book, or web tutorial. See these threads for more tips:
https://stackoverflow.com/questions/77839/what-are-the-best-resources-for-learning-java-books-websites-etc
Learning Java
https://stackoverflow.com/questions/78293/good-book-to-learn-to-program-well-in-java-engineering-or-architecture-wise-not

Though I'm afraid that I have no experience whatsoever using Java to play with MD5 hashes, I can recommend Sun's Java Tutorials as a fantastic resource for learning Java. They go through most of the language, and helped me out a ton when I was learing Java.
Also look around for other posts asking the same thing and see what suggestions popped up there.

The reason BigInteger is used is because the byte array is very long, too big too fit into an int or long. However, if you do want to see everything in the byte array, there's an alternate approach. You could just replace the line:
String output = newBA.toString();
with:
String output = Arrays.toString(newBA);
This will print out the contents of the array, not the reference address.

Use an IDE that shows you where the "toString()" method is coming from. In most cases it's just from the Object class and won't be very useful. It's generally recommended to overwrite the toString-method to provide some clean output, but many classes don't do this.

I'm also a newbie to development. For the current problem, I suggest the Book "Introduction To Cryptography With Java Applets" by David Bishop. It demonstrates what you need and so forth...

Any advice on how I can get better
without having to "cheat" by Googling
how to do something all the time?
By by not starting out with an MD5 hasher! Seriously, work your way up little by little on programs that you can complete without worrying about domain-specific stuff like MD5.
If you're dumping everything into main, you're not programming Java.
In a program of this scale, your main() should do one thing: create an MD5Hasher object and then call some methods on it. You should have a constructor that takes an initial string, a method to "do the work" (update, digest), and a method to print the result.
Get some tutorials and spend time on simple, traditional exercises (a Fibonacci generator, a program to solve some logic puzzle), so you understand the language basics before bothering with the libraries, which is what you are struggling with now. Then you can start doing useful stuff.

I wrongly assumed that the "toString" method on my newBA object would convert it to a readable output, but I was, apparently, wrong. How is a person supposed to know which way to go in Java?
You could replace here Java with the language of your choice that you don't know/haven't mastered yet. Even if you worked 10 years in a specific language, you will still get those "Aha! This is the way it's working!"-effects, though not that often as in the beginning.
The point you need to learn here is that toString() is not returning the representation you want/expect, but any the implementer has chosen. The default implementation of toString() is like this (javadoc):
Returns a string representation of the object. In general, the toString method returns a string that "textually represents" this object. The result should be a concise but informative representation that is easy for a person to read. It is recommended that all subclasses override this method.
The toString method for class Object returns a string consisting of the name of the class of which the object is an instance, the at-sign character `#', and the unsigned hexadecimal representation of the hash code of the object. In other words, this method returns a string equal to the value of:
getClass().getName() + '#' + Integer.toHexString(hashCode())

How is a person supposed to know which
way to go in Java? I have a background
in C so this Java thing seems pretty
weird. Any advice on how I can get
better without having to "cheat" by
Googling how to do something all the
time?
Obvious answers are 1- google when you have questions (and it's not considered cheating imo) and 2- read books on the subject matter.
Apart from these two, I would recommend trying to find a mentor for yourself. If you do not have experienced Java developers at work, then try to join a local Java developer user group. You can find more experienced developers there and perhaps pick their brains to get answers to your questions.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.