Differences between Crypt.crypt() and DigestUtils.md5() in apache.commons.Codec

Differences between Crypt.crypt() and DigestUtils.md5() in apache.commons.Codec - java

I am writing a basic password cracker for the MD5 hashing scheme against a Linux /etc/shadow file. When I use commons.codec's DigestUtils or Crypt libraries, the hash length for them are different (among other things).
When I use the Crypt.crypt(passwordToHash, "$1$Jhe937$") the output is a 22-character string. When I use the DigestUtils.md5[Hex](passwordToHash + "Jhe937")(or the Java MessageDigest class) the output is a 32-character string (after converted). This makes no sense to me.
aside: is there no easy way to convert the DigestUtils.md5(passwordToHash)'s byte[] to a String. I've tried all* the ways and I get all non-valid output: Nz_èJÓ_µù[î¬y
*all being: new String(byte[], "UTF-8") and convert to char then to String

The executive summary is that while they'll perform the same hashing, the output format is different between the two so the lengths will be different. Read on for details.
MD5 is a message digesting algorithm that produces a 16 byte hash value, always (assuming valid input, etc.) Those bytes aren't all printable characters, they can take any value from 0-255 for any of the bytes, while the printable characters in ASCII are in the range 32-126.
DigestUtils.md5(String) generates the MD5 of the string and returns a 16 element byte array. DigestUtils.md5Hex(String) is a convenience wrapper (I'm assuming, I haven't looked at the source, but that's how I'd write it :-) ) around DigestUtils.md5 that takes the 16 element byte array md5 produces and base16 encodes it (also known as hex encoding). That replaces each byte with the equivalent two hex characters, which is why you get a 32 character String out of it.
Crypt.crypt uses a special format that goes back to the original Unix method of storing passwords. It's been extended over the years to use different hash/encryption algorithms, longer salts, and additional features. It also encodes it's output to be printable text, which is where the length difference is coming from. By using a salt of "$1$...", you're saying to use MD5, so the password plus the salt will be hashed using MD5, resulting in 16 bytes as expected, but because those bytes aren't necessarily printable, the hash is base64 encoded (using a slightly different alphabet than the standard base64 encoding), which replaces 3 bytes with 4 printable characters. So 16 bytes becomes 16 / 3 * 4 = 21-1/3 characters, rounded up to 22.
On your aside, DigestUtils.md5 produces 16 bytes, but those bytes can have any value from 0 to 255 and are (effectively) random. new String(byte[], "UTF-8") says the bytes in the byte array are a UTF-8 encoding, which is a very specific format. new String does it's best to treat the bytes as a UTF-8 encoded string, but because they're really not, you generally get gibberish out. If you want something printable, you'll have to use something that takes random bytes, not bytes in a specific format (like UTF-8). Two popular options are base16/hex encoding, which you can get with DigestUtils.md5Hex, or base64, which you can get with Base64.encodeBase64String(DigestUtils.md5(pwd + salt)).

Related

Byte array with negative byte values can't be converted to String using UTF-8 [closed]

Consider this is byte array,
byte[] by = [2, 126, 33, -66, -100, 4, -39, 108]
then if we execute the below code and print it,
String utf8_str = new String(by, StandardCharsets.UTF_8);
System.out.println(utf8_str);
the output is:
\~!���l
Where all the negative values are converted to '�' which means that the byte with -ve value is not in the UTF-8 character set.
But the UTF-8 character set has a range of 0 to 255.
If only 0-127 can be shown in +ve in the form of byte datatype, then the numbers greater than 127 can never be used when encoding to UTF-8 character set as Java does not support unsigned byte value.
Any solution for this?
I needed to encode a byte array to UTF-8 character String and get the byte array back from the UTF-8 character String.
But all the character are encoded and retrieved properly except '�'.
when I try to retrieve '�' (i.e, print it's UTF-8 Unicode), it gives some other Unicode rather than the Unicode of the encoded character.

tl;dr: You can't decode arbitrary bytes as UTF-8, because some byte streams are not conforming UTF-8 streams. If you need to represent arbitrary bytes as String, use something like Base64:
String base64 = Base64.getEncoder().encodeToString(arbitraryBytes);
Not all byte sequences are valid UTF-8
UTF-8 has very specific rules about what bytes sequences are allowed. The short version is:
a byte in the range 0x00-0x7F can stand alone (and represents the equivalent character as its ASCII encoding).
a byte in the range 0xC2-0xF4 is a leading byte that starts a multi-byte sequence with the exact value indicating the number of continuation bytes
a byte in the range 0x80-0xBF is a continuation byte that has to come after a leading byte and possibly some other continuation bytes.
There's a few more rules and nuances to it, but that's the basic idea.
As you can see there are several byte values (0xC0, 0xC1, 0xF5-0xFF) that can't appear in a well-formed UTF-8 stream at all. Additionally some other bytes can only occur in specific sequences. For example a leading byte can never be followed by another leading byte or a stand-alone byte. Similarly a stand-alone byte must never be followed by a continuation byte.
Note about "negative values": byte in Java is a signed data type. But the signed/unsigned debate is not relevant for this topic, as it only matters when calculating with the value or when printing it. It's the 8-bit type to use in Java and the fact that the byte 0xBE is represented as -66 in Java is mostly a visual distinction. For the purposes of this discussion "negative values" is equivalent to "byte values between 0x80 and 0xFF". It just so happens that the non-negative values are exactly the stand alone bytes in UTF-8 and are converted just fine.
All this means that decoding arbitrary byte[] as UTF-8 will not work in most cases!**
Then why doesn't new String(...) throw an exception?
But if arbitraryBytes contains a byte[] that isn't valid UTF-8, then why doesn't new String(arbitraryBytes, StandardCharsets.UTF_8) throw an exception?
Good question! Maybe it should, but the designers of Java have decided that this specific way of decoding a byte[] into a String should be lenient:
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. The CharsetDecoder class should be used when more control over the decoding process is required.
The "default replacement string" in this case is simply the Unicode character U+FFFD Replacement Character, which looks like a question mark in a filled rhombus: �
And as the documentation states, there is of course a way to decode a byte[] to a String and getting a real exception when it doesn't go right:
byte[] arbitraryBytes = new byte[] { 2, 126, 33, -66, -100, 4, -39, 108 };
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder().onMalformedInput(CodingErrorAction.REPORT);
String string = decoder.decode(ByteBuffer.wrap(arbitraryBytes)).toString();
This code will throw an exception:
Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:820)
at org.example.Main.main(Main.java:13)
Okay, but I really need a String!
We have realized that decoding your byte[] to a String using UTF-8 doesn't work. One could use ISO-8859-1, which maps all 256 byte values to characters, but that would result in Strings with many unprintable control characters, which would be quite cumbersome to handle.
Use Base64
The usual solution for this is to use Base64:
// encode byte[] to Base64
String base64 = Base64.getEncoder().encodeToString(arbitraryBytes);
System.out.println(base64);
// decode Base64 to byte[]
byte[] decoded = Base64.getDecoder().decode(base64);
System.out.println(Arrays.equals(arbitraryBytes, decoded));
With the same arbitraryBytes as before this will print
An4hvpwE2Ww=
true
Base64 is a common choice because it is able to represent arbitrary bytes with a reasonable number of characters (on average it will take about a third more characters than it has input bytes, depending on the exact formatting and/or padding used).
There are a few variations of Base64, which are used in various situations. Particularly common is the use of the URL- and filename-safe variant, which ensures that no characters with any special meaning in URLs and file names are used. Luckily it is directly supported in Java.
Format as a hex string
Base64 is neat and useful, but it somewhat obfuscates the individual byte values. Occasionally we want a format that allows us to interpret the values in some way. For this a hexadecimal representation of the data might be more useful, even though it takes up more characters than Base64:
// encode byte[] to hex
String hexFormatted = HexFormat.of().formatHex(arbitraryBytes);
System.out.println(hexFormatted);
// decode hex to byte[]
byte[] decoded = HexFormat.of().parseHex(hexFormatted);
System.out.println(Arrays.equals(arbitraryBytes, decoded));
This will print
027e21be9c04d96c
true
This hex format (without separator) will take exactly 2 characters per input byte, making this format more verbose than Base64.
If you're not yet on Java 17 or later, there are plenty of other ways to do this.
But I've already converted my byte[] to String using UTF-8 and I really need my original data back.
Sorry, but you most likely can't. Unless you were very lucky and your original byte[] happened to be a well-formed UTF-8 stream, the conversion to String will have lost some data and you will only be able to recover a fraction of your original byte[].
String badString = new String(arbitraryBytes, StandardCharsets.UTF_8);
byte[] recoveredBytes = badString.getBytes(StandardCharsets.UTF_8);
This will give you something but every time your input contained a encoding error, this will contain the byte sequence 0xEF 0xBF 0xBD (or -17 -65 -67, when interpreted as signed bytes and printed in decimal). That byte sequence is what UTF-8 encodes the U+FFFD Replacement Character as.
Depending on the specific input (and even the specific implementation of the UTF-8 decoder!) each replacement character can replace one or more bytes, so you can't even reliably tell the size of the original input array like this.

Why is it said: CharacterStream classes are used to perform the input/output for the 16-bit Unicode characters?

When an I/O stream manages 8-bit bytes of raw binary data, it is
called a byte stream. And, when the I/O stream manages 16-bit Unicode
characters, it is called a character stream.
Byte stream is clear. It uses 8-bit bytes. So if I were to write a character that uses 3 bytes it would only write its last 8 bits! Thus making incorrect output.
So that is why we use character streams. Say I want to write Latin Capital Letter Ạ. I would need 3 bytes for storing in UTF-8. But say I also want to store 'normal' A. Now it would take 1 byte to store.
Are you seeing pattern? We can't know how much bytes it will take for writing any of these characters until we convert them. So my question is why is it said that character streams manage 16-bit Unicode characters? When in case where I wrote Ạ that takes 3 bytes it didn't cut it to last 16-bits like byte streams cut last 8-bits. What does that quote even mean then?

In Java, a String is composed of a sequence of 16-bit chars, representing text stored in the UTF-16 encoding.
A Charset is an object that describes how to convert Unicode characters to a sequence of bytes. UTF-8 is an example of a charset.
A character stream like Writer, when it outputs to a thing that contains bytes -- a file, or a byte output stream like OutputStream -- uses a Charset to convert Strings to simple byte sequences for output. (Technically, it converts the UTF-16 chars to Unicode characters and then converts those to byte sequences with the Charset.) A Reader, when reading from a byte source, does the reverse conversion.
In UTF-16, Ạ is represented as the 16-bit char 0x1EA1. It takes only 16 bits in UTF-16, not 24 bits as in UTF-8.
If you converted it to bytes with the UTF-8 encoding, as here:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(baos, StandardCharsets.UTF_8);
writer.write("Ạ");
writer.close();
return baos.toByteArray();
Then you would get the 3 byte sequence 0xE1 0xBA 0xA1 as expected.

In Java, a character (char) is always 16 bits, as can be seen from its max value - 65535. This is why the quote is not wrong. 16 bit is indeed a character.
"How can all the Unicode characters be stored in just 16 bits?" you might ask. This is done in Java using the UTF-16 encoding. Here's how it works (in very simplified terms):
Every Unicode code point in the Basic Multilingual Plane is encoded in 16 bits. (Yes 16 bit is enough for that) Every code point outside of the BMP is encoded with a pair of 16 bit characters, called surrogate pairs.
"Ạ" (U+1EA0) is inside the BMP, so can be encoded with 16 bits.
You said:
Say I want to write Latin Capital Letter Ạ. I would need 3 bytes for storing in UTF-8. But say I also want to store 'normal' A. Now it would take 1 byte to store!
That does not make the quote incorrect. The stream still "manages 16-bit characters", because that's what you will give it with Java code. When you call println on a PrintStream, you are giving it a String, which is a bunch of chars under the hood, which is a bunch of 16-bits. So it is really managing a stream of 16-bit characters. It's just that it outputs them in a different encoding.
It's probably worth mentioning what happens when you try to print a character that is not in the BMP. This would still not make the quote incorrect. The quote does not say "code point". It says "character" which would refer to the upper/lower surrogates of the surrogate pair that you are printing.

Isn't the size of character in Java 2 bytes?

I used RandomAccessFile to read a byte from a text file.
public static void readFile(RandomAccessFile fr) {
byte[] cbuff = new byte[1];
fr.read(cbuff,0,1);
System.out.println(new String(cbuff));
}
Why am I seeing one full character being read by this?

A char represents a character in Java (*). It is 2 bytes large (or 16 bits).
That doesn't necessarily mean that every representation of a character is 2 bytes long. In fact many character encodings only reserve 1 byte for every character (or use 1 byte for the most common characters).
When you call the String(byte[]) constructor you ask Java to convert the byte[] to a String using the platform's default charset(**). Since the platform default charset is usually a 1-byte encoding such as ISO-8859-1 or a variable-length encoding such as UTF-8, it can easily convert that 1 byte to a single character.
If you run that code on a platform that uses UTF-16 (or UTF-32 or UCS-2 or UCS-4 or ...) as the platform default encoding, then you will not get a valid result (you'll get a String containing the Unicode Replacement Character instead).
That's one of the reasons why you should not depend on the platform default encoding: when converting between byte[] and char[]/String or between InputStream and Reader or between OutputStream and Writer, you should always specify which encoding you want to use. If you don't, then your code will be platform-dependent.
(*) that's not entirely true: a char represents a UTF-16 code unit. Either one or two UTF-16 code units represent a Unicode code point. A Unicode code point usually represents a character, but sometimes multiple Unicode code points are used to make up a single character. But the approximation above is close enough to discuss the topic at hand.
(**) Note that on Android the default character set is always UTF-8 and starting with Java 18 the Java platform itself also switched to this default (but it can still be configured to act the legacy way)

Java stores all it's "chars" internally as two bytes. However, when they become strings etc, the number of bytes will depend on your encoding.
Some characters (ASCII) are single byte, but many others are multi-byte.
Java supports Unicode, thus according to:
Java Character Docs
The max value supported is "\uFFFF" (hex FFFF, dec 65535), or 11111111 11111111 binary (two bytes).

The constructor String(byte[] bytes) takes the bytes from the buffer and encodes them to characters.
It uses the platform default charset to encode bytes to characters. If you know, your file contains text, that is encoded in a different charset, you can use the String(byte[] bytes, String charsetName) to use the correct encoding (from bytes to characters).

In ASCII text file each character is just one byte

Looks like your file contains ASCII characters, which are encoded in just 1 byte. If text file was containing non-ASCII character, e.g. 2-byte UTF-8, then you get just the first byte, not whole character.

There are some great answers here but I wanted to point out the jvm is free to store a char value in any size space >= 2 bytes.
On many architectures there is a penalty for performing unaligned memory access so a char might easily be padded to 4 bytes. A volatile char might even be padded to the size of the CPU cache line to prevent false sharing. https://en.wikipedia.org/wiki/False_sharing
It might be non-intuitive to new Java programmers that a character array or a string is NOT simply multiple characters. You should learn and think about strings and arrays distinctly from "multiple characters".
I also want to point out that java characters are often misused. People don't realize they are writing code that won't properly handle codepoints over 16 bits in length.

Java allocates 2 of 2 bytes for character as it follows UTF-16. It occupies minimum 2 bytes while storing a character, and maximum of 4 bytes. There is no 1 byte or 3 bytes of storage for character.

The Java char is 2 bytes. But the file encoding may be different.
So first you should know what encoding your file uses. For example, the file could be UTF-8 or ASCII encoded, then you will retrieve the right chars by reading one byte at a time.
If the encoding of the file is UTF-16, it may still show you the correct char if your UTF-16 is little endian. For example, the little endian UTF-16 for A is [65, 0]. Then when you read the first byte, it returns 65. After padding with 0 for the second byte, you will get A.

Base64 vs HEX for sending binary content over the internet in XML doc

What is the best way of sending binary content between system inside an XML document
I know of Base64 and Hex, what is the real difference. I am currently using Base64 but need to include an external commons library for this where as with HEX I think I could just create a function.

You could just write your own method for Base64 as well... but I'd generally recommend using external, well-tested libraries for both. (It's not like there's any shortage of them.)
The difference between Base64 and hex is really just how bytes are represented. Hex is another way of saying "Base16". Hex will take two characters for each byte - Base64 takes 4 characters for every 3 bytes, so it's more efficient than hex. Assuming you're using UTF-8 to encode the XML document, a 100K file will take 200K to encode in hex, or 133K in Base64. Of course it may well be that you don't care about the space efficiency - in many cases it won't matter. If it does matter, then clearly Base64 is better on that front. (There are alternatives which are even more efficient, but they're not as common.)

I was curious how on EARTH base64 can convert 3 input bytes into 4 output bytes for just 33% space growth (whereas hex converts 1 input byte into 2 output bytes for 100% space growth). Why specifically 3 input bytes?
The answer is:
3 bytes = 3 x 8 bits = 24 bits.
Why that magic "24 bits" number? Well, base 64 represents the numbers 0 to 63. How are those represented in binary? With 000000 (0) to 111111 (63).
Bingo! Each base64 character represents 6 bits of input data using a single output byte (a single character such as "Z", etc).
So 24 bits (3 full 8-bit bytes of input) / 6 bits (base64 alphabet) = 4 bytes of base64. That's it!
Or, described another way, every Base64 character (which is 1 byte (8 bits)) encodes 6 bits of real data. And if we divide 8bits/6bits we see where the 33% growth comes from, as mentioned at the top of this post... So yes, Base64 always increases data size by 33% (plus some potential padding by the = characters that are sometimes added at the end of the base64 output).
You may think "Why not base128 (7 bits of input = 8 bits of output), at just 14% size growth when encoding?". The answer for that is that base64 is the best we can find, since the lower 128 ASCII characters aren't all printable. Many are control characters such as NULL etc.
There are obviously ways to create other systems such as perhaps "base81" etc, since you can do anything you want if you create a custom encoding algorithm. But the beauty of base64 is how it encodes data so cleanly in chunks of 6 bits, and how you simply have to "read 3 bytes and output 4" to encode, and "read 4 bytes and output 3" to decode. So that encoding scheme became popular.
Now you are hopefully wiser after having read this.
Fun Update: Speaking of other encoding styles with more characters... It's come to my attention that Ascii85 aka Base85 exists and is slightly more efficient (25% data size growth when encoding as Base85 instead of 33% for Base64): https://en.wikipedia.org/wiki/Ascii85

There only two 'real differences':
The radix. Base64 is base-64, surprise, and hex is base-16.
The encoding: base-64 encodes 3 source bytes into 4 base-64 characters (http://en.wikipedia.org/wiki/Base64#Examples); hex encodes 1 byte into 2 hex characters.
So base64 is more compact than hex.

Other answers made clear the efficiency difference between base16 and base64.
There is more to base selection than efficiency.
Base64 uses more than just letters and numbers. Different implementations use different punctuation characters for indicating padding, and making up the last two characters of the set of 64. These can include plus "+" and equal "=". both problematic in HTTP query strings.
So one reason to favour base16 over base64 is that base16 values can be composed directly into HTTP query strings without requiring additional encoding. Is that important to you?
Notice that this is an additional concern, over and above efficiency. Neither base is inherently better or worse; they're just two different points on a scale, at which you'll find different properties that will be more or less attractive in different situations.
For example, consider base32. It's 20% less efficient than base64, but is still suitable for use in HTTP query strings. Most of its inefficiency comes from being case-insensitive and avoiding zero "0" and one "1", to mistakes in reproduction by humans.
So base32 introduces a new concern; ease of reproduction for humans. Is that a concern for you? If it's not, you could go for something like base62, which is still convenient in HTTP query strings, but is case sensitive and includes zero "0" and "1".
Hopefully, I've clarified that the selection of your encoding base is a matter of sliding along a scale until you get the best efficiency you can have before sacrificing what's important to you.
Wikipedia has a fun list of numeral systems.

Is size important to you?
Base64 is more space efficient. Using 4 characters to represent 3 bytes where as hex uses 2 characters for each byte. In other words: hex increases the size of the string with 100%. For small strings that fit as params in url requests I wouldn't mind the extra cost/size.
Is ease of use important to you?
Hex is easier to use than Base64 because you don't need to escape (it may contain +, = and /) when using the string as a get parameter in url requests.
Is widespread use important to you?
I don't have the numbers, but Base64 might be more known to the general developer than hex depending on several factors. I knew about base64 long before hex (base16).

base64 has less overhead (base64 produces 4 characters for every 3 bytes of original data while hex produces 2 characters for every byte of original data). Hex is more readable - you just look at the two characters and immediately know what byte is behind, but with base64 you need effort decoding the 4-characters group, so debugging will be easier with hex.

SHA-1 Hashes Mixed with Strings

I have to parse something like the following "some text <40 byte hash>" can i read this whole thing in to a string without corrupting 40 byte hash part?
The thing is hash is not going to be there so i don't want to process it while reading.
EDIT: I forgot to mention that the 40 byte hash is 2x20 byte hashes no encoding raw bytes.

Read it from your input stream as a byte stream, and then strip the String out of the stream like this:
String s = new String(Arrays.copyOfRange(bytes, 0, bytes.length-40));
Then get your bytes as:
byte[] hash = Arrays.copyOfRange(bytes, s.length-1, bytes.length-1)

SHA-1 hashes are 20 bytes (160 bits) in length. If you are dealing with 40 character hashes, then they are probably an ASCII representation of the hash, and therefore only contain the characters 0-9 and a-f. If this is the case, then you should be able to read and manipulate the strings in Java without any trouble.

Some more details could be useful, but I think the answer is that you should be okay.
You didn't say how the SHA-1 hash was encoded (common possibilities include "none" (the raw bytes), Base64 and hex). Since SHA-1 produces a 20 byte (160 bit) hash, I am guessing that it will be encoded using hex, since that doubles the space needed to the 40 bytes you mentioned. With that encoding, 2 characters will be used to encode each byte from the hash, using the symbols 0 through 9 and A through F. Those are all ASCII characters so you are safe.
Base64 encoding would also work (though probably not what you asked about since it increases the size by about 1/3 leaving you at well less than 40 bytes) as each of the characters used in Base64 are also ASCII.
If the raw bytes were used directly, you would have a problem, as some of the values are not valid characters.

OK, now that you've clarified that these are raw bytes
No, you cannot read this into Java as a string, you will need to read it as raw bytes.

WORKING CODE:
Converts byte string inputs into hex characters which should be safe in almost all string encodings. Use the code I posted in your other question to decode the hex chars back to raw bytes.
/** Lookup table: character for a half-byte */
static final char[] CHAR_FOR_BYTE = {'0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F'};
/** Encode byte data as a hex string... hex chars are UPPERCASE */
public static String encode(byte[] data){
if(data == null || data.length==0){
return null;
}
char[] store = new char[data.length*2];
for(int i=0; i<data.length; i++){
final int val = (data[i]&0xFF);
final int charLoc=i<<1;
store[charLoc]=CHAR_FOR_BYTE[val>>>4];
store[charLoc+1]=CHAR_FOR_BYTE[val&0x0F];
}
return new String(store);
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.