byte[] commonsDecode = Base64.decodeBase64(data);
debug("The data is " + commonsDecode.length + " bytes long for the apache commons base64 decoder.");
BASE64Decoder decoder = new BASE64Decoder();
byte[] sunDecode = decoder.decodeBuffer(data);
Log.debug("The data is " + sunDecode.length + " bytes long for the SUN base64 decoder.");
Please explain to me why these two method calls would produce different length for the resulting byte arrays. I initially thought it might have to do with character encodings but if so I don't understand all of the issues properly. The above code was executed on the same system and in the same application, in the order shown above. So the default character encoding on that system would be the same.
The input (test) data:
The below is a System.out.println of the Java String.
qFkIQgDq jk3ScHpqx8BPVS97YE4pP/nBl5Qw7mBnpSGqNqSdGIkLPVod0pBl Uz7NgpizHDicGzNCaauefAdwGklpPr0YdwCu4wRkwyAuvtDmL0BYASOn2tDw72LMz5FChtSa0CoCBQ2ARsFG2GdflnIWsUuBQapX73ZBMiqqm ZCOnMRv9Ol8zT1TECddlKZMYAvmjANgq0sBPyUMF7co XY9BYAjV3L/cA8CGQpXGdrsAgjPKMhzk4hh1GAoQ1soX2Dva8p3erPJ4sy2Vcb6lS1Hap9FR0AZFawbJ10FFSTg10wxc24539kYA6xxq/TFqkhaEoSyTqjXjvo1SA==
Apache commons decoder says it's 252 length byte array.
Java Sun decoder says 256.
The decoded data is not valid Base64 data.
Valid Base64 data can contain whitespace. Usually, it has a newline every 72 characters. However, your data contains spaces in random places. If they are removed (as every Base64 decoder is supposed to do), 339 characters remain. Yet, valid Base64 data has to be a multiple of 4 characters.
Interestingly, your data contains no plus signs. I suspect it once contained them but they have probably been replaced with spaces somewhere in transmission. If you replace all spaces with plus signs, the Base64 data is valid and the decoded data will have a length of 256 bytes: 344 characters / 4 * 3 - 2 padding characters.
I further suspect that the Base64 data was used in a URL without proper URL encoding. That's a probable cause for the missing plus signs. Note that Base64 encoded data is not URL safe. Both the plus and the equal signs need to be escaped.
Related
Consider this is byte array,
byte[] by = [2, 126, 33, -66, -100, 4, -39, 108]
then if we execute the below code and print it,
String utf8_str = new String(by, StandardCharsets.UTF_8);
System.out.println(utf8_str);
the output is:
\~!���l
Where all the negative values are converted to '�' which means that the byte with -ve value is not in the UTF-8 character set.
But the UTF-8 character set has a range of 0 to 255.
If only 0-127 can be shown in +ve in the form of byte datatype, then the numbers greater than 127 can never be used when encoding to UTF-8 character set as Java does not support unsigned byte value.
Any solution for this?
I needed to encode a byte array to UTF-8 character String and get the byte array back from the UTF-8 character String.
But all the character are encoded and retrieved properly except '�'.
when I try to retrieve '�' (i.e, print it's UTF-8 Unicode), it gives some other Unicode rather than the Unicode of the encoded character.
tl;dr: You can't decode arbitrary bytes as UTF-8, because some byte streams are not conforming UTF-8 streams. If you need to represent arbitrary bytes as String, use something like Base64:
String base64 = Base64.getEncoder().encodeToString(arbitraryBytes);
Not all byte sequences are valid UTF-8
UTF-8 has very specific rules about what bytes sequences are allowed. The short version is:
a byte in the range 0x00-0x7F can stand alone (and represents the equivalent character as its ASCII encoding).
a byte in the range 0xC2-0xF4 is a leading byte that starts a multi-byte sequence with the exact value indicating the number of continuation bytes
a byte in the range 0x80-0xBF is a continuation byte that has to come after a leading byte and possibly some other continuation bytes.
There's a few more rules and nuances to it, but that's the basic idea.
As you can see there are several byte values (0xC0, 0xC1, 0xF5-0xFF) that can't appear in a well-formed UTF-8 stream at all. Additionally some other bytes can only occur in specific sequences. For example a leading byte can never be followed by another leading byte or a stand-alone byte. Similarly a stand-alone byte must never be followed by a continuation byte.
Note about "negative values": byte in Java is a signed data type. But the signed/unsigned debate is not relevant for this topic, as it only matters when calculating with the value or when printing it. It's the 8-bit type to use in Java and the fact that the byte 0xBE is represented as -66 in Java is mostly a visual distinction. For the purposes of this discussion "negative values" is equivalent to "byte values between 0x80 and 0xFF". It just so happens that the non-negative values are exactly the stand alone bytes in UTF-8 and are converted just fine.
All this means that decoding arbitrary byte[] as UTF-8 will not work in most cases!**
Then why doesn't new String(...) throw an exception?
But if arbitraryBytes contains a byte[] that isn't valid UTF-8, then why doesn't new String(arbitraryBytes, StandardCharsets.UTF_8) throw an exception?
Good question! Maybe it should, but the designers of Java have decided that this specific way of decoding a byte[] into a String should be lenient:
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. The CharsetDecoder class should be used when more control over the decoding process is required.
The "default replacement string" in this case is simply the Unicode character U+FFFD Replacement Character, which looks like a question mark in a filled rhombus: �
And as the documentation states, there is of course a way to decode a byte[] to a String and getting a real exception when it doesn't go right:
byte[] arbitraryBytes = new byte[] { 2, 126, 33, -66, -100, 4, -39, 108 };
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder().onMalformedInput(CodingErrorAction.REPORT);
String string = decoder.decode(ByteBuffer.wrap(arbitraryBytes)).toString();
This code will throw an exception:
Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:820)
at org.example.Main.main(Main.java:13)
Okay, but I really need a String!
We have realized that decoding your byte[] to a String using UTF-8 doesn't work. One could use ISO-8859-1, which maps all 256 byte values to characters, but that would result in Strings with many unprintable control characters, which would be quite cumbersome to handle.
Use Base64
The usual solution for this is to use Base64:
// encode byte[] to Base64
String base64 = Base64.getEncoder().encodeToString(arbitraryBytes);
System.out.println(base64);
// decode Base64 to byte[]
byte[] decoded = Base64.getDecoder().decode(base64);
System.out.println(Arrays.equals(arbitraryBytes, decoded));
With the same arbitraryBytes as before this will print
An4hvpwE2Ww=
true
Base64 is a common choice because it is able to represent arbitrary bytes with a reasonable number of characters (on average it will take about a third more characters than it has input bytes, depending on the exact formatting and/or padding used).
There are a few variations of Base64, which are used in various situations. Particularly common is the use of the URL- and filename-safe variant, which ensures that no characters with any special meaning in URLs and file names are used. Luckily it is directly supported in Java.
Format as a hex string
Base64 is neat and useful, but it somewhat obfuscates the individual byte values. Occasionally we want a format that allows us to interpret the values in some way. For this a hexadecimal representation of the data might be more useful, even though it takes up more characters than Base64:
// encode byte[] to hex
String hexFormatted = HexFormat.of().formatHex(arbitraryBytes);
System.out.println(hexFormatted);
// decode hex to byte[]
byte[] decoded = HexFormat.of().parseHex(hexFormatted);
System.out.println(Arrays.equals(arbitraryBytes, decoded));
This will print
027e21be9c04d96c
true
This hex format (without separator) will take exactly 2 characters per input byte, making this format more verbose than Base64.
If you're not yet on Java 17 or later, there are plenty of other ways to do this.
But I've already converted my byte[] to String using UTF-8 and I really need my original data back.
Sorry, but you most likely can't. Unless you were very lucky and your original byte[] happened to be a well-formed UTF-8 stream, the conversion to String will have lost some data and you will only be able to recover a fraction of your original byte[].
String badString = new String(arbitraryBytes, StandardCharsets.UTF_8);
byte[] recoveredBytes = badString.getBytes(StandardCharsets.UTF_8);
This will give you something but every time your input contained a encoding error, this will contain the byte sequence 0xEF 0xBF 0xBD (or -17 -65 -67, when interpreted as signed bytes and printed in decimal). That byte sequence is what UTF-8 encodes the U+FFFD Replacement Character as.
Depending on the specific input (and even the specific implementation of the UTF-8 decoder!) each replacement character can replace one or more bytes, so you can't even reliably tell the size of the original input array like this.
I am writing a basic password cracker for the MD5 hashing scheme against a Linux /etc/shadow file. When I use commons.codec's DigestUtils or Crypt libraries, the hash length for them are different (among other things).
When I use the Crypt.crypt(passwordToHash, "$1$Jhe937$") the output is a 22-character string. When I use the DigestUtils.md5[Hex](passwordToHash + "Jhe937")(or the Java MessageDigest class) the output is a 32-character string (after converted). This makes no sense to me.
aside: is there no easy way to convert the DigestUtils.md5(passwordToHash)'s byte[] to a String. I've tried all* the ways and I get all non-valid output: Nz_èJÓ_µù[î¬y
*all being: new String(byte[], "UTF-8") and convert to char then to String
The executive summary is that while they'll perform the same hashing, the output format is different between the two so the lengths will be different. Read on for details.
MD5 is a message digesting algorithm that produces a 16 byte hash value, always (assuming valid input, etc.) Those bytes aren't all printable characters, they can take any value from 0-255 for any of the bytes, while the printable characters in ASCII are in the range 32-126.
DigestUtils.md5(String) generates the MD5 of the string and returns a 16 element byte array. DigestUtils.md5Hex(String) is a convenience wrapper (I'm assuming, I haven't looked at the source, but that's how I'd write it :-) ) around DigestUtils.md5 that takes the 16 element byte array md5 produces and base16 encodes it (also known as hex encoding). That replaces each byte with the equivalent two hex characters, which is why you get a 32 character String out of it.
Crypt.crypt uses a special format that goes back to the original Unix method of storing passwords. It's been extended over the years to use different hash/encryption algorithms, longer salts, and additional features. It also encodes it's output to be printable text, which is where the length difference is coming from. By using a salt of "$1$...", you're saying to use MD5, so the password plus the salt will be hashed using MD5, resulting in 16 bytes as expected, but because those bytes aren't necessarily printable, the hash is base64 encoded (using a slightly different alphabet than the standard base64 encoding), which replaces 3 bytes with 4 printable characters. So 16 bytes becomes 16 / 3 * 4 = 21-1/3 characters, rounded up to 22.
On your aside, DigestUtils.md5 produces 16 bytes, but those bytes can have any value from 0 to 255 and are (effectively) random. new String(byte[], "UTF-8") says the bytes in the byte array are a UTF-8 encoding, which is a very specific format. new String does it's best to treat the bytes as a UTF-8 encoded string, but because they're really not, you generally get gibberish out. If you want something printable, you'll have to use something that takes random bytes, not bytes in a specific format (like UTF-8). Two popular options are base16/hex encoding, which you can get with DigestUtils.md5Hex, or base64, which you can get with Base64.encodeBase64String(DigestUtils.md5(pwd + salt)).
I need to calculate the length of base64 decoded data.
I have Base-64 data that I am sending the unencoded data as the body of a HTTP response (typo: I meant request, but same idea).
I need to send a Content-Length header.
In the interest of memory usage and performance I'd rather not actually Base-64 decode the data all at once, but rather stream it.
Given base64 data, how do I calculate the length of the decoded data will be? I need either a general algorithm, or a Java/Scala solution.
EDIT: This is similar to, but not a duplicate of Calculate actual data size from Base64 encoded string length, where the OP asks
...can I calculate the length of the raw data that has been encoded only by looking at the length of the Base64-encoded string?
The answer is no. It is necessary to look at the padding as well.
I want to know how the length and the base64 data can be used to calculate the original length.
Assuming that you can't just use chunked encoding (and thereby avoid sending a Content-Length header), you need to consult the padding thus:
Base64 encodes three binary octets into four characters. You have 4N Base64 characters. Let k be the number of trailing '=' chars (i.e. padding chars: 0, 1 or 2).
Let M = 3*floor((N-k)/4), i.e. the number of octets in "complete" 3-octet chunks.
If you have 2 padding chars then you have M + 1 bytes.
If you have 1 padding char then you have M + 2 bytes.
If you have 0 padding chars then you have M bytes.
Of course, floor() in this case means truncating integer division, i.e. the normal / operator.
Presumably you can count padding octets relatively easily (e.g. by seeking to the end of a file, or by looking at the end of a byte array), without having to read the whole Base64-encoded thing sequentially.
I arrived at this simple calculation.
If L is the length of the Base-64 encoded data, and p is the number of padding characters (which will be 0, 1, or 2), then the length of the unencoded data is
L * 3 / 4 - p
In my case (with Scala),
bytes.length * 3 / 4 - bytes.reverseIterator.takeWhile(_ == '=').length
NOTE: This is assuming the the data does not have line separators. (Often, Base-64 data will have new lines every 72 characters or so.) If it does, exclude line separators from the length L.
Trying to convert a byte[] to base64 string using
org.apache.commons.codec.binary.Base64..For this my java code looks like:
base64String = Base64.encodeBase64URLSafeString(myByteArray);
But what i see is some invalid characters in the generated base64 string..
Why do I see these ____ lines in my generated base64 String?
Is it a valid string?
Note the length of the generated string is dividable by four.
have you tried with the encodeBase64String method instead of using encodeBase64URLSafeString ?
encodeBase64URLSafeString:
Encodes binary data using a URL-safe variation of the base64 algorithm
but does not chunk the output. The url-safe variation emits - and _
instead of + and / characters.
encodeBase64String:
Encodes binary data using the base64 algorithm but does not chunk the
output. NOTE: We changed the behaviour of this method from multi-line
chunking (commons-codec-1.4) to single-line non-chunking
(commons-codec-1.5).
source : org.apache.commons.codec.binary.Base64 javadoc
Use
String sCert = javax.xml.bind.DatatypeConverter.printBase64Binary(pk);
This may be helpful
sun.misc.BASE64Encoder encoder = new sun.misc.BASE64Encoder();
encoder.encode(byteArray);
you've got (at least) two flavours of base64, the original one using '+' and '/' in addition to alphanumeric characters, and the "url safe" one, using "-" and "_" so that the content can be enclosed in a URL (or used as a filename, btw).
It looks like you're using a base64 encoder that has been turned into "url-safe mode".
apache's javadoc for URLSafeString()
Oh, and '_' being the last character of the base64 alphabet, seeing strings of "____" means you've been encoding chunks of 0xffffff ... , just like seeing "AAAAAA" means there's a lot of consecutive zeroes.
If you want to be convinced that it's a normal case, just pick your favourite hexadecimal dumper/editor and check what your binary input looked like.
By the below process you can convert byte array to Base64
// Convert a byte array to base64 string
byte[] bytearray= new byte[]{0x12, 0x23};
String s = new sun.misc.BASE64Encoder().encode(bytearray);
// Convert base64 string to a byte array
bytearray = new sun.misc.BASE64Decoder().decodeBuffer(s);
Edit:
Please check for guide below link commons apache
SecureRandom random = SecureRandom.getInstance("SHA1PRNG");
byte[] salt = new byte[16];
random.nextBytes(salt);
I would like to convert salt to a string to store/read. I don't seem to be able to get this to work. I have read that I need to use the right encoding but I'm not sure what encoding to use. I have tried the following but get junk:
String s = new String(salt, "UTF-8");
String s = new String(salt, "UTF-16");
String s = new String(salt);
Edit: For context, I'm trying to work through and understand this code. I'm trying to view the salt and password so I can monkey with the code.
You need to use Base64 (Apache Commons) class or sun.misc.BASE64Encoder/BASE64Decode to encode the byte array.
Like AVD says, the solution is to use Base64 encoding or some other binary-as-text encoding. (For example, Hex encoding.)
Why? Because binary data is not text!
What you are currently doing is telling the String constructor that the bytes are text that has been correctly encoded as UTF-8 or UTF-16 or (in the last case) the platform's default encoding. This is patently false. The "junk" you are seeing is what you get if you attempt to decode random binary stuff as text.
Worse still, the decoding process is probably lossy when you apply it to random binary data. For instance, some sequences of bytes are simply invalid if you try to treat them as UTF-8. (The spec for UTF-8 says so!) When the UTF-8 decoder sees one of these invalid sequences, it replaces it with a character (such as a '?') that means "invalid character". If you then turn the characters in the string back into bytes, you will get a different byte sequence to the one that you started with. That's probably a disaster for your use-case.