Encode byte[] as String - java

Heyho,
I want to convert byte data, which can be anything, to a String. My question is, whether it is "secure" to encode the byte data with UTF-8 for example:
String s1 = new String(data, "UTF-8");
or by using base64:
String s2 = Base64.encodeToString(data, false); //migbase64
I'm just afraid that using the first method has negative side effects. I mean both variants work p̶e̶r̶f̶e̶c̶t̶l̶y̶ , but s1 can contain any character of the UTF-8 charset, s2 only uses "readable" characters. I'm just not sure if it's really need to use base64. Basically I just need to create a String send it over the network and receive it again. (There is no other way in my situation :/)
The question is only about negative side effects, not if it's possible!

You should absolutely use base64 or possibly hex. (Either will work; base64 is more compact but harder for humans to read.)
You claim "both variants work perfectly" but that's actually not true. If you use the first approach and data is not actually a valid UTF-8 sequence, you will lose data. You're not trying to convert UTF-8-encoded text into a String, so don't write code which tries to do so.
Using ISO-8859-1 as an encoding will preserve all the data - but in very many cases the string that is returned will not be easily transported across other protocols. It may very well contain unprintable control characters, for example.
Only use the String(byte[], String) constructor when you've got inherently textual data, which you happen to have in an encoded form (where the encoding is specified as the second argument). For anything else - music, video, images, encrypted or compressed data, just for example - you should use an approach which treats the incoming data as "arbitrary binary data" and finds a textual encoding of it... which is precisely what base64 and hex do.

You can store a byte in a String, though it's not a good idea. You can't use UTF-8 as this will mange the bytes but a faster and more efficient way is to use ISO-8859-1 encoding or plain 8-bit. The simplest way to do this is to use
String s1 = new String(data, 0);
or
String s1 = new String(data, "ISO-8859-1");
From UTF-8 on Wikipedia, As Jon Skeet notes, these encodings are not valid under the standard. Their behaviour in Java varies. DataInputStream treats them as the same for the first three version and the next two throw an exception. The Charset decoder treats them as separate characters silently.
00000000 is \0
11000000 10000000 is \0
11100000 10000000 10000000 is \0
11110000 10000000 10000000 10000000 is \0
11111000 10000000 10000000 10000000 10000000 is \0
11111100 10000000 10000000 10000000 10000000 10000000 is \0
This means if you see \0 in you String, you have no way of knowing for sure what the original byte[] values were. DataOutputStream uses the second option for compatibility with C which sees \0 as a terminator.
BTW DataOutputStream is not aware of code points so writes high code point characters in UTF-16 and then UTF-8 encoding.
0xFE and 0xFF are not valid to appear in a character. Values 0x11000000+ can only appear at the start of a character, not inside a multi-byte character.

Confirmed the accepted answer with Java. To repeat, UTF-8, UTF-16 do not preserve all the byte values. ISO-8859-1 does preserve all the byte values. But if the encoded bytes is to be transported beyond the JVM, use Base64.
#Test
public void testBase64() {
final byte[] original = enumerate();
final String encoded = Base64.encodeBase64String( original );
final byte[] decoded = Base64.decodeBase64( encoded );
assertTrue( "Base64 preserves bytes", Arrays.equals( original, decoded ) );
}
#Test
public void testIso8859() {
final byte[] original = enumerate();
String s = new String( original, StandardCharsets.ISO_8859_1 );
final byte[] decoded = s.getBytes( StandardCharsets.ISO_8859_1 );
assertTrue( "ISO-8859-1 preserves bytes", Arrays.equals( original, decoded ) );
}
#Test
public void testUtf16() {
final byte[] original = enumerate();
String s = new String( original, StandardCharsets.UTF_16 );
final byte[] decoded = s.getBytes( StandardCharsets.UTF_16 );
assertFalse( "UTF-16 does not preserve bytes", Arrays.equals( original, decoded ) );
}
#Test
public void testUtf8() {
final byte[] original = enumerate();
String s = new String( original, StandardCharsets.UTF_8 );
final byte[] decoded = s.getBytes( StandardCharsets.UTF_8 );
assertFalse( "UTF-8 does not preserve bytes", Arrays.equals( original, decoded ) );
}
#Test
public void testEnumerate() {
final Set<Byte> byteSet = new HashSet<>();
final byte[] bytes = enumerate();
for ( byte b : bytes ) {
byteSet.add( b );
}
assertEquals( "Expecting 256 distinct values of byte.", 256, byteSet.size() );
}
/**
* Enumerates all the byte values.
*/
private byte[] enumerate() {
final int length = Byte.MAX_VALUE - Byte.MIN_VALUE + 1;
final byte[] bytes = new byte[length];
for ( int i = 0; i < length; i++ ) {
bytes[i] = (byte)(i + Byte.MIN_VALUE);
}
return bytes;
}

Related

Byte array is a valid UTF8 encoded String in Java but not in Python

When I run the following in Python 2.7.6, I get an exception:
import base64
some_bytes = b"\x80\x02\x03"
print ("base 64 of the bytes:")
print (base64.b64encode(some_bytes))
try:
print (some_bytes.decode("utf-8"))
except Exception as e:
print(e)
The output:
base 64 of the bytes:
gAID
'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
So in Python 2.7.6 the bytes represented as gAID are not a valid UTF8.
When I try it in Java 8 (HotSpot 1.8.0_74), using this code:
java.util.Base64.Decoder decoder = java.util.Base64.getDecoder();
byte[] bytes = decoder.decode("gAID");
Charset charset = Charset.forName("UTF8");
String s = new String(bytes, charset);
I don't get any exception.
How so? Why is the same byte array is valid in Java and invalid in Python, using UTF8 decoding?
This is because the String constructor in Java just doesn't throw exceptions in the case of invalid characters. See documentation here
public String(byte[] bytes, Charset charset)
... This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. The CharsetDecoder class should be used when more control over the decoding process is required.
It's not valid UTF8. https://en.wikipedia.org/wiki/UTF-8
Bytes between 0x80 and 0xBF cannot be the first byte of a multi-byte character. They can only be the second byte or later.
Java replaces bytes which cannot be decoded with a ? rather than throw an exception.
So in Python 2.7.6 the bytes represented as gAID are not a valid UTF8.
This is wrong as you try to decode the Base64 encoded bytes.
import base64
some_bytes = b"\x80\x02\x03"
print ("base 64 of the bytes:")
print (base64.b64encode(some_bytes))
# store the decoded bytes
some_bytes = base64.b64encode(some_bytes)
decoded_bytes = [hex(ord(c)) for c in some_bytes]
print ("decoded bytes: ")
print (decoded_bytes)
try:
print (some_bytes.decode("utf-8"))
except Exception as e:
print(e)
output
gAID
['0x67', '0x41', '0x49', '0x44']
gAID
In Java you try to create a String from the Base64 encoded bytes using the UTF-8 charset. Which results (as already answered) in the default replacement character �.
Running following snippet
java.util.Base64.Decoder decoder = java.util.Base64.getDecoder();
byte[] bytes = decoder.decode("gAID");
System.out.println("base 64 of the bytes:");
for (byte b : bytes) {
System.out.printf("x%02x ", b);
}
System.out.println();
Charset charset = Charset.forName("UTF8");
String s = new String(bytes, charset);
System.out.println(s);
produce following output
base 64 of the bytes:
x80 x02 x03
?
There you can see the same bytes you are using in the Python snippet. Which lead in Python to 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte lead there to an ? (it stands for the default replacement character on a not-unicode console)
Following snippet used the bytes from gAID to construct a String with the UTF-8 character set.
byte[] bytes = "gAID".getBytes(StandardCharsets.ISO_8859_1);
for (byte b : bytes) {
System.out.printf("x%02x ", b);
}
System.out.println();
Charset charset = Charset.forName("UTF8");
String s = new String(bytes, charset);
System.out.println(s);
output
x67 x41 x49 x44
gAID

Conversion of UTF-8 char to ISO-8859-1

I have tried to convert UTF8 char to ISO-8859-1 but all characters (like 0x84; 0x96;) are not converting into ISO-8859-1, See code below in java
static byte[] encode(byte[] arr) throws CharacterCodingException{
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");
ByteBuffer inputBuffer = ByteBuffer.wrap( arr );
// decode UTF-8
CharBuffer data = utf8charset.decode(inputBuffer);
// encode ISO-8559-1
ByteBuffer outputBuffer= iso88591charset.newEncoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE)
.encode(data);
data = iso88591charset.decode(outputBuffer);
byte[] outputData = outputBuffer.array();
return outputData;
}
Please help to resolve it.
Thanks.
First you might use StandardCharsets.UTF_8 and StandardCharsets.ISO_8859_1.
However, better replace "ISO-8859-1" with "Windows-1252".
The reason is that browsers and others interprete an indication of ISO-8859-1 (Latin-1) as Windows-1252 (Windows Latin-1). In Windows Latin-1 the range 0x80 - 0xbf are used for comma-like quotes and such.
So with a bit of luck (I do not think you meant browsers) this will work.
BTW in browsers this will even work on the Mac, and is official since HTML5.
Give the following a go,
String str = new String(utf8Bytes, "UTF-8");
byte[] isoBytes = str.getBytes( "ISO-8859-1" );
If it gives you exactly the same result, then you have characters that do not map between those character sets.
My guess is that when you say "0x84, 0x96" you mean bytes in the byte array.
If that is the case, you are taking those bytes and try to interpret them as UTF-8, but
that sequence of bytes is not a valid UTF-8 sequence.
from U+0000 to U+007F : 1 byte : 0xxxxxxx
from U+0080 to U+07FF : 2 bytes : 110xxxxx 10xxxxxx
from U+0800 to U+FFFF : 3 bytes : 1110xxxx 10xxxxxx 10xxxxxx
from U+10000 to U+1FFFFF : 4 bytes : 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Since 84 96 is 0x10000100 0x10010110 is does not match the bit patterns above
(note the 0x11.... or 0x0.... in the lead byte, never 0x10...., that is a "trailing byte")

Decoding multibyte UTF8 symbols with charset decoder in byte-by-byte manner?

I am trying to decode UTF8 byte by byte with charset decoder. Is this possible?
The following code
public static void main(String[] args) {
Charset cs = Charset.forName("utf8");
CharsetDecoder decoder = cs.newDecoder();
CoderResult res;
byte[] source = new byte[] {(byte)0xc3, (byte)0xa6}; // LATIN SMALL LETTER AE in UTF8
byte[] b = new byte[1];
ByteBuffer bb = ByteBuffer.wrap(b);
char[] c = new char[1];
CharBuffer cb = CharBuffer.wrap(c);
decoder.reset();
b[0] = source[0];
bb.rewind();
cb.rewind();
res = decoder.decode(bb, cb, false);
System.out.println(res);
System.out.println(cb.remaining());
b[0] = source[1];
bb.rewind();
cb.rewind();
res = decoder.decode(bb, cb, false);
System.out.println(res);
System.out.println(cb.remaining());
}
gives the following output.
UNDERFLOW
1
MALFORMED[1]
1
Why?
My theory is that the problem with the way that you are doing it is that in the "underflow" condition, the decoder leaves the unconsumed bytes in the input buffer. At least, that is my reading.
Note this sentence in the javadoc:
"In any case, if this method is to be reinvoked in the same decoding operation then care should be taken to preserve any bytes remaining in the input buffer so that they are available to the next invocation. "
But you are clobbering the (presumably) unread byte.
You should be able to check whether my theory / interpretation is correct by looking at how many bytes remain unconsumed in bb after the first decode(...) call.
If my theory is correct then the answer is that you cannot decode UTF-8 by providing the decoder with byte buffers containing exactly one byte. But you could implement byte-by-byte decoding by starting with a ByteBuffer containing one byte and adding extra bytes until the decoder succeeds in outputing a character. Just make sure that you don't clobber input bytes that haven't been consumed yet.
Note that decoding like this is not efficient. The API design is optimized for decoding a large number of bytes in one go.
As has been said, utf has 1-6 bytes per char. you need to add all bytes to the bytebuffer before you decode try this:
public static void main(String[] args) {
Charset cs = Charset.forName("utf8");
CharsetDecoder decoder = cs.newDecoder();
CoderResult res;
byte[] source = new byte[] {(byte)0xc3, (byte)0xa6}; // LATIN SMALL LETTER AE in UTF8
byte[] b = new byte[2]; //two bytes for this char
ByteBuffer bb = ByteBuffer.wrap(b);
char[] c = new char[1];
CharBuffer cb = CharBuffer.wrap(c);
decoder.reset();
b[0] = source[0];
b[1] = source[1];
bb.rewind();
cb.rewind();
res = decoder.decode(bb, cb, false); //translates 2 bytes to 1 char
System.out.println(cb.remaining()); //prints 0
System.out.println(cb.get(0)); //prints latin ae
}

Cannot output correct hash in Java. What is wrong?

In my Android app I have a SHA256 hash which I must further hash with the RIPEMD160 message digest algorithm.
I can output the correct sha256 and ripemd160 hash of any string, but when I try to hash the sha256 hash with ripemd160 I get a hash which is incorrect.
According to online hash calculators, the SHA256 value of the string 'test'(all lowercase) is:
9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08
And the RIPEMD160 value of the string 'test' is:
5e52fee47e6b070565f74372468cdc699de89107
The value from hashing the resulting sha256 hash with ripemd160 according to online calcs is:
4efc1c36d3349189fb3486d2914f56e05d3e66f8
And the one my app gives me is:
cebaa98c19807134434d107b0d3e5692a516ea66
which is obviously wrong.
Here is my code:
public static String toRIPEMD160(String in)
{
byte[] addr = in.getBytes();
byte[] out = new byte[20];
RIPEMD160Digest digest = new RIPEMD160Digest();
byte[] sha256 = sha256(addr);
digest.update(sha256,0,sha256.length);
digest.doFinal(out,0);
return getHexString(out);
}
public static byte[] sha256(byte[] data)
{
byte[] sha256 = new byte[32];
try
{
sha256 = MessageDigest.getInstance("SHA-256").digest(data);
}
catch(NoSuchAlgorithmException e)
{}
return sha256;
}
For the ripemd160 algorithm, you need bouncycastle and java.security.MessageDigest for sha256.
Your "online calculator" result is the result of hashing the bytes of the string "test" with SHA-256, converting the result of that hash to a hex string, then taking the bytes corresponding to the ASCII characters of that hex string and hashing those a second time. This is very different from your Java code, which passes the bytes that come out of the first hash directly to the second one, without printing them as hex and turning those characters back into bytes in between. The single byte with value 254 (decimal) becomes "fe" in hex, which becomes the two-byte sequence [0x66, 0x65] when converted back to bytes.
Your hash is working fine. The problem is that the online calculators that you're using are treating your input:
9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08
as a string instead of an array of bytes. In other words, it's treating each character as a byte instead of parsing character pairs as bytes in hexadecimal. If I give this as a string to online calculators, I indeed get exactly what you got:
4efc1c36d3349189fb3486d2914f56e05d3e66f8
However, you're treating the output as an array of bytes instead of a String and that's giving you different results. You should encode your raw SHA256 hash as a string, then pass the encoded string to the hash function. I see you have a getHexString method, so we'll just use that.
public static String toRIPEMD160(String in) {
try {
byte[] addr = in.getBytes();
byte[] out = new byte[20];
RIPEMD160Digest digest = new RIPEMD160Digest();
// These are the lines that changed
byte[] rawSha256 = sha256(addr);
String encodedSha256 = getHexString(rawSha256);
byte[] strBytes = encodedSha256.getBytes("UTF-8");
digest.update(strBytes, 0, strBytes.length);
digest.doFinal(out, 0);
return getHexString(out);
} catch (UnsupportedEncodingException ex) {
// Never happens, everything supports UTF-8
return null;
}
}
If you want to know it's working, take the value of encodedSha256 and put that into an online hash calculator. As long as the calculator uses UTF-8 encoding to turn the string into a byte array, it will match your output.
To get printable version of byte[] digest use this code:
StringBuffer hexString = new StringBuffer();
for (int i=0;i<out.length;i++) {
hexString.append( String.format("%02x", 0xFF & out[i]) );
}
and then call hexString.toString();

Java String to SHA1

I'm trying to make a simple String to SHA1 converter in Java and this is what I've got...
public static String toSHA1(byte[] convertme) {
MessageDigest md = null;
try {
md = MessageDigest.getInstance("SHA-1");
}
catch(NoSuchAlgorithmException e) {
e.printStackTrace();
}
return new String(md.digest(convertme));
}
When I pass it toSHA1("password".getBytes()), I get [�a�ɹ??�%l�3~��. I know it's probably a simple encoding fix like UTF-8, but could someone tell me what I should do to get what I want which is 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8? Or am I doing this completely wrong?
UPDATE
You can use Apache Commons Codec (version 1.7+) to do this job for you.
DigestUtils.sha1Hex(stringToConvertToSHexRepresentation)
Thanks to #Jon Onstott for this suggestion.
Old Answer
Convert your Byte Array to Hex String. Real's How To tells you how.
return byteArrayToHexString(md.digest(convertme))
and (copied from Real's How To)
public static String byteArrayToHexString(byte[] b) {
String result = "";
for (int i=0; i < b.length; i++) {
result +=
Integer.toString( ( b[i] & 0xff ) + 0x100, 16).substring( 1 );
}
return result;
}
BTW, you may get more compact representation using Base64. Apache Commons Codec API 1.4, has this nice utility to take away all the pain. refer here
This is my solution of converting string to sha1. It works well in my Android app:
private static String encryptPassword(String password)
{
String sha1 = "";
try
{
MessageDigest crypt = MessageDigest.getInstance("SHA-1");
crypt.reset();
crypt.update(password.getBytes("UTF-8"));
sha1 = byteToHex(crypt.digest());
}
catch(NoSuchAlgorithmException e)
{
e.printStackTrace();
}
catch(UnsupportedEncodingException e)
{
e.printStackTrace();
}
return sha1;
}
private static String byteToHex(final byte[] hash)
{
Formatter formatter = new Formatter();
for (byte b : hash)
{
formatter.format("%02x", b);
}
String result = formatter.toString();
formatter.close();
return result;
}
Using Guava Hashing class:
Hashing.sha1().hashString( "password", Charsets.UTF_8 ).toString()
SHA-1 (and all other hashing algorithms) return binary data. That means that (in Java) they produce a byte[]. That byte array does not represent any specific characters, which means you can't simply turn it into a String like you did.
If you need a String, then you have to format that byte[] in a way that can be represented as a String (otherwise, just keep the byte[] around).
Two common ways of representing arbitrary byte[] as printable characters are BASE64 or simple hex-Strings (i.e. representing each byte by two hexadecimal digits). It looks like you're trying to produce a hex-String.
There's also another pitfall: if you want to get the SHA-1 of a Java String, then you need to convert that String to a byte[] first (as the input of SHA-1 is a byte[] as well). If you simply use myString.getBytes() as you showed, then it will use the platform default encoding and as such will be dependent on the environment you run it in (for example it could return different data based on the language/locale setting of your OS).
A better solution is to specify the encoding to use for the String-to-byte[] conversion like this: myString.getBytes("UTF-8"). Choosing UTF-8 (or another encoding that can represent every Unicode character) is the safest choice here.
This is a simple solution that can be used when converting a string to a hex format:
private static String encryptPassword(String password) throws NoSuchAlgorithmException, UnsupportedEncodingException {
MessageDigest crypt = MessageDigest.getInstance("SHA-1");
crypt.reset();
crypt.update(password.getBytes("UTF-8"));
return new BigInteger(1, crypt.digest()).toString(16);
}
Just use the apache commons codec library. They have a utility class called DigestUtils
No need to get into details.
As mentioned before use apache commons codec. It's recommended by Spring guys as well (see DigestUtils in Spring doc). E.g.:
DigestUtils.sha1Hex(b);
Definitely wouldn't use the top rated answer here.
It is not printing correctly because you need to use Base64 encoding. With Java 8 you can encode using Base64 encoder class.
public static String toSHA1(byte[] convertme) throws NoSuchAlgorithmException {
MessageDigest md = MessageDigest.getInstance("SHA-1");
return Base64.getEncoder().encodeToString(md.digest(convertme));
}
Result
This will give you your expected output of 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8
Message Digest (hash) is byte[] in byte[] out
A message digest is defined as a function that takes a raw byte array and returns a raw byte array (aka byte[]). For example SHA-1 (Secure Hash Algorithm 1) has a digest size of 160 bit or 20 byte. Raw byte arrays cannot usually be interpreted as character encodings like UTF-8, because not every byte in every order is an legal that encoding. So converting them to a String with:
new String(md.digest(subject), StandardCharsets.UTF_8)
might create some illegal sequences or has code-pointers to undefined Unicode mappings:
[�a�ɹ??�%l�3~��.
Binary-to-text Encoding
For that binary-to-text encoding is used. With hashes, the one that is used most is the HEX encoding or Base16. Basically a byte can have the value from 0 to 255 (or -128 to 127 signed) which is equivalent to the HEX representation of 0x00-0xFF. Therefore hex will double the required length of the output, that means a 20 byte output will create a 40 character long hex string, e.g.:
2fd4e1c67a2d28fced849ee1bb76e7391b93eb12
Note that it is not required to use hex encoding. You could also use something like base64. Hex is often preferred because it is easier readable by humans and has a defined output length without the need for padding.
You can convert a byte array to hex with JDK functionality alone:
new BigInteger(1, token).toString(16)
Note however that BigInteger will interpret given byte array as number and not as a byte string. That means leading zeros will not be outputted and the resulting string may be shorter than 40 chars.
Using Libraries to Encode to HEX
You could now copy and paste an untested byte-to-hex method from Stack Overflow or use massive dependencies like Guava.
To have a go-to solution for most byte related issues I implemented a utility to handle these cases: bytes-java (Github)
To convert your message digest byte array you could just do
String hex = Bytes.wrap(md.digest(subject)).encodeHex();
or you could just use the built-in hash feature
String hex = Bytes.from(subject).hashSha1().encodeHex();
Base 64 Representation of SHA1 Hash:
String hashedVal = Base64.getEncoder().encodeToString(DigestUtils.sha1(stringValue.getBytes(Charset.forName("UTF-8"))));
Convert byte array to hex string.
public static String toSHA1(byte[] convertme) {
final char[] HEX_CHARS = "0123456789ABCDEF".toCharArray();
MessageDigest md = null;
try {
md = MessageDigest.getInstance("SHA-1");
}
catch(NoSuchAlgorithmException e) {
e.printStackTrace();
}
byte[] buf = md.digest(convertme);
char[] chars = new char[2 * buf.length];
for (int i = 0; i < buf.length; ++i) {
chars[2 * i] = HEX_CHARS[(buf[i] & 0xF0) >>> 4];
chars[2 * i + 1] = HEX_CHARS[buf[i] & 0x0F];
}
return new String(chars);
}
The reason this doesn't work is that when you call String(md.digest(convertme)), you are telling Java to interpret a sequence of encrypted bytes as a String. What you want is to convert the bytes into hexadecimal characters.
Maybe this helps (works on java 17):
import org.apache.tomcat.util.codec.binary.Base64;
return new String(Base64.encodeBase64(md.digest(convertme)));

Categories